Title: Conversion of str-Instances does not use Java's default charset
Type: behaviour Severity: normal
Components: Versions: Jython 2.5
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: omatz, santa4nt, zyasoft
Priority: Keywords:

Created on 2013-07-31.10:04:26 by omatz, last changed 2014-05-21.21:16:32 by zyasoft.

File name Uploaded Description Edit Remove omatz, 2013-08-14.08:31:08
msg8077 (view) Author: Oliver Matz (omatz) Date: 2013-07-31.10:04:25
strval = "schön"
hexval = ':'.join(x.encode('hex') for x in strval)
print hexval

correctly outputs: 73:63:68:c3:b6:6e
so the german o-umlaut (unicode \u00fc) is correctly converted to the utf-8 byte sequence "c3 b6"

It should be possible to configure jython in such a way that if I pass strval to a Java-Method, then the str is correctly converted back to "schön".

However, jython invokes the Java-method with the 6-character string value "schön", i.e., the o-umlaut has been replaced by the two characters \u00c3 \u00b6.
msg8078 (view) Author: Oliver Matz (omatz) Date: 2013-07-31.10:51:11
Sorry for my mast note beeing incomplete.
I have the described encoding problem.  I have reproduced the problem with jython 2.5.2 and 2.5.3 on Windows 7 with Java 7.
I know that I could change the jython code and use unicode-strings throughout.  However, this is no desirable an option for us because the jython code is not under my control. 

In any case I cannot imagine that the behaviour for multibyte-characters is intended by anybody.
I would expect that the conversion uses either Java's default charset, which can be specified by system property file.encoding, or some otherwise specifiable charset.
msg8079 (view) Author: Oliver Matz (omatz) Date: 2013-08-01.08:05:46
Another remark: in order for all this to work, my jython-code starts with the header below.  (I had experimented with other settings, too.)

# -*- coding: utf-8 -*-
from org.python.core import codecs
msg8081 (view) Author: Oliver Matz (omatz) Date: 2013-08-14.08:39:25
The attached program reproduces the problem.  I paste its output. let us see what happens to the umlauts in the display.
Mind the difference between the two lines prefixed by java-hex:
the first reveals the error: the java String has two characters for the german o-umlaut, obtained by padding each of the two bytes from its utf-8-sequence with a zero byte.

# -*- coding: utf-8 -*-
from org.python.core import codecs
import Issue2073Main
print 'schön'
print 'python-hex:', ':'.join(x.encode('hex') for x in 'schön')
python-hex: 73:63:68:c3:b6:6e
javaPrint: schön
java-hex: 73:63:68:c3:b6:6e
expected output:
javaPrint: schön
java-hex: 73:63:68:f6:6e
msg8473 (view) Author: Jim Baker (zyasoft) Date: 2014-05-21.21:16:01
This is an unfortunate aspect of Python's str/unicode distinction. Jython used to ignore this distinction, but it caused significant incompatibility with running standard Python code.

The fact that you can construct bytestrings out of UTF-8 sequences in this way is very much orthogonal; I should also point out that the console is a bit problematic here ( I will stick with scripts:

# -*- coding: utf-8 -*-
val = u"schön"
print val

Does print out what we expect:

$ jython27

So please use unicode for unicode strings, str (or bytes) for byte strings.

Jython 3.x will make this work much better by making the usual string type be unicode, which will align better with Java usage, but that's very much vaporware for now.
Date User Action Args
2014-05-21 21:16:32zyasoftsetstatus: open -> closed
2014-05-21 21:16:02zyasoftsetresolution: wont fix
messages: + msg8473
nosy: + zyasoft
2013-08-14 08:39:26omatzsetmessages: + msg8081
2013-08-14 08:31:08omatzsetfiles: +
2013-08-01 08:05:46omatzsetmessages: + msg8079
2013-07-31 16:37:35santa4ntsetnosy: + santa4nt
2013-07-31 10:51:12omatzsetmessages: + msg8078
2013-07-31 10:04:26omatzcreate