Title: Conversion of str-Instances does not use Java's default charset
Type: behaviour Severity: normal
Components: Versions: Jython 2.5
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: omatz, santa4nt
Priority: Keywords:

Created on 2013-07-31.10:04:26 by omatz, last changed 2013-08-14.08:39:26 by omatz.

File name Uploaded Description Edit Remove omatz, 2013-08-14.08:31:08
msg8077 (view) Author: Oliver Matz (omatz) Date: 2013-07-31.10:04:25
strval = "schön"
hexval = ':'.join(x.encode('hex') for x in strval)
print hexval

correctly outputs: 73:63:68:c3:b6:6e
so the german o-umlaut (unicode \u00fc) is correctly converted to the utf-8 byte sequence "c3 b6"

It should be possible to configure jython in such a way that if I pass strval to a Java-Method, then the str is correctly converted back to "schön".

However, jython invokes the Java-method with the 6-character string value "schön", i.e., the o-umlaut has been replaced by the two characters \u00c3 \u00b6.
msg8078 (view) Author: Oliver Matz (omatz) Date: 2013-07-31.10:51:11
Sorry for my mast note beeing incomplete.
I have the described encoding problem.  I have reproduced the problem with jython 2.5.2 and 2.5.3 on Windows 7 with Java 7.
I know that I could change the jython code and use unicode-strings throughout.  However, this is no desirable an option for us because the jython code is not under my control. 

In any case I cannot imagine that the behaviour for multibyte-characters is intended by anybody.
I would expect that the conversion uses either Java's default charset, which can be specified by system property file.encoding, or some otherwise specifiable charset.
msg8079 (view) Author: Oliver Matz (omatz) Date: 2013-08-01.08:05:46
Another remark: in order for all this to work, my jython-code starts with the header below.  (I had experimented with other settings, too.)

# -*- coding: utf-8 -*-
from org.python.core import codecs
msg8081 (view) Author: Oliver Matz (omatz) Date: 2013-08-14.08:39:25
The attached program reproduces the problem.  I paste its output. let us see what happens to the umlauts in the display.
Mind the difference between the two lines prefixed by java-hex:
the first reveals the error: the java String has two characters for the german o-umlaut, obtained by padding each of the two bytes from its utf-8-sequence with a zero byte.

# -*- coding: utf-8 -*-
from org.python.core import codecs
import Issue2073Main
print 'schön'
print 'python-hex:', ':'.join(x.encode('hex') for x in 'schön')
python-hex: 73:63:68:c3:b6:6e
javaPrint: schön
java-hex: 73:63:68:c3:b6:6e
expected output:
javaPrint: schön
java-hex: 73:63:68:f6:6e
Date User Action Args
2013-08-14 08:39:26omatzsetmessages: + msg8081
2013-08-14 08:31:08omatzsetfiles: +
2013-08-01 08:05:46omatzsetmessages: + msg8079
2013-07-31 16:37:35santa4ntsetnosy: + santa4nt
2013-07-31 10:51:12omatzsetmessages: + msg8078
2013-07-31 10:04:26omatzcreate