Issue2073

classification
Title: Conversion of str-Instances does not use Java's default charset
Type: behaviour Severity: normal
Components: Versions: Jython 2.5
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: omatz, santa4nt, zyasoft
Priority: Keywords:

Created on 2013-07-31.10:04:26 by omatz, last changed 2014-08-28.15:04:34 by omatz.

Files
File name Uploaded Description Edit Remove
Issue2073Main.java omatz, 2013-08-14.08:31:08
Issue2073Test.java omatz, 2014-08-28.15:04:33 junit-test that reproduces the probkem, may be put into tests/java/org/python/tests/
Messages
msg8077 (view) Author: Oliver Matz (omatz) Date: 2013-07-31.10:04:25
strval = "schön"
hexval = ':'.join(x.encode('hex') for x in strval)
print hexval

correctly outputs: 73:63:68:c3:b6:6e
so the german o-umlaut (unicode \u00fc) is correctly converted to the utf-8 byte sequence "c3 b6"

It should be possible to configure jython in such a way that if I pass strval to a Java-Method, then the str is correctly converted back to "schön".

However, jython invokes the Java-method with the 6-character string value "schön", i.e., the o-umlaut has been replaced by the two characters \u00c3 \u00b6.
msg8078 (view) Author: Oliver Matz (omatz) Date: 2013-07-31.10:51:11
Sorry for my mast note beeing incomplete.
I have the described encoding problem.  I have reproduced the problem with jython 2.5.2 and 2.5.3 on Windows 7 with Java 7.
I know that I could change the jython code and use unicode-strings throughout.  However, this is no desirable an option for us because the jython code is not under my control. 

In any case I cannot imagine that the behaviour for multibyte-characters is intended by anybody.
I would expect that the conversion uses either Java's default charset, which can be specified by system property file.encoding, or some otherwise specifiable charset.
msg8079 (view) Author: Oliver Matz (omatz) Date: 2013-08-01.08:05:46
Another remark: in order for all this to work, my jython-code starts with the header below.  (I had experimented with other settings, too.)

# -*- coding: utf-8 -*-
from org.python.core import codecs
codecs.setDefaultEncoding('utf-8')
msg8081 (view) Author: Oliver Matz (omatz) Date: 2013-08-14.08:39:25
The attached program reproduces the problem.  I paste its output. let us see what happens to the umlauts in the display.
Mind the difference between the two lines prefixed by java-hex:
the first reveals the error: the java String has two characters for the german o-umlaut, obtained by padding each of the two bytes from its utf-8-sequence with a zero byte.


Executing:
# -*- coding: utf-8 -*-
from org.python.core import codecs
import Issue2073Main
codecs.setDefaultEncoding('utf-8')
print 'schön'
print 'python-hex:', ':'.join(x.encode('hex') for x in 'schön')
Issue2073Main.javaPrint('schön')
--------------------------------------------
schön
python-hex: 73:63:68:c3:b6:6e
javaPrint: schön
java-hex: 73:63:68:c3:b6:6e
--------------------------------------------
expected output:
javaPrint: schön
java-hex: 73:63:68:f6:6e
msg8473 (view) Author: Jim Baker (zyasoft) Date: 2014-05-21.21:16:01
This is an unfortunate aspect of Python's str/unicode distinction. Jython used to ignore this distinction, but it caused significant incompatibility with running standard Python code.

The fact that you can construct bytestrings out of UTF-8 sequences in this way is very much orthogonal; I should also point out that the console is a bit problematic here (https://wiki.python.org/jython/ConsoleChoices). I will stick with scripts:

# -*- coding: utf-8 -*-
val = u"schön"
print val

Does print out what we expect:

$ jython27 test_utf8.py
schön

So please use unicode for unicode strings, str (or bytes) for byte strings.

Jython 3.x will make this work much better by making the usual string type be unicode, which will align better with Java usage, but that's very much vaporware for now.
msg8941 (view) Author: Oliver Matz (omatz) Date: 2014-08-28.15:04:33
Entwurf:

@Jim Baker, msg8473: Thank you for taking the time to investigate my problem.
I absolutely agree that it was a good decision to change the string handling between 2.1 and 2.5, in that unicode strings are now the default string type.
You are right, if the jython code uses type unicode as opposed to str, everything works nicely.  Unfortunately, we have thousands of lines of jython code that (1) use str, (2) used to work correctly for jython 2.1 and (3) are not shipped with our software but are stored in databases at our customers' sites.
Changing that code manually to use unicode rather than str is tedious and error-prone.

I insist that the current behaviour is undesirable. If you use utf8-encoding and pass the German letter o-umlaut (unicode: \u00f6) from java to jython and then back from jython to java, that single letter is currently replaced to two characters \u00c3\u00b6.  That cannot be intended!

The bug is in the method org.python.core.util.StringUtil.fromBytes(byte[], int, int).
There you find the following snippet, mind the comment:
------------------------------------------------------
// Yes, I know the method is deprecated, but it is the fastest
// way of converting between between byte[] and String
return new String(buf, 0, off, len);
------------------------------------------------------

That method StringUtil.fromBytes(byte[], int, int) is used in a couple of places where the correct encoding is not obvious from the soure code.  Fortunately, for the specific problem under consideration, that encoding *is* evident, so I propose the following fix:

Add two new methods in org.python.core.util.StringUtil, both overloading existing methods with an additional parameter of type Charset:
------------------------------------------------------
public static String fromBytes(ByteBuffer buf, Charset cs) {
  return fromBytes(buf.array(), buf.arrayOffset() + buf.position(),
      buf.arrayOffset() + buf.limit(), cs);
}

public static String fromBytes(byte[] buf, int off, int len, Charset cs) {
  return new String(buf, off, len, cs);
}
------------------------------------------------------

Then, in method org.python.antlr.GrammarActions.extractString(Token, String, boolean) (line 520) replace

------------------------------------------------------
string = StringUtil.fromBytes(decoded);
------------------------------------------------------
by
------------------------------------------------------
string = StringUtil.fromBytes(decoded, cs); // Issue2073: cs!
------------------------------------------------------
History
Date User Action Args
2014-08-28 15:04:34omatzsetfiles: + Issue2073Test.java
messages: + msg8941
2014-05-21 21:16:32zyasoftsetstatus: open -> closed
2014-05-21 21:16:02zyasoftsetresolution: wont fix
messages: + msg8473
nosy: + zyasoft
2013-08-14 08:39:26omatzsetmessages: + msg8081
2013-08-14 08:31:08omatzsetfiles: + Issue2073Main.java
2013-08-01 08:05:46omatzsetmessages: + msg8079
2013-07-31 16:37:35santa4ntsetnosy: + santa4nt
2013-07-31 10:51:12omatzsetmessages: + msg8078
2013-07-31 10:04:26omatzcreate