Message8941

Author	omatz
Recipients	omatz, santa4nt, zyasoft
Date	2014-08-28.15:04:33
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1409238274.84.0.536437601379.issue2073@psf.upfronthosting.co.za>
In-reply-to

Content
Entwurf: @Jim Baker, msg8473: Thank you for taking the time to investigate my problem. I absolutely agree that it was a good decision to change the string handling between 2.1 and 2.5, in that unicode strings are now the default string type. You are right, if the jython code uses type unicode as opposed to str, everything works nicely. Unfortunately, we have thousands of lines of jython code that (1) use str, (2) used to work correctly for jython 2.1 and (3) are not shipped with our software but are stored in databases at our customers' sites. Changing that code manually to use unicode rather than str is tedious and error-prone. I insist that the current behaviour is undesirable. If you use utf8-encoding and pass the German letter o-umlaut (unicode: \u00f6) from java to jython and then back from jython to java, that single letter is currently replaced to two characters \u00c3\u00b6. That cannot be intended! The bug is in the method org.python.core.util.StringUtil.fromBytes(byte[], int, int). There you find the following snippet, mind the comment: ------------------------------------------------------ // Yes, I know the method is deprecated, but it is the fastest // way of converting between between byte[] and String return new String(buf, 0, off, len); ------------------------------------------------------ That method StringUtil.fromBytes(byte[], int, int) is used in a couple of places where the correct encoding is not obvious from the soure code. Fortunately, for the specific problem under consideration, that encoding is evident, so I propose the following fix: Add two new methods in org.python.core.util.StringUtil, both overloading existing methods with an additional parameter of type Charset: ------------------------------------------------------ public static String fromBytes(ByteBuffer buf, Charset cs) { return fromBytes(buf.array(), buf.arrayOffset() + buf.position(), buf.arrayOffset() + buf.limit(), cs); } public static String fromBytes(byte[] buf, int off, int len, Charset cs) { return new String(buf, off, len, cs); } ------------------------------------------------------ Then, in method org.python.antlr.GrammarActions.extractString(Token, String, boolean) (line 520) replace ------------------------------------------------------ string = StringUtil.fromBytes(decoded); ------------------------------------------------------ by ------------------------------------------------------ string = StringUtil.fromBytes(decoded, cs); // Issue2073: cs! ------------------------------------------------------

Entwurf:

@Jim Baker, msg8473: Thank you for taking the time to investigate my problem.
I absolutely agree that it was a good decision to change the string handling between 2.1 and 2.5, in that unicode strings are now the default string type.
You are right, if the jython code uses type unicode as opposed to str, everything works nicely.  Unfortunately, we have thousands of lines of jython code that (1) use str, (2) used to work correctly for jython 2.1 and (3) are not shipped with our software but are stored in databases at our customers' sites.
Changing that code manually to use unicode rather than str is tedious and error-prone.

I insist that the current behaviour is undesirable. If you use utf8-encoding and pass the German letter o-umlaut (unicode: \u00f6) from java to jython and then back from jython to java, that single letter is currently replaced to two characters \u00c3\u00b6.  That cannot be intended!

The bug is in the method org.python.core.util.StringUtil.fromBytes(byte[], int, int).
There you find the following snippet, mind the comment:
------------------------------------------------------
// Yes, I know the method is deprecated, but it is the fastest
// way of converting between between byte[] and String
return new String(buf, 0, off, len);
------------------------------------------------------

That method StringUtil.fromBytes(byte[], int, int) is used in a couple of places where the correct encoding is not obvious from the soure code.  Fortunately, for the specific problem under consideration, that encoding *is* evident, so I propose the following fix:

Add two new methods in org.python.core.util.StringUtil, both overloading existing methods with an additional parameter of type Charset:
------------------------------------------------------
public static String fromBytes(ByteBuffer buf, Charset cs) {
  return fromBytes(buf.array(), buf.arrayOffset() + buf.position(),
      buf.arrayOffset() + buf.limit(), cs);
}

public static String fromBytes(byte[] buf, int off, int len, Charset cs) {
  return new String(buf, off, len, cs);
}
------------------------------------------------------

Then, in method org.python.antlr.GrammarActions.extractString(Token, String, boolean) (line 520) replace

------------------------------------------------------
string = StringUtil.fromBytes(decoded);
------------------------------------------------------
by
------------------------------------------------------
string = StringUtil.fromBytes(decoded, cs); // Issue2073: cs!
------------------------------------------------------

History
Date	User	Action	Args
2014-08-28 15:04:34	omatz	set	messageid: <1409238274.84.0.536437601379.issue2073@psf.upfronthosting.co.za>
2014-08-28 15:04:34	omatz	set	recipients: + omatz, zyasoft, santa4nt
2014-08-28 15:04:34	omatz	link	issue2073 messages
2014-08-28 15:04:33	omatz	create