Message7559

Author jeff.allen
Recipients jeff.allen
Date 2012-12-31.13:26:16
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1356960377.48.0.581413613371.issue2002@psf.upfronthosting.co.za>
In-reply-to
Content
Codec encoders should report the number of elements consumed from their input so that incremental encoding can pick up correctly where it left off. On a wide CPython build these units are genuine unicode characters; on a narrow CPython build they are UTF-16 units (ie either point codes in BMP or surrogates).

Jython's codecs do not do this correctly. Jython's unicode type behaves as if it were a CPython wide build, but its codecs behave like a CPython narrow build. This is because they process the java.lang.String representation of the unicode object, which is UTF-16.

>dist\bin\jython
Jython 2.7.0a2+ (, Dec 25 2012, 00:49:21)
[Java HotSpot(TM) 64-Bit Server VM (Sun Microsystems Inc.)] on java1.6.0_35
Type "help", "copyright", "credits" or "license" for more information.
>>> import codecs, sys, unicodedata
>>> u = u"\xe9\u0bf2\u0f84\u1770\u33af\U00010000"
>>> len(u)
6
>>> for c in u : print unicodedata.name(c)
...
LATIN SMALL LETTER E WITH ACUTE
TAMIL NUMBER ONE THOUSAND
TIBETAN MARK HALANTA
TAGBANWA LETTER SA
SQUARE RAD OVER S SQUARED
LINEAR B SYLLABLE B008 A
>>> c8 = codecs.lookup("utf-8")
>>> c8.encode(u)
('\xc3\xa9\xe0\xaf\xb2\xe0\xbe\x84\xe1\x9d\xb0\xe3\x8e\xaf\xf0\x90\x80\x80', 7)
>>> c7 = codecs.lookup("utf-7")
>>> c7.encode(u)
('+AOkL8g+EF3Azr9gA3AA-', 7)
>>> c16 = codecs.lookup("utf-16")
>>> c16.encode(u)
('\xfe\xff\x00\xe9\x0b\xf2\x0f\x84\x17p3\xaf\xd8\x00\xdc\x00', 7)
>>>

This probably requires reworking the built-in codecs to take PyUnicode arguments as point-code sequences in place of the default conversion to java.lang.String. See tangentially related issue #1128.
History
Date User Action Args
2012-12-31 13:26:17jeff.allensetrecipients: + jeff.allen
2012-12-31 13:26:17jeff.allensetmessageid: <1356960377.48.0.581413613371.issue2002@psf.upfronthosting.co.za>
2012-12-31 13:26:17jeff.allenlinkissue2002 messages
2012-12-31 13:26:16jeff.allencreate