Issue2002

classification

Title:	Codecs report consumed length in UTF-16 code units
Type:		Severity:	normal
Components:	Core, Library	Versions:	Jython 2.7
		Milestone:

process

Status:	open	Resolution:	remind
Dependencies:		Superseder:
Assigned To:		Nosy List:	fwierzbicki, jeff.allen, zyasoft
Priority:	normal	Keywords:

Created on 2012-12-31.13:26:17 by jeff.allen, last changed 2014-06-19.06:45:12 by zyasoft.

Messages
msg7559 (view)	Author: Jeff Allen (jeff.allen)	Date: 2012-12-31.13:26:16
Codec encoders should report the number of elements consumed from their input so that incremental encoding can pick up correctly where it left off. On a wide CPython build these units are genuine unicode characters; on a narrow CPython build they are UTF-16 units (ie either point codes in BMP or surrogates). Jython's codecs do not do this correctly. Jython's unicode type behaves as if it were a CPython wide build, but its codecs behave like a CPython narrow build. This is because they process the java.lang.String representation of the unicode object, which is UTF-16. >dist\bin\jython Jython 2.7.0a2+ (, Dec 25 2012, 00:49:21) [Java HotSpot(TM) 64-Bit Server VM (Sun Microsystems Inc.)] on java1.6.0_35 Type "help", "copyright", "credits" or "license" for more information. >>> import codecs, sys, unicodedata >>> u = u"\xe9\u0bf2\u0f84\u1770\u33af\U00010000" >>> len(u) 6 >>> for c in u : print unicodedata.name(c) ... LATIN SMALL LETTER E WITH ACUTE TAMIL NUMBER ONE THOUSAND TIBETAN MARK HALANTA TAGBANWA LETTER SA SQUARE RAD OVER S SQUARED LINEAR B SYLLABLE B008 A >>> c8 = codecs.lookup("utf-8") >>> c8.encode(u) ('\xc3\xa9\xe0\xaf\xb2\xe0\xbe\x84\xe1\x9d\xb0\xe3\x8e\xaf\xf0\x90\x80\x80', 7) >>> c7 = codecs.lookup("utf-7") >>> c7.encode(u) ('+AOkL8g+EF3Azr9gA3AA-', 7) >>> c16 = codecs.lookup("utf-16") >>> c16.encode(u) ('\xfe\xff\x00\xe9\x0b\xf2\x0f\x84\x17p3\xaf\xd8\x00\xdc\x00', 7) >>> This probably requires reworking the built-in codecs to take PyUnicode arguments as point-code sequences in place of the default conversion to java.lang.String. See tangentially related issue #1128.
msg8732 (view)	Author: Jim Baker (zyasoft)	Date: 2014-06-19.06:45:12
Looks like a pretty serious bug, which is only mitigated by the fact that we probably do the right thing in incremental/stream codecs.

History
Date	User	Action	Args
2014-06-19 06:45:12	zyasoft	set	nosy: + zyasoft messages: + msg8732
2013-02-20 00:21:43	fwierzbicki	set	priority: normal nosy: + fwierzbicki resolution: remind versions: + Jython 2.7, - 2.7a2
2012-12-31 13:26:17	jeff.allen	create