Issue2048

classification

Title:	Allow lone surrogates
Type:	behaviour	Severity:	normal
Components:	Core	Versions:	Jython 2.7
		Milestone:

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	amak	Nosy List:	Arfrever, amak, fwierzbicki, jeff.allen, serhiy.storchaka, zyasoft
Priority:		Keywords:	patch

Created on 2013-05-12.19:52:44 by serhiy.storchaka, last changed 2014-05-04.20:08:08 by zyasoft.

Files
File name	Uploaded	Description	Edit	Remove
unicode_surrogates.patch	serhiy.storchaka, 2013-05-12.19:52:43

Messages
msg8011 (view)	Author: Serhiy Storchaka (serhiy.storchaka)	Date: 2013-05-12.19:52:43
Jython doesn't support lone surrogates even in string literals. This makes it incompatible with part of CPython testsuite and with tests of some third-party projects (i.e. simplejson). Here is a patch which allows Jython work with alone surrogates.
msg8033 (view)	Author: Alan Kennedy (amak)	Date: 2013-05-28.02:09:16
We have already discussed this in detail in other bug reports Jython doesn't allow to use unmapped unicode codepoint http://bugs.jython.org/issue1707 Invalid Unicode characters cause compile-time error (CPython divergence) http://bugs.jython.org/issue1836 Unpaired surrogates are a deserialization issue. Unpaired 16-bit surrogates that appear in a stream of 16-bit words (e.g. encoded python source) are invalid, unless you are decoding according to an encoding which accepts 16 bit values in the range 0xD800-0xDFFF as valid characters, which cpython does (because it uses UCS-2), but which java does not, because it uses UTF-16. Note also that even in UCS-2, these characters have no meaning[1]. """ Code points U+D800 to U+DFFF The Unicode standard permanently reserves these code point values for UTF-16 encoding of the lead and trail surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that all UTF forms, including UTF-16, cannot encode these code points. """ http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2BD800_to_U.2BDFFF I argue that the statement "The official Unicode standard says that all UTF forms, including UTF-16, cannot encode these code points" also asserts that UTF-16 cannot DECODE these code points, when present in byte-BE/byte-LE/word serializations. Unless you come up with a very good end-case reason why we should break standard Unicode deserialization, other than passing broken cpython-specific UCS-2 character decoding unit tests, I'm closing this bug as "won't fix". [1] http://www.azillionmonkeys.com/qed/unicode.html
msg8325 (view)	Author: Jim Baker (zyasoft)	Date: 2014-05-04.20:08:08
Agreed, this has been discussed before with core Python dev - it's OK to have UTF-16 be an alternative internal encoding for Python.

History
Date	User	Action	Args
2014-05-04 20:08:08	zyasoft	set	status: open -> closed resolution: wont fix messages: + msg8325 nosy: + zyasoft
2013-05-28 02:09:17	amak	set	assignee: amak messages: + msg8033 nosy: + fwierzbicki, amak, jeff.allen
2013-05-20 09:46:28	Arfrever	set	nosy: + Arfrever
2013-05-12 19:52:44	serhiy.storchaka	create