Message8033

Author amak
Recipients Arfrever, amak, fwierzbicki, jeff.allen, serhiy.storchaka
Date 2013-05-28.02:09:16
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1369706957.11.0.226598244735.issue2048@psf.upfronthosting.co.za>
In-reply-to
Content
We have already discussed this in detail in other bug reports

Jython doesn't allow to use unmapped unicode codepoint
http://bugs.jython.org/issue1707

Invalid Unicode characters cause compile-time error (CPython divergence)
http://bugs.jython.org/issue1836

Unpaired surrogates are a *deserialization* issue. Unpaired 16-bit surrogates that appear in a stream of 16-bit words (e.g. encoded python source) are *invalid*, unless you are decoding according to an encoding which accepts 16 bit values in the range 0xD800-0xDFFF as valid characters, which cpython does (because it uses UCS-2), but which java does not, because it uses UTF-16. Note also that even in UCS-2, these characters have no meaning[1].

"""
Code points U+D800 to U+DFFF

The Unicode standard permanently reserves these code point values for UTF-16 encoding of the lead and trail surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that all UTF forms, including UTF-16, cannot encode these code points.
"""

http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2BD800_to_U.2BDFFF

I argue that the statement "The official Unicode standard says that all UTF forms, including UTF-16, cannot encode these code points" also asserts that UTF-16 cannot DECODE these code points, when present in byte-BE/byte-LE/word serializations.

Unless you come up with a very good end-case reason why we should break standard Unicode deserialization, other than passing broken cpython-specific UCS-2 character decoding unit tests, I'm closing this bug as "won't fix".

[1] http://www.azillionmonkeys.com/qed/unicode.html
History
Date User Action Args
2013-05-28 02:09:17amaksetmessageid: <1369706957.11.0.226598244735.issue2048@psf.upfronthosting.co.za>
2013-05-28 02:09:17amaksetrecipients: + amak, fwierzbicki, jeff.allen, Arfrever, serhiy.storchaka
2013-05-28 02:09:16amaklinkissue2048 messages
2013-05-28 02:09:16amakcreate