Message7988

Author	gsnedders
Recipients	amak, fwierzbicki, gsnedders, jeff.allen
Date	2013-04-07.15:06:02
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1365347162.68.0.0505202254274.issue1836@psf.upfronthosting.co.za>
In-reply-to

Content
Python 2 doesn't define the unicode type as UCS-2 or UTF-32 string: it defines it as a sequence of code units: "The items of a Unicode object are Unicode code units. A Unicode code unit is represented by a Unicode object of one item and can hold either a 16-bit or 32-bit value representing a Unicode ordinal (the maximum value for the ordinal is given in sys.maxunicode, and depends on how Python is configured at compile time)." As such, validity constraints of UCS-2 and UTF-32 (and UTF-16) do not apply here, as it is none of them, but rather it is an abstract sequence of code units. It places no constraints on what Unicode ordinals (which I take to mean codepoints) are valid. The Python 3, definition, for what it's worth, is clearer in terms of what is allowed: "A string is a sequence of values that represent Unicode codepoints. All the codepoints in range U+0000 - U+10FFFF can be represented in a string." This is clear that lone surrogates are valid.

Python 2 doesn't define the unicode type as UCS-2 or UTF-32 string: it defines it as a sequence of code units: "The items of a Unicode object are Unicode code units. A Unicode code unit is represented by a Unicode object of one item and can hold either a 16-bit or 32-bit value representing a Unicode ordinal (the maximum value for the ordinal is given in sys.maxunicode, and depends on how Python is configured at compile time)."

As such, validity constraints of UCS-2 and UTF-32  (and UTF-16) do not apply here, as it is none of them, but rather it is an abstract sequence of code units. It places no constraints on what Unicode ordinals (which I take to mean codepoints) are valid.

The Python 3, definition, for what it's worth, is clearer in terms of what is allowed: "A string is a sequence of values that represent Unicode codepoints. All the codepoints in range U+0000 - U+10FFFF can be represented in a string." This is clear that lone surrogates are valid.

History
Date	User	Action	Args
2013-04-07 15:06:02	gsnedders	set	messageid: <1365347162.68.0.0505202254274.issue1836@psf.upfronthosting.co.za>
2013-04-07 15:06:02	gsnedders	set	recipients: + gsnedders, fwierzbicki, amak, jeff.allen
2013-04-07 15:06:02	gsnedders	link	issue1836 messages
2013-04-07 15:06:02	gsnedders	create