Message9965

Author gsnedders
Recipients gsnedders, jeff.allen, zyasoft
Date 2015-04-25.23:36:25
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1430004986.38.0.942507280386.issue2340@psf.upfronthosting.co.za>
In-reply-to
Content
I have mixed feelings about this: it's nice to be able to do something like `eval(r'u"\uD800"')` to see if the Python being run on supports lone surrogates (and not having to hardcode a list of implementations that do/don't). Also seems a bit evil given in a fair few cases it'll suddenly fail much later on when dealing with the string (when toString is called, which isn't obvious from the Python level), and I'm not sure TypeError makes sense there, though I'm not sure what does.

To make a counter-proposal: allow them to parse, but doing *anything* with them throw the error. This allows for something like:

```
try:
    eval(r'u"\uD800"[0]')
except SyntaxError, TypeError:
    supportsSurrogates = False
else:
    supportsSurrogates = True

if supportsSurrogates:
    x = u"[\uD800-\uDFFF]"
else:
    x = u""
```

That said, that's quite a lot more work to do. Meh. Not sure what to suggest.

Also, just to point out: in principle, "\UFFFFFFFF" should still fail. Nothing else actually uses 32-bit code units, despite what the documentation suggests.

Speaking of PEP 939 and Python 3.3, note the definition of the Unicode type changed to be "a sequence of values that represent Unicode code points", which implicitly allows surrogates (both unpaired and paired), so the hypothetical Python 3 release of Jython will eventually have to deal with them properly somehow.
History
Date User Action Args
2015-04-25 23:36:26gsnedderssetmessageid: <1430004986.38.0.942507280386.issue2340@psf.upfronthosting.co.za>
2015-04-25 23:36:26gsnedderssetrecipients: + gsnedders, zyasoft, jeff.allen
2015-04-25 23:36:26gsnedderslinkissue2340 messages
2015-04-25 23:36:25gsnedderscreate