Message4761

Author zyasoft
Recipients dmbaggett, zyasoft
Date 2009-05-30.00:45:57
SpamBayes Score 1.7947809e-08
Marked as misclassified No
Message-id <1243644358.93.0.471744870654.issue1335@psf.upfronthosting.co.za>
In-reply-to
Content
This is a fundamental design decision: we do not allow for isolated half 
surrogates in Jython, since we use the same underlying representation as 
Java, UTF-16, for our unicode strings. In Jython, unicode is just a 
wrapper around java.lang.String.

Wikipedia succinctly describes the issue here: "All possible code points 
from U+0000 through U+10FFFF, except for the surrogate code points 
U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 
regardless of the code point's current or future character assignment or 
use." (http://en.wikipedia.org/wiki/UTF-16).

So the workaround is to special case for Jython \uD8000-\uDFFF, instead 
of using a regex as in msg4625.

Similar considerations would apply for other Unicode usage in CPython, 
notably UCS2 vs UCS4.

A similar problem was seen in Pygments, http://dev.pocoo.org/projects/pygments/ticket/358
History
Date User Action Args
2009-05-30 00:45:58zyasoftsetmessageid: <1243644358.93.0.471744870654.issue1335@psf.upfronthosting.co.za>
2009-05-30 00:45:58zyasoftsetrecipients: + zyasoft, dmbaggett
2009-05-30 00:45:58zyasoftlinkissue1335 messages
2009-05-30 00:45:57zyasoftcreate