Title: html5lib trunk won't compile due to Jython Unicode pickiness
Type: Severity: normal
Components: Core Versions: 25rc4
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: zyasoft Nosy List: dmbaggett, zyasoft
Priority: Keywords:

Created on 2009-05-01.13:12:39 by dmbaggett, last changed 2009-05-30.00:45:58 by zyasoft.

msg4625 (view) Author: Dave Baggett (dmbaggett) Date: 2009-05-01.13:12:36
This line of code from the html5lib trunk file

invalid_unicode_re =

won't compile under Jython 2.5b3:

Sorry: UnicodeDecodeError: ('unicodeescape',
48, 55, 'illegal Unicode character')

It looks like Jython (via Java) is enforcing valid unicode in the
literal while standard Python is not.
msg4761 (view) Author: Jim Baker (zyasoft) Date: 2009-05-30.00:45:57
This is a fundamental design decision: we do not allow for isolated half 
surrogates in Jython, since we use the same underlying representation as 
Java, UTF-16, for our unicode strings. In Jython, unicode is just a 
wrapper around java.lang.String.

Wikipedia succinctly describes the issue here: "All possible code points 
from U+0000 through U+10FFFF, except for the surrogate code points 
U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 
regardless of the code point's current or future character assignment or 
use." (

So the workaround is to special case for Jython \uD8000-\uDFFF, instead 
of using a regex as in msg4625.

Similar considerations would apply for other Unicode usage in CPython, 
notably UCS2 vs UCS4.

A similar problem was seen in Pygments,
Date User Action Args
2009-05-30 00:45:58zyasoftsetstatus: open -> closed
resolution: wont fix
messages: + msg4761
2009-05-29 01:42:19pjenveysetassignee: zyasoft
nosy: + zyasoft
2009-05-01 13:12:39dmbaggettcreate