Issue1335

classification

Title:	html5lib trunk won't compile due to Jython Unicode pickiness
Type:		Severity:	normal
Components:	Core	Versions:	25rc4
		Milestone:

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	zyasoft	Nosy List:	dmbaggett, zyasoft
Priority:		Keywords:

Created on 2009-05-01.13:12:39 by dmbaggett, last changed 2009-05-30.00:45:58 by zyasoft.

Messages
msg4625 (view)	Author: Dave Baggett (dmbaggett)	Date: 2009-05-01.13:12:36
This line of code from the html5lib trunk file inputstream.py: invalid_unicode_re = re.compile(u"[\u0001-\u0008\u000B\u000E-\u001F\u007F-\u009F\uD800-\uDFFF\uFDD0-\uFDEF\uFFFE\uFFFF\U0001FFFE\U0001FFFF\U0002FFFE\U0002FFFF\U0003FFFE\U0003FFFF\U0004FFFE\U0004FFFF\U0\ 005FFFE\U0005FFFF\U0006FFFE\U0006FFFF\U0007FFFE\U0007FFFF\U0008FFFE\U0008FFFF\U0009FFFE\U0009FFFF\U000AFFFE\U000AFFFF\U000BFFFE\U000BFFFF\U000CFFFE\U000CFFFF\U000DFFFE\U000DFFFF\U000EFFFE\U000EFFFF\U00\ 0FFFFE\U000FFFFF\U0010FFFE\U0010FFFF]") won't compile under Jython 2.5b3: Sorry: UnicodeDecodeError: ('unicodeescape', 'u"[\\u0001-\\u0008\\u000B\\u000E-\\u001F\\u007F-\\u009F\\uD800-\\uDFFF\\uFDD0-\\uFDEF\\uFFFE\\uFFFF\\U0001FFFE\\U0001FFFF\\U0002FFFE\\U0002FFFF\\U0003FFFE\\U0003FFFF\\U0004FFFE\\U0004FFFF\\U0005FFFE\\U0005FFFF\\U0006FFFE\\U0006FFFF\\U0007FFFE\\U0007FFFF\\U0008FFFE\\U0008FFFF\\U0009FFFE\\U0009FFFF\\U000AFFFE\\U000AFFFF\\U000BFFFE\\U000BFFFF\\U000CFFFE\\U000CFFFF\\U000DFFFE\\U000DFFFF\\U000EFFFE\\U000EFFFF\\U000FFFFE\\U000FFFFF\\U0010FFFE\\U0010FFFF]"', 48, 55, 'illegal Unicode character') It looks like Jython (via Java) is enforcing valid unicode in the literal while standard Python is not.
msg4761 (view)	Author: Jim Baker (zyasoft)	Date: 2009-05-30.00:45:57
This is a fundamental design decision: we do not allow for isolated half surrogates in Jython, since we use the same underlying representation as Java, UTF-16, for our unicode strings. In Jython, unicode is just a wrapper around java.lang.String. Wikipedia succinctly describes the issue here: "All possible code points from U+0000 through U+10FFFF, except for the surrogate code points U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 regardless of the code point's current or future character assignment or use." (http://en.wikipedia.org/wiki/UTF-16). So the workaround is to special case for Jython \uD8000-\uDFFF, instead of using a regex as in msg4625. Similar considerations would apply for other Unicode usage in CPython, notably UCS2 vs UCS4. A similar problem was seen in Pygments, http://dev.pocoo.org/projects/pygments/ticket/358

History
Date	User	Action	Args
2009-05-30 00:45:58	zyasoft	set	status: open -> closed resolution: wont fix messages: + msg4761
2009-05-29 01:42:19	pjenvey	set	assignee: zyasoft nosy: + zyasoft
2009-05-01 13:12:39	dmbaggett	create