Issue1583

classification

Title:	xml.dom.Node.data returns bytestrings of decoded unicode
Type:	behaviour	Severity:	normal
Components:	Library	Versions:	2.5.1
		Milestone:

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	amak	Nosy List:	amak, morganwahl, pjenvey
Priority:		Keywords:	patch

Created on 2010-03-26.16:04:27 by morganwahl, last changed 2010-04-02.01:56:26 by pjenvey.

Files
File name	Uploaded	Description	Edit	Remove
unicode.xml	morganwahl, 2010-03-26.17:27:21	input file
unicode.py	morganwahl, 2010-03-26.17:27:33
bug1583.patch	amak, 2010-03-31.15:21:00

Messages
msg5594 (view)	Author: morgan wahl (morganwahl)	Date: 2010-03-26.16:10:02
I'm not sure where to write the bug description, but here goes: I'm parsing an xml file in utf-8 (declared). When i call Node.data on a text node that contains the character U+00C5 (capital A with ring above) it's returning a byte-string whose repr is: '\xc5' . 0xc5 is the ISO-8859-15 (and cp1252) encoding of U+00C5, but of course U+00C5 is undef in ASCII. Thus, I get an error when joining the byte-string returned by Node.data with a unicode string, since my default encoding is ascii. I'm using jython 2.5.1
msg5595 (view)	Author: morgan wahl (morganwahl)	Date: 2010-03-26.16:42:59
I'm not sure if this is relevant, but I'm using xml.dom.pulldom to parse.
msg5596 (view)	Author: morgan wahl (morganwahl)	Date: 2010-03-26.17:27:21
I've run across this bug when using django. Here are some minimal test-case files i've extracted from the django code.
msg5597 (view)	Author: morgan wahl (morganwahl)	Date: 2010-03-26.17:29:41
just confirmed CPython 2.5.4 runs the test script fine.
msg5609 (view)	Author: Alan Kennedy (amak)	Date: 2010-03-31.15:17:48
OK, I found out where this is happening, if not why. Several things have changed since the old xml.dom and xml.sax code was written. The most important is a change in the way that jython handles unicode. The line at fault is line 188 of xml.sax.drivers2.drv_javasax, in the sax "characters" method. The method used to look like this def characters(self, char, start, len): self._cont_handler.characters(str(String(char, start, len))) Which could never have been right, since it was returning a 'str', not unicode. The java.lang.String contains the correct data in this case, i.e. u"\xc5land". Casting it to "str" made jython think it was an ascii string, meaning that you would receive the decode error you saw. As far as I can see, the correct conversion should be using 'unicode', i.e. the method should look like this def characters(self, char, start, len): self._cont_handler.characters(unicode(String(char, start, len))) But for some unknown reason, when I try to run this, it gives UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal not in range(128) Even though it should be trying to construct an org.python.core.PyUnicode from a java.lang.String which contains unicode. So I'm at a loss to explain why the UnicodeDecodeError is happening in this case: someone more familiar with the recent changes in unicode handling needs to take a look at this. In the meantime, there is a quick fix you can make, which is to define the method like this def characters(self, char, start, len): self._cont_handler.characters(String(char, start, len).getBytes('utf-8').tostring().decode('utf-8')) It's kind of hacky, but it works, and should get you back up and running. I've attached a patch to the bug report. Thanks for reporting it.
msg5610 (view)	Author: Alan Kennedy (amak)	Date: 2010-03-31.15:21:00
Adding a patch which provides a simple solution to the problem.
msg5612 (view)	Author: morgan wahl (morganwahl)	Date: 2010-03-31.22:00:34
thanks! I have no idea what the internals of Jython are, but my expectation would be that every time a string gets passed from Java it gets put through decode('whatever-encoding-java-uses') to produce a unicode-string (which is the Python type most similiar to Java strings).
msg5613 (view)	Author: Alan Kennedy (amak)	Date: 2010-04-01.12:16:40
Fixes and tests checked in at r6994 and r6995.
msg5616 (view)	Author: Philip Jenvey (pjenvey)	Date: 2010-04-02.01:56:26
I fixed the underlying issue with using the unicode() solution (that was #1563) and applied that change to this fix in r6997

History
Date	User	Action	Args
2010-04-02 01:56:26	pjenvey	set	nosy: + pjenvey messages: + msg5616
2010-04-01 12:16:42	amak	set	status: open -> closed resolution: fixed messages: + msg5613
2010-03-31 22:00:35	morganwahl	set	messages: + msg5612
2010-03-31 15:21:01	amak	set	files: + bug1583.patch keywords: + patch messages: + msg5610
2010-03-31 15:17:50	amak	set	assignee: amak messages: + msg5609 nosy: + amak
2010-03-26 17:29:41	morganwahl	set	messages: + msg5597
2010-03-26 17:27:33	morganwahl	set	files: + unicode.py
2010-03-26 17:27:21	morganwahl	set	files: + unicode.xml messages: + msg5596
2010-03-26 16:42:59	morganwahl	set	messages: + msg5595
2010-03-26 16:10:03	morganwahl	set	messages: + msg5594
2010-03-26 16:04:27	morganwahl	create