Issue1583

classification
Title: xml.dom.Node.data returns bytestrings of decoded unicode
Type: behaviour Severity: normal
Components: Library Versions: 2.5.1
Milestone:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: amak Nosy List: amak, morganwahl, pjenvey
Priority: Keywords: patch

Created on 2010-03-26.16:04:27 by morganwahl, last changed 2010-04-02.01:56:26 by pjenvey.

Files
File name Uploaded Description Edit Remove
unicode.xml morganwahl, 2010-03-26.17:27:21 input file
unicode.py morganwahl, 2010-03-26.17:27:33
bug1583.patch amak, 2010-03-31.15:21:00
Messages
msg5594 (view) Author: morgan wahl (morganwahl) Date: 2010-03-26.16:10:02
I'm not sure where to write the bug description, but here goes:

I'm parsing an xml file in utf-8 (declared). When i call Node.data on a text node that contains the character U+00C5 (capital A with ring above) it's returning a byte-string whose repr is: '\xc5' . 0xc5 is the ISO-8859-15 (and cp1252) encoding of U+00C5, but of course U+00C5 is undef in ASCII. Thus, I get an error when joining the byte-string returned by Node.data with a unicode string, since my default encoding is ascii.

I'm using jython 2.5.1
msg5595 (view) Author: morgan wahl (morganwahl) Date: 2010-03-26.16:42:59
I'm not sure if this is relevant, but I'm using xml.dom.pulldom to parse.
msg5596 (view) Author: morgan wahl (morganwahl) Date: 2010-03-26.17:27:21
I've run across this bug when using django. Here are some minimal test-case files i've extracted from the django code.
msg5597 (view) Author: morgan wahl (morganwahl) Date: 2010-03-26.17:29:41
just confirmed CPython 2.5.4 runs the test script fine.
msg5609 (view) Author: Alan Kennedy (amak) Date: 2010-03-31.15:17:48
OK, I found out where this is happening, if not why.

Several things have changed since the old xml.dom and xml.sax code was written. The most important is a change in the way that jython handles unicode.

The line at fault is line 188 of xml.sax.drivers2.drv_javasax, in the sax "characters" method. The method used to look like this

def characters(self, char, start, len):
    self._cont_handler.characters(str(String(char, start, len)))

Which could never have been right, since it was returning a 'str', not unicode.

The java.lang.String contains the correct data in this case, i.e. u"\xc5land". Casting it to "str" made jython think it was an ascii string, meaning that you would receive the decode error you saw.

As far as I can see, the correct conversion should be using 'unicode', i.e. the method should look like this

def characters(self, char, start, len):
    self._cont_handler.characters(unicode(String(char, start, len)))

But for some unknown reason, when I try to run this, it gives 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal not in range(128)

Even though it should be trying to construct an org.python.core.PyUnicode from a java.lang.String which contains unicode. So I'm at a loss to explain why the UnicodeDecodeError is happening in this case: someone more familiar with the recent changes in unicode handling needs to take a look at this.

In the meantime, there is a quick fix you can make, which is to define the method like this

def characters(self, char, start, len):
    self._cont_handler.characters(String(char, start, len).getBytes('utf-8').tostring().decode('utf-8'))

It's kind of hacky, but it works, and should get you back up and running. I've attached a patch to the bug report.

Thanks for reporting it.
msg5610 (view) Author: Alan Kennedy (amak) Date: 2010-03-31.15:21:00
Adding a patch which provides a simple solution to the problem.
msg5612 (view) Author: morgan wahl (morganwahl) Date: 2010-03-31.22:00:34
thanks!

I have no idea what the internals of Jython are, but my expectation would be that every time a string gets passed from Java it gets put through decode('whatever-encoding-java-uses') to produce a unicode-string (which is the Python type most similiar to Java strings).
msg5613 (view) Author: Alan Kennedy (amak) Date: 2010-04-01.12:16:40
Fixes and tests checked in at r6994 and r6995.
msg5616 (view) Author: Philip Jenvey (pjenvey) Date: 2010-04-02.01:56:26
I fixed the underlying issue with using the unicode() solution (that was #1563) and applied that change to this fix in r6997
History
Date User Action Args
2010-04-02 01:56:26pjenveysetnosy: + pjenvey
messages: + msg5616
2010-04-01 12:16:42amaksetstatus: open -> closed
resolution: fixed
messages: + msg5613
2010-03-31 22:00:35morganwahlsetmessages: + msg5612
2010-03-31 15:21:01amaksetfiles: + bug1583.patch
keywords: + patch
messages: + msg5610
2010-03-31 15:17:50amaksetassignee: amak
messages: + msg5609
nosy: + amak
2010-03-26 17:29:41morganwahlsetmessages: + msg5597
2010-03-26 17:27:33morganwahlsetfiles: + unicode.py
2010-03-26 17:27:21morganwahlsetfiles: + unicode.xml
messages: + msg5596
2010-03-26 16:42:59morganwahlsetmessages: + msg5595
2010-03-26 16:10:03morganwahlsetmessages: + msg5594
2010-03-26 16:04:27morganwahlcreate