Message5609

Author amak
Recipients amak, morganwahl
Date 2010-03-31.15:17:48
SpamBayes Score 3.4972025e-15
Marked as misclassified No
Message-id <1270048670.48.0.314550527926.issue1583@psf.upfronthosting.co.za>
In-reply-to
Content
OK, I found out where this is happening, if not why.

Several things have changed since the old xml.dom and xml.sax code was written. The most important is a change in the way that jython handles unicode.

The line at fault is line 188 of xml.sax.drivers2.drv_javasax, in the sax "characters" method. The method used to look like this

def characters(self, char, start, len):
    self._cont_handler.characters(str(String(char, start, len)))

Which could never have been right, since it was returning a 'str', not unicode.

The java.lang.String contains the correct data in this case, i.e. u"\xc5land". Casting it to "str" made jython think it was an ascii string, meaning that you would receive the decode error you saw.

As far as I can see, the correct conversion should be using 'unicode', i.e. the method should look like this

def characters(self, char, start, len):
    self._cont_handler.characters(unicode(String(char, start, len)))

But for some unknown reason, when I try to run this, it gives 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal not in range(128)

Even though it should be trying to construct an org.python.core.PyUnicode from a java.lang.String which contains unicode. So I'm at a loss to explain why the UnicodeDecodeError is happening in this case: someone more familiar with the recent changes in unicode handling needs to take a look at this.

In the meantime, there is a quick fix you can make, which is to define the method like this

def characters(self, char, start, len):
    self._cont_handler.characters(String(char, start, len).getBytes('utf-8').tostring().decode('utf-8'))

It's kind of hacky, but it works, and should get you back up and running. I've attached a patch to the bug report.

Thanks for reporting it.
History
Date User Action Args
2010-03-31 15:17:50amaksetmessageid: <1270048670.48.0.314550527926.issue1583@psf.upfronthosting.co.za>
2010-03-31 15:17:50amaksetrecipients: + amak, morganwahl
2010-03-31 15:17:50amaklinkissue1583 messages
2010-03-31 15:17:48amakcreate