Message5780

Author	amak
Recipients	amak, cbearden, fdb
Date	2010-05-25.21:58:08
SpamBayes Score	0.00052977743
Marked as misclassified	No
Message-id	<1274824689.84.0.388003618604.issue1614@psf.upfronthosting.co.za>
In-reply-to

Content
OK, I see now why this is happening. Cpython uses the xml.dom.expatbuilder class, which has special code to recognize when text nodes are adjacent as they are being built. This tries to ensure that all contiguous text in a document will be contained in a single node. I am not sure if it is guaranteed. (And performance degrades as the document sizes grow, because they are adding strings together with s=s1+s2+..+sN, rather than s="".join([s1,s2,..,sN])). Jython has no expat, and so cannot use expatbuilder. Instead it uses xml.dom.pulldom.parse(), which relies on the parse events of the underlying java SAX2 parser. Since there is no parser (that I know of) that will guarantee to report all contiguous text in a single SAX2 characters() call, this splitting of contiguous text into multiple nodes is virtually guaranteed on jython. I have the fix made in my local repo. But I still haven't decided to check it in. Points to note are 1. Cpython didn't always behave the way it does now, as you can see from the following thread on python-list from January 2003. xml.dom.minidom.parse() splitting text nodes? http://mail.python.org/pipermail/python-list/2003-January/801932.html http://mail.python.org/pipermail/python-list/2003-January/819322.html 2. Introducing the normalize method to guarantee normalization almost doubles the time required to run the minidom test suite. So this fix does not come without cost, a cost which I'm not too happy about forcing everyone to bear just to solve this one simple case. And that simple case could be solved by the user simply adding their own "dom.normalize()" call to their code. 3. Should the behaviour of jython follow XML specs (which despite Charles excellent links are still open to interpretation) or cpython behaviour? 4. Perhaps the best position is that we should follow cpython blind^H^H^H^H^Hfaithfully, and behave exactly as it does. If that makes performance suck, then users should translate their stuff to work with java APIs instead. Opinions welcome. I'm going to sleep on it. Meantime, the OP can solve his problem by simply adding dom.normalize() to his code after the xml.dom.minidom.parseString().

OK, I see now why this is happening.

Cpython uses the xml.dom.expatbuilder class, which has special code to recognize when text nodes are adjacent *as they are being built*. This tries to ensure that all contiguous text in a document will be contained in a single node. I am not sure if it is guaranteed. (And performance degrades as the document sizes grow, because they are adding strings together with s=s1+s2+..+sN, rather than s="".join([s1,s2,..,sN])).

Jython has no expat, and so cannot use expatbuilder. Instead it uses xml.dom.pulldom.parse(), which relies on the parse events of the underlying java SAX2 parser. Since there is no parser (that I know of) that will guarantee to report all contiguous text in a single SAX2 characters() call, this splitting of contiguous text into multiple nodes is virtually guaranteed on jython.

I have the fix made in my local repo.

But I still haven't decided to check it in. Points to note are

1. Cpython didn't always behave the way it does now, as you can see from the following thread on python-list from January 2003.

xml.dom.minidom.parse() splitting text nodes?
http://mail.python.org/pipermail/python-list/2003-January/801932.html
http://mail.python.org/pipermail/python-list/2003-January/819322.html

2. Introducing the normalize method to guarantee normalization almost doubles the time required to run the minidom test suite. So this fix does not come without cost, a cost which I'm not too happy about forcing everyone to bear just to solve this one simple case. And that simple case could be solved by the user simply adding their own "dom.normalize()" call to their code.

3. Should the behaviour of jython follow XML specs (which despite Charles excellent links are still open to interpretation) or cpython behaviour?

4. Perhaps the best position is that we should follow cpython blind^H^H^H^H^Hfaithfully, and behave exactly as it does. If that makes performance suck, then users should translate their stuff to work with java APIs instead.

Opinions welcome. I'm going to sleep on it.

Meantime, the OP can solve his problem by simply adding dom.normalize() to his code after the xml.dom.minidom.parseString().

History
Date	User	Action	Args
2010-05-25 21:58:09	amak	set	messageid: <1274824689.84.0.388003618604.issue1614@psf.upfronthosting.co.za>
2010-05-25 21:58:09	amak	set	recipients: + amak, fdb, cbearden
2010-05-25 21:58:09	amak	link	issue1614 messages
2010-05-25 21:58:08	amak	create