Message5782

Author	cbearden
Recipients	amak, cbearden, fdb
Date	2010-05-25.22:16:16
SpamBayes Score	3.917631e-05
Marked as misclassified	No
Message-id	<AANLkTilf8lNqpd6u4J3w1CfKZDdIV51IG9npIjrH4-F2@mail.gmail.com>
In-reply-to	<1274824689.84.0.388003618604.issue1614@psf.upfronthosting.co.za>

Content
On Tue, May 25, 2010 at 4:58 PM, Alan Kennedy <report@bugs.jython.org> wrote: > > Alan Kennedy <jython-dev@xhaus.com> added the comment: > > OK, I see now why this is happening. > > Cpython uses the xml.dom.expatbuilder class, which has special code to recognize when text nodes are adjacent as they are being built. This tries to ensure that all contiguous text in a document will be contained in a single node. I am not sure if it is guaranteed. (And performance degrades as the document sizes grow, because they are adding strings together with s=s1+s2+..+sN, rather than s="".join([s1,s2,..,sN])). > > Jython has no expat, and so cannot use expatbuilder. Instead it uses xml.dom.pulldom.parse(), which relies on the parse events of the underlying java SAX2 parser. Since there is no parser (that I know of) that will guarantee to report all contiguous text in a single SAX2 characters() call, this splitting of contiguous text into multiple nodes is virtually guaranteed on jython. > > I have the fix made in my local repo. > > But I still haven't decided to check it in. Points to note are > > 1. Cpython didn't always behave the way it does now, as you can see from the following thread on python-list from January 2003. > > xml.dom.minidom.parse() splitting text nodes? > http://mail.python.org/pipermail/python-list/2003-January/801932.html > http://mail.python.org/pipermail/python-list/2003-January/819322.html > > 2. Introducing the normalize method to guarantee normalization almost doubles the time required to run the minidom test suite. So this fix does not come without cost, a cost which I'm not too happy about forcing everyone to bear just to solve this one simple case. And that simple case could be solved by the user simply adding their own "dom.normalize()" call to their code. > > 3. Should the behaviour of jython follow XML specs (which despite Charles excellent links are still open to interpretation) or cpython behaviour? > > 4. Perhaps the best position is that we should follow cpython blind^H^H^H^H^Hfaithfully, and behave exactly as it does. If that makes performance suck, then users should translate their stuff to work with java APIs instead. > > Opinions welcome. I'm going to sleep on it. > > Meantime, the OP can solve his problem by simply adding dom.normalize() to his code after the xml.dom.minidom.parseString(). For what it may be worth, I did a small (very ad hack) test of Java behavior, in which I increased the size of a single text node in a simple XML doc and ran a simple Java DOM program against it, and the result I got was always one text node. My test code is at http://pastebin.com/CS5dKV1f (I won't be offended if you laugh). Chuck

On Tue, May 25, 2010 at 4:58 PM, Alan Kennedy <report@bugs.jython.org> wrote:
>
> Alan Kennedy <jython-dev@xhaus.com> added the comment:
>
> OK, I see now why this is happening.
>
> Cpython uses the xml.dom.expatbuilder class, which has special code to recognize when text nodes are adjacent *as they are being built*. This tries to ensure that all contiguous text in a document will be contained in a single node. I am not sure if it is guaranteed. (And performance degrades as the document sizes grow, because they are adding strings together with s=s1+s2+..+sN, rather than s="".join([s1,s2,..,sN])).
>
> Jython has no expat, and so cannot use expatbuilder. Instead it uses xml.dom.pulldom.parse(), which relies on the parse events of the underlying java SAX2 parser. Since there is no parser (that I know of) that will guarantee to report all contiguous text in a single SAX2 characters() call, this splitting of contiguous text into multiple nodes is virtually guaranteed on jython.
>
> I have the fix made in my local repo.
>
> But I still haven't decided to check it in. Points to note are
>
> 1. Cpython didn't always behave the way it does now, as you can see from the following thread on python-list from January 2003.
>
> xml.dom.minidom.parse() splitting text nodes?
> http://mail.python.org/pipermail/python-list/2003-January/801932.html
> http://mail.python.org/pipermail/python-list/2003-January/819322.html
>
> 2. Introducing the normalize method to guarantee normalization almost doubles the time required to run the minidom test suite. So this fix does not come without cost, a cost which I'm not too happy about forcing everyone to bear just to solve this one simple case. And that simple case could be solved by the user simply adding their own "dom.normalize()" call to their code.
>
> 3. Should the behaviour of jython follow XML specs (which despite Charles excellent links are still open to interpretation) or cpython behaviour?
>
> 4. Perhaps the best position is that we should follow cpython blind^H^H^H^H^Hfaithfully, and behave exactly as it does. If that makes performance suck, then users should translate their stuff to work with java APIs instead.
>
> Opinions welcome. I'm going to sleep on it.
>
> Meantime, the OP can solve his problem by simply adding dom.normalize() to his code after the xml.dom.minidom.parseString().

For what it may be worth, I did a small (very ad hack) test of Java
behavior, in which I increased the size of a single text node in a
simple XML doc and ran a simple Java DOM program against it, and the
result I got was always one text node.  My test code is at
http://pastebin.com/CS5dKV1f (I won't be offended if you laugh).

Chuck

History
Date	User	Action	Args
2010-05-25 22:16:16	cbearden	set	recipients: + cbearden, amak, fdb
2010-05-25 22:16:16	cbearden	link	issue1614 messages
2010-05-25 22:16:16	cbearden	create