Issue1614

classification
Title: minidom chunks the character input on multi-line values
Type: behaviour Severity: normal
Components: Library Versions: 2.5.1
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: amak Nosy List: amak, cbearden, fdb
Priority: Keywords:

Created on 2010-05-25.10:36:31 by fdb, last changed 2010-06-25.14:25:06 by amak.

Files
File name Uploaded Description Edit Remove
minidom_test.py fdb, 2010-05-25.10:36:29 Simplest test case that show difference between Jython and CPython
Messages
msg5776 (view) Author: Frederik De Bleser (fdb) Date: 2010-05-25.10:36:29
The node value is not stored under one node value if the parser divides it up into multiple chunks.

I'm not sure if this is a bug or if my implementation code is wrong, but the behavior is different from CPython.

In the attached example, the XML document has four lines. Java's SAX parser chunks the input into two lines at the time. Only the first two lines are stored inside childNodes[0].nodeValue. The other two are in the next child node. CPython stores everything under childNodes[0].nodeValue, even for very large node values. (I tested with 7 million characters)

To reproduce:
jython minidom_test.py

Expected result:
line1
line2
line3
line4

Actual result:
line1
line2

Actual result in Python:
line1
line2
line3
line4

Is this an error in the implementation or am I using minidom wrong?

I'm using Mac OS X 10.6.3 with Jython:

Jython 2.5.1 (Release_2_5_1:6813, Sep 26 2009, 13:47:54) 
[Java HotSpot(TM) 64-Bit Server VM (Apple Inc.)] on java1.6.0_20
msg5777 (view) Author: Alan Kennedy (amak) Date: 2010-05-25.18:17:28
Whether or not adjacent text nodes are added together depends on
whether or not the DOM has been "normalized".

http://www.w3.org/TR/DOM-Level-3-Core/core.html#ID-normalize

I can't find any reference for whether a DOM should be auto-normalized
when first loaded. If it should be automatically normalized, then
jython is defective here.

If the DOM should not be auto-normalized, then carrying out a
normalize() operation on the DOM will certainly ensure that adjacent
text nodes should be grouped together. If jython does not return the
expected results after a normalize() operation, then it is definitely
defective.

Please can you try the normalize() operation on the DOM, and see if that solves the problem?

Meantime, I will try to ascertain if the DOM should be auto-normalized.
msg5778 (view) Author: Alan Kennedy (amak) Date: 2010-05-25.18:22:38
Lastly, I am loathe to blindly follow the cpython behaviour in this case.

1. It may be an accident that this particular document delivers the expected result without normalization. Other documents may differ: needs testing.

2. Jython's behaviour should be based on the DOM standard. If that standard does not require auto-normalization, then auto-normalization just to match cpython behaviour will incur un-necessary performance penalties.

While we want jython to behave like cpython, in this case the DOM standard should be the guide for expected behaviour.
msg5779 (view) Author: Charles Bearden (cbearden) Date: 2010-05-25.18:45:36
The DOM Level 1 [1] and Level 2 [2] recommendations have this:

 "When a document is first made available via the DOM, there is only one
 Text node for each block of text."

In the source context, "first made available" sounds like the point at
which the DOM is hot off the press, er, parser.  The CPython minidom
implementation always presents text nodes as if normalized, in my
experience.

[1] <http://www.w3.org/TR/REC-DOM-Level-1/level-one-core.html#ID-1312295772>
- Hide quoted text -
[2] <http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/core.html#ID-1312295772>
msg5780 (view) Author: Alan Kennedy (amak) Date: 2010-05-25.21:58:08
OK, I see now why this is happening.

Cpython uses the xml.dom.expatbuilder class, which has special code to recognize when text nodes are adjacent *as they are being built*. This tries to ensure that all contiguous text in a document will be contained in a single node. I am not sure if it is guaranteed. (And performance degrades as the document sizes grow, because they are adding strings together with s=s1+s2+..+sN, rather than s="".join([s1,s2,..,sN])).

Jython has no expat, and so cannot use expatbuilder. Instead it uses xml.dom.pulldom.parse(), which relies on the parse events of the underlying java SAX2 parser. Since there is no parser (that I know of) that will guarantee to report all contiguous text in a single SAX2 characters() call, this splitting of contiguous text into multiple nodes is virtually guaranteed on jython.

I have the fix made in my local repo.

But I still haven't decided to check it in. Points to note are

1. Cpython didn't always behave the way it does now, as you can see from the following thread on python-list from January 2003.

xml.dom.minidom.parse() splitting text nodes?
http://mail.python.org/pipermail/python-list/2003-January/801932.html
http://mail.python.org/pipermail/python-list/2003-January/819322.html

2. Introducing the normalize method to guarantee normalization almost doubles the time required to run the minidom test suite. So this fix does not come without cost, a cost which I'm not too happy about forcing everyone to bear just to solve this one simple case. And that simple case could be solved by the user simply adding their own "dom.normalize()" call to their code.

3. Should the behaviour of jython follow XML specs (which despite Charles excellent links are still open to interpretation) or cpython behaviour?

4. Perhaps the best position is that we should follow cpython blind^H^H^H^H^Hfaithfully, and behave exactly as it does. If that makes performance suck, then users should translate their stuff to work with java APIs instead.

Opinions welcome. I'm going to sleep on it.

Meantime, the OP can solve his problem by simply adding dom.normalize() to his code after the xml.dom.minidom.parseString().
msg5781 (view) Author: Alan Kennedy (amak) Date: 2010-05-25.22:02:30
I should have mentioned, I haven;t been able to find *why* the behaviour of cpython expatbuilder was changed, because of the change of version control system from CVS to SVN that taken place in the interim.

Perhaps there was a bug that was filed?

Perhaps it was judged too capricious/unusual/unpythonic for the user to get surprises like this.

If anyone finds any relevant links, please post them.
msg5782 (view) Author: Charles Bearden (cbearden) Date: 2010-05-25.22:16:16
On Tue, May 25, 2010 at 4:58 PM, Alan Kennedy <report@bugs.jython.org> wrote:
>
> Alan Kennedy <jython-dev@xhaus.com> added the comment:
>
> OK, I see now why this is happening.
>
> Cpython uses the xml.dom.expatbuilder class, which has special code to recognize when text nodes are adjacent *as they are being built*. This tries to ensure that all contiguous text in a document will be contained in a single node. I am not sure if it is guaranteed. (And performance degrades as the document sizes grow, because they are adding strings together with s=s1+s2+..+sN, rather than s="".join([s1,s2,..,sN])).
>
> Jython has no expat, and so cannot use expatbuilder. Instead it uses xml.dom.pulldom.parse(), which relies on the parse events of the underlying java SAX2 parser. Since there is no parser (that I know of) that will guarantee to report all contiguous text in a single SAX2 characters() call, this splitting of contiguous text into multiple nodes is virtually guaranteed on jython.
>
> I have the fix made in my local repo.
>
> But I still haven't decided to check it in. Points to note are
>
> 1. Cpython didn't always behave the way it does now, as you can see from the following thread on python-list from January 2003.
>
> xml.dom.minidom.parse() splitting text nodes?
> http://mail.python.org/pipermail/python-list/2003-January/801932.html
> http://mail.python.org/pipermail/python-list/2003-January/819322.html
>
> 2. Introducing the normalize method to guarantee normalization almost doubles the time required to run the minidom test suite. So this fix does not come without cost, a cost which I'm not too happy about forcing everyone to bear just to solve this one simple case. And that simple case could be solved by the user simply adding their own "dom.normalize()" call to their code.
>
> 3. Should the behaviour of jython follow XML specs (which despite Charles excellent links are still open to interpretation) or cpython behaviour?
>
> 4. Perhaps the best position is that we should follow cpython blind^H^H^H^H^Hfaithfully, and behave exactly as it does. If that makes performance suck, then users should translate their stuff to work with java APIs instead.
>
> Opinions welcome. I'm going to sleep on it.
>
> Meantime, the OP can solve his problem by simply adding dom.normalize() to his code after the xml.dom.minidom.parseString().

For what it may be worth, I did a small (very ad hack) test of Java
behavior, in which I increased the size of a single text node in a
simple XML doc and ran a simple Java DOM program against it, and the
result I got was always one text node.  My test code is at
http://pastebin.com/CS5dKV1f (I won't be offended if you laugh).

Chuck
msg5836 (view) Author: Alan Kennedy (amak) Date: 2010-06-25.14:25:04
OK, I checked all of the java XML Object Models (DOM4J, JDOM, DOM, XOM) and they all behave the same: they normalise all text nodes.

And so does cpython's minidom.

So it's correct for jython's minidom to behave the same way.

Fix checked in at r7070.
History
Date User Action Args
2010-06-25 14:25:07amaksetstatus: open -> closed
resolution: fixed
messages: + msg5836
2010-05-25 22:16:16cbeardensetmessages: + msg5782
2010-05-25 22:02:30amaksetmessages: + msg5781
2010-05-25 21:58:09amaksetmessages: + msg5780
2010-05-25 18:45:37cbeardensetnosy: + cbearden
messages: + msg5779
2010-05-25 18:22:39amaksetmessages: + msg5778
2010-05-25 18:17:29amaksetassignee: amak
messages: + msg5777
nosy: + amak
2010-05-25 10:36:31fdbcreate