Issue1774

classification
Title: xml.dom.pulldom exception for empty files not consistent with Python
Type: Severity: normal
Components: Library Versions: 2.5.2
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: amak Nosy List: amak, pjac
Priority: Keywords:

Created on 2011-07-21.15:09:12 by pjac, last changed 2012-03-17.22:39:43 by amak.

Messages
msg6574 (view) Author: Peter (pjac) Date: 2011-07-21.15:09:12
Test case:

import sys
print sys.version
from StringIO import StringIO
from xml.dom import pulldom
from xml.sax import SAXParseException
handle = StringIO() # simulate empty file
try:
    for event,node in pulldom.parse(handle):
        print event
except SAXParseException, e:
    print repr(e)
    print "Line number", e.getLineNumber()
    print "Column number", e.getColumnNumber()
print "Done"


Reference output from (C) Python on Linux,

$ python2.5 sax_empty_xml.py 
2.5.5 (r255:77872, Jan 14 2011, 17:09:55) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)]
SAXParseException('no element found',)
Line number 1
Column number 0
Done


$ python2.6 sax_empty_xml.py 
2.6.6 (r266:84292, Aug 31 2010, 16:21:14) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)]
SAXParseException('no element found',)
Line number 1
Column number 0
Done


$ python2.7 sax_empty_xml.py 
2.7 (r27:82500, Jul 13 2010, 14:02:41) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)]
SAXParseException('no element found',)
Line number 1
Column number 0
Done


Inconsistent output from Jython,


$ jython sax_empty_xml.py 
2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06) 
[OpenJDK 64-Bit Server VM (Sun Microsystems Inc.)]
SAXParseException(u'Premature end of file.',)
Line number 1
Column number 1
Done


Notice (a) different exception description, (b) different column number.

This bug was identified from a failing Biopython unit test, see:
https://redmine.open-bio.org/issues/3267
msg6591 (view) Author: Alan Kennedy (amak) Date: 2011-07-30.00:32:50
This is fundamentally an interpretation issue. 

Does one interpret an empty document as a failure to provide parsable tokens from the input stream (the java interpretation, i.e. the tokenizer raises the error) or does one interpret an empty document as a stream of tokens that is empty (the python interpretation, i.e the parser raises the error)?

Is there an xml declaration present in the file? i.e. does the stream contain something like "<?xml version="x.y" encoding="blah_encoding"?>"

Or is the input stream completely empty, i.e. contains no characters other than whitespace?

If the latter, i.e. the document is pure whitespace, then I recommend a pragmatic solution, i.e.

document = document.strip()
if document:
    xml_parse(document)
else:
    raise MyException("An whitespace document is meaningless, no matter what its file extension is")

In the meantime, I will investigate whether an empty file or a file full of whitespace can meaningfully be described as an XML file.
msg6811 (view) Author: Alan Kennedy (amak) Date: 2012-03-17.22:39:43
Having thought about this, I think that this is not a valid bug.

The correct exception is raised. The textual description contained in the exception cannot be expected to be identical across platforms.

As for the column number issue, it looks like expat (which cpython uses) is counting from zero, where the java parser is counting from column 1. Both are valid interpretations, but I consider the java one more appropriate, particularly on jython, since there is no column zero when you open the file in a text editor, for example.
History
Date User Action Args
2012-03-17 22:39:43amaksetstatus: open -> closed
resolution: wont fix
messages: + msg6811
2011-07-30 00:32:51amaksetassignee: amak
messages: + msg6591
nosy: + amak
2011-07-21 15:09:12pjaccreate