Title: xml.dom.pulldom exception for empty files not consistent with Python
Type: Severity: normal
Components: Library Versions: 2.5.2
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: amak Nosy List: amak, pjac
Priority: Keywords:

Created on 2011-07-21.15:09:12 by pjac, last changed 2012-03-17.22:39:43 by amak.

msg6574 (view) Author: Peter (pjac) Date: 2011-07-21.15:09:12
Test case:

import sys
print sys.version
from StringIO import StringIO
from xml.dom import pulldom
from xml.sax import SAXParseException
handle = StringIO() # simulate empty file
    for event,node in pulldom.parse(handle):
        print event
except SAXParseException, e:
    print repr(e)
    print "Line number", e.getLineNumber()
    print "Column number", e.getColumnNumber()
print "Done"

Reference output from (C) Python on Linux,

$ python2.5 
2.5.5 (r255:77872, Jan 14 2011, 17:09:55) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)]
SAXParseException('no element found',)
Line number 1
Column number 0

$ python2.6 
2.6.6 (r266:84292, Aug 31 2010, 16:21:14) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)]
SAXParseException('no element found',)
Line number 1
Column number 0

$ python2.7 
2.7 (r27:82500, Jul 13 2010, 14:02:41) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)]
SAXParseException('no element found',)
Line number 1
Column number 0

Inconsistent output from Jython,

$ jython 
2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06) 
[OpenJDK 64-Bit Server VM (Sun Microsystems Inc.)]
SAXParseException(u'Premature end of file.',)
Line number 1
Column number 1

Notice (a) different exception description, (b) different column number.

This bug was identified from a failing Biopython unit test, see:
msg6591 (view) Author: Alan Kennedy (amak) Date: 2011-07-30.00:32:50
This is fundamentally an interpretation issue. 

Does one interpret an empty document as a failure to provide parsable tokens from the input stream (the java interpretation, i.e. the tokenizer raises the error) or does one interpret an empty document as a stream of tokens that is empty (the python interpretation, i.e the parser raises the error)?

Is there an xml declaration present in the file? i.e. does the stream contain something like "<?xml version="x.y" encoding="blah_encoding"?>"

Or is the input stream completely empty, i.e. contains no characters other than whitespace?

If the latter, i.e. the document is pure whitespace, then I recommend a pragmatic solution, i.e.

document = document.strip()
if document:
    raise MyException("An whitespace document is meaningless, no matter what its file extension is")

In the meantime, I will investigate whether an empty file or a file full of whitespace can meaningfully be described as an XML file.
msg6811 (view) Author: Alan Kennedy (amak) Date: 2012-03-17.22:39:43
Having thought about this, I think that this is not a valid bug.

The correct exception is raised. The textual description contained in the exception cannot be expected to be identical across platforms.

As for the column number issue, it looks like expat (which cpython uses) is counting from zero, where the java parser is counting from column 1. Both are valid interpretations, but I consider the java one more appropriate, particularly on jython, since there is no column zero when you open the file in a text editor, for example.
Date User Action Args
2012-03-17 22:39:43amaksetstatus: open -> closed
resolution: wont fix
messages: + msg6811
2011-07-30 00:32:51amaksetassignee: amak
messages: + msg6591
nosy: + amak
2011-07-21 15:09:12pjaccreate