Message11306

Author jeff.allen
Recipients Arfrever, akira, amak, fwierzbicki, jeff.allen, pekka.klarck, zyasoft
Date 2017-04-13.12:56:05
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1492088167.6.0.754079242489.issue1839@psf.upfronthosting.co.za>
In-reply-to
Content
This is related to #2536, in that I can't seem to fix that without getting the encoding of file paths straight too. I may have a "right answer".

Returning to Jython 2.7, at present we meet Jim's critereon:
>>> [java.io.File(p).exists() for p in os.listdir('.')]
[True, True, True, True, True, True, True, True, True, True]
and this is because we adapt to the file names Java gives us:
>>> os.listdir('.')
['argtest.py', u'c-\u5496\u5561', u'caf\xe9', 'dist', 'mbcs.txt', u'p-\u87d2\u86c7', u'test\u5496\u5561.tmp', u'\u05e9\u05b4\u05c1\u05d1\u05b9\u05bc\u05dc\u05b6\u05ea.txt', u'\u4e2d\u6587.txt', u'\U0001f40d']

CPython does not unless asked specifically:
>>> os.listdir('.')
['argtest.py', 'c-??', 'caf\xe9', 'dist', 'mbcs.txt', 'p-??', 'test??.tmp', '?????????.txt', '??.txt', '??']
>>> os.listdir(u'.')
[u'argtest.py', u'c-\u5496\u5561', u'caf\xe9', u'dist', u'mbcs.txt', u'p-\u87d2\u86c7', u'test\u5496\u5561.tmp', u'\u05e9\u05b4\u05c1\u05d1\u05b9\u05bc\u05dc\u05b6\u05ea.txt', u'\u4e2d\u6587.txt', u'\U0001f40d']

Why aren't all our file names unicode objects? Because it breaks too many tests. That's ok because we only really test with ascii. (It's not really ok.) They break because the more widely that unicode filenames spread, the more places we discover that either unicode is not expected at all, or we don't deal with it correctly. For example:
>>> import sys, os, os.path, java
>>> os.getcwd()
u'C:\\Users\\\xc9preuve\\Documents\\Python2\\encoding\\c-\u5496\u5561'
>>> os.path.abspath('x.tmp')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Épreuve\Documents\Python2\jython-int\dist\Lib\ntpath.py", line 471, in abspath
    path = sys.getPath(path).encode('latin-1')
  File "C:\Users\Épreuve\Documents\Python2\jython-int\dist\Lib\ntpath.py", line 471, in abspath
    path = sys.getPath(path).encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 46-47: ordinal not in range(256)

It is attractive, but I'm not sure it is feasible, to have all paths be unicode objects "on demand", for example:
>>> os.__file__
u'C:\\Users\\\xc9preuve\\Documents\\Python2\\jython-int\\dist\\Lib\\os$py.class'
I've tried this and it causes problems elsewhere (e.g. traceback.py and doctest.py) to which we could respond by making more and more things unicode, but there always seems to be more trouble, and we can't fix the same assumptions in user code.

An alternative is to encode to byte data the paths Java gives us, in enough places to keep the stdlib largely unmodified (such as the __file__ attribute). In that case we need to choose an encoding *for Jython*. Notice we are not trying to match an encoding chosen by the platform (e.g. 'mbcs' on Windows): Java has insulated us from that already. Rather we need an encoding simply because the stdlib forces byte paths on us (in Python 2), and the methods receiving paths in that form have to know how to *decode* them again for Java. There is no reason for this choice to vary with OS platform. UTF-8 is the obvious choice.

Now one could argue that this is different from CPython's use of "file system encoding", which tracks the platform's choice so that OS services may be called that have a bytes interface. We interact with these services via Java (even jnr-posix wants String arguments), so in a sense the *platform* encoding is None, or moot, or unknown, while for CPython it actually matters. Nevertheless, as we still need a conversion, for paths that Python requires be byte data, we should advertise that through sys.getfilesystemencoding(), as the answer to "how are byte paths encoded".

This won't make all our problems go away immediately -- we're still doing the wrong thing in many places -- but I think it reduces them to problems for which there is a right answer, once if you can figure out whether that java.lang.String is really a bytes object.
History
Date User Action Args
2017-04-13 12:56:07jeff.allensetmessageid: <1492088167.6.0.754079242489.issue1839@psf.upfronthosting.co.za>
2017-04-13 12:56:07jeff.allensetrecipients: + jeff.allen, fwierzbicki, amak, pekka.klarck, zyasoft, Arfrever, akira
2017-04-13 12:56:07jeff.allenlinkissue1839 messages
2017-04-13 12:56:05jeff.allencreate