Message11285

Author jeff.allen
Recipients bstjean, jeff.allen, liuxy_hes86, zyasoft
Date 2017-03-30.21:47:40
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1490910462.03.0.299084191234.issue2356@psf.upfronthosting.co.za>
In-reply-to
Content
I think I have this pretty much beaten, but at the expense of turning affected file paths from str values into unicode values. I'm wondering if this is a harmless divergence, perhaps even a good one?

CPython 2.7 is quite tolerant about mixing unicode and str: it just promotes to unicode where necessary to match the types, in the result of concatenation, say, or when searching a unicode string for a str target. It's happy to open files with unicode names, and the path modules seem to work fine in unicode.

I spent some time working out what CPython does with non-ascii paths when importing modules. CPython 2.7 will find modules in non-ascii directories on the search path, and it will tolerate a non-ascii installation directory. However, this is only as long as that directory can be handled as bytes in the default encoding. (Which default encoding? Not sure. The one returned by locale.getpreferredencoding(), I think.) If you create a directory named 困难 (u'\u56f0\u96be') and put it on your PYTHONPATH, the environment variable comes through as '??', and if you add it to sys.path as a unicode, CPython ignores it. If you install CPython into such a directory (make it PYTHONHOME) it crashes on startup.

Jython is already better than this in that:

1. The environment variables come through as unicode values when they are not ascii (thanks to https://hg.python.org/jython/file/tip/src/org/python/modules/posix/PosixModule.java#l1348).

2. Paths internal to the sys module, coming from java.io.File, are unconditionally unicode objects, e.g. https://hg.python.org/jython/file/tip/src/org/python/core/PySystemState.java#l215, which emerges as:

    >>> sys.getCurrentWorkingDir()
    u'C:\\Users\\Jeff\\Documents\\Python2\\\u56f0\u96be'


However, Jython is less good than CPython in places where a str path is expected, because we only allow ascii, rather than assume a dubiously-guessed encoding. The bit we're missing, and I propose to add, is to create and support unicode paths (as opposed to byte str paths). Often these come from Java or environment variables, and are used as Java String objects, but we are tunnelling them through PyString objects (that allow only ascii), where I think we could use PyUnicode. When added experimentally, maybe 15 regression tests currently fail, but I think this is a matter of following through consistently, and in a couple of places, allowing unicode where the str type is explicitly expected in the test.

Because this seems to spread quite widely, I feel I should ask if this sounds reasonable? Do we think this promotion to unicode should only happen when provoked by a non-ascii path, or is it better if affected values (sys.path directories mainly) become unicode unconditionally in the way of sys.getCurrentWorkingDir()?
History
Date User Action Args
2017-03-30 21:47:42jeff.allensetmessageid: <1490910462.03.0.299084191234.issue2356@psf.upfronthosting.co.za>
2017-03-30 21:47:42jeff.allensetrecipients: + jeff.allen, zyasoft, liuxy_hes86, bstjean
2017-03-30 21:47:41jeff.allenlinkissue2356 messages
2017-03-30 21:47:40jeff.allencreate