Created on 2015-05-20.02:21:29 by liuxy_hes86, last changed 2017-04-13.12:58:38 by jeff.allen.
|OpenRefineProblem.txt||bstjean, 2017-03-16.03:37:46||OpenRefine error trace|
|msg10069 (view)||Author: liuxy (liuxy_hes86)||Date: 2015-05-20.02:21:28|
On a windows 8.1 PC, run jython from cmd, then such an error occured: C:\Users\雪彦>jython Exception in thread "main" java.lang.IllegalArgumentException: Cannot create PyS tring with non-byte value at org.python.core.PyString.<init>(PyString.java:64) at org.python.core.PyString.<init>(PyString.java:70) at org.python.core.packagecache.PathPackageManager.addDirectory(PathPack ageManager.java:201) at org.python.core.packagecache.PathPackageManager.addClassPath(PathPack ageManager.java:232) at org.python.core.packagecache.SysPackageManager.findAllPackages(SysPac kageManager.java:96) at org.python.core.packagecache.SysPackageManager.<init>(SysPackageManag er.java:39) at org.python.core.PySystemState.initPackages(PySystemState.java:1127) at org.python.core.PySystemState.doInitialize(PySystemState.java:1057) at org.python.core.PySystemState.initialize(PySystemState.java:974) at org.python.core.PySystemState.initialize(PySystemState.java:930) at org.python.core.PySystemState.initialize(PySystemState.java:925) at org.python.util.jython.run(jython.java:263) at org.python.util.jython.main(jython.java:142)
|msg10070 (view)||Author: Jim Baker (zyasoft)||Date: 2015-05-20.06:38:26|
Likely a duplicate of #2348
|msg10258 (view)||Author: Jeff Allen (jeff.allen)||Date: 2015-09-13.16:42:31|
Probably same as test_os_jy failure in #2397.
|msg10265 (view)||Author: Jeff Allen (jeff.allen)||Date: 2015-09-19.09:51:23|
We're both right. Running Jython 2.7.1b1 founders on #2397, but running a version with that fix, it dies importing site packages. C:\Users\用户名\Documents\Jython> %jt%\dist\bin\jython Exception in thread "main" Traceback (most recent call last): File "C:\Users\Jeff\Documents\Eclipse\jython-trunk\dist\Lib\site.py", line 585, in <module> ... UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-11: ordinal not in range(128) Skip the site import and you can get a prompt. C:\Users\用户名\Documents\Jython> %jt%\dist\bin\jython -S Jython 2.7.1 (default:26d248c72b90+, Sep 19 2015, 08:44:17) [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_60 >>> I think it would do us all good to work under Chinese user names for a while!
|msg11237 (view)||Author: Benoit St-Jean (bstjean)||Date: 2017-03-16.03:37:45|
In the same vein, I have a similar exception (originates from OpenRefine at startup). Looks like jython and/or java doesn't like my username in Windows 10 and bombs. My WIndows 10 user name is "Benoît St-Jean" (notice the accentuated î).
|msg11261 (view)||Author: Jeff Allen (jeff.allen)||Date: 2017-03-22.07:14:03|
We're not very good with non-ascii paths and program text, certainly on Windows, and in more than one part of the code I suspect. E.g. I have to tweak even build.xml, when I'm logged in as "Épreuve". :( Minable. I'll give this some more time, as I've meant to for a while.
|msg11272 (view)||Author: Jeff Allen (jeff.allen)||Date: 2017-03-25.10:57:43|
I've fixed the build, the problem being that ANTLR would generate files in file.encoding and then we would compile them as UTF-8. It makes no difference to the *text*, but the *comments* contain the full source path. C:\Users\Épreuve\atelier\ ... blahblah ... . Now file.encoding=UTF-8. I'm fighting the launcher now, in the shape of jython.py. One can easily create a complicated situation in which all sorts of encodings are in play. Just at the DOS and Python prompts: > type argtest.py # What do arguments appear as, when codepages intervene? import sys, os, locale, subprocess print sys.argv for arg in sys.argv: print "%s ( %r )" % (arg, arg) > chcp Active code page: 850 > set TEST=Épreuve > python -i argtest.py café crème %TEST% ['argtest.py', 'caf\xe9', 'cr\xe8me', '\xc9preuve'] argtest.py ( 'argtest.py' ) cafÚ ( 'caf\xe9' ) crÞme ( 'cr\xe8me' ) ╔preuve ( '\xc9preuve' ) ### Notice that sys.argv contains byte strings but they are ### not encoded with the console encoding cp850. ### The os module is using the same encoding. >>> os.getcwd() 'C:\\Users\\\xc9preuve\\Documents\\Python2' >>> print os.getcwd() C:\Users\╔preuve\Documents\Python2 >>> print os.getcwdu() C:\Users\Épreuve\Documents\Python2 >>> os.getenv('TEST') '\xc9preuve' ### There are plenty of encodings to choose from. >>> sys.stdout.encoding 'cp850' >>> sys.getdefaultencoding() 'ascii' >>> sys.getfilesystemencoding() 'mbcs' >>> locale.getpreferredencoding() 'cp1252' ### But this one is consistent with what I'm seeing: >>> for a in sys.argv: print a.decode(locale.getpreferredencoding()) ... argtest.py café crème Épreuve What fun! I *tentatively* conclude we must treat arguments and environment variables as encoded with locale.getpreferredencoding(). This also seems to be the acceptable encoding when we come to launch a subprocess: >>> subprocess.call(["python", "argtest.py"] + sys.argv[1:]) ['argtest.py', 'caf\xe9', 'cr\xe8me', '\xc9preuve'] argtest.py ( 'argtest.py' ) cafÚ ( 'caf\xe9' ) crÞme ( 'cr\xe8me' ) ╔preuve ( '\xc9preuve' ) The point here is not that these print correctly, but they print the same as they did when I ran this from the DOS prompt. Now, in jython.py, it's all driven from sys.stdout.encoding, which is different. We may even be calling encode() where we should be decoding. Or possibly we could just leave everything as bytes in the seemingly-consistent encoding of CPython and Windows. I'll see what I can do. (I'll try not to break jython.py for Linux, though it seems the minority case here.) Eventually, when Jython lunches again, I'll get to the bug(s) our users French and Chinese are experiencing, that pops up first in site.py. But fighting jython.py has been instructive. There may be lessons from CPython here about what we should be doing internally to Jython when handling byte strings from the system via file system, environment and arguments.
|msg11276 (view)||Author: Jeff Allen (jeff.allen)||Date: 2017-03-27.07:45:38|
I've re-written jython.py to use Unicode internally, decoding args and environment variables in-bound, and encoding for subprocess.call() out-bound. Both times we use locale.getpreferredencoding(), which is cp1252 on my system while the console encoding is cp850. It passes test_jython_launcher for a user named "Épreuve" as long as I suppress the site module with -S. Interestingly, both virtualenv and PyInstaller (on Python 2.7.13) fail for this user with: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc9 ... .
|msg11285 (view)||Author: Jeff Allen (jeff.allen)||Date: 2017-03-30.21:47:40|
I think I have this pretty much beaten, but at the expense of turning affected file paths from str values into unicode values. I'm wondering if this is a harmless divergence, perhaps even a good one? CPython 2.7 is quite tolerant about mixing unicode and str: it just promotes to unicode where necessary to match the types, in the result of concatenation, say, or when searching a unicode string for a str target. It's happy to open files with unicode names, and the path modules seem to work fine in unicode. I spent some time working out what CPython does with non-ascii paths when importing modules. CPython 2.7 will find modules in non-ascii directories on the search path, and it will tolerate a non-ascii installation directory. However, this is only as long as that directory can be handled as bytes in the default encoding. (Which default encoding? Not sure. The one returned by locale.getpreferredencoding(), I think.) If you create a directory named 困难 (u'\u56f0\u96be') and put it on your PYTHONPATH, the environment variable comes through as '??', and if you add it to sys.path as a unicode, CPython ignores it. If you install CPython into such a directory (make it PYTHONHOME) it crashes on startup. Jython is already better than this in that: 1. The environment variables come through as unicode values when they are not ascii (thanks to https://hg.python.org/jython/file/tip/src/org/python/modules/posix/PosixModule.java#l1348). 2. Paths internal to the sys module, coming from java.io.File, are unconditionally unicode objects, e.g. https://hg.python.org/jython/file/tip/src/org/python/core/PySystemState.java#l215, which emerges as: >>> sys.getCurrentWorkingDir() u'C:\\Users\\Jeff\\Documents\\Python2\\\u56f0\u96be' However, Jython is less good than CPython in places where a str path is expected, because we only allow ascii, rather than assume a dubiously-guessed encoding. The bit we're missing, and I propose to add, is to create and support unicode paths (as opposed to byte str paths). Often these come from Java or environment variables, and are used as Java String objects, but we are tunnelling them through PyString objects (that allow only ascii), where I think we could use PyUnicode. When added experimentally, maybe 15 regression tests currently fail, but I think this is a matter of following through consistently, and in a couple of places, allowing unicode where the str type is explicitly expected in the test. Because this seems to spread quite widely, I feel I should ask if this sounds reasonable? Do we think this promotion to unicode should only happen when provoked by a non-ascii path, or is it better if affected values (sys.path directories mainly) become unicode unconditionally in the way of sys.getCurrentWorkingDir()?
|msg11307 (view)||Author: Jeff Allen (jeff.allen)||Date: 2017-04-13.12:58:38|
+ sys.getfilesystemencoding() is None although java.lang.System.getProperty('file.encoding') seems to work|
messages: + msg11307
|2017-03-30 21:47:41||jeff.allen||set||messages: + msg11285|
+ test failure causes|
messages: + msg11276
|2017-03-25 10:57:44||jeff.allen||set||messages: + msg11272|
|2017-03-22 07:14:04||jeff.allen||set||messages: + msg11261|
nosy: + bstjean
messages: + msg11237
|2015-09-19 09:51:24||jeff.allen||set||messages: + msg10265|
|2015-09-13 16:42:31||jeff.allen||set||assignee: jeff.allen|
messages: + msg10258
nosy: + jeff.allen
messages: + msg10070