Issue2356

classification
Title: java.lang.IllegalArgumentException while startup jython on Windows 8.1 with current username is not ASCII characters
Type: crash Severity: major
Components: Core Versions: Jython 2.7
Milestone: Jython 2.7.1
process
Status: closed Resolution: accepted
Dependencies: sys.getfilesystemencoding() is None although java.lang.System.getProperty('file.encoding') seems to work
View: 1839
Superseder:
Assigned To: jeff.allen Nosy List: bstjean, jeff.allen, liuxy_hes86, zyasoft
Priority: Keywords: test failure causes

Created on 2015-05-20.02:21:29 by liuxy_hes86, last changed 2017-06-09.04:39:27 by zyasoft.

Files
File name Uploaded Description Edit Remove
OpenRefineProblem.txt bstjean, 2017-03-16.03:37:46 OpenRefine error trace
Messages
msg10069 (view) Author: liuxy (liuxy_hes86) Date: 2015-05-20.02:21:28
On a windows 8.1 PC, run jython from cmd, then such an error occured:

C:\Users\雪彦>jython
Exception in thread "main" java.lang.IllegalArgumentException: Cannot create PyS
tring with non-byte value
        at org.python.core.PyString.<init>(PyString.java:64)
        at org.python.core.PyString.<init>(PyString.java:70)
        at org.python.core.packagecache.PathPackageManager.addDirectory(PathPack
ageManager.java:201)
        at org.python.core.packagecache.PathPackageManager.addClassPath(PathPack
ageManager.java:232)
        at org.python.core.packagecache.SysPackageManager.findAllPackages(SysPac
kageManager.java:96)
        at org.python.core.packagecache.SysPackageManager.<init>(SysPackageManag
er.java:39)
        at org.python.core.PySystemState.initPackages(PySystemState.java:1127)
        at org.python.core.PySystemState.doInitialize(PySystemState.java:1057)
        at org.python.core.PySystemState.initialize(PySystemState.java:974)
        at org.python.core.PySystemState.initialize(PySystemState.java:930)
        at org.python.core.PySystemState.initialize(PySystemState.java:925)
        at org.python.util.jython.run(jython.java:263)
        at org.python.util.jython.main(jython.java:142)
msg10070 (view) Author: Jim Baker (zyasoft) Date: 2015-05-20.06:38:26
Likely a duplicate of #2348
msg10258 (view) Author: Jeff Allen (jeff.allen) Date: 2015-09-13.16:42:31
Probably same as test_os_jy failure in #2397.
msg10265 (view) Author: Jeff Allen (jeff.allen) Date: 2015-09-19.09:51:23
We're both right. Running Jython 2.7.1b1 founders on #2397, but running a version with that fix, it dies importing site packages.

C:\Users\用户名\Documents\Jython> %jt%\dist\bin\jython
Exception in thread "main" Traceback (most recent call last):
  File "C:\Users\Jeff\Documents\Eclipse\jython-trunk\dist\Lib\site.py", line 585, in <module>
...
UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-11: ordinal not in range(128)

Skip the site import and you can get a prompt.

C:\Users\用户名\Documents\Jython> %jt%\dist\bin\jython -S
Jython 2.7.1 (default:26d248c72b90+, Sep 19 2015, 08:44:17)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_60
>>>

I think it would do us all good to work under Chinese user names for a while!
msg11237 (view) Author: Benoit St-Jean (bstjean) Date: 2017-03-16.03:37:45
In the same vein, I have a similar exception (originates from OpenRefine at startup).  Looks like jython and/or java doesn't like my username in Windows 10 and bombs.  My WIndows 10 user name is "Benoît St-Jean" (notice the accentuated î).
msg11261 (view) Author: Jeff Allen (jeff.allen) Date: 2017-03-22.07:14:03
We're not very good with non-ascii paths and program text, certainly on Windows, and in more than one part of the code I suspect. E.g. I have to tweak even build.xml, when I'm logged in as "Épreuve". :( Minable.

I'll give this some more time, as I've meant to for a while.
msg11272 (view) Author: Jeff Allen (jeff.allen) Date: 2017-03-25.10:57:43
I've fixed the build, the problem being that ANTLR would generate files in file.encoding and then we would compile them as UTF-8. It makes no difference to the *text*, but the *comments* contain the full source path. C:\Users\Épreuve\atelier\ ... blahblah ... . Now file.encoding=UTF-8.

I'm fighting the launcher now, in the shape of jython.py. One can easily create a complicated situation in which all sorts of encodings are in play. Just at the DOS and Python prompts:

> type argtest.py
# What do arguments appear as, when codepages intervene?
import sys, os, locale, subprocess
print sys.argv
for arg in sys.argv:
    print "%s ( %r )" % (arg, arg)

> chcp
Active code page: 850

> set TEST=Épreuve

> python -i argtest.py café crème %TEST%
['argtest.py', 'caf\xe9', 'cr\xe8me', '\xc9preuve']
argtest.py ( 'argtest.py' )
cafÚ ( 'caf\xe9' )
crÞme ( 'cr\xe8me' )
╔preuve ( '\xc9preuve' )

### Notice that sys.argv contains byte strings but they are
### not encoded with the console encoding cp850.
### The os module is using the same encoding.

>>> os.getcwd()
'C:\\Users\\\xc9preuve\\Documents\\Python2'
>>> print os.getcwd()
C:\Users\╔preuve\Documents\Python2
>>> print os.getcwdu()
C:\Users\Épreuve\Documents\Python2
>>> os.getenv('TEST')
'\xc9preuve'

### There are plenty of encodings to choose from.

>>> sys.stdout.encoding
'cp850'
>>> sys.getdefaultencoding()
'ascii'
>>> sys.getfilesystemencoding()
'mbcs'
>>> locale.getpreferredencoding()
'cp1252'

### But this one is consistent with what I'm seeing:

>>> for a in sys.argv: print a.decode(locale.getpreferredencoding())
...
argtest.py
café
crème
Épreuve

What fun! I *tentatively* conclude we must treat arguments and environment variables as encoded with locale.getpreferredencoding(). This also seems to be the acceptable encoding when we come to launch a subprocess:
>>> subprocess.call(["python", "argtest.py"] + sys.argv[1:])
['argtest.py', 'caf\xe9', 'cr\xe8me', '\xc9preuve']
argtest.py ( 'argtest.py' )
cafÚ ( 'caf\xe9' )
crÞme ( 'cr\xe8me' )
╔preuve ( '\xc9preuve' )

The point here is not that these print correctly, but they print the same as they did when I ran this from the DOS prompt.

Now, in jython.py, it's all driven from sys.stdout.encoding, which is different. We may even be calling encode() where we should be decoding. Or possibly we could just leave everything as bytes in the seemingly-consistent encoding of CPython and Windows. I'll see what I can do. (I'll try not to break jython.py for Linux, though it seems the minority case here.)

Eventually, when Jython lunches again, I'll get to the bug(s) our users French and Chinese are experiencing, that pops up first in site.py.

But fighting jython.py has been instructive. There may be lessons from CPython here about what we should be doing internally to Jython when handling byte strings from the system via file system, environment and arguments.
msg11276 (view) Author: Jeff Allen (jeff.allen) Date: 2017-03-27.07:45:38
I've re-written jython.py to use Unicode internally, decoding args and environment variables in-bound, and encoding for subprocess.call() out-bound. Both times we use locale.getpreferredencoding(), which is cp1252 on my system while the console encoding is cp850. It passes test_jython_launcher for a user named "Épreuve" as long as I suppress the site module with -S.

Interestingly, both virtualenv and PyInstaller (on Python 2.7.13) fail for this user with: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc9 ... .
msg11285 (view) Author: Jeff Allen (jeff.allen) Date: 2017-03-30.21:47:40
I think I have this pretty much beaten, but at the expense of turning affected file paths from str values into unicode values. I'm wondering if this is a harmless divergence, perhaps even a good one?

CPython 2.7 is quite tolerant about mixing unicode and str: it just promotes to unicode where necessary to match the types, in the result of concatenation, say, or when searching a unicode string for a str target. It's happy to open files with unicode names, and the path modules seem to work fine in unicode.

I spent some time working out what CPython does with non-ascii paths when importing modules. CPython 2.7 will find modules in non-ascii directories on the search path, and it will tolerate a non-ascii installation directory. However, this is only as long as that directory can be handled as bytes in the default encoding. (Which default encoding? Not sure. The one returned by locale.getpreferredencoding(), I think.) If you create a directory named 困难 (u'\u56f0\u96be') and put it on your PYTHONPATH, the environment variable comes through as '??', and if you add it to sys.path as a unicode, CPython ignores it. If you install CPython into such a directory (make it PYTHONHOME) it crashes on startup.

Jython is already better than this in that:

1. The environment variables come through as unicode values when they are not ascii (thanks to https://hg.python.org/jython/file/tip/src/org/python/modules/posix/PosixModule.java#l1348).

2. Paths internal to the sys module, coming from java.io.File, are unconditionally unicode objects, e.g. https://hg.python.org/jython/file/tip/src/org/python/core/PySystemState.java#l215, which emerges as:

    >>> sys.getCurrentWorkingDir()
    u'C:\\Users\\Jeff\\Documents\\Python2\\\u56f0\u96be'


However, Jython is less good than CPython in places where a str path is expected, because we only allow ascii, rather than assume a dubiously-guessed encoding. The bit we're missing, and I propose to add, is to create and support unicode paths (as opposed to byte str paths). Often these come from Java or environment variables, and are used as Java String objects, but we are tunnelling them through PyString objects (that allow only ascii), where I think we could use PyUnicode. When added experimentally, maybe 15 regression tests currently fail, but I think this is a matter of following through consistently, and in a couple of places, allowing unicode where the str type is explicitly expected in the test.

Because this seems to spread quite widely, I feel I should ask if this sounds reasonable? Do we think this promotion to unicode should only happen when provoked by a non-ascii path, or is it better if affected values (sys.path directories mainly) become unicode unconditionally in the way of sys.getCurrentWorkingDir()?
msg11307 (view) Author: Jeff Allen (jeff.allen) Date: 2017-04-13.12:58:38
These problems stem ultimately from how we treat paths that the stdlib essentially requires us to represent in bytes. See #msg11306 in issue #1839.
msg11337 (view) Author: Jeff Allen (jeff.allen) Date: 2017-05-01.14:28:41
In a significant change of approach (see #1839) I have addressed this by making sys.getfilesystemencoding() == 'utf-8' and it works pretty well. I've tweaked a lot of exsting code. Some is quite old. I have published to here:

https://bitbucket.org/tournesol/jython-utf8

in case anyone sees a massive flaw. If not, I'll push to the main repo.

The current regression test runs for my user name "Épreuve" and passes, but not yet for "用户名". I think we are still assuming bytes are unicode in some places. So I estimate that Benoît is now ok, but there's more to do for 雪彦.

Just to show off a bit what we can do:

> dist\bin\jython
Jython 2.7.1rc1 (default:060e4e4a06d8, Apr 30 2017, 23:08:20)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_60
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys, os, os.path
>>> os.getcwd()
'C:\\Users\\\xe7\x94\xa8\xe6\x88\xb7\xe5\x90\x8d\\Documents\\Jython\\utf-8'
>>> print os.getcwdu()
C:\Users\用户名\Documents\Jython\utf-8
>>> f = open(os.path.join(u'c-\u5496\u5561', u'\u56f0\u96be.txt'), 'wb')
>>> print f.name
c-咖啡\困难.txt
>>> f.close()
>>> f = open(os.path.join(u's-\U0001f40d', u'pythón'), 'wb')
>>> f
<open file u's-\U0001f40d\\pyth\xf3n', mode 'wb' at 0x3>

I observe that it is mostly having a non-ascii installation location, current directory or TMP/TEMP that cause the trouble. I can perhaps simulate those things without actually having changing user name (which tends to break the tools I need). It's also a clue to a work-around.
msg11370 (view) Author: Jeff Allen (jeff.allen) Date: 2017-05-14.11:21:28
I am claiming success on this ... by which I mean, it passes test.regrtest -e, on Windows localised for Chinese, with code page 936 at the console, even if your user name is 用户名, and you run from a personal directory (with all that implies for home, temp and dist directories).

As noted above, I am pushing for the time being to
https://bitbucket.org/tournesol/jython-utf8
because this felt a little experimental at the start. However, the way in which this has panned out, with all the change being in code the Jython Project wrote, gives me confidence. We're actually more compatible with CPython than before. (One exception: I fixed lib2to3.test_main.py, but that's a bug in CPython too.)

I will next:
1. draw this down to my Linux machine and try it there, and
2. see if I can package it for installation (as a test).

----- DETAIL

In case anyone else wants to try this I will add that with an exotic user name, pretty much all the tools we use in development stop working. I therefore work under my own name and an ascii directory but copy the distribution to a "challenging location" each time I run, debugging by remote attachment. I have the usual lay-down when compiling:

├─.hg
.
├─build
├─cachedir
├─dist
├─grammar
.
├─src
└─tests

which is joined by a runtime environment:

├─h-故乡
│  ├─d-分配
│  │  ├─bin
│  │  ├─javalib
│  │  └─Lib
│  └─t-一时

When I run a test, the cwd is h-故乡, jython.exe launches from \d-分配\bin, and the temporary directory is (the full path of) h-故乡\t-一时 (by setting TMP and TEMP). Material generated in tests typically ends up like this:
│  └─t-一时
│      ├─tmp1y99jl
│      │  └─org
│      │      └─python
│      │          └─test
│      │              └─bark

The non-ascii temporary directory is important. It added loads to the failing tests when I first did it :).

BTW, the prefixes h-, d-, t- are there so I can type the names by filename completion.
msg11389 (view) Author: Jeff Allen (jeff.allen) Date: 2017-05-21.09:14:03
Solution now in the trunk in change sets culminating in https://hg.python.org/jython/rev/4ebf44457697
msg11404 (view) Author: Jeff Allen (jeff.allen) Date: 2017-05-25.06:14:34
This is not quite laid to rest. I find that during installation, ensurepip crashes with the famous "Cannot create PyString ... ".

I believe I have tracked this down to zipimporter where the archive name (a java.lang.String that contains a Unicode path) is exposed as a PyString by @ExposedGet. Now this seems to me a systematic problem with the exposer (contrast#1128), but for now I'll fix it just at this spot.
msg11405 (view) Author: Jeff Allen (jeff.allen) Date: 2017-05-25.08:11:33
I re-claim success at: https://hg.python.org/jython/rev/097a1441a68f
History
Date User Action Args
2018-03-14 23:09:27jeff.allenlinkissue2348 superseder
2018-02-25 08:07:19jeff.allenlinkissue2369 superseder
2017-06-09 04:39:27zyasoftsetstatus: pending -> closed
2017-05-25 08:11:34jeff.allensetstatus: open -> pending
messages: + msg11405
2017-05-25 06:14:35jeff.allensetstatus: pending -> open
messages: + msg11404
2017-05-21 09:14:04jeff.allensetstatus: open -> pending
resolution: accepted
messages: + msg11389
2017-05-14 11:21:52jeff.allensetmilestone: Jython 2.7.0 -> Jython 2.7.1
2017-05-14 11:21:30jeff.allensetmessages: + msg11370
2017-05-01 14:28:42jeff.allensetmessages: + msg11337
2017-04-13 12:58:38jeff.allensetdependencies: + sys.getfilesystemencoding() is None although java.lang.System.getProperty('file.encoding') seems to work
messages: + msg11307
2017-03-30 21:47:41jeff.allensetmessages: + msg11285
2017-03-27 07:45:39jeff.allensetkeywords: + test failure causes
messages: + msg11276
2017-03-25 10:57:44jeff.allensetmessages: + msg11272
2017-03-22 07:14:04jeff.allensetmessages: + msg11261
2017-03-16 03:37:47bstjeansetfiles: + OpenRefineProblem.txt
nosy: + bstjean
messages: + msg11237
2015-09-19 09:51:24jeff.allensetmessages: + msg10265
2015-09-13 16:42:31jeff.allensetassignee: jeff.allen
messages: + msg10258
nosy: + jeff.allen
2015-05-20 06:38:27zyasoftsetnosy: + zyasoft
messages: + msg10070
2015-05-20 02:21:29liuxy_hes86create