Issue1839

classification
Title: sys.getfilesystemencoding() is None although java.lang.System.getProperty('file.encoding') seems to work
Type: Severity: normal
Components: Versions:
Milestone: Jython 2.7.1
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: Arfrever, akira, amak, fwierzbicki, jamesmudd, jeff.allen, pekka.klarck, zyasoft
Priority: normal Keywords:

Created on 2012-02-13.07:37:37 by pekka.klarck, last changed 2017-06-21.15:58:04 by zyasoft.

Messages
msg6779 (view) Author: Pekka Klärck (pekka.klarck) Date: 2012-02-13.07:37:36
With Jython 2.5.2 and earlier sys.getfilesystemencoding() always returns None. This breaks code that tries to encode or decode strings on system boundary. Example uses include decoding received command line arguments or encoding/decoding set/get environment variables. Working sys.getfilesystemencoding() could apparently also fix os.stat on Windows (issue #1658).

I have tested that at least on Ubuntu Linux and WinXP with Western locale the value returned by java.lang.System.getProperty('file.encoding') seems to be correct encoding to use. On Ubuntu I get UTF-8 both with that approach and with Python using sys.getfilesystemencoding(). On Windows file.encoding is Cp1252 and sys.getfilesystemencoding() on Python returns mbcs. Both of these are fine as the former is the actual encoding and the latter a special encoding that the operating system later translates to the correct encoding. Notice also that Jython doesn't support mbcs.

Based on my experimentation I propose sys.getfilesystemencoding() is implemented using java.lang.System.getProperty('file.encoding').
msg6948 (view) Author: Pekka Klärck (pekka.klarck) Date: 2012-03-21.05:37:07
It turned out that using 'file.encoding' property doesn't always work because Jython doesn't support all the encodings supported by JVM. That ought to be pretty easy to fix, though, and I submitted a separate issue #1865 about it.
msg8309 (view) Author: Jim Baker (zyasoft) Date: 2014-04-25.16:44:58
Need to fix for 2.7, a number of libraries we use depend on sys.getfilesystemencoding()

Can also remove Jython-specific version of SimpleHTTPServer once this is resolved.
msg9292 (view) Author: Jim Baker (zyasoft) Date: 2015-01-04.17:27:57
The title of the issue is currently misleading, given that per Python 2.7 docs (https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding):

> getfilesystemencoding()
> Return the name of the encoding used to convert Unicode filenames into system file names, or None if the system default encoding is used.

This is different than the file.encoding system property. Although I was unable to find an authoritative source on this as a standard property, conventionally this sets http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html#defaultCharset(); see http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding
msg9293 (view) Author: Jim Baker (zyasoft) Date: 2015-01-04.17:28:24
Changing this current behavior is much, much harder than it first appears. I have partially addressed it with the fix for #2239, but the problem is that the file system encoding for Jython is in some sense None - Jython simply uses Unicode paths, much like Java. Also returning None is considered correct behavior: "returns None if the system default encoding is used" (https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding)

Supporting anything else is a very big issue once we consider Java integration. In general, any solution should ensure that the following code snippet would always work:

[java.io.File(p).exists() for p in os.listdir()]

regardless of how wrapped these calls (java.io.File or os.listdir) might actually be.

Note that this is somewhat similar to Windows which uses "mbcs" for its file system encoding. Also this problem goes away more or less with Jython 3.

Set priority accordingly to low: there is no straightforward perfect fix, which makes sense because it's an integration issue.
msg9317 (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) Date: 2015-01-06.19:09:02
> Also this problem goes away more or less with Jython 3.

CPython 3 still has sys.getfilesystemencoding() and losslessly (i.e. without REPLACEMENT CHARACTERs) supports bytes paths...

$ rm -fr /tmp/some_dir
$ mkdir /tmp/some_dir
$ touch /tmp/some_dir/ś
$ touch /tmp/some_dir/$'\x80'
$ touch /tmp/some_dir/aaa$'\x80\x81\x82\x83'aaa
$ python3.5 -c 'import os; print(os.listdir(b"/tmp/some_dir"))'
[b'aaa\x80\x81\x82\x83aaa', b'\xc5\x9b', b'\x80']
$ python3.5 -c 'import os; print(os.listdir("/tmp/some_dir"))'
['aaa\udc80\udc81\udc82\udc83aaa', 'ś', '\udc80']
$ LC_ALL="C" python3.5 -c 'import os; print(os.listdir("/tmp/some_dir"))'
['aaa\udc80\udc81\udc82\udc83aaa', '\udcc5\udc9b', '\udc80']
$ python3.5 -c 'import sys; print(sys.getfilesystemencoding())'
utf-8
$ LC_ALL="C" python3.5 -c 'import sys; print(sys.getfilesystemencoding())'
ascii
msg9318 (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) Date: 2015-01-06.19:23:43
Although:

>>> "\udcc5\udc9b" == "ś"
False

But both paths refer to the same file:

>>> os.path.exists("/tmp/some_dir/\udcc5\udc9b")
True
>>> os.path.exists("/tmp/some_dir/ś")
True
>>> os.stat("/tmp/some_dir/\udcc5\udc9b") == os.stat("/tmp/some_dir/ś")
True
msg9448 (view) Author: (akira) Date: 2015-01-23.20:26:01
sys.getfilesystemencoding() can't be None since Python 3.2 [1]

[1] https://docs.python.org/3/library/sys.html#sys.getfilesystemencoding
msg11306 (view) Author: Jeff Allen (jeff.allen) Date: 2017-04-13.12:56:05
This is related to #2536, in that I can't seem to fix that without getting the encoding of file paths straight too. I may have a "right answer".

Returning to Jython 2.7, at present we meet Jim's critereon:
>>> [java.io.File(p).exists() for p in os.listdir('.')]
[True, True, True, True, True, True, True, True, True, True]
and this is because we adapt to the file names Java gives us:
>>> os.listdir('.')
['argtest.py', u'c-\u5496\u5561', u'caf\xe9', 'dist', 'mbcs.txt', u'p-\u87d2\u86c7', u'test\u5496\u5561.tmp', u'\u05e9\u05b4\u05c1\u05d1\u05b9\u05bc\u05dc\u05b6\u05ea.txt', u'\u4e2d\u6587.txt', u'\U0001f40d']

CPython does not unless asked specifically:
>>> os.listdir('.')
['argtest.py', 'c-??', 'caf\xe9', 'dist', 'mbcs.txt', 'p-??', 'test??.tmp', '?????????.txt', '??.txt', '??']
>>> os.listdir(u'.')
[u'argtest.py', u'c-\u5496\u5561', u'caf\xe9', u'dist', u'mbcs.txt', u'p-\u87d2\u86c7', u'test\u5496\u5561.tmp', u'\u05e9\u05b4\u05c1\u05d1\u05b9\u05bc\u05dc\u05b6\u05ea.txt', u'\u4e2d\u6587.txt', u'\U0001f40d']

Why aren't all our file names unicode objects? Because it breaks too many tests. That's ok because we only really test with ascii. (It's not really ok.) They break because the more widely that unicode filenames spread, the more places we discover that either unicode is not expected at all, or we don't deal with it correctly. For example:
>>> import sys, os, os.path, java
>>> os.getcwd()
u'C:\\Users\\\xc9preuve\\Documents\\Python2\\encoding\\c-\u5496\u5561'
>>> os.path.abspath('x.tmp')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Épreuve\Documents\Python2\jython-int\dist\Lib\ntpath.py", line 471, in abspath
    path = sys.getPath(path).encode('latin-1')
  File "C:\Users\Épreuve\Documents\Python2\jython-int\dist\Lib\ntpath.py", line 471, in abspath
    path = sys.getPath(path).encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 46-47: ordinal not in range(256)

It is attractive, but I'm not sure it is feasible, to have all paths be unicode objects "on demand", for example:
>>> os.__file__
u'C:\\Users\\\xc9preuve\\Documents\\Python2\\jython-int\\dist\\Lib\\os$py.class'
I've tried this and it causes problems elsewhere (e.g. traceback.py and doctest.py) to which we could respond by making more and more things unicode, but there always seems to be more trouble, and we can't fix the same assumptions in user code.

An alternative is to encode to byte data the paths Java gives us, in enough places to keep the stdlib largely unmodified (such as the __file__ attribute). In that case we need to choose an encoding *for Jython*. Notice we are not trying to match an encoding chosen by the platform (e.g. 'mbcs' on Windows): Java has insulated us from that already. Rather we need an encoding simply because the stdlib forces byte paths on us (in Python 2), and the methods receiving paths in that form have to know how to *decode* them again for Java. There is no reason for this choice to vary with OS platform. UTF-8 is the obvious choice.

Now one could argue that this is different from CPython's use of "file system encoding", which tracks the platform's choice so that OS services may be called that have a bytes interface. We interact with these services via Java (even jnr-posix wants String arguments), so in a sense the *platform* encoding is None, or moot, or unknown, while for CPython it actually matters. Nevertheless, as we still need a conversion, for paths that Python requires be byte data, we should advertise that through sys.getfilesystemencoding(), as the answer to "how are byte paths encoded".

This won't make all our problems go away immediately -- we're still doing the wrong thing in many places -- but I think it reduces them to problems for which there is a right answer, once if you can figure out whether that java.lang.String is really a bytes object.
msg11314 (view) Author: Pekka Klärck (pekka.klarck) Date: 2017-04-20.11:24:54
Jeff, I assume you meant some other issue than #2536. Which itself is pretty interesting.

As a user I'd be happier Jython returning bytes using the same encoding as CPython but universally using UTF-8 is not bad at all. Looking forward for Jython 3 and end for this madness. =)
msg11318 (view) Author: Jeff Allen (jeff.allen) Date: 2017-04-22.06:20:48
Thanks. I believe it would be the same on most Unices.

CPython on Windows isn't even the same as itself. To depend on consistency is to go against the advice given here:
https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx
concerning CP_ACP, which is the call CPython uses inside the 'mbcs' codec.

Yes, I meant #2356. Er, Notlob.
msg11335 (view) Author: Jeff Allen (jeff.allen) Date: 2017-05-01.13:58:22
I have implemented with sys.getfilesystemencoding()=='utf-8' and it works pretty well. I've tweaked a lot of exsting code, and some is quite old. I have published to here:

https://bitbucket.org/tournesol/jython-utf8

in case anyone sees a massive flaw.
msg11338 (view) Author: Pekka Klärck (pekka.klarck) Date: 2017-05-02.11:31:51
Cool! Is there going to be 2.7.1 preview releases soon where this is included? I'd like to test how the change affects Robot Framework on different environments. We have pretty extensive test suite that has uncovered various Jython (and IronPython and even CPython) bugs over the years.
msg11340 (view) Author: Jeff Allen (jeff.allen) Date: 2017-05-02.21:43:55
This is considered important to have in in 2.7.1 (according to Jim), including the Chinese case, which is coming along ok. I can't say when there will be a release candidate with it in.
msg11341 (view) Author: Jim Baker (zyasoft) Date: 2017-05-02.22:00:25
We can expect it will part of a release candidate however; it's just a question of timing.
msg11342 (view) Author: James Mudd (jamesmudd) Date: 2017-05-03.20:15:37
@Jeff I checked out your bitbucket repo and ran the regrtest on linux. I have 2 failures I don't see on the standard master:

     [exec] 2 tests skipped:
     [exec]     test_codecmaps_hk test_curses
     [exec] 3 tests failed:
     [exec]     test_os_jy test_runpy test_ssl
     [exec] 3 fails unexpected:
     [exec]     test_os_jy test_runpy test_ssl
     [exec] Result: 1

On master at the moment I only see test_ssl fail (with what seems to be a spurious failure, I think netty related) with the UTF-8 fix I see test_os_jy and test_runpy also fail.

Here is the relevant test output:

     [exec] test_os_jy
     [exec] bash: warning: setlocale: LC_ALL: cannot change locale (tr_TR.UTF-8)
     [exec] bash: warning: setlocale: LC_ALL: cannot change locale (tr_TR.UTF-8)
     [exec] test test_os_jy failed -- Traceback (most recent call last):
     [exec]   File "/home/james/Desktop/jython-utf8/dist/Lib/test/test_os_jy.py", line 208, in test_env
     [exec]     self.assertEqual(p.stdout.read().decode("utf-8"), u"首页")
     [exec] AssertionError: u'\xe9\xa6\x96\xe9\xa1\xb5' != u'\u9996\u9875'
     [exec] - \xe9\xa6\x96\xe9\xa1\xb5
     [exec] + \u9996\u9875


     [exec] test_runpy
     [exec] test test_runpy failed -- Traceback (most recent call last):
     [exec]   File "/home/james/Desktop/jython-utf8/dist/Lib/test/test_runpy.py", line 384, in test_zipfile_error
     [exec]     self._check_import_error(zip_name, msg)
     [exec]   File "/home/james/Desktop/jython-utf8/dist/Lib/test/test_runpy.py", line 322, in _check_import_error
     [exec]     self.assertRaisesRegexp(ImportError, msg, run_path, script_name)
     [exec] AssertionError: "can\'t\ find\ \'\_\_main\_\_\'\ module\ in\ \'\/tmp\/tmp0sX4w2\/test\_zip\.zip\'" does not match "can't find '__main__' module in u'/tmp/tmp0sX4w2/test_zip.zip'"

Both appear to be UTF-8 related? Not sure if you are also seeing these?
msg11343 (view) Author: Jeff Allen (jeff.allen) Date: 2017-05-04.16:41:29
@James Thanks for running that. No, I didn't see them because they're on the expected failures list for Windows 🤦. I'm happy to note that in test_os_jy that's the only failure (now, and for French). test_runpy fails for me on unlink(), masking whatever else happens. But that's some weird escaping!

If there's a Cygwin user out there, it would be good to know I haven't broken the launcher for you. There is some Cygwin-specific code and I was having mintty problems (before and after).
msg11369 (view) Author: Jeff Allen (jeff.allen) Date: 2017-05-13.21:38:03
@James: in fact those failures in test_os_jy are doubly interesting as the tests encapsulate what we expect from os.listdir, which I'm changing to be more like CPython. 

Nearly there now (I think).
msg11371 (view) Author: Jeff Allen (jeff.allen) Date: 2017-05-14.11:27:14
Now looking good. I am pushing for the time being to
https://bitbucket.org/tournesol/jython-utf8

Further report in #2356, msg11370.
msg11390 (view) Author: Jeff Allen (jeff.allen) Date: 2017-05-21.09:16:26
Solution now in the trunk in change sets culminating in https://hg.python.org/jython/rev/4ebf44457697

@Pekka: this will be in the release candidate we're putting together shorthly.
msg11391 (view) Author: James Mudd (jamesmudd) Date: 2017-05-21.16:06:36
Just tried running this on windows and the jython.exe launcher doesn't seem to work for me?

It fails with:
 contains neither jython-dev.jar nor jython.jar.
Try running this script from the 'bin' directory of an installed Jython or
setting JYTHON_HOME.

But I am running it from the bin directory? It does work if JYTHON_HOME is set.
msg11392 (view) Author: Jim Baker (zyasoft) Date: 2017-05-21.17:23:52
James, I'm also seeing the same issue when running from the installer, eg doing something like:

$ java -jar dist/jython-installer.jar -s -d ~/jython-2.7.1-test-RC1

(more discussion in http://bugs.jython.org/issue2570)

This looks like a transient problem in finding Jython, and only impacts the installation step of running ensurepip. OK, one more bug to fix :)

The workaround is to run jython -m ensurepip as a separate later step (ensurepip can be repeated without any problem until it works, or so it seems).
msg11393 (view) Author: Pekka Klärck (pekka.klarck) Date: 2017-05-21.20:52:37
@Jeff, looking forward for the RC!

@Jim, it's OT for this issue, but have you been discussing with PyPA people about testing pip more on Jython? It seems to me they only test with CPython (and possibly PyPI) which can cause problems with Jython and IronPython. Just yesterday commented this PR that contains code that doesn't work on Jython:
https://github.com/pypa/pip/pull/4490
msg11395 (view) Author: Jeff Allen (jeff.allen) Date: 2017-05-22.07:13:47
@James Thanks for testing. I'll take a look tonight.

Trying it in multiple environments is a good test after tinkering with the launcher. I run in the dev environment of my machine, which of course is to repeat the same environment each time. (I do not usually have JYTHON_HOME set.)

I've been relying on test_jython_launcher to cover other combinations in the environment. I think this was one of the changes I tried manually at some stage, but we should add all test cases that give us trouble to test_jython_launcher that we might encounter.
msg11397 (view) Author: Jeff Allen (jeff.allen) Date: 2017-05-23.08:18:36
Ok, simple enough. I've fixed that and will push a change soon. The pip installation happens smoothly now.

Building and installing is a good test generally. Doing so for my Chinese user name, I find the installation of pip still crashes. Regression tests are ok, except for test_ssl which shows one failure I don't understand and isn't reproduced in the "ascii environment". (More relevant to #2356 than here.)
History
Date User Action Args
2017-06-21 15:58:04zyasoftsetstatus: pending -> closed
2017-05-23 08:18:37jeff.allensetmessages: + msg11397
2017-05-22 07:13:48jeff.allensetmessages: + msg11395
2017-05-21 20:52:37pekka.klarcksetmessages: + msg11393
2017-05-21 17:23:53zyasoftsetmessages: + msg11392
2017-05-21 16:06:37jamesmuddsetmessages: + msg11391
2017-05-21 09:16:26jeff.allensetstatus: open -> pending
resolution: accepted
messages: + msg11390
2017-05-14 11:27:14jeff.allensetmessages: + msg11371
milestone: Jython 2.7.1
2017-05-13 21:38:04jeff.allensetmessages: + msg11369
2017-05-04 16:41:30jeff.allensetmessages: + msg11343
milestone: Jython 2.7.1 -> (no value)
2017-05-03 20:15:39jamesmuddsetnosy: + jamesmudd
messages: + msg11342
2017-05-02 22:00:25zyasoftsetmessages: + msg11341
2017-05-02 21:43:55jeff.allensetmessages: + msg11340
milestone: Jython 2.7.1
2017-05-02 11:31:52pekka.klarcksetmessages: + msg11338
2017-05-01 13:58:22jeff.allensetmessages: + msg11335
2017-04-22 06:20:49jeff.allensetmessages: + msg11318
2017-04-20 11:24:55pekka.klarcksetmessages: + msg11314
2017-04-13 12:58:38jeff.allenlinkissue2356 dependencies
2017-04-13 12:56:07jeff.allensetpriority: low -> normal
assignee: jeff.allen
messages: + msg11306
nosy: + jeff.allen
2015-01-23 20:26:01akirasetnosy: + akira
messages: + msg9448
2015-01-06 19:23:43Arfreversetmessages: + msg9318
2015-01-06 19:09:02Arfreversetmessages: + msg9317
2015-01-06 18:47:28Arfreversetnosy: + Arfrever
2015-01-04 17:28:25zyasoftsetmessages: + msg9293
2015-01-04 17:27:58zyasoftsetpriority: low
messages: + msg9292
2014-04-25 16:44:58zyasoftsetnosy: + zyasoft
messages: + msg8309
2013-02-26 23:45:48amaksetnosy: + amak
2013-02-26 18:12:28fwierzbickisetnosy: + fwierzbicki
2012-03-21 05:37:07pekka.klarcksetmessages: + msg6948
2012-02-13 07:37:37pekka.klarckcreate