Issue1839

classification
Title: sys.getfilesystemencoding() is None although java.lang.System.getProperty('file.encoding') seems to work
Type: Severity: normal
Components: Versions:
Milestone:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: Arfrever, akira, amak, fwierzbicki, jeff.allen, pekka.klarck, zyasoft
Priority: normal Keywords:

Created on 2012-02-13.07:37:37 by pekka.klarck, last changed 2017-04-22.06:20:49 by jeff.allen.

Messages
msg6779 (view) Author: Pekka Klärck (pekka.klarck) Date: 2012-02-13.07:37:36
With Jython 2.5.2 and earlier sys.getfilesystemencoding() always returns None. This breaks code that tries to encode or decode strings on system boundary. Example uses include decoding received command line arguments or encoding/decoding set/get environment variables. Working sys.getfilesystemencoding() could apparently also fix os.stat on Windows (issue #1658).

I have tested that at least on Ubuntu Linux and WinXP with Western locale the value returned by java.lang.System.getProperty('file.encoding') seems to be correct encoding to use. On Ubuntu I get UTF-8 both with that approach and with Python using sys.getfilesystemencoding(). On Windows file.encoding is Cp1252 and sys.getfilesystemencoding() on Python returns mbcs. Both of these are fine as the former is the actual encoding and the latter a special encoding that the operating system later translates to the correct encoding. Notice also that Jython doesn't support mbcs.

Based on my experimentation I propose sys.getfilesystemencoding() is implemented using java.lang.System.getProperty('file.encoding').
msg6948 (view) Author: Pekka Klärck (pekka.klarck) Date: 2012-03-21.05:37:07
It turned out that using 'file.encoding' property doesn't always work because Jython doesn't support all the encodings supported by JVM. That ought to be pretty easy to fix, though, and I submitted a separate issue #1865 about it.
msg8309 (view) Author: Jim Baker (zyasoft) Date: 2014-04-25.16:44:58
Need to fix for 2.7, a number of libraries we use depend on sys.getfilesystemencoding()

Can also remove Jython-specific version of SimpleHTTPServer once this is resolved.
msg9292 (view) Author: Jim Baker (zyasoft) Date: 2015-01-04.17:27:57
The title of the issue is currently misleading, given that per Python 2.7 docs (https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding):

> getfilesystemencoding()
> Return the name of the encoding used to convert Unicode filenames into system file names, or None if the system default encoding is used.

This is different than the file.encoding system property. Although I was unable to find an authoritative source on this as a standard property, conventionally this sets http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html#defaultCharset(); see http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding
msg9293 (view) Author: Jim Baker (zyasoft) Date: 2015-01-04.17:28:24
Changing this current behavior is much, much harder than it first appears. I have partially addressed it with the fix for #2239, but the problem is that the file system encoding for Jython is in some sense None - Jython simply uses Unicode paths, much like Java. Also returning None is considered correct behavior: "returns None if the system default encoding is used" (https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding)

Supporting anything else is a very big issue once we consider Java integration. In general, any solution should ensure that the following code snippet would always work:

[java.io.File(p).exists() for p in os.listdir()]

regardless of how wrapped these calls (java.io.File or os.listdir) might actually be.

Note that this is somewhat similar to Windows which uses "mbcs" for its file system encoding. Also this problem goes away more or less with Jython 3.

Set priority accordingly to low: there is no straightforward perfect fix, which makes sense because it's an integration issue.
msg9317 (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) Date: 2015-01-06.19:09:02
> Also this problem goes away more or less with Jython 3.

CPython 3 still has sys.getfilesystemencoding() and losslessly (i.e. without REPLACEMENT CHARACTERs) supports bytes paths...

$ rm -fr /tmp/some_dir
$ mkdir /tmp/some_dir
$ touch /tmp/some_dir/ś
$ touch /tmp/some_dir/$'\x80'
$ touch /tmp/some_dir/aaa$'\x80\x81\x82\x83'aaa
$ python3.5 -c 'import os; print(os.listdir(b"/tmp/some_dir"))'
[b'aaa\x80\x81\x82\x83aaa', b'\xc5\x9b', b'\x80']
$ python3.5 -c 'import os; print(os.listdir("/tmp/some_dir"))'
['aaa\udc80\udc81\udc82\udc83aaa', 'ś', '\udc80']
$ LC_ALL="C" python3.5 -c 'import os; print(os.listdir("/tmp/some_dir"))'
['aaa\udc80\udc81\udc82\udc83aaa', '\udcc5\udc9b', '\udc80']
$ python3.5 -c 'import sys; print(sys.getfilesystemencoding())'
utf-8
$ LC_ALL="C" python3.5 -c 'import sys; print(sys.getfilesystemencoding())'
ascii
msg9318 (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) Date: 2015-01-06.19:23:43
Although:

>>> "\udcc5\udc9b" == "ś"
False

But both paths refer to the same file:

>>> os.path.exists("/tmp/some_dir/\udcc5\udc9b")
True
>>> os.path.exists("/tmp/some_dir/ś")
True
>>> os.stat("/tmp/some_dir/\udcc5\udc9b") == os.stat("/tmp/some_dir/ś")
True
msg9448 (view) Author: (akira) Date: 2015-01-23.20:26:01
sys.getfilesystemencoding() can't be None since Python 3.2 [1]

[1] https://docs.python.org/3/library/sys.html#sys.getfilesystemencoding
msg11306 (view) Author: Jeff Allen (jeff.allen) Date: 2017-04-13.12:56:05
This is related to #2536, in that I can't seem to fix that without getting the encoding of file paths straight too. I may have a "right answer".

Returning to Jython 2.7, at present we meet Jim's critereon:
>>> [java.io.File(p).exists() for p in os.listdir('.')]
[True, True, True, True, True, True, True, True, True, True]
and this is because we adapt to the file names Java gives us:
>>> os.listdir('.')
['argtest.py', u'c-\u5496\u5561', u'caf\xe9', 'dist', 'mbcs.txt', u'p-\u87d2\u86c7', u'test\u5496\u5561.tmp', u'\u05e9\u05b4\u05c1\u05d1\u05b9\u05bc\u05dc\u05b6\u05ea.txt', u'\u4e2d\u6587.txt', u'\U0001f40d']

CPython does not unless asked specifically:
>>> os.listdir('.')
['argtest.py', 'c-??', 'caf\xe9', 'dist', 'mbcs.txt', 'p-??', 'test??.tmp', '?????????.txt', '??.txt', '??']
>>> os.listdir(u'.')
[u'argtest.py', u'c-\u5496\u5561', u'caf\xe9', u'dist', u'mbcs.txt', u'p-\u87d2\u86c7', u'test\u5496\u5561.tmp', u'\u05e9\u05b4\u05c1\u05d1\u05b9\u05bc\u05dc\u05b6\u05ea.txt', u'\u4e2d\u6587.txt', u'\U0001f40d']

Why aren't all our file names unicode objects? Because it breaks too many tests. That's ok because we only really test with ascii. (It's not really ok.) They break because the more widely that unicode filenames spread, the more places we discover that either unicode is not expected at all, or we don't deal with it correctly. For example:
>>> import sys, os, os.path, java
>>> os.getcwd()
u'C:\\Users\\\xc9preuve\\Documents\\Python2\\encoding\\c-\u5496\u5561'
>>> os.path.abspath('x.tmp')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Épreuve\Documents\Python2\jython-int\dist\Lib\ntpath.py", line 471, in abspath
    path = sys.getPath(path).encode('latin-1')
  File "C:\Users\Épreuve\Documents\Python2\jython-int\dist\Lib\ntpath.py", line 471, in abspath
    path = sys.getPath(path).encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 46-47: ordinal not in range(256)

It is attractive, but I'm not sure it is feasible, to have all paths be unicode objects "on demand", for example:
>>> os.__file__
u'C:\\Users\\\xc9preuve\\Documents\\Python2\\jython-int\\dist\\Lib\\os$py.class'
I've tried this and it causes problems elsewhere (e.g. traceback.py and doctest.py) to which we could respond by making more and more things unicode, but there always seems to be more trouble, and we can't fix the same assumptions in user code.

An alternative is to encode to byte data the paths Java gives us, in enough places to keep the stdlib largely unmodified (such as the __file__ attribute). In that case we need to choose an encoding *for Jython*. Notice we are not trying to match an encoding chosen by the platform (e.g. 'mbcs' on Windows): Java has insulated us from that already. Rather we need an encoding simply because the stdlib forces byte paths on us (in Python 2), and the methods receiving paths in that form have to know how to *decode* them again for Java. There is no reason for this choice to vary with OS platform. UTF-8 is the obvious choice.

Now one could argue that this is different from CPython's use of "file system encoding", which tracks the platform's choice so that OS services may be called that have a bytes interface. We interact with these services via Java (even jnr-posix wants String arguments), so in a sense the *platform* encoding is None, or moot, or unknown, while for CPython it actually matters. Nevertheless, as we still need a conversion, for paths that Python requires be byte data, we should advertise that through sys.getfilesystemencoding(), as the answer to "how are byte paths encoded".

This won't make all our problems go away immediately -- we're still doing the wrong thing in many places -- but I think it reduces them to problems for which there is a right answer, once if you can figure out whether that java.lang.String is really a bytes object.
msg11314 (view) Author: Pekka Klärck (pekka.klarck) Date: 2017-04-20.11:24:54
Jeff, I assume you meant some other issue than #2536. Which itself is pretty interesting.

As a user I'd be happier Jython returning bytes using the same encoding as CPython but universally using UTF-8 is not bad at all. Looking forward for Jython 3 and end for this madness. =)
msg11318 (view) Author: Jeff Allen (jeff.allen) Date: 2017-04-22.06:20:48
Thanks. I believe it would be the same on most Unices.

CPython on Windows isn't even the same as itself. To depend on consistency is to go against the advice given here:
https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx
concerning CP_ACP, which is the call CPython uses inside the 'mbcs' codec.

Yes, I meant #2356. Er, Notlob.
History
Date User Action Args
2017-04-22 06:20:49jeff.allensetmessages: + msg11318
2017-04-20 11:24:55pekka.klarcksetmessages: + msg11314
2017-04-13 12:58:38jeff.allenlinkissue2356 dependencies
2017-04-13 12:56:07jeff.allensetpriority: low -> normal
assignee: jeff.allen
messages: + msg11306
nosy: + jeff.allen
2015-01-23 20:26:01akirasetnosy: + akira
messages: + msg9448
2015-01-06 19:23:43Arfreversetmessages: + msg9318
2015-01-06 19:09:02Arfreversetmessages: + msg9317
2015-01-06 18:47:28Arfreversetnosy: + Arfrever
2015-01-04 17:28:25zyasoftsetmessages: + msg9293
2015-01-04 17:27:58zyasoftsetpriority: low
messages: + msg9292
2014-04-25 16:44:58zyasoftsetnosy: + zyasoft
messages: + msg8309
2013-02-26 23:45:48amaksetnosy: + amak
2013-02-26 18:12:28fwierzbickisetnosy: + fwierzbicki
2012-03-21 05:37:07pekka.klarcksetmessages: + msg6948
2012-02-13 07:37:37pekka.klarckcreate