Issue1839

classification
Title: sys.getfilesystemencoding() is None although java.lang.System.getProperty('file.encoding') seems to work
Type: Severity: normal
Components: Versions:
Milestone:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, akira, amak, fwierzbicki, pekka.klarck, zyasoft
Priority: low Keywords:

Created on 2012-02-13.07:37:37 by pekka.klarck, last changed 2015-01-23.20:26:01 by akira.

Messages
msg6779 (view) Author: Pekka Klärck (pekka.klarck) Date: 2012-02-13.07:37:36
With Jython 2.5.2 and earlier sys.getfilesystemencoding() always returns None. This breaks code that tries to encode or decode strings on system boundary. Example uses include decoding received command line arguments or encoding/decoding set/get environment variables. Working sys.getfilesystemencoding() could apparently also fix os.stat on Windows (issue #1658).

I have tested that at least on Ubuntu Linux and WinXP with Western locale the value returned by java.lang.System.getProperty('file.encoding') seems to be correct encoding to use. On Ubuntu I get UTF-8 both with that approach and with Python using sys.getfilesystemencoding(). On Windows file.encoding is Cp1252 and sys.getfilesystemencoding() on Python returns mbcs. Both of these are fine as the former is the actual encoding and the latter a special encoding that the operating system later translates to the correct encoding. Notice also that Jython doesn't support mbcs.

Based on my experimentation I propose sys.getfilesystemencoding() is implemented using java.lang.System.getProperty('file.encoding').
msg6948 (view) Author: Pekka Klärck (pekka.klarck) Date: 2012-03-21.05:37:07
It turned out that using 'file.encoding' property doesn't always work because Jython doesn't support all the encodings supported by JVM. That ought to be pretty easy to fix, though, and I submitted a separate issue #1865 about it.
msg8309 (view) Author: Jim Baker (zyasoft) Date: 2014-04-25.16:44:58
Need to fix for 2.7, a number of libraries we use depend on sys.getfilesystemencoding()

Can also remove Jython-specific version of SimpleHTTPServer once this is resolved.
msg9292 (view) Author: Jim Baker (zyasoft) Date: 2015-01-04.17:27:57
The title of the issue is currently misleading, given that per Python 2.7 docs (https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding):

> getfilesystemencoding()
> Return the name of the encoding used to convert Unicode filenames into system file names, or None if the system default encoding is used.

This is different than the file.encoding system property. Although I was unable to find an authoritative source on this as a standard property, conventionally this sets http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html#defaultCharset(); see http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding
msg9293 (view) Author: Jim Baker (zyasoft) Date: 2015-01-04.17:28:24
Changing this current behavior is much, much harder than it first appears. I have partially addressed it with the fix for #2239, but the problem is that the file system encoding for Jython is in some sense None - Jython simply uses Unicode paths, much like Java. Also returning None is considered correct behavior: "returns None if the system default encoding is used" (https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding)

Supporting anything else is a very big issue once we consider Java integration. In general, any solution should ensure that the following code snippet would always work:

[java.io.File(p).exists() for p in os.listdir()]

regardless of how wrapped these calls (java.io.File or os.listdir) might actually be.

Note that this is somewhat similar to Windows which uses "mbcs" for its file system encoding. Also this problem goes away more or less with Jython 3.

Set priority accordingly to low: there is no straightforward perfect fix, which makes sense because it's an integration issue.
msg9317 (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) Date: 2015-01-06.19:09:02
> Also this problem goes away more or less with Jython 3.

CPython 3 still has sys.getfilesystemencoding() and losslessly (i.e. without REPLACEMENT CHARACTERs) supports bytes paths...

$ rm -fr /tmp/some_dir
$ mkdir /tmp/some_dir
$ touch /tmp/some_dir/ś
$ touch /tmp/some_dir/$'\x80'
$ touch /tmp/some_dir/aaa$'\x80\x81\x82\x83'aaa
$ python3.5 -c 'import os; print(os.listdir(b"/tmp/some_dir"))'
[b'aaa\x80\x81\x82\x83aaa', b'\xc5\x9b', b'\x80']
$ python3.5 -c 'import os; print(os.listdir("/tmp/some_dir"))'
['aaa\udc80\udc81\udc82\udc83aaa', 'ś', '\udc80']
$ LC_ALL="C" python3.5 -c 'import os; print(os.listdir("/tmp/some_dir"))'
['aaa\udc80\udc81\udc82\udc83aaa', '\udcc5\udc9b', '\udc80']
$ python3.5 -c 'import sys; print(sys.getfilesystemencoding())'
utf-8
$ LC_ALL="C" python3.5 -c 'import sys; print(sys.getfilesystemencoding())'
ascii
msg9318 (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) Date: 2015-01-06.19:23:43
Although:

>>> "\udcc5\udc9b" == "ś"
False

But both paths refer to the same file:

>>> os.path.exists("/tmp/some_dir/\udcc5\udc9b")
True
>>> os.path.exists("/tmp/some_dir/ś")
True
>>> os.stat("/tmp/some_dir/\udcc5\udc9b") == os.stat("/tmp/some_dir/ś")
True
msg9448 (view) Author: (akira) Date: 2015-01-23.20:26:01
sys.getfilesystemencoding() can't be None since Python 3.2 [1]

[1] https://docs.python.org/3/library/sys.html#sys.getfilesystemencoding
History
Date User Action Args
2015-01-23 20:26:01akirasetnosy: + akira
messages: + msg9448
2015-01-06 19:23:43Arfreversetmessages: + msg9318
2015-01-06 19:09:02Arfreversetmessages: + msg9317
2015-01-06 18:47:28Arfreversetnosy: + Arfrever
2015-01-04 17:28:25zyasoftsetmessages: + msg9293
2015-01-04 17:27:58zyasoftsetpriority: low
messages: + msg9292
2014-04-25 16:44:58zyasoftsetnosy: + zyasoft
messages: + msg8309
2013-02-26 23:45:48amaksetnosy: + amak
2013-02-26 18:12:28fwierzbickisetnosy: + fwierzbicki
2012-03-21 05:37:07pekka.klarcksetmessages: + msg6948
2012-02-13 07:37:37pekka.klarckcreate