Created on 2012-02-13.07:37:37 by pekka.klarck, last changed 2017-04-22.06:20:49 by jeff.allen.
|msg6779 (view)||Author: Pekka Klärck (pekka.klarck)||Date: 2012-02-13.07:37:36|
With Jython 2.5.2 and earlier sys.getfilesystemencoding() always returns None. This breaks code that tries to encode or decode strings on system boundary. Example uses include decoding received command line arguments or encoding/decoding set/get environment variables. Working sys.getfilesystemencoding() could apparently also fix os.stat on Windows (issue #1658). I have tested that at least on Ubuntu Linux and WinXP with Western locale the value returned by java.lang.System.getProperty('file.encoding') seems to be correct encoding to use. On Ubuntu I get UTF-8 both with that approach and with Python using sys.getfilesystemencoding(). On Windows file.encoding is Cp1252 and sys.getfilesystemencoding() on Python returns mbcs. Both of these are fine as the former is the actual encoding and the latter a special encoding that the operating system later translates to the correct encoding. Notice also that Jython doesn't support mbcs. Based on my experimentation I propose sys.getfilesystemencoding() is implemented using java.lang.System.getProperty('file.encoding').
|msg6948 (view)||Author: Pekka Klärck (pekka.klarck)||Date: 2012-03-21.05:37:07|
It turned out that using 'file.encoding' property doesn't always work because Jython doesn't support all the encodings supported by JVM. That ought to be pretty easy to fix, though, and I submitted a separate issue #1865 about it.
|msg8309 (view)||Author: Jim Baker (zyasoft)||Date: 2014-04-25.16:44:58|
Need to fix for 2.7, a number of libraries we use depend on sys.getfilesystemencoding() Can also remove Jython-specific version of SimpleHTTPServer once this is resolved.
|msg9292 (view)||Author: Jim Baker (zyasoft)||Date: 2015-01-04.17:27:57|
The title of the issue is currently misleading, given that per Python 2.7 docs (https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding): > getfilesystemencoding() > Return the name of the encoding used to convert Unicode filenames into system file names, or None if the system default encoding is used. This is different than the file.encoding system property. Although I was unable to find an authoritative source on this as a standard property, conventionally this sets http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html#defaultCharset(); see http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding
|msg9293 (view)||Author: Jim Baker (zyasoft)||Date: 2015-01-04.17:28:24|
Changing this current behavior is much, much harder than it first appears. I have partially addressed it with the fix for #2239, but the problem is that the file system encoding for Jython is in some sense None - Jython simply uses Unicode paths, much like Java. Also returning None is considered correct behavior: "returns None if the system default encoding is used" (https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding) Supporting anything else is a very big issue once we consider Java integration. In general, any solution should ensure that the following code snippet would always work: [java.io.File(p).exists() for p in os.listdir()] regardless of how wrapped these calls (java.io.File or os.listdir) might actually be. Note that this is somewhat similar to Windows which uses "mbcs" for its file system encoding. Also this problem goes away more or less with Jython 3. Set priority accordingly to low: there is no straightforward perfect fix, which makes sense because it's an integration issue.
|msg9317 (view)||Author: Arfrever Frehtes Taifersar Arahesis (Arfrever)||Date: 2015-01-06.19:09:02|
> Also this problem goes away more or less with Jython 3. CPython 3 still has sys.getfilesystemencoding() and losslessly (i.e. without REPLACEMENT CHARACTERs) supports bytes paths... $ rm -fr /tmp/some_dir $ mkdir /tmp/some_dir $ touch /tmp/some_dir/ś $ touch /tmp/some_dir/$'\x80' $ touch /tmp/some_dir/aaa$'\x80\x81\x82\x83'aaa $ python3.5 -c 'import os; print(os.listdir(b"/tmp/some_dir"))' [b'aaa\x80\x81\x82\x83aaa', b'\xc5\x9b', b'\x80'] $ python3.5 -c 'import os; print(os.listdir("/tmp/some_dir"))' ['aaa\udc80\udc81\udc82\udc83aaa', 'ś', '\udc80'] $ LC_ALL="C" python3.5 -c 'import os; print(os.listdir("/tmp/some_dir"))' ['aaa\udc80\udc81\udc82\udc83aaa', '\udcc5\udc9b', '\udc80'] $ python3.5 -c 'import sys; print(sys.getfilesystemencoding())' utf-8 $ LC_ALL="C" python3.5 -c 'import sys; print(sys.getfilesystemencoding())' ascii
|msg9318 (view)||Author: Arfrever Frehtes Taifersar Arahesis (Arfrever)||Date: 2015-01-06.19:23:43|
Although: >>> "\udcc5\udc9b" == "ś" False But both paths refer to the same file: >>> os.path.exists("/tmp/some_dir/\udcc5\udc9b") True >>> os.path.exists("/tmp/some_dir/ś") True >>> os.stat("/tmp/some_dir/\udcc5\udc9b") == os.stat("/tmp/some_dir/ś") True
|msg9448 (view)||Author: (akira)||Date: 2015-01-23.20:26:01|
sys.getfilesystemencoding() can't be None since Python 3.2   https://docs.python.org/3/library/sys.html#sys.getfilesystemencoding
|msg11306 (view)||Author: Jeff Allen (jeff.allen)||Date: 2017-04-13.12:56:05|
This is related to #2536, in that I can't seem to fix that without getting the encoding of file paths straight too. I may have a "right answer". Returning to Jython 2.7, at present we meet Jim's critereon: >>> [java.io.File(p).exists() for p in os.listdir('.')] [True, True, True, True, True, True, True, True, True, True] and this is because we adapt to the file names Java gives us: >>> os.listdir('.') ['argtest.py', u'c-\u5496\u5561', u'caf\xe9', 'dist', 'mbcs.txt', u'p-\u87d2\u86c7', u'test\u5496\u5561.tmp', u'\u05e9\u05b4\u05c1\u05d1\u05b9\u05bc\u05dc\u05b6\u05ea.txt', u'\u4e2d\u6587.txt', u'\U0001f40d'] CPython does not unless asked specifically: >>> os.listdir('.') ['argtest.py', 'c-??', 'caf\xe9', 'dist', 'mbcs.txt', 'p-??', 'test??.tmp', '?????????.txt', '??.txt', '??'] >>> os.listdir(u'.') [u'argtest.py', u'c-\u5496\u5561', u'caf\xe9', u'dist', u'mbcs.txt', u'p-\u87d2\u86c7', u'test\u5496\u5561.tmp', u'\u05e9\u05b4\u05c1\u05d1\u05b9\u05bc\u05dc\u05b6\u05ea.txt', u'\u4e2d\u6587.txt', u'\U0001f40d'] Why aren't all our file names unicode objects? Because it breaks too many tests. That's ok because we only really test with ascii. (It's not really ok.) They break because the more widely that unicode filenames spread, the more places we discover that either unicode is not expected at all, or we don't deal with it correctly. For example: >>> import sys, os, os.path, java >>> os.getcwd() u'C:\\Users\\\xc9preuve\\Documents\\Python2\\encoding\\c-\u5496\u5561' >>> os.path.abspath('x.tmp') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\Épreuve\Documents\Python2\jython-int\dist\Lib\ntpath.py", line 471, in abspath path = sys.getPath(path).encode('latin-1') File "C:\Users\Épreuve\Documents\Python2\jython-int\dist\Lib\ntpath.py", line 471, in abspath path = sys.getPath(path).encode('latin-1') UnicodeEncodeError: 'latin-1' codec can't encode characters in position 46-47: ordinal not in range(256) It is attractive, but I'm not sure it is feasible, to have all paths be unicode objects "on demand", for example: >>> os.__file__ u'C:\\Users\\\xc9preuve\\Documents\\Python2\\jython-int\\dist\\Lib\\os$py.class' I've tried this and it causes problems elsewhere (e.g. traceback.py and doctest.py) to which we could respond by making more and more things unicode, but there always seems to be more trouble, and we can't fix the same assumptions in user code. An alternative is to encode to byte data the paths Java gives us, in enough places to keep the stdlib largely unmodified (such as the __file__ attribute). In that case we need to choose an encoding *for Jython*. Notice we are not trying to match an encoding chosen by the platform (e.g. 'mbcs' on Windows): Java has insulated us from that already. Rather we need an encoding simply because the stdlib forces byte paths on us (in Python 2), and the methods receiving paths in that form have to know how to *decode* them again for Java. There is no reason for this choice to vary with OS platform. UTF-8 is the obvious choice. Now one could argue that this is different from CPython's use of "file system encoding", which tracks the platform's choice so that OS services may be called that have a bytes interface. We interact with these services via Java (even jnr-posix wants String arguments), so in a sense the *platform* encoding is None, or moot, or unknown, while for CPython it actually matters. Nevertheless, as we still need a conversion, for paths that Python requires be byte data, we should advertise that through sys.getfilesystemencoding(), as the answer to "how are byte paths encoded". This won't make all our problems go away immediately -- we're still doing the wrong thing in many places -- but I think it reduces them to problems for which there is a right answer, once if you can figure out whether that java.lang.String is really a bytes object.
|msg11314 (view)||Author: Pekka Klärck (pekka.klarck)||Date: 2017-04-20.11:24:54|
Jeff, I assume you meant some other issue than #2536. Which itself is pretty interesting. As a user I'd be happier Jython returning bytes using the same encoding as CPython but universally using UTF-8 is not bad at all. Looking forward for Jython 3 and end for this madness. =)
|msg11318 (view)||Author: Jeff Allen (jeff.allen)||Date: 2017-04-22.06:20:48|
Thanks. I believe it would be the same on most Unices. CPython on Windows isn't even the same as itself. To depend on consistency is to go against the advice given here: https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx concerning CP_ACP, which is the call CPython uses inside the 'mbcs' codec. Yes, I meant #2356. Er, Notlob.
|2017-04-22 06:20:49||jeff.allen||set||messages: + msg11318|
|2017-04-20 11:24:55||pekka.klarck||set||messages: + msg11314|
|2017-04-13 12:58:38||jeff.allen||link||issue2356 dependencies|
|2017-04-13 12:56:07||jeff.allen||set||priority: low -> normal|
messages: + msg11306
nosy: + jeff.allen
messages: + msg9448
|2015-01-06 19:23:43||Arfrever||set||messages: + msg9318|
|2015-01-06 19:09:02||Arfrever||set||messages: + msg9317|
|2015-01-06 18:47:28||Arfrever||set||nosy: + Arfrever|
|2015-01-04 17:28:25||zyasoft||set||messages: + msg9293|
|2015-01-04 17:27:58||zyasoft||set||priority: low|
messages: + msg9292
messages: + msg8309
|2013-02-26 23:45:48||amak||set||nosy: + amak|
|2013-02-26 18:12:28||fwierzbicki||set||nosy: + fwierzbicki|
|2012-03-21 05:37:07||pekka.klarck||set||messages: + msg6948|