Issue2239

classification
Title: os.listdir() broken with unicode characters in filenames
Type: Severity: normal
Components: Versions: Jython 2.7
Milestone:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: zyasoft Nosy List: Arfrever, jeff.allen, zyasoft
Priority: urgent Keywords:

Created on 2014-12-27.04:43:01 by Arfrever, last changed 2015-01-07.05:27:39 by zyasoft.

Messages
msg9266 (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) Date: 2014-12-27.04:43:00
os.listdir() is broken with unicode characters in filenames, probably since one of the following revisions, which introduced 'throw new IllegalArgumentException("Cannot create PyString with non-byte value")' lines:
https://hg.python.org/jython/rev/f0c63b42e552
https://hg.python.org/jython/rev/521823de34a5

CPython 2.7 works correctly.

$ mkdir /tmp/some_dir
$ touch /tmp/some_dir/ś
$ python2.7 -c 'import os; print(os.listdir("/tmp/some_dir"))'
['\xc5\x9b']
$ jython2.7 -c 'import os; print(os.listdir("/tmp/some_dir"))'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
        at org.python.core.PyString.<init>(PyString.java:57)
        at org.python.core.PyString.<init>(PyString.java:63)
        at org.python.core.PyString.createInstance(PyString.java:765)
        at org.python.modules.posix.PosixModule.listdir(PosixModule.java:499)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Cannot create PyString with non-byte value
$
msg9274 (view) Author: Jim Baker (zyasoft) Date: 2014-12-30.15:53:13
So os.listdir has interesting semantics with respect to working with Unicode paths that I was not aware of until now (being US/ASCII centric I suppose ;)

I created the following directory layout:
unicode
└── 首页

(Note that at least on OSX, the tree and ls commands don't work from the command line. cd does. So I manually created the above tree diagram!)

Python 2.7
>>> os.listdir(u"unicode")
[u'\u9996\u9875']
>>> os.listdir("unicode")
['\xe9\xa6\x96\xe9\xa1\xb5']

Trying again with Jython 2.7 trunk:
>>> os.listdir(u"unicode")
[u'\u9996\u9875']
>>> os.listdir("unicode")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
	at org.python.core.PyString.<init>(PyString.java:60)
	at org.python.core.PyString.<init>(PyString.java:66)
	at org.python.core.PyString.createInstance(PyString.java:776)
	at org.python.modules.posix.PosixModule.listdir(PosixModule.java:499)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:601)

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Cannot create PyString with non-byte value

Similar behavior is seen in say glob.glob("unicode/*") vs glob.glob(u"unicode/*"). os.stat can work with paths that are Unicode strings, but not as bytestrings:

os.stat('unicode/\xe9\xa6\x96\xe9\xa1\xb5')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 2] No such file or directory: '/Users/jbaker/test/unicode/\xe9\xa6\x96\xe9\xa1\xb5'

There is the outstanding and possibly related bug #2110 re updating our JNR Posix jar
msg9275 (view) Author: Jim Baker (zyasoft) Date: 2014-12-30.16:34:31
So this bug has been in Jython since at least 2.5:

$ jython25
Jython 2.5.4 (2.5:5ce837b1a1d8+, Dec 30 2014, 09:01:23)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_21
Type "help", "copyright", "credits" or "license" for more information.
>>> glob.glob("unicode/*")
['unicode/\u9996\u9875']
>>> os.stat('unicode/\u9996\u9875')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 2] No such file or directory: '/Users/jbaker/test/unicode/\\u9996\\u9875'

Jeff's fixes to ensure that no 16-bit characters sneak into PyString simply made it fail earlier, which is a very good thing indeed.

Probably the right thing to do is for os.listdir and similar os functions to emit PyUnicode if any characters > 127; we can see the obvious bug here from PosixModule.java:

    public static PyList listdir(PyObject path) {
        String absolutePath = absolutePath(path);
        File file = new File(absolutePath);
        String[] names = file.list();

        if (names == null) {
            // Can't read the path for some reason. stat will throw an error if it can't
            // read it either
            FileStat stat = posix.stat(absolutePath);
            // It exists, maybe not a dir, or we don't have permission?
            if (!stat.isDirectory()) {
                throw Py.OSError(Errno.ENOTDIR, path);
            }
            if (!file.canRead()) {
                throw Py.OSError(Errno.EACCES, path);
            }
            throw Py.OSError("listdir(): an unknown error occurred: " + path);
        }

        PyList list = new PyList();
        PyString string = (PyString) path;
        for (String name : names) {
            list.append(string.createInstance(name));
        }
        return list;

The point of string.createInstance(name) is that it will construct a PyString or PyUnicode based on the starting path; but this is also what caused the quiet bug earlier, and now the immediate failure we are seeing with Jeff's fixes.

The alternative is to try to simulate CPython here and return paths encoded using the underlying filesystem encoding (maybe UTF-8, maybe something else). But this is going too far: Java file paths are inherently already in Unicode.

But the most important reason is that we should be able to take os.listdir output and use with java.io.File, etc, etc. Java interoperability remains the most important reason for using Jython, after all.
msg9276 (view) Author: Jim Baker (zyasoft) Date: 2014-12-30.18:02:20
A related bug is that one cannot start Jython with the current working directory being Unicode. This one fails in a similar fashion:

$ cd 首页
jimbaker:首页 jbaker$ jython27
Exception in thread "main" Traceback (most recent call last):
  File "/Users/jbaker/jythondev/jython27/dist/Lib/site.py", line 62, in <module>
    import os
  File "/Users/jbaker/jythondev/jython27/dist/Lib/os.py", line 45, in <module>
    from posix import *
java.lang.IllegalArgumentException: Cannot create PyString with non-byte value
	at org.python.core.PyString.<init>(PyString.java:60)
	at org.python.core.PyString.<init>(PyString.java:66)
	at org.python.core.Py.newString(Py.java:626)
	at org.python.modules.posix.PosixModule.getEnviron(PosixModule.java:896)
	at org.python.modules.posix.PosixModule.classDictInit(PosixModule.java:113)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
...
msg9277 (view) Author: Jeff Allen (jeff.allen) Date: 2014-12-30.18:11:33
I made it fail ugly as I expected there would be breakage, but didn't know how to look for it. (I found a few things.) The test suite does not stretch far in this direction, so it passes.

Interesting mention of java.io.File. I wonder how that deals with the conversion from what the filesystem might return and Unicode file names?
msg9278 (view) Author: Jim Baker (zyasoft) Date: 2014-12-30.18:31:04
@Jeff,

It's pretty straightforward - all paths are Unicode in Java. Apparently so are environment variables and their values. So if we look at PosixModule, it's trying to replicate what's specified (more or less) in https://docs.python.org/2/library/os.html#os.listdir by intercepting the returned String and making a PyString/PyUnicode as appropriate.

I believe the right choice for us is if the path is PyString, to only return a PyString if ascii, otherwise PyUnicode, because we don't actually support encoded strings anyway.

Likewise, we have a similar problem in os.environ, as supported by PosixModule.getEnviron:

    private static PyObject getEnviron() {
        PyObject environ = new PyDictionary();
        Map<String, String> env;
        try {
            env = System.getenv();
        } catch (SecurityException se) {
            return environ;
        }
        for (Map.Entry<String, String> entry : env.entrySet()) {
            environ.__setitem__(Py.newString(entry.getKey()), Py.newString(entry.getValue()));
        }
        return environ;
    }

https://github.com/jythontools/jython/blob/master/src/org/python/modules/posix/PosixModule.java#L896

Note that Python 3 separates out os.environ and os.environb:
https://docs.python.org/3/library/os.html#os.environ

So when I run on Python 3, os.environ has this entry:
'PWD': '/Users/jbaker/test/unicode/首页'

whereas on Python 2.7:
'PWD': '/Users/jbaker/test/unicode/\xe9\xa6\x96\xe9\xa1\xb5'

Compare Java: 
http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#getenv()

And using Jython 2.5 in this same directory:
>>> System.getenv().get("PWD")
u'/Users/jbaker/test/unicode/\u9996\u9875'

And it's all related to that curious entity of surrogateescape
msg9279 (view) Author: Jim Baker (zyasoft) Date: 2014-12-30.19:23:01
Something like the following would work, with all current unit tests passing. (More unit tests to replicate this failing behavior would need to be added, of course.)

Note that bestString and castString are not the greatest names. Also, we have similar requirements in a variety of other places in the code, so it's reasonable to consider refactoring and placing in say org.python.core.Py. 

diff -r 07959ff1d8e9 src/org/python/modules/posix/PosixModule.java
--- a/src/org/python/modules/posix/PosixModule.java	Mon Dec 29 23:44:05 2014 -0700
+++ b/src/org/python/modules/posix/PosixModule.java	Tue Dec 30 12:19:50 2014 -0700
@@ -17,6 +17,7 @@
 import java.util.Iterator;
 import java.util.Map;
 
+import com.google.common.base.CharMatcher;
 import jnr.constants.Constant;
 import jnr.constants.platform.Errno;
 import jnr.posix.FileStat;
@@ -40,6 +41,7 @@
 import org.python.core.PyString;
 import org.python.core.PySystemState;
 import org.python.core.PyTuple;
+import org.python.core.PyUnicode;
 import org.python.core.imp;
 import org.python.core.io.FileDescriptors;
 import org.python.core.io.FileIO;
@@ -474,6 +476,18 @@
         "path: path of directory to list\n\n" +
         "The list is in arbitrary order.  It does not include the special\n" +
         "entries '.' and '..' even if they are present in the directory.");
+
+    private static PyString bestString(PyObject kind, String s) {
+        if (kind instanceof PyUnicode) {
+            return Py.newUnicode(s);
+        }
+        if (CharMatcher.ASCII.matchesAllOf(s)) {
+            return Py.newString(s);
+        } else {
+            return Py.newUnicode(s);
+        }
+    }
+
     public static PyList listdir(PyObject path) {
         String absolutePath = absolutePath(path);
         File file = new File(absolutePath);
@@ -494,9 +508,8 @@
         }
 
         PyList list = new PyList();
-        PyString string = (PyString) path;
         for (String name : names) {
-            list.append(string.createInstance(name));
+            list.append(bestString(path, name));
         }
         return list;
     }
@@ -884,6 +897,15 @@
      * Initialize the environ dict from System.getenv. environ may be empty when the
      * security policy doesn't grant us access.
      */
+
+    private static PyString castString(String s) {
+        if (CharMatcher.ASCII.matchesAllOf(s)) {
+            return Py.newString(s);
+        } else {
+            return Py.newUnicode(s);
+        }
+    }
+
     private static PyObject getEnviron() {
         PyObject environ = new PyDictionary();
         Map<String, String> env;
@@ -893,7 +915,7 @@
             return environ;
         }
         for (Map.Entry<String, String> entry : env.entrySet()) {
-            environ.__setitem__(Py.newString(entry.getKey()), Py.newString(entry.getValue()));
+            environ.__setitem__(castString(entry.getKey()), castString(entry.getValue()));
         }
         return environ;
     }
msg9281 (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) Date: 2014-12-31.16:21:04
I would suggest to try to return list with elements of the same values and types as in CPython.
It will be probably needed in Jython 3, since Python 3 has clear separation between bytes and strings.
msg9282 (view) Author: Jim Baker (zyasoft) Date: 2014-12-31.16:31:21
The problem with returning encoded strings, vs unicode, is that we cannot simply use the results of Python os operations, including ones that build on it, like glob, with Java IO, since they do not know about encoded strings.

For once, this may be one advantage of the conflation of strings and Unicode strings in Python 2.x. With Jython 3.x, there will also be a seamless transition, since it will be unicode anyway unless bytes are asked for specifically.

The other advantage is that this is then an easy fix, per the diff in my earlier message. Getting similar Java IO seamlessness with returning encoded strings from os.listdir is involved (of unknown scope), and doesn't correspond to anything else we have done. But Jython has always tried to make for a seamless experience between Python and Java; is it why people use Jython.

Boundaries always present challenges like these. Providing unicode simplifies what we need to do.
msg9283 (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) Date: 2014-12-31.20:33:00
How will Jython handle filenames containing sequence of bytes not decodable to unicode?
(Only NULL byte and "/" are not allowed in filenames in filesystems native in GNU/Linux.)
CPython 2.7 keeps them as bytes even when unicode string is passed to os.listdir().
CPython >=3.1 decodes bytes to strings using surrogateescape error handler.

$ mkdir /tmp/some_dir
$ touch /tmp/some_dir/$'\x80'
$ python2.7 -c 'import os; print(os.listdir("/tmp/some_dir"))'
['\x80']
$ python2.7 -c 'import os; print(os.listdir(u"/tmp/some_dir"))'
['\x80']
$ python3.5 -c 'import os; print(os.listdir(b"/tmp/some_dir"))'
[b'\x80']
$ python3.5 -c 'import os; print(os.listdir("/tmp/some_dir"))'
['\udc80']
msg9284 (view) Author: Jim Baker (zyasoft) Date: 2015-01-01.05:06:34
Arfrever, great question. But the underlying Java platform doesn't support this scenario anyway:

>>> from java.io import File
>>> File("/tmp/somedir").list()

returns None

I wouldn't rule out supporting bytestring filenames, with C support in a future release. But it's a lot of work, and it's not going to be in 2.7.0, per what we can reasonably triage :)
msg9291 (view) Author: Jim Baker (zyasoft) Date: 2015-01-04.17:09:19
Fixed as of https://hg.python.org/jython/rev/ea036792f304

There are two remaining issues:

1. Supporting non Unicode paths, if available, through C integration. Moved to #2245

2. Supporting bytes through all filesystem integration points including Java (eg java.io.File). Moved to #1839
msg9311 (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) Date: 2015-01-06.16:17:57
> But the underlying Java platform doesn't support this scenario anyway:
> 
> >>> from java.io import File
> >>> File("/tmp/somedir").list()
> 
> returns None

I get array with REPLACEMENT CHARACTER, not None:

$ jython2.7 -c 'from java.io import File; print(File("/tmp/some_dir").list())'
array(java.lang.String, [u'\ufffd'])
$ jython2.7 -c 'import unicodedata; print(unicodedata.name(u"\ufffd"))'
REPLACEMENT CHARACTER

I use IcedTea 2.5.3 (Java 1.7.0_71).
msg9312 (view) Author: Jim Baker (zyasoft) Date: 2015-01-06.17:09:16
Arfrever, that's a very interesting difference. When I tested, I was using OS X 10.10.1 (Mavericks) with Java 7 release 21.

However, paths returned with replacement characters are not going to be very useful. This seems like a fundamental limitation of Java because it has decided to use Unicode paths uniformly even if the underlying OS can support a more general path model.
msg9316 (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) Date: 2015-01-06.18:45:36
By the way, after https://hg.python.org/jython/rev/ea036792f304 os.listdir() also returns such REPLACEMENT CHARACTER for any unicode-incompatible byte.

$ rm -fr /tmp/some_dir
$ mkdir /tmp/some_dir
$ touch /tmp/some_dir/$'\x80'
$ touch /tmp/some_dir/aaa$'\x80\x81'aaa
$ jython2.7 -c 'import os; print(os.listdir("/tmp/some_dir"))'
[u'aaa\ufffd\ufffdaaa', u'\ufffd']
$

I think that this bug can be closed now.
History
Date User Action Args
2015-01-07 05:27:39zyasoftsetstatus: pending -> closed
2015-01-06 18:45:37Arfreversetmessages: + msg9316
2015-01-06 17:09:16zyasoftsetmessages: + msg9312
2015-01-06 16:17:58Arfreversetmessages: + msg9311
2015-01-04 17:10:21zyasoftsetstatus: open -> pending
resolution: fixed
2015-01-04 17:09:20zyasoftsetmessages: + msg9291
2015-01-01 05:06:35zyasoftsetmessages: + msg9284
2014-12-31 20:33:01Arfreversetmessages: + msg9283
2014-12-31 16:31:21zyasoftsetmessages: + msg9282
2014-12-31 16:21:05Arfreversetmessages: + msg9281
2014-12-30 19:23:02zyasoftsetmessages: + msg9279
2014-12-30 18:31:04zyasoftsetmessages: + msg9278
2014-12-30 18:11:33jeff.allensetmessages: + msg9277
2014-12-30 18:02:21zyasoftsetmessages: + msg9276
2014-12-30 16:34:32zyasoftsetassignee: zyasoft
messages: + msg9275
2014-12-30 15:53:14zyasoftsetnosy: + zyasoft
messages: + msg9274
2014-12-27 17:22:14zyasoftsetpriority: urgent
2014-12-27 04:43:01Arfrevercreate