Issue2820

classification
Title: Importing non-existing module fails with UnicodeDecodeError if sys.path contains non-ASCII characters
Type: Severity: normal
Components: Core Versions: Jython 2.7.2
Milestone: Jython 2.7.2
process
Status: open Resolution: accepted
Dependencies: Superseder:
Assigned To: Nosy List: jeff.allen, pekka.klarck
Priority: normal Keywords:

Created on 2019-11-03.22:09:04 by pekka.klarck, last changed 2019-11-05.22:07:30 by pekka.klarck.

Messages
msg12742 (view) Author: Pekka Klärck (pekka.klarck) Date: 2019-11-03.22:09:04
Noticed this regression when testing Jython 2.7.2b2 on Linux. See the example below for a demonstration. Works fine with Jython 2.7.0 but seems to fail also with Jython 2.7.1.


Jython 2.7.2b2 (v2.7.2b2:b9b60766cabe, Nov 1 2019, 07:46:45) 
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_201
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path.append('hyv\xe4')
>>> import re       # existing modules can be imported fine
>>> import nonex    # this should fail with ImportError
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 3: unexpected end of data
msg12744 (view) Author: Jeff Allen (jeff.allen) Date: 2019-11-04.07:57:53
Thanks for testing. It's the same on Windows. As you can see, the problem is that when we encounter bytes in the context of file paths, we assume they are utf-8 encoded. A simpler test is:

f = open('hyv\xe4', 'w')

This works:

f = open(u'hyv\xe4', 'w')

But it means something different. (I now have a file called "hyvä".) Similarly sys.path.append(u'hyv\xe4') produces the effect you expect.

There is an argument (and it won amongst the developers of CPython) that file names are arbitrary sequences of bytes. Unfortunately (?), Java wants a String, and generally we have lost the encoding of the bytes by the time we need to produce it (since this does not just affect file names).

I found this helpful: http://jonisalonen.com/2012/java-and-file-names-with-invalid-characters . But we do not read locale information to set the file system encoding. A UTF-8 locale is almost universal these days on Linux.

Is the bug that, despite what would happen in an open() statement, the invalid directory should result in an ImportError? I.e. that should be the result of *anything* that goes wrong during an import?
msg12746 (view) Author: Pekka Klärck (pekka.klarck) Date: 2019-11-04.11:19:46
I did some further investigation and compared how Jython and CPython behave on Linux and Windows.

CPython:

1. When you set `PYTHONPATH=föö` and run Python, `sys.path` contains `föö` in bytes so that it's encoded using the system encoding (UTF-8 on Linux, Windows-1252 on by Windows machine).

2. Programmatically it's possible to set new items to `sys.path` using Unicode strings.

3. Also system encoded byte strings work when set programmatically. This is understandable because the interpreter itself uses that format when setting `PYTHONPATH` externally.

4. `sys.path` entries in bytes using some other than the system encoding seem to be ignored. At least they don't cause any problems.


Jython 2.7.2b2:

1. When you set `JYTHONPATH=föö`, and run Jython, `sys.path` contains `föö` as Unicode string regardless the operating system. This is different to CPython, but I don't think it matters because also CPython accepts Unicode entries in `sys.path`.

2. Programmatically it's possible to use Unicode strings. This is same as with CPython.

3. Byte strings work only if you use UTF-8, regardless the operating system. This is different to CPython and, in my opinion, it would be better to support system encoded byte strings similarly as CPython. In practice this only affects Windows because other OSes generally use UTF-8. On Windows it's also possible to use Unicode strings so at least there's a workaround.

4. Byte strings that aren't UTF-8 cause UnicodeDecodeError. It occurs with non-existing modules when such entries are at the end of `sys.path`, but it occurs *always* if these entries are in the beginning. This is, in my opinion, pretty severe. With the point 3. above, this means that a system encoded `sys.path` entry that works with CPython can break *all* imports with Jython.
msg12748 (view) Author: Pekka Klärck (pekka.klarck) Date: 2019-11-04.11:33:05
In my opinion the problem with `sys.path` entries that are bytes and cannot be decoded should be fixed. If I've understood the problem correctly, it shouldn't require anything more than ignoring such `sys.path` entries and moving to the next one.

It would also be good to support system encoding (not blindly UTF-8) with `sys.path` entries to be compatible with CPython. Because Unicode strings work with both, I don't think this is too high priority. Probably better to direct limited development resources to more important issues or to Jython 3 where `sys.path` entries are always Unicode anyway.
msg12758 (view) Author: Jeff Allen (jeff.allen) Date: 2019-11-05.20:51:51
Java insulates us from the encoding used for file names. As long as we see the file system exclusively through Java's operations, we can choose a conventional FS encoding. We choose UTF-8.

It may be that jnr.posix punctures this bubble. It would be the case I suppose if C (or CPython) made a directory listing in a file, that took a bytes approach to names, then we read that into Jython. In that case, you would need to know the FS encoding to transcode it to UTF-8. I think this will be a problem for very few people.

I agree there is a case for treating a badly-encoded sys.path entry (case 4 of your investigations) as no such directory, rather than surfacing the decoding error. I'm tentatively identifying it with 2.7.2.
msg12760 (view) Author: Pekka Klärck (pekka.klarck) Date: 2019-11-05.22:07:30
I agree UTF-8 is a good encoding for Jython to use. Just needed to adjust our codebase to use that, not the real system encoding, with Jython and when processing CLI arguments, environment variables, and so on. Luckily wasn't a big task.

The issue with invalid sys.path entries isn't a huge problem for us. We just have one test for that functionality but I can disable it with Jython if needed. Would anyway be nice if it was fixed.
History
Date User Action Args
2019-11-05 22:07:30pekka.klarcksetmessages: + msg12760
2019-11-05 20:51:51jeff.allensetresolution: accepted
messages: + msg12758
milestone: Jython 2.7.2
2019-11-04 11:33:05pekka.klarcksetmessages: + msg12748
2019-11-04 11:19:47pekka.klarcksetmessages: + msg12746
2019-11-04 07:57:54jeff.allensetpriority: normal
nosy: + jeff.allen
messages: + msg12744
components: + Core
versions: + Jython 2.7.2
2019-11-03 22:09:04pekka.klarckcreate