Issue2820

classification
Title: Import fails with UnicodeDecodeError if sys.path contains invalid UTF-8 bytes
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7.2
Milestone: Jython 2.7.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: jeff.allen, pekka.klarck
Priority: normal Keywords:

Created on 2019-11-03.22:09:04 by pekka.klarck, last changed 2020-02-01.08:54:41 by jeff.allen.

Messages
msg12742 (view) Author: Pekka Klärck (pekka.klarck) Date: 2019-11-03.22:09:04
Noticed this regression when testing Jython 2.7.2b2 on Linux. See the example below for a demonstration. Works fine with Jython 2.7.0 but seems to fail also with Jython 2.7.1.


Jython 2.7.2b2 (v2.7.2b2:b9b60766cabe, Nov 1 2019, 07:46:45) 
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_201
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path.append('hyv\xe4')
>>> import re       # existing modules can be imported fine
>>> import nonex    # this should fail with ImportError
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 3: unexpected end of data
msg12744 (view) Author: Jeff Allen (jeff.allen) Date: 2019-11-04.07:57:53
Thanks for testing. It's the same on Windows. As you can see, the problem is that when we encounter bytes in the context of file paths, we assume they are utf-8 encoded. A simpler test is:

f = open('hyv\xe4', 'w')

This works:

f = open(u'hyv\xe4', 'w')

But it means something different. (I now have a file called "hyvä".) Similarly sys.path.append(u'hyv\xe4') produces the effect you expect.

There is an argument (and it won amongst the developers of CPython) that file names are arbitrary sequences of bytes. Unfortunately (?), Java wants a String, and generally we have lost the encoding of the bytes by the time we need to produce it (since this does not just affect file names).

I found this helpful: http://jonisalonen.com/2012/java-and-file-names-with-invalid-characters . But we do not read locale information to set the file system encoding. A UTF-8 locale is almost universal these days on Linux.

Is the bug that, despite what would happen in an open() statement, the invalid directory should result in an ImportError? I.e. that should be the result of *anything* that goes wrong during an import?
msg12746 (view) Author: Pekka Klärck (pekka.klarck) Date: 2019-11-04.11:19:46
I did some further investigation and compared how Jython and CPython behave on Linux and Windows.

CPython:

1. When you set `PYTHONPATH=föö` and run Python, `sys.path` contains `föö` in bytes so that it's encoded using the system encoding (UTF-8 on Linux, Windows-1252 on by Windows machine).

2. Programmatically it's possible to set new items to `sys.path` using Unicode strings.

3. Also system encoded byte strings work when set programmatically. This is understandable because the interpreter itself uses that format when setting `PYTHONPATH` externally.

4. `sys.path` entries in bytes using some other than the system encoding seem to be ignored. At least they don't cause any problems.


Jython 2.7.2b2:

1. When you set `JYTHONPATH=föö`, and run Jython, `sys.path` contains `föö` as Unicode string regardless the operating system. This is different to CPython, but I don't think it matters because also CPython accepts Unicode entries in `sys.path`.

2. Programmatically it's possible to use Unicode strings. This is same as with CPython.

3. Byte strings work only if you use UTF-8, regardless the operating system. This is different to CPython and, in my opinion, it would be better to support system encoded byte strings similarly as CPython. In practice this only affects Windows because other OSes generally use UTF-8. On Windows it's also possible to use Unicode strings so at least there's a workaround.

4. Byte strings that aren't UTF-8 cause UnicodeDecodeError. It occurs with non-existing modules when such entries are at the end of `sys.path`, but it occurs *always* if these entries are in the beginning. This is, in my opinion, pretty severe. With the point 3. above, this means that a system encoded `sys.path` entry that works with CPython can break *all* imports with Jython.
msg12748 (view) Author: Pekka Klärck (pekka.klarck) Date: 2019-11-04.11:33:05
In my opinion the problem with `sys.path` entries that are bytes and cannot be decoded should be fixed. If I've understood the problem correctly, it shouldn't require anything more than ignoring such `sys.path` entries and moving to the next one.

It would also be good to support system encoding (not blindly UTF-8) with `sys.path` entries to be compatible with CPython. Because Unicode strings work with both, I don't think this is too high priority. Probably better to direct limited development resources to more important issues or to Jython 3 where `sys.path` entries are always Unicode anyway.
msg12758 (view) Author: Jeff Allen (jeff.allen) Date: 2019-11-05.20:51:51
Java insulates us from the encoding used for file names. As long as we see the file system exclusively through Java's operations, we can choose a conventional FS encoding. We choose UTF-8.

It may be that jnr.posix punctures this bubble. It would be the case I suppose if C (or CPython) made a directory listing in a file, that took a bytes approach to names, then we read that into Jython. In that case, you would need to know the FS encoding to transcode it to UTF-8. I think this will be a problem for very few people.

I agree there is a case for treating a badly-encoded sys.path entry (case 4 of your investigations) as no such directory, rather than surfacing the decoding error. I'm tentatively identifying it with 2.7.2.
msg12760 (view) Author: Pekka Klärck (pekka.klarck) Date: 2019-11-05.22:07:30
I agree UTF-8 is a good encoding for Jython to use. Just needed to adjust our codebase to use that, not the real system encoding, with Jython and when processing CLI arguments, environment variables, and so on. Luckily wasn't a big task.

The issue with invalid sys.path entries isn't a huge problem for us. We just have one test for that functionality but I can disable it with Jython if needed. Would anyway be nice if it was fixed.
msg12860 (view) Author: Jeff Allen (jeff.allen) Date: 2019-12-20.08:24:13
I believe I know where to handle this: seems like two or three places separately.

I think logging it as a non-fatal error is sensible, to give you a clue what is going on (at least if you give Jython a -v or two).
msg12866 (view) Author: Jeff Allen (jeff.allen) Date: 2019-12-21.19:48:28
It now behaves (locally) like this (with one -v given):

>>> import sys
>>> sys.path[:0] = ["hyv\xe4"]
>>> import csv
org.python.import CONFIG Cannot decode path entry 'hyv\xe4'
org.python.import CONFIG Cannot decode path entry 'hyv\xe4'
org.python.import CONFIG import csv # precompiled from ...\dist\Lib\csv$py.class
org.python.import CONFIG import _csv # builtin org.python.modules._csv._csv
org.python.import CONFIG import cStringIO # builtin org.python.modules.cStringIO

And for a non-existent package:

>>> import zzz
org.python.import CONFIG Cannot decode path entry 'hyv\xe4'
org.python.import CONFIG Cannot decode path entry 'hyv\xe4'
org.python.import CONFIG Cannot decode path entry 'hyv\xe4'
org.python.import CONFIG Cannot decode path entry 'hyv\xe4'
org.python.import CONFIG import encodings.gbk # precompiled from C:\Users\Jeff\Documents\Eclipse\jython-trunk\dist\Lib\encodings\gbk$py.class
org.python.import CONFIG Cannot decode path entry 'hyv\xe4'
org.python.import CONFIG Cannot decode path entry 'hyv\xe4'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named zzz

A couple of these are actually in the nested import of my console encoding, but it just shows how many times we use consult sys.path.
msg12886 (view) Author: Jeff Allen (jeff.allen) Date: 2019-12-24.15:06:28
Now published at https://hg.python.org/jython/rev/3e46a80390fb
History
Date User Action Args
2020-02-01 08:54:41jeff.allensetstatus: pending -> closed
2019-12-24 15:06:28jeff.allensetstatus: open -> pending
resolution: accepted -> fixed
messages: + msg12886
title: Importing non-existing module fails with UnicodeDecodeError if sys.path contains non-ASCII characters -> Import fails with UnicodeDecodeError if sys.path contains invalid UTF-8 bytes
2019-12-21 19:48:28jeff.allensettype: behaviour
messages: + msg12866
2019-12-20 08:24:14jeff.allensetmessages: + msg12860
2019-11-05 22:07:30pekka.klarcksetmessages: + msg12760
2019-11-05 20:51:51jeff.allensetresolution: accepted
messages: + msg12758
milestone: Jython 2.7.2
2019-11-04 11:33:05pekka.klarcksetmessages: + msg12748
2019-11-04 11:19:47pekka.klarcksetmessages: + msg12746
2019-11-04 07:57:54jeff.allensetpriority: normal
nosy: + jeff.allen
messages: + msg12744
components: + Core
versions: + Jython 2.7.2
2019-11-03 22:09:04pekka.klarckcreate