Issue1693

classification
Title: Unicode sys.path elements cause UnicodeErrors on import
Type: behaviour Severity: normal
Components: Core Versions: 2.5.2rc
Milestone:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: otmarhumbel Nosy List: alex.gronholm, otmarhumbel, pjenvey
Priority: Keywords: patch

Created on 2011-01-03.23:31:15 by alex.gronholm, last changed 2011-01-07.01:15:58 by pjenvey.

Files
File name Uploaded Description Edit Remove
test_sys2_jy.py otmarhumbel, 2011-01-04.22:57:45 a unit test exposing the problem
1693-patch.txt otmarhumbel, 2011-01-06.21:23:59 proposed patch
imp_unicode_fix.diff pjenvey, 2011-01-07.00:21:54
Messages
msg6309 (view) Author: Alex Grönholm (alex.gronholm) Date: 2011-01-03.23:31:15
Specifically, names with non-ascii characters in them. Whether the module you're trying to import exists or not is irrelevant.

CPython 2.5.5:
>> sys.path.append(u'/home/alex/t/töö')
>>> import ttt
>>> 

Jython 2.5.2rc2:
>>> sys.path.append(u'/home/alex/t/töö')
>>> import ttt
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alex/libs/jython2.5.2rc2/Lib/encodings/__init__.py", line 31, in <module>
    import codecs, types
UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-15: ordinal not in range(128)
msg6311 (view) Author: Oti Humbel (otmarhumbel) Date: 2011-01-04.22:57:45
The unit test in the attached file test_sys2_jy.py fails,
regardless if the first line
  # coding=latin2
is present or not.
msg6313 (view) Author: Oti Humbel (otmarhumbel) Date: 2011-01-06.21:23:59
The changes in 1693-patch.txt solve the problem.

The fix is to encode a unicode string with latin-1 instead of ascii only (in __str__()).
javatests and regrtests all pass.

pjenvey: could you please review this? - thanks!
msg6314 (view) Author: Alex Grönholm (alex.gronholm) Date: 2011-01-06.23:46:27
I'm sorry to say that this patch doesn't cut it, not by far.
Two reasons: first, using latin-1 encoding in PyUnicode breaks CPython compatibility (UnicodeError should be thrown when u'åäö' is converted to str); second, you'd still get a UnicodeError when adding a path element with, say, chinese characters. Why are sys.path elements being converted to bytestrings anyway?
msg6315 (view) Author: Philip Jenvey (pjenvey) Date: 2011-01-07.00:19:23
Alex is right. Our import system is converting sys.path items to java Strings via item.__str__().toString(). CPython in this case converts unicode to strings by encoding them via the filesystem encoding.

We don't support a filesystem encoding on Jython (at this point anyway). Instead we've just been 'passing thru' unicode when it's requested (e.g. os.listdir).

That is technically broken (you could end up with a plain str where ord(somestr[0]) > 255) but I think we'll continue getting away with this strategy until 2.6. This is one of the few leftover str/unicode weirdness bits carried over from 2.2 where unicode and str were pretty much the same object
msg6316 (view) Author: Philip Jenvey (pjenvey) Date: 2011-01-07.00:21:54
something like this..
msg6317 (view) Author: Philip Jenvey (pjenvey) Date: 2011-01-07.01:15:58
applied that patch and Oti's test in r7182
History
Date User Action Args
2011-01-07 01:15:58pjenveysetstatus: open -> closed
resolution: fixed
messages: + msg6317
2011-01-07 00:21:54pjenveysetfiles: + imp_unicode_fix.diff
keywords: + patch
messages: + msg6316
2011-01-07 00:19:23pjenveysetmessages: + msg6315
2011-01-06 23:46:27alex.gronholmsetmessages: + msg6314
2011-01-06 21:24:00otmarhumbelsetfiles: + 1693-patch.txt
assignee: otmarhumbel
messages: + msg6313
nosy: + pjenvey
2011-01-04 22:57:45otmarhumbelsetfiles: + test_sys2_jy.py
nosy: + otmarhumbel
messages: + msg6311
2011-01-03 23:31:15alex.gronholmcreate