Issue1592
Created on 2010-04-11.21:52:20 by pekka.klarck, last changed 2010-04-12.07:46:10 by pekka.klarck.
File name |
Uploaded |
Description |
Edit |
Remove |
unnamed
|
pjenvey,
2010-04-11.22:50:07
|
|
|
|
msg5669 (view) |
Author: Pekka Klärck (pekka.klarck) |
Date: 2010-04-11.21:52:18 |
|
With CPython 2.6 on Ubuntu I can do:
args = [ unicode(a, sys.getfilesystemencoding()) for a in sys.argv[1:] ]
but on Jython 2.5.1 that fails with error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
The failure is most likely caused by sys.getfilesystemencoding() returning None on Jython when on CPython it returns correctly UTF-8. The differences don't end there, though, as the arguments are got in different format too:
$ python -c "import sys; print sys.argv[1:]" ä €
['\xc3\xa4', '\xe2\x82\xac']
$ jython -c "import sys; print sys.argv[1:]" ä €
['\xe4', '\u20ac']
The bytes Jython gets would actually be correct without decoding if their type would be unicode and not str. In this format they cannot be used directly:
$ jython -c "import sys; print sys.argv[1] + u'\xe4'" ä
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
|
msg5671 (view) |
Author: Pekka Klärck (pekka.klarck) |
Date: 2010-04-11.21:58:45 |
|
I think I found a workaround for this problem, or at least the following code prints the same correct results both on CPython 2.6 and Jython 2.5.1 on Ubuntu. Does anyone see problems in it or have some cleaner solution?
import sys
if sys.platform.startswith('java'):
def _to_unicode(arg):
return ''.join(unichr(ord(c)) for c in arg)
else:
def _to_unicode(arg):
return unicode(arg, sys.getfilesystemencoding())
args = [ _to_unicode(a) for a in sys.argv[1:] ]
for a in args:
print a
|
msg5672 (view) |
Author: Philip Jenvey (pjenvey) |
Date: 2010-04-11.22:14:57 |
|
This is a little better:
if sys.platform.startswith('java'):
from java.lang import String
_to_unicode = lambda arg: unicode(String(arg))
There's not much else we can do about this I think except wait for Python 3, so I'm closing this out for now =]
|
msg5674 (view) |
Author: Pekka Klärck (pekka.klarck) |
Date: 2010-04-11.22:46:53 |
|
Thanks for a better workaround. Couldn't that be done automatically for sys.argv?
|
msg5675 (view) |
Author: Philip Jenvey (pjenvey) |
Date: 2010-04-11.22:50:07 |
|
No, because it would be incompat with python 2. Plain str is expected
|
msg5676 (view) |
Author: Pekka Klärck (pekka.klarck) |
Date: 2010-04-11.23:05:32 |
|
Personally I consider the current situation where you get wrong str worse than getting correct unicode. In the latter case `unicode(arg, sys.getfilesystemencoding())` would even work the same way both in CPython and Jython (although the fact that `unicode(x, None)` works on Jython at all is inconsistent with CPython).
Now that I know the workaround this isn't such a big problem anyway. Perhaps the best idea would be documenting this behavior somewhere.
|
msg5677 (view) |
Author: Philip Jenvey (pjenvey) |
Date: 2010-04-11.23:26:58 |
|
We just can't change objects that are expected to be str to unicode because they're incompatible in certain situations -- when you combine unicode with non-ascii strs you end up with UnicodeDecodeErrors.
Consider a value somehow created from or combined with part of the argv that a developer assumes is a str -- with this change it would become unicode. If that value is combined with a non-ascii str in some later part of his codebase a mysterious UnicodeDecodeError is raised.
Furthermore tracking down what the cause of that the error was can be really painful
|
msg5679 (view) |
Author: Pekka Klärck (pekka.klarck) |
Date: 2010-04-12.07:46:10 |
|
If it's not possible to actually fix this, I guess it's matter of taste what kind of error is least problematic. Adding a note to Jython documentation of sys.argv might anyway be a good idea.
In our code base adding a workaround for this problem revealed another Unicode issue, this time with os.listdir and non-ASCII files: issue #1593. It seems the root cause is the same as in this one.
|
|
Date |
User |
Action |
Args |
2010-04-12 07:46:10 | pekka.klarck | set | messages:
+ msg5679 |
2010-04-11 23:26:58 | pjenvey | set | messages:
+ msg5677 |
2010-04-11 23:05:32 | pekka.klarck | set | messages:
+ msg5676 |
2010-04-11 22:50:08 | pjenvey | set | files:
+ unnamed messages:
+ msg5675 |
2010-04-11 22:46:53 | pekka.klarck | set | messages:
+ msg5674 |
2010-04-11 22:14:57 | pjenvey | set | status: open -> closed resolution: invalid messages:
+ msg5672 nosy:
+ pjenvey |
2010-04-11 21:58:46 | pekka.klarck | set | messages:
+ msg5671 |
2010-04-11 21:52:20 | pekka.klarck | create | |
|