Title: Converting cli arguments to Unicode fails
Type: Severity: normal
Components: Versions: 2.5.1
Status: closed Resolution: invalid
Dependencies: Superseder:
Assigned To: Nosy List: pekka.klarck, pjenvey
Priority: Keywords:

Created on 2010-04-11.21:52:20 by pekka.klarck, last changed 2010-04-12.07:46:10 by pekka.klarck.

File name Uploaded Description Edit Remove
unnamed pjenvey, 2010-04-11.22:50:07
msg5669 (view) Author: Pekka Klärck (pekka.klarck) Date: 2010-04-11.21:52:18
With CPython 2.6 on Ubuntu I can do:

    args = [ unicode(a, sys.getfilesystemencoding()) for a in sys.argv[1:] ]

but on Jython 2.5.1 that fails with error:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

The failure is most likely caused by sys.getfilesystemencoding() returning None on Jython when on CPython it returns correctly UTF-8. The differences don't end there, though, as the arguments are got in different format too:

$ python -c "import sys; print sys.argv[1:]" ä €
['\xc3\xa4', '\xe2\x82\xac']

$ jython -c "import sys; print sys.argv[1:]" ä €
['\xe4', '\u20ac']

The bytes Jython gets would actually be correct without decoding if their type would be unicode and not str. In this format they cannot be used directly:

$ jython -c "import sys; print sys.argv[1] + u'\xe4'" ä
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
msg5671 (view) Author: Pekka Klärck (pekka.klarck) Date: 2010-04-11.21:58:45
I think I found a workaround for this problem, or at least the following code prints the same correct results both on CPython 2.6 and Jython 2.5.1 on Ubuntu. Does anyone see problems in it or have some cleaner solution?

import sys

if sys.platform.startswith('java'):
    def _to_unicode(arg):
        return ''.join(unichr(ord(c)) for c in arg)
    def _to_unicode(arg):
        return unicode(arg, sys.getfilesystemencoding())

args = [ _to_unicode(a) for a in sys.argv[1:] ]
for a in args:
    print a
msg5672 (view) Author: Philip Jenvey (pjenvey) Date: 2010-04-11.22:14:57
This is a little better:

if sys.platform.startswith('java'):
    from java.lang import String
    _to_unicode = lambda arg: unicode(String(arg))

There's not much else we can do about this I think except wait for Python 3, so I'm closing this out for now =]
msg5674 (view) Author: Pekka Klärck (pekka.klarck) Date: 2010-04-11.22:46:53
Thanks for a better workaround. Couldn't that be done automatically for sys.argv?
msg5675 (view) Author: Philip Jenvey (pjenvey) Date: 2010-04-11.22:50:07
No, because it would be incompat with python 2. Plain str is expected
msg5676 (view) Author: Pekka Klärck (pekka.klarck) Date: 2010-04-11.23:05:32
Personally I consider the current situation where you get wrong str worse than getting correct unicode. In the latter case `unicode(arg, sys.getfilesystemencoding())` would even work the same way both in CPython and Jython (although the fact that `unicode(x, None)` works on Jython at all is inconsistent with CPython). 

Now that I know the workaround this isn't such a big problem anyway. Perhaps the best idea would be documenting this behavior somewhere.
msg5677 (view) Author: Philip Jenvey (pjenvey) Date: 2010-04-11.23:26:58
We just can't change objects that are expected to be str to unicode because they're incompatible in certain situations -- when you combine unicode with non-ascii strs you end up with UnicodeDecodeErrors. 

Consider a value somehow created from or combined with part of the argv that a developer assumes is a str -- with this change it would become unicode. If that value is combined with a non-ascii str in some later part of his codebase a mysterious UnicodeDecodeError is raised. 

Furthermore tracking down what the cause of that the error was can be really painful
msg5679 (view) Author: Pekka Klärck (pekka.klarck) Date: 2010-04-12.07:46:10
If it's not possible to actually fix this, I guess it's matter of taste what kind of error is least problematic. Adding a note to Jython documentation of sys.argv might anyway be a good idea. 

In our code base adding a workaround for this problem revealed another Unicode issue, this time with os.listdir and non-ASCII files: issue #1593. It seems the root cause is the same as in this one.
Date User Action Args
2010-04-12 07:46:10pekka.klarcksetmessages: + msg5679
2010-04-11 23:26:58pjenveysetmessages: + msg5677
2010-04-11 23:05:32pekka.klarcksetmessages: + msg5676
2010-04-11 22:50:08pjenveysetfiles: + unnamed
messages: + msg5675
2010-04-11 22:46:53pekka.klarcksetmessages: + msg5674
2010-04-11 22:14:57pjenveysetstatus: open -> closed
resolution: invalid
messages: + msg5672
nosy: + pjenvey
2010-04-11 21:58:46pekka.klarcksetmessages: + msg5671
2010-04-11 21:52:20pekka.klarckcreate