Message11272

Author jeff.allen
Recipients bstjean, jeff.allen, liuxy_hes86, zyasoft
Date 2017-03-25.10:57:43
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1490439464.14.0.168783957337.issue2356@psf.upfronthosting.co.za>
In-reply-to
Content
I've fixed the build, the problem being that ANTLR would generate files in file.encoding and then we would compile them as UTF-8. It makes no difference to the *text*, but the *comments* contain the full source path. C:\Users\Épreuve\atelier\ ... blahblah ... . Now file.encoding=UTF-8.

I'm fighting the launcher now, in the shape of jython.py. One can easily create a complicated situation in which all sorts of encodings are in play. Just at the DOS and Python prompts:

> type argtest.py
# What do arguments appear as, when codepages intervene?
import sys, os, locale, subprocess
print sys.argv
for arg in sys.argv:
    print "%s ( %r )" % (arg, arg)

> chcp
Active code page: 850

> set TEST=Épreuve

> python -i argtest.py café crème %TEST%
['argtest.py', 'caf\xe9', 'cr\xe8me', '\xc9preuve']
argtest.py ( 'argtest.py' )
cafÚ ( 'caf\xe9' )
crÞme ( 'cr\xe8me' )
╔preuve ( '\xc9preuve' )

### Notice that sys.argv contains byte strings but they are
### not encoded with the console encoding cp850.
### The os module is using the same encoding.

>>> os.getcwd()
'C:\\Users\\\xc9preuve\\Documents\\Python2'
>>> print os.getcwd()
C:\Users\╔preuve\Documents\Python2
>>> print os.getcwdu()
C:\Users\Épreuve\Documents\Python2
>>> os.getenv('TEST')
'\xc9preuve'

### There are plenty of encodings to choose from.

>>> sys.stdout.encoding
'cp850'
>>> sys.getdefaultencoding()
'ascii'
>>> sys.getfilesystemencoding()
'mbcs'
>>> locale.getpreferredencoding()
'cp1252'

### But this one is consistent with what I'm seeing:

>>> for a in sys.argv: print a.decode(locale.getpreferredencoding())
...
argtest.py
café
crème
Épreuve

What fun! I *tentatively* conclude we must treat arguments and environment variables as encoded with locale.getpreferredencoding(). This also seems to be the acceptable encoding when we come to launch a subprocess:
>>> subprocess.call(["python", "argtest.py"] + sys.argv[1:])
['argtest.py', 'caf\xe9', 'cr\xe8me', '\xc9preuve']
argtest.py ( 'argtest.py' )
cafÚ ( 'caf\xe9' )
crÞme ( 'cr\xe8me' )
╔preuve ( '\xc9preuve' )

The point here is not that these print correctly, but they print the same as they did when I ran this from the DOS prompt.

Now, in jython.py, it's all driven from sys.stdout.encoding, which is different. We may even be calling encode() where we should be decoding. Or possibly we could just leave everything as bytes in the seemingly-consistent encoding of CPython and Windows. I'll see what I can do. (I'll try not to break jython.py for Linux, though it seems the minority case here.)

Eventually, when Jython lunches again, I'll get to the bug(s) our users French and Chinese are experiencing, that pops up first in site.py.

But fighting jython.py has been instructive. There may be lessons from CPython here about what we should be doing internally to Jython when handling byte strings from the system via file system, environment and arguments.
History
Date User Action Args
2017-03-25 10:57:44jeff.allensetmessageid: <1490439464.14.0.168783957337.issue2356@psf.upfronthosting.co.za>
2017-03-25 10:57:44jeff.allensetrecipients: + jeff.allen, zyasoft, liuxy_hes86, bstjean
2017-03-25 10:57:44jeff.allenlinkissue2356 messages
2017-03-25 10:57:43jeff.allencreate