Message11819

Author jeff.allen
Recipients amak, fwierzbicki, irmen, jeff.allen, ssoldatenko
Date 2018-03-17.09:16:50
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1521278211.27.0.467229070634.issue1807@psf.upfronthosting.co.za>
In-reply-to
Content
This is awfully old.

I used this modified program to investigate:
-------- y.py --------
# Jython issue 1807
# -*- coding: UTF-8 -*-

import sys
try:
    import java.lang
    enc = java.lang.System.getProperty("file.encoding")
except:
    enc = "cp936"

print enc
print sys.stdout.encoding
print sys.getdefaultencoding()

print(u"Ф".encode(enc))
print(u"Ф")
-------- y.py --------

The behaviour of Jython 2.7.2a1 observed on Windows is as expected at the console:

PS iss1807> jython "-Dfile.encoding=cp936" y.py
cp936
ms936
ascii
Ф
Ф
PS iss1807> jython "-Dfile.encoding=utf-8" y.py
utf-8
ms936
ascii
肖
Ф

Any unicode written to a redirected stdout is written to file as UTF-16, little-endian with BOM. (I think we let Java handle it directly.) To my surprise, this encoding is chosen whatever the setting the Java property file.encoding.

PS iss1807> jython "-Dfile.encoding=ms936" y.py > y936.txt
PS iss1807> filedump -Bx y936.txt

 ff fe 6d 00 73 00 39 00  ? ? m . s . 9 .
 33 00 36 00 0d 00 0a 00  3 . 6 . . . . .
 6d 00 73 00 39 00 33 00  m . s . 9 . 3 .
 36 00 0d 00 0a 00 61 00  6 . . . . . a .
 73 00 63 00 69 00 69 00  s . c . i . i .
 0d 00 0a 00 24 04 0d 00  . . . . $ . . .
 0a 00 24 04 0d 00 0a 00  . . $ . . . . .
                          EOF

This is also how the ascii output from CPython is handled. Maybe it's the shell that is actually doing this?

sys.getdefaultencoding() is 'ascii'. It appears that CPython uses this encoding when a unicode object is printed to the redirected stdout (sys.stdout.encoding is None), since the program dies with an encoding error:

PS iss1807> python y.py > ycp.txt
Traceback (most recent call last):
  File "y.py", line 16, in <module>
    print(u"肖")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0424' in position 0: ordinal not in range(128)
PS iss1807> filedump -Bx ycp.txt

 ff fe 63 00 70 00 39 00  ? ? c . p . 9 .
 33 00 36 00 0d 00 0a 00  3 . 6 . . . . .
 4e 00 6f 00 6e 00 65 00  N . o . n . e .
 0d 00 0a 00 61 00 73 00  . . . . a . s .
 63 00 69 00 69 00 0d 00  c . i . i . . .
 0a 00 24 04 0d 00 0a 00  . . $ . . . . .
                          EOF

Overall, it feels to me like what we're doing is not wrong, and there is no reason to expect the contents of y.txt to be UTF-8 encoded as opposed to anything else. It may well be at the discretion of the shell and/or Java runtime, in which case fighting with it is likely to be a dispiriting experience. A comparison on Linux would be interesting.
History
Date User Action Args
2018-03-17 09:16:51jeff.allensetmessageid: <1521278211.27.0.467229070634.issue1807@psf.upfronthosting.co.za>
2018-03-17 09:16:51jeff.allensetrecipients: + jeff.allen, fwierzbicki, amak, irmen, ssoldatenko
2018-03-17 09:16:51jeff.allenlinkissue1807 messages
2018-03-17 09:16:50jeff.allencreate