Message11819
This is awfully old.
I used this modified program to investigate:
-------- y.py --------
# Jython issue 1807
# -*- coding: UTF-8 -*-
import sys
try:
import java.lang
enc = java.lang.System.getProperty("file.encoding")
except:
enc = "cp936"
print enc
print sys.stdout.encoding
print sys.getdefaultencoding()
print(u"Ф".encode(enc))
print(u"Ф")
-------- y.py --------
The behaviour of Jython 2.7.2a1 observed on Windows is as expected at the console:
PS iss1807> jython "-Dfile.encoding=cp936" y.py
cp936
ms936
ascii
Ф
Ф
PS iss1807> jython "-Dfile.encoding=utf-8" y.py
utf-8
ms936
ascii
肖
Ф
Any unicode written to a redirected stdout is written to file as UTF-16, little-endian with BOM. (I think we let Java handle it directly.) To my surprise, this encoding is chosen whatever the setting the Java property file.encoding.
PS iss1807> jython "-Dfile.encoding=ms936" y.py > y936.txt
PS iss1807> filedump -Bx y936.txt
ff fe 6d 00 73 00 39 00 ? ? m . s . 9 .
33 00 36 00 0d 00 0a 00 3 . 6 . . . . .
6d 00 73 00 39 00 33 00 m . s . 9 . 3 .
36 00 0d 00 0a 00 61 00 6 . . . . . a .
73 00 63 00 69 00 69 00 s . c . i . i .
0d 00 0a 00 24 04 0d 00 . . . . $ . . .
0a 00 24 04 0d 00 0a 00 . . $ . . . . .
EOF
This is also how the ascii output from CPython is handled. Maybe it's the shell that is actually doing this?
sys.getdefaultencoding() is 'ascii'. It appears that CPython uses this encoding when a unicode object is printed to the redirected stdout (sys.stdout.encoding is None), since the program dies with an encoding error:
PS iss1807> python y.py > ycp.txt
Traceback (most recent call last):
File "y.py", line 16, in <module>
print(u"肖")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0424' in position 0: ordinal not in range(128)
PS iss1807> filedump -Bx ycp.txt
ff fe 63 00 70 00 39 00 ? ? c . p . 9 .
33 00 36 00 0d 00 0a 00 3 . 6 . . . . .
4e 00 6f 00 6e 00 65 00 N . o . n . e .
0d 00 0a 00 61 00 73 00 . . . . a . s .
63 00 69 00 69 00 0d 00 c . i . i . . .
0a 00 24 04 0d 00 0a 00 . . $ . . . . .
EOF
Overall, it feels to me like what we're doing is not wrong, and there is no reason to expect the contents of y.txt to be UTF-8 encoded as opposed to anything else. It may well be at the discretion of the shell and/or Java runtime, in which case fighting with it is likely to be a dispiriting experience. A comparison on Linux would be interesting. |
|
Date |
User |
Action |
Args |
2018-03-17 09:16:51 | jeff.allen | set | messageid: <1521278211.27.0.467229070634.issue1807@psf.upfronthosting.co.za> |
2018-03-17 09:16:51 | jeff.allen | set | recipients:
+ jeff.allen, fwierzbicki, amak, irmen, ssoldatenko |
2018-03-17 09:16:51 | jeff.allen | link | issue1807 messages |
2018-03-17 09:16:50 | jeff.allen | create | |
|