Issue1807

classification
Title: Failed to set UTF-8 encoding for output redirected to file.
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7, Jython 2.5
Milestone:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: amak, fwierzbicki, irmen, jeff.allen, ssoldatenko
Priority: normal Keywords:

Created on 2011-10-11.06:42:29 by ssoldatenko, last changed 2018-03-17.09:16:51 by jeff.allen.

Messages
msg6667 (view) Author: Sam (ssoldatenko) Date: 2011-10-11.06:42:28
Failed to set UTF-8 encoding for output redirected to file.

--------x.py--------
#!/usr/bin/python
# -*- coding: UTF-8 -*-

import sys

print sys.stdout.encoding
print sys.getdefaultencoding()

print(u"Ф")
--------x.py--------

=== Output to console works fine ===

$ java -jar jython.jar -Dpython.console.encoding=UTF-8 x.py
UTF-8
ascii
Ф

=== Output to file does not work ===

$ java -jar jython.jar -Dpython.console.encoding=UTF-8 x.py > tmp.txt 2>tmp2.txt ; cat tmp.txt
$ cat tmp.txt
None
ascii
$ cat tmp2.txt
Traceback (most recent call last):
  File "x.py", line 9, in <module>
    print(u"Ф")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0424' in position 0: ordinal not in range(128)
msg6711 (view) Author: Irmen de Jong (irmen) Date: 2011-11-06.22:55:33
in PySystemState.java there is this code fragment in initEncoding:

            if (stdStream.isatty()) {
                stdStream.encoding = encoding;
            }

So the encoding is only applied if the stream is a tty. 
When the check was removed, your code example works fine.
I'm not sure why this check is there?
msg6717 (view) Author: Sam (ssoldatenko) Date: 2011-11-07.06:20:11
I think it because of property name '-Dpython.console.encoding=UTF-8'.
It is CONSOLE encoding, not output encoding...

Can we add properties python.stdout.encoding and python.stderr.encoding?
Can we then change the code of initialization of the streams?

python.console.encoding - applied when stdout of stderr is a terminal.
python.stdout.encoding - applied when stdout is not a terminal, or when it is terminal but python.console.encoding is not set.
python.stderr.encoding - same as python.stdout.encoding.
msg11819 (view) Author: Jeff Allen (jeff.allen) Date: 2018-03-17.09:16:50
This is awfully old.

I used this modified program to investigate:
-------- y.py --------
# Jython issue 1807
# -*- coding: UTF-8 -*-

import sys
try:
    import java.lang
    enc = java.lang.System.getProperty("file.encoding")
except:
    enc = "cp936"

print enc
print sys.stdout.encoding
print sys.getdefaultencoding()

print(u"Ф".encode(enc))
print(u"Ф")
-------- y.py --------

The behaviour of Jython 2.7.2a1 observed on Windows is as expected at the console:

PS iss1807> jython "-Dfile.encoding=cp936" y.py
cp936
ms936
ascii
Ф
Ф
PS iss1807> jython "-Dfile.encoding=utf-8" y.py
utf-8
ms936
ascii
肖
Ф

Any unicode written to a redirected stdout is written to file as UTF-16, little-endian with BOM. (I think we let Java handle it directly.) To my surprise, this encoding is chosen whatever the setting the Java property file.encoding.

PS iss1807> jython "-Dfile.encoding=ms936" y.py > y936.txt
PS iss1807> filedump -Bx y936.txt

 ff fe 6d 00 73 00 39 00  ? ? m . s . 9 .
 33 00 36 00 0d 00 0a 00  3 . 6 . . . . .
 6d 00 73 00 39 00 33 00  m . s . 9 . 3 .
 36 00 0d 00 0a 00 61 00  6 . . . . . a .
 73 00 63 00 69 00 69 00  s . c . i . i .
 0d 00 0a 00 24 04 0d 00  . . . . $ . . .
 0a 00 24 04 0d 00 0a 00  . . $ . . . . .
                          EOF

This is also how the ascii output from CPython is handled. Maybe it's the shell that is actually doing this?

sys.getdefaultencoding() is 'ascii'. It appears that CPython uses this encoding when a unicode object is printed to the redirected stdout (sys.stdout.encoding is None), since the program dies with an encoding error:

PS iss1807> python y.py > ycp.txt
Traceback (most recent call last):
  File "y.py", line 16, in <module>
    print(u"肖")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0424' in position 0: ordinal not in range(128)
PS iss1807> filedump -Bx ycp.txt

 ff fe 63 00 70 00 39 00  ? ? c . p . 9 .
 33 00 36 00 0d 00 0a 00  3 . 6 . . . . .
 4e 00 6f 00 6e 00 65 00  N . o . n . e .
 0d 00 0a 00 61 00 73 00  . . . . a . s .
 63 00 69 00 69 00 0d 00  c . i . i . . .
 0a 00 24 04 0d 00 0a 00  . . $ . . . . .
                          EOF

Overall, it feels to me like what we're doing is not wrong, and there is no reason to expect the contents of y.txt to be UTF-8 encoded as opposed to anything else. It may well be at the discretion of the shell and/or Java runtime, in which case fighting with it is likely to be a dispiriting experience. A comparison on Linux would be interesting.
History
Date User Action Args
2018-03-17 09:16:51jeff.allensetnosy: + jeff.allen
messages: + msg11819
versions: + Jython 2.7
2013-03-05 22:37:02amaksetkeywords: - console
2013-02-25 22:02:28amaksetkeywords: + console
2013-02-25 20:29:15fwierzbickisetpriority: normal
nosy: + fwierzbicki
versions: + Jython 2.5, - 2.5.2
2012-03-19 18:44:32amaksetnosy: + amak
2011-11-07 06:20:11ssoldatenkosetmessages: + msg6717
2011-11-06 22:55:33irmensetnosy: + irmen
messages: + msg6711
2011-10-11 06:42:29ssoldatenkocreate