Issue2659

classification
Title: Determine console encoding without access violation (Java 9)
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7
Milestone: Jython 2.7.2
process
Status: pending Resolution: fixed
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: amak, jeff.allen, otmarhumbel
Priority: normal Keywords: Java Roadmap, console

Created on 2018-03-26.22:11:00 by jeff.allen, last changed 2018-05-05.18:48:02 by jeff.allen.

Messages
msg11856 (view) Author: Jeff Allen (jeff.allen) Date: 2018-03-26.22:10:59
Related to #2656: WARNING: Illegal reflective access by org.python.core.PySystemState (file:/C:/Jython/2.7.2a1/jython.jar) to method java.io.Console.encoding()

Unlike most of the other illegal accesses found by the test suite, this one will pop up from more-or-less any interactive use of Jython, hence the separate ticket. Also, we may wish to discuss the solution separately.


Problem:

Bytes written to sys.stdout/err emerge on the real (OS/shell) console untranslated, eventually via System.out/err. And the reverse is true on the way in. So Python needs to know the encoding, when the data is not ascii text, and expects to be told it via sys.stdout.encoding (etc.).

In the case of the JLine console, which replaces System.in/out/err, we take the bytes written by Python and *decode* them to characters, so JLine can encode them again the other side of its character editing. You can never have to many codecs.

When nothing else tells us the console encoding, we obtain it by a reflective call to the private java.io.Console.encoding(), which Java 9 doesn't like and threatens to disallow. If even that fails, we use the property file.encoding, however, this is dubious and generally misleading on Windows.


Solution proposed:

    do without the call that upsets Java 9,
    take a default supplied by the launcher (i.e. CPython), from interrogating sys.stdout.
    stop paying attention to file.encoding
    maybe use UTF-8 as a fixed last resort. (Or should it be None, meaning ASCII?)

I believe this makes the order of precedence (high to low):

    python.console.encoding (from the "post properties" supplied during initialisation)
    python.console.encoding (from system properties e.g. command line)
    python.console.encoding (from registry)
    PYTHONIOENCODING environment variable
    python.console.defaultencoding (from the launcher i.e. CPython) (NEW)
    UTF-8 (one, ASCII?) (NEW)

The last resort fixed encoding will only have effect if you don't use the launcher.

We can't simply specify python.console.encoding from the launcher because then this inference would take precedence over the registry and PYTHONIOENCODING.
msg11857 (view) Author: Oti Humbel (otmarhumbel) Date: 2018-03-27.21:24:54
I was playing with the jdk9+ issues on a separate branch.

If you go to https://github.com/ohumbel/jython/tree/jdk9 and scroll down to the README you can find a table of the different encoding determination methods on some platforms.
To summarize:

 - the discouraged internal encoding() method only has an effect on windows
 - there it returns the 'old' DOS code pages
 - on all other platforms, file.encoding equals to defaultCharset()

So my conclusion was to use file.encoding on non windows, and a subprocess 'cmd /c chcp' on windows.

A possible implementation can be found in this commit: https://github.com/ohumbel/jython/commit/8fdf2c762aa054b7525eaaaf853b4e6a7bd69134

If this sounds reasonable, I would be happy to distill this into a new issue2659 branch and create a pull request.
msg11858 (view) Author: Oti Humbel (otmarhumbel) Date: 2018-03-27.21:40:17
chcp returns exactly the same encoding as java.io.Console.encoding() did. This way we can keep it backwards compatible.

The downside is the slower startup time on Windows. But maybe we only have to spawn a subprocess if none of the properties / registry / env vars is set.
msg11859 (view) Author: Jeff Allen (jeff.allen) Date: 2018-03-28.08:01:54
That's great, Oti! It's really useful that you tried all those platform combinations. I was surprised you observed Console.encoding() to return null on Unix-like systems. That's another reason to do without it.

On balance, I think running chcp (for Windows) is better than depending on the launcher, since it covers Jython run in other ways. This is nicely done as a separate class. I would probably make it private so we have freedom to change. And raise ConsoleEncoding.get() to where getPlatformEncoding() is called. (It's a poor name, anyway.)

If it's a good approach for Windows, why not also for Unix-alikes? Is "locale charmap" portable?

I'm not sure about file.encoding, even as a last resort. It isn't for specifying the console encoding. I think going directly to a hard-coded fall-back is the honest choice (ascii or utf-8).

If you indicate you'd like to provide a change set I'll hold off. Otherwise I'll base something on your investigation (and credit you).
msg11863 (view) Author: Jeff Allen (jeff.allen) Date: 2018-04-01.07:00:35
It looks like sun.stdout.encoding might be a good source from Java 8 onwards. 
http://hg.openjdk.java.net/jdk8/jdk8/jdk/rev/d38fed7d2ea7
although, as the name suggests, we should be prepared not to find it set:
http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/share/native/libjava/System.c#l267
msg11876 (view) Author: Oti Humbel (otmarhumbel) Date: 2018-04-05.05:35:42
It is ok for me if you make a suggestion and give me some credits. Please go ahead!

And a private class sounds totally ok.

I am not sure about "locale charmap". I tried it out on Ubuntu, Fedora, OpenSUSE and openindiana. The result always was "UTF-8".
msg11946 (view) Author: Jeff Allen (jeff.allen) Date: 2018-05-05.08:24:26
Here's a puzzle. In a Powershell window where I have:
PS jython-jvm9> chcp
Active code page: 936

On Java 7:
Jython 2.7.2a1+ (default:6b912cfd485a+, May 4 2018, 23:21:27)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_60
Type "help", "copyright", "credits" or "license" for more information.
>>> from java.io import InputStreamReader, BufferedReader
>>> from java.lang import ProcessBuilder
>>> pb = ProcessBuilder(["cmd", "/c", "chcp"])
>>> p = pb.start()
>>> r = BufferedReader( InputStreamReader( p.getInputStream() ) )
>>> r.readLine()
u'Active code page: 936'

Hurrah! On Java 8:
PS jython-jvm9> dist\bin\jython
Jython 2.7.2a1+ (default:6b912cfd485a+, May 4 2018, 23:21:27)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_151
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp850'
>>> from java.io import InputStreamReader, BufferedReader
>>> from java.lang import ProcessBuilder
>>> pb = ProcessBuilder(["cmd", "/c", "chcp"])
>>> p = pb.start()
>>> r = BufferedReader( InputStreamReader( p.getInputStream() ) )
>>> r.readLine()
u'Active code page: 850'

:( And on Java 9 it's the same.

In a further twist, if I call pb.inheritIO() before I spawn the command, then it senses the console encoding correctly, but the message I want goes directly to the console (Java 9):
>>> p = pb.inheritIO().start()
>>> Active code page: 936

It looks like spawning off chcp doesn't do the trick on all version, but on exactly those where it fails (from Java 8 onwards) I have a registry entry:

>>> from java.lang import System
>>> System.getProperty("sun.stdout.encoding")
u'ms936'

Together these cover the observed cases, but it all feels a tad precarious. It depends on behaviours I don't think are guaranteed. OTOH if it fails I get cp850, which is not the end of the world and can be diagnosed by the user.
msg11948 (view) Author: Jeff Allen (jeff.allen) Date: 2018-05-05.18:48:01
Now in the repository at: https://hg.python.org/jython/rev/9185f0a117f0

This was developed on Windows but now also verified on my Linux system to test the other code path:

jeff@amos ~/eclipse/jython-trunk $ LANG=el_GR.iso88597 dist/bin/jython -m test.regrtest -e
== 2.7.2a1+ (default:9185f0a117f0, May 5 2018, 15:11:55) 
== [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)]
== platform: java1.8.0_161
== encodings: stdin=ISO-8859-7, stdout=ISO-8859-7, FS=utf-8
== locale: default=('el_GR', 'ISO-8859-7'), actual=(None, None)

This must be the result of running "locale charmap" as the property "sun.stdout.encoding" is not present in this JVM. (Oti's idea extended to Linux.) It shows that the encoding ends up where you'd like, and the file system encoding is still utf-8.

Quite a few tests failed for me in the Greek locale, but the same is true before the change.
History
Date User Action Args
2018-05-05 18:48:02jeff.allensetstatus: open -> pending
resolution: fixed
messages: + msg11948
2018-05-05 08:24:27jeff.allensetmessages: + msg11946
2018-04-05 05:35:42otmarhumbelsetmessages: + msg11876
2018-04-01 08:07:44jeff.allenlinkissue2656 dependencies
2018-04-01 07:00:36jeff.allensetmessages: + msg11863
2018-03-28 08:01:56jeff.allensetmessages: + msg11859
2018-03-27 21:40:18otmarhumbelsetmessages: + msg11858
2018-03-27 21:24:55otmarhumbelsetnosy: + otmarhumbel
messages: + msg11857
2018-03-27 10:48:19amaksetnosy: + amak
2018-03-26 22:11:00jeff.allencreate