Issue2659

classification
Title: Determine console encoding without access violation (Java 9)
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7
Milestone: Jython 2.7.2
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: amak, jeff.allen, otmarhumbel
Priority: normal Keywords: Java Roadmap, console

Created on 2018-03-26.22:11:00 by jeff.allen, last changed 2018-04-05.05:35:42 by otmarhumbel.

Messages
msg11856 (view) Author: Jeff Allen (jeff.allen) Date: 2018-03-26.22:10:59
Related to #2656: WARNING: Illegal reflective access by org.python.core.PySystemState (file:/C:/Jython/2.7.2a1/jython.jar) to method java.io.Console.encoding()

Unlike most of the other illegal accesses found by the test suite, this one will pop up from more-or-less any interactive use of Jython, hence the separate ticket. Also, we may wish to discuss the solution separately.


Problem:

Bytes written to sys.stdout/err emerge on the real (OS/shell) console untranslated, eventually via System.out/err. And the reverse is true on the way in. So Python needs to know the encoding, when the data is not ascii text, and expects to be told it via sys.stdout.encoding (etc.).

In the case of the JLine console, which replaces System.in/out/err, we take the bytes written by Python and *decode* them to characters, so JLine can encode them again the other side of its character editing. You can never have to many codecs.

When nothing else tells us the console encoding, we obtain it by a reflective call to the private java.io.Console.encoding(), which Java 9 doesn't like and threatens to disallow. If even that fails, we use the property file.encoding, however, this is dubious and generally misleading on Windows.


Solution proposed:

    do without the call that upsets Java 9,
    take a default supplied by the launcher (i.e. CPython), from interrogating sys.stdout.
    stop paying attention to file.encoding
    maybe use UTF-8 as a fixed last resort. (Or should it be None, meaning ASCII?)

I believe this makes the order of precedence (high to low):

    python.console.encoding (from the "post properties" supplied during initialisation)
    python.console.encoding (from system properties e.g. command line)
    python.console.encoding (from registry)
    PYTHONIOENCODING environment variable
    python.console.defaultencoding (from the launcher i.e. CPython) (NEW)
    UTF-8 (one, ASCII?) (NEW)

The last resort fixed encoding will only have effect if you don't use the launcher.

We can't simply specify python.console.encoding from the launcher because then this inference would take precedence over the registry and PYTHONIOENCODING.
msg11857 (view) Author: Oti Humbel (otmarhumbel) Date: 2018-03-27.21:24:54
I was playing with the jdk9+ issues on a separate branch.

If you go to https://github.com/ohumbel/jython/tree/jdk9 and scroll down to the README you can find a table of the different encoding determination methods on some platforms.
To summarize:

 - the discouraged internal encoding() method only has an effect on windows
 - there it returns the 'old' DOS code pages
 - on all other platforms, file.encoding equals to defaultCharset()

So my conclusion was to use file.encoding on non windows, and a subprocess 'cmd /c chcp' on windows.

A possible implementation can be found in this commit: https://github.com/ohumbel/jython/commit/8fdf2c762aa054b7525eaaaf853b4e6a7bd69134

If this sounds reasonable, I would be happy to distill this into a new issue2659 branch and create a pull request.
msg11858 (view) Author: Oti Humbel (otmarhumbel) Date: 2018-03-27.21:40:17
chcp returns exactly the same encoding as java.io.Console.encoding() did. This way we can keep it backwards compatible.

The downside is the slower startup time on Windows. But maybe we only have to spawn a subprocess if none of the properties / registry / env vars is set.
msg11859 (view) Author: Jeff Allen (jeff.allen) Date: 2018-03-28.08:01:54
That's great, Oti! It's really useful that you tried all those platform combinations. I was surprised you observed Console.encoding() to return null on Unix-like systems. That's another reason to do without it.

On balance, I think running chcp (for Windows) is better than depending on the launcher, since it covers Jython run in other ways. This is nicely done as a separate class. I would probably make it private so we have freedom to change. And raise ConsoleEncoding.get() to where getPlatformEncoding() is called. (It's a poor name, anyway.)

If it's a good approach for Windows, why not also for Unix-alikes? Is "locale charmap" portable?

I'm not sure about file.encoding, even as a last resort. It isn't for specifying the console encoding. I think going directly to a hard-coded fall-back is the honest choice (ascii or utf-8).

If you indicate you'd like to provide a change set I'll hold off. Otherwise I'll base something on your investigation (and credit you).
msg11863 (view) Author: Jeff Allen (jeff.allen) Date: 2018-04-01.07:00:35
It looks like sun.stdout.encoding might be a good source from Java 8 onwards. 
http://hg.openjdk.java.net/jdk8/jdk8/jdk/rev/d38fed7d2ea7
although, as the name suggests, we should be prepared not to find it set:
http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/share/native/libjava/System.c#l267
msg11876 (view) Author: Oti Humbel (otmarhumbel) Date: 2018-04-05.05:35:42
It is ok for me if you make a suggestion and give me some credits. Please go ahead!

And a private class sounds totally ok.

I am not sure about "locale charmap". I tried it out on Ubuntu, Fedora, OpenSUSE and openindiana. The result always was "UTF-8".
History
Date User Action Args
2018-04-05 05:35:42otmarhumbelsetmessages: + msg11876
2018-04-01 08:07:44jeff.allenlinkissue2656 dependencies
2018-04-01 07:00:36jeff.allensetmessages: + msg11863
2018-03-28 08:01:56jeff.allensetmessages: + msg11859
2018-03-27 21:40:18otmarhumbelsetmessages: + msg11858
2018-03-27 21:24:55otmarhumbelsetnosy: + otmarhumbel
messages: + msg11857
2018-03-27 10:48:19amaksetnosy: + amak
2018-03-26 22:11:00jeff.allencreate