Issue2234

classification
Title: PythonInterpreter and parser mis-handle encoding
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7
Milestone:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: jeff.allen
Priority: normal Keywords: console

Created on 2014-11-27.21:18:47 by jeff.allen, last changed 2014-12-11.23:47:45 by jeff.allen.

Messages
msg9221 (view) Author: Jeff Allen (jeff.allen) Date: 2014-11-27.21:18:45
http://permalink.gmane.org/gmane.comp.lang.jython.user/10517

This user has wrapped the InteractiveConsole in a BeanShell widget that supplies console streams with java.io.Reader/Writer interfaces. He is able to type in Chinese on these streams. In Jython 2.2, the print command would echo a literal string containing Chinese characters as typed. Since 2.5 this has been possible and the work we did on the Jython console in 2.7 has not restored the expected behaviour.

Looking at the user's code and our source I estimate the problem is with the stream-handling in class PythonInterpreter, or perhaps withthe way PyFileReader supports it. 16 bit characters are being accepted here as the content of a PyString, then mishandled.

Part of the problem here is not using the unicode type. However, there is also a bug in Jython.

The fix for #2037 will help, in that the content check on PyString would raise an error in this case, but it doesn't fix the problem that character text in being treated as bytes.

There may be a work-around by avoiding the Reader/Writer interface to PythonInterpreter and working only with encoded text as bytes.
msg9223 (view) Author: Jeff Allen (jeff.allen) Date: 2014-12-01.22:57:17
In what may be part of the same problem (or that arises from some attempts at a work-around), we may be wrong to look for an encoding declaration in the case where Unicode is supplied here:
https://hg.python.org/jython/file/849ec9c291db/src/org/python/core/ParserFacade.java#l281

At any rate, the logic seems fragile that decides when a String is really a byte-like object, and when it might actually be UTF-16. Is cflags.source_is_utf8 saying quite what we mean?
msg9227 (view) Author: Jeff Allen (jeff.allen) Date: 2014-12-11.23:47:44
Ok, the behaviour of exec is exactly what PEP-263 requires:
>>> u
u'# Test encoding line\n# coding= iso-8859-15\nprint u"caf\xe9 du c\u0153ur"\n\n'
>>> exec(u)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 0
SyntaxError: encoding declaration in Unicode string

PythonInterpreter doesn't quite mirror this:

>>> from org.python.util import PythonInterpreter
>>> pi = PythonInterpreter()
>>> pi.exec(u)
café du c?ur

The reason is that both calls end up at PythonInterpreter.exec(String), which then treats the String as bytes.

A bunch of this code accepts either a String or a Reader, but in a few places quietly assumes char is byte. Things that accept an InputStream are unambiguous that one is dealing with bytes, but it's not entirely clear how the encoding is remembered and used. I'm extending test_pythoninterpreter_jy to a range of non-ascii cases.
History
Date User Action Args
2014-12-11 23:47:45jeff.allensetpriority: normal
messages: + msg9227
2014-12-01 22:57:18jeff.allensetassignee: jeff.allen
messages: + msg9223
title: PythonInterpreter.setIn(Reader) ignores encoding -> PythonInterpreter and parser mis-handle encoding
2014-11-27 21:18:47jeff.allencreate