Issue1836

classification
Title: Type: Invalid Unicode characters cause compile-time error (CPython divergence) behaviour normal Core
process
Status: Resolution: closed wont fix fwierzbicki amak, fwierzbicki, gsnedders, jeff.allen normal

Created on 2012-02-02.22:14:18 by jeff.allen, last changed 2013-04-07.15:06:02 by gsnedders.

Messages
msg6768 (view) Author: Jeff Allen (jeff.allen) Date: 2012-02-02.22:14:17
In the present tip (2b4f725d4d29 date Tue Jan 03 09:34:18 2012 -0800) the response of the Jython compiler rejects a string literal that contains invalid Unicode characters. This behaviour is divergent from CPython. As a result, valid Python programs, including the regression test CPythonLib\test\test_bytes.py, fail to run.

In interactive mode, Jython seems to miss the end of the string:

Jython 2.6a0+ (, Feb 2 2012, 19:46:58)
[Java HotSpot(TM) 64-Bit Server VM (Sun Microsystems Inc.)] on java1.6.0_26
>>> a = u"Hello world\n\u1234\u5678\u9abc\udef0"
...
...
... "
File "<stdin>", line 4
"
^
SyntaxError: no viable alternative at character '"'
>>>

In the same situation, CPython accepts the literal, although a subsequent attempt to transcode it, for example print it, may fail at run-time.

Python 2.7.2 (default, Jun 12 2011, 14:24:46) [MSC v.1500 64 bit (AMD64)] on win32
>>> a = u"Hello world\n\u1234\u5678\u9abc\udef0"
>>> print a
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\python\27\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 12-15: character maps to undefined>
>>> for c in a: print "%x" % ord(c)
...
48
65
6c
6c
6f
20
77
6f
72
6c
64
a
1234
5678
9abc
def0
>>>
msg6769 (view) Author: Jeff Allen (jeff.allen) Date: 2012-02-02.22:44:34
Partial analysis ...

The error is effectively raised by org.python.core.PyString.hexescape(), which is trying to translate "\udef0" into a Unicode character. This accepts a value that controls how it responds to an invalid character code. Options are "ignore" (i.e. don't insert it), "replace" (with the standard Unicode replacement character), or "strict", meaning throw this error.

hexescape() is called (indirectly) from the parser at org.python.antlr.GrammarActions.extractToken() to convert the text. That is the place where a "strict" error policy is chosen.

None of the existing policies corresponds to inserting a character "unchecked", which appears to be the CPython policy. Either a fourth should be defined, or the behaviour of an existing policy changed.

This is not the only point at which the error policy may determine behaviour, so the other implications of not being "strict" should be examined.
msg6859 (view) Author: Alan Kennedy (amak) Date: 2012-03-19.20:25:53
The character "\udef0" is in the range 0xD800-0xDFFF, i.e. it is an "unpaired surrogate".

http://en.wikipedia.org/wiki/UTF-16

Cpython accepts it, because cpython uses UCS-2, for which "\def0" is a valid character.

Java, and thus jython, uses UTF-16, which supports surrogate pairs for encoding characters outside the Basic Multilingual Plane.

If you retry your code with values outside the range oxD800-0xDFFF, it will work.

Or provide a proper surrogate pair to decode.

This bug should be closed as "invalid" or "wont fix".
msg6860 (view) Author: Frank Wierzbicki (fwierzbicki) Date: 2012-03-19.20:33:10
Thanks for the extra analysis Alan - that does indeed look like it is out of Jython's scope. It looks like it would be really hard for us to even store something like that given the Java constraints. I'll see if Jeff has anything further to say on the subject and if he agrees I'll close this.
msg6958 (view) Author: Jeff Allen (jeff.allen) Date: 2012-03-23.20:48:01
"If you retry your code with values outside the range oxD800-0xDFFF, it will work."

It was not my code but Python's test_bytes.py that brought this up (by not compiling). The point about UCS-2 vs. UTF-16 is a good explanation. If we are living with that difference, we should perhaps not expect the same response to the invalid string.

I think the response by the parser leaves somewhat be desired, but I'll go quietly.
msg7032 (view) Author: Alan Kennedy (amak) Date: 2012-04-06.19:52:17
I understand the desire to have exactly the same behaviour as cpython.

But it is worth noting that the characters we're talking about are invalid characters in UCS-2 as well.

http://www.azillionmonkeys.com/qed/unicode.html

So real-world users will never see this situation: the code-points only ever appear in test code.

How hard do we want to work to make jython behave the same as cpython? To the point of breaking UTF-16 behaviour in order to be same as cpython's more limited UCS-2 behaviour? Just to make some UCS-2 specific (i.e. not portable) tests behave the same?

I vote no.
msg7843 (view) Author: Frank Wierzbicki (fwierzbicki) Date: 2013-02-27.18:01:10
I agree with Alan K on this one. For the foreseeable future we are UTF-16 and so will cause the "narrow-build" style of Python to live on. Maybe someday we might go crazy and back our unicode representation with a byte array or something and try to emulate a "wide-build" style, but that would be a ton of work and would leave us diverging from Java. I think for now we can close this.
msg7988 (view) Author: (gsnedders) Date: 2013-04-07.15:06:02
Python 2 doesn't define the unicode type as UCS-2 or UTF-32 string: it defines it as a sequence of code units: "The items of a Unicode object are Unicode code units. A Unicode code unit is represented by a Unicode object of one item and can hold either a 16-bit or 32-bit value representing a Unicode ordinal (the maximum value for the ordinal is given in sys.maxunicode, and depends on how Python is configured at compile time)."

As such, validity constraints of UCS-2 and UTF-32  (and UTF-16) do not apply here, as it is none of them, but rather it is an abstract sequence of code units. It places no constraints on what Unicode ordinals (which I take to mean codepoints) are valid.

The Python 3, definition, for what it's worth, is clearer in terms of what is allowed: "A string is a sequence of values that represent Unicode codepoints. All the codepoints in range U+0000 - U+10FFFF can be represented in a string." This is clear that lone surrogates are valid.
History
Date User Action Args
messages: + msg7988
2013-02-27 18:01:10fwierzbickisetstatus: open -> closed
resolution: accepted -> wont fix
messages: + msg7843
2012-04-06 19:52:17amaksetmessages: + msg7032
2012-03-23 20:48:01jeff.allensetmessages: + msg6958
2012-03-19 20:33:10fwierzbickisetmessages: + msg6860
2012-03-19 20:25:53amaksetnosy: + amak
messages: + msg6859
2012-02-13 16:53:18fwierzbickisetpriority: normal
assignee: fwierzbicki
resolution: accepted
2012-02-02 22:44:34jeff.allensetmessages: + msg6769
2012-02-02 22:14:18jeff.allencreate