Issue1707

classification
Title: Jython doesn't allow to use unmapped unicode codepoint
Type: behaviour Severity: normal
Components: Core Versions:
Milestone:
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: amak Nosy List: amak, yyamano
Priority: Keywords:

Created on 2011-02-16.03:34:00 by yyamano, last changed 2012-03-17.22:50:58 by amak.

Messages
msg6399 (view) Author: Yuji Yamano (yyamano) Date: 2011-02-16.03:34:57
Jython doesn't allow to use unmapped unicode codepoint. CPython works fine with it.

% uname -a
Darwin amp.local 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386

% cat foo.py
# Taken from CPythonLib/test/test_multibytecodec_support.py

unmappedunicode = u'\udeee' # a unicode codepoint that is not mapped.

% ./dist/bin/jython foo.py
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 2-8: illegal Unicode character
% ./dist/bin/jython --version
Jython 2.5.2rc3

% python2.6 foo.py
% python2.6 --version
Python 2.6.5

% ./dist/bin/jython
Jython 2.5.2rc3 (trunk:7195, 2 16 2011, 11:23:40) 
[Java HotSpot(TM) 64-Bit Server VM (Apple Inc.)] on java1.6.0_22
Type "help", "copyright", "credits" or "license" for more information.
>>> unmappedunicode = u'\udeee'
...
msg6430 (view) Author: Alan Kennedy (amak) Date: 2011-03-12.13:56:26
It's arguable that the cpython behaviour is wrong in this case.

Why would you want to handle an unpaired surrogate?

Note that java will not permit this. Consider the following code

Jython 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06)
[Java HotSpot(TM) Client VM (Sun Microsystems Inc.)] on java1.5.0_21
Type "help", "copyright", "credits" or "license" for more information.
>>> import java
>>> import jarray
>>> bytes = [-34, -18]
>>> byte_array = jarray.array(bytes, 'b')
>>> java_string = java.lang.String(byte_array, "UTF-16")
>>> jython_string = unicode(java_string)
>>> jython_string
u'\ufffd'

Note that the result is u"\ufffd", which is a "Replacement Character".

"Replacement Character: A character used as a substitute for an uninterpretable character from another encoding. The Unicode Standard uses U+FFFD  replacement character for this function."

http://unicode.org/glossary/
msg6432 (view) Author: Alan Kennedy (amak) Date: 2011-03-12.14:16:07
Another point to mention is that the reason for this difference in behaviour is because cpython uses UCS-2 for character representation, whereas java (and thus jython) uses UTF-16.

"\udeee" *is* a valid code point in UCS-2: it just doesn't represent anything. This is why cpython does not complain.

http://www.unicode.org/charts/PDF/UDC00.pdf

Jython's behaviour is arguably more correct in this case. 

Also cpython, because it uses UCS-2, cannot represent characters outside the "Basic Multilingual Plane", but jython, because it uses UTF-16, can.

http://en.wikipedia.org/wiki/UTF-16/UCS-2
msg6434 (view) Author: Yuji Yamano (yyamano) Date: 2011-03-13.14:33:27
> Why would you want to handle an unpaired surrogate?

It is taken from CPythonLib/test/test_multibytecodec_support.py.
I'd like to make the test work for japanese codec.
msg6435 (view) Author: Alan Kennedy (amak) Date: 2011-03-13.18:15:01
> It is taken from CPythonLib/test/test_multibytecodec_support.py.
> I'd like to make the test work for japanese codec.

OK. So the only place that it appears is in test code, not a real-world use case.

I would make the argument that the test is broken. It can only work on python interpreters that use UCS-2, i.e. cpython: it cannot work on python interpreters that use UTF-16, i.e. jython, and possibly ironpython.

I think it should be reported on the cpython bug tracker as a non-portable, i.e. cpython-specific, test.

I will see if I can come up with a patch we can push upstream.

In the meantime, I advise you to treat those tests as broken. Looking at several of the tests, they seem to only be using the illegal value for purposes not relating to the value itself. In these case, I advise you to simply use different test values, i.e. some other sequence that causes the user-supplied encode function to be called.

For example, in the following test, it is the return type of the myreplace function that is under test, not the unmappable sequence.

def test_callback_wrong_objects(self):
  def myreplace(exc):
    return (ret, exc.end)
  codecs.register_error("test.cjktest", myreplace)
  for ret in ([1, 2, 3], [], None, object(), 'string', ''):
    self.assertRaises(TypeError, self.encode, self.unmappedunicode,
       'test.cjktest')

Replacing self.unmappedunicode with some other sequence which causes invocation of "myreplace", at encode time, should permit the test to run, without needing an illegal UTF-16 sequence.
msg6813 (view) Author: Alan Kennedy (amak) Date: 2012-03-17.22:50:57
Closing this bug: jython's handling of this invalid input is correct. I.e. The exception

UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 2-8: illegal Unicode character

Is entirely accurate: the input is invalid and cannot be decoded.

As discussed, this is because java (and thus jython) uses UTF-16 as an internal representation.

Cpython, on the other hand, uses UCS-2, so this is a valid test for cpython.
History
Date User Action Args
2012-03-17 22:50:58amaksetstatus: open -> closed
resolution: rejected
messages: + msg6813
2011-03-13 18:15:13amaksetassignee: amak
2011-03-13 18:15:02amaksetmessages: + msg6435
2011-03-13 14:33:27yyamanosetmessages: + msg6434
2011-03-12 14:16:07amaksetmessages: + msg6432
2011-03-12 13:56:26amaksetnosy: + amak
messages: + msg6430
2011-02-16 03:34:57yyamanosetmessages: + msg6399
2011-02-16 03:34:00yyamanocreate