Issue2061

classification
Title: Behavior for invalid UTF8 differs from CPy
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7
Milestone:
process
Status: closed Resolution: duplicate
Dependencies: Superseder:
Assigned To: amak Nosy List: amak, oberstet, santa4nt
Priority: Keywords:

Created on 2013-06-12.09:05:40 by oberstet, last changed 2013-06-15.11:47:36 by oberstet.

Messages
msg8043 (view) Author: Tobias Oberstein (oberstet) Date: 2013-06-12.09:05:40
CPython 2.7.4

   oberstet@THINKPAD-T410S ~
   $ python
   Python 2.7.4 (default, Apr  6 2013, 19:54:46) [MSC v.1500 32 bit (Intel)] on win32
   Type "help", "copyright", "credits" or "license" for more information.
   >>> s='\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited'
   >>> import json
   >>> json.dumps(s)
   '"\\u03ba\\u1f79\\u03c3\\u03bc\\u03b5\\ud800edited"'
   >>>

Jython 2.7b1

   C:\jython2.7b1\bin>jython
   Jython 2.7b1 (default:ac42d59644e9, Feb 9 2013, 15:24:52)
   [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import json
   >>> s = '\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited'
   >>> s
   '\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited'
   >>> json.dumps(s)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "C:\jython2.7b1\Lib\json\__init__.py", line 231, in dumps
       return _default_encoder.encode(obj)
     File "C:\jython2.7b1\Lib\json\encoder.py", line 195, in encode
       return encode_basestring_ascii(o)
     File "C:\jython2.7b1\Lib\json\encoder.py", line 48, in py_encode_basestring_ascii
       s = s.decode('utf-8')
     File "C:\jython2.7b1\Lib\encodings\utf_8.py", line 16, in decode
       return codecs.utf_8_decode(input, errors, True)
   UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 11-13: illegal encoding
   >>> json.dumps(s, encoding = 'utf8')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "C:\jython2.7b1\Lib\json\__init__.py", line 234, in dumps
       return cls(
     File "C:\jython2.7b1\Lib\json\encoder.py", line 193, in encode
       o = o.decode(_encoding)
     File "C:\jython2.7b1\Lib\encodings\utf_8.py", line 16, in decode
       return codecs.utf_8_decode(input, errors, True)
   UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 11-13: illegal encoding
   >>>
msg8044 (view) Author: Santoso Wijaya (santa4nt) Date: 2013-06-12.18:42:21
A simplified, minimal code to reproduce using the json module's base parts:


In CPython:

  Python 2.7.4 (default, Apr 19 2013, 18:28:01) 
  [GCC 4.7.3] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> from json.encoder import encode_basestring
  >>> s = '\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited'
  >>> o = s.decode('utf-8')
  >>> o
  u'\u03ba\u1f79\u03c3\u03bc\u03b5\ud800edited'
  >>> encode_basestring(o)
  u'"\u03ba\u1f79\u03c3\u03bc\u03b5\ud800edited"'


In Jython:

  Jython 2.7b1+ (default:3f971d6907b7+, Jun 12 2013, 11:30:15) 
  [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_21
  Type "help", "copyright", "credits" or "license" for more information.
  >>> s = '\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited'
  >>> s.decode('utf-8')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/home/santa/Code/jython/dist/Lib/encodings/utf_8.py", line 16, in decode
      return codecs.utf_8_decode(input, errors, True)
  UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 11-13: illegal encoding
msg8046 (view) Author: Alan Kennedy (amak) Date: 2013-06-15.11:37:11
\ud800 is an unpaired surrogate which is illegal in the UTF-16 representation used by jython.

It is legal in cpythons UCS-2, but does not actually represent any real world characters.

It never appears in the real world, only in cpython-specific tests which are broken outside of the cpython world. Expecting these tests to pass on any platform that does not use UCS-2 is a broken expectation.

Resolving as a duplicate of #2048

http://bugs.jython.org/issue2048

Which itself is a duplicate of these bug reports

Jython doesn't allow to use unmapped unicode codepoint
http://bugs.jython.org/issue1707

Invalid Unicode characters cause compile-time error (CPython divergence)
http://bugs.jython.org/issue1836
msg8047 (view) Author: Tobias Oberstein (oberstet) Date: 2013-06-15.11:47:36
Yep, if the goal of Jython is not to 100% replicate CPython's behavior (including bugs), then this shouldn't be "fixed" in Jython.

In general, regarding UTF8 handling, both CPython and Java are broken (not 100% correct), e.g. the builtin UTF8 decoders cannot detect/decode exactly the set of valid UTF8.
History
Date User Action Args
2013-06-15 11:47:36oberstetsetmessages: + msg8047
2013-06-15 11:37:12amaksetstatus: open -> closed
assignee: amak
resolution: duplicate
messages: + msg8046
nosy: + amak
2013-06-12 18:42:21santa4ntsetmessages: + msg8044
2013-06-12 17:45:28santa4ntsetnosy: + santa4nt
type: behaviour
2013-06-12 09:05:40oberstetcreate