Issue2061

classification

Title:	Behavior for invalid UTF8 differs from CPy
Type:	behaviour	Severity:	normal
Components:	Core	Versions:	Jython 2.7
		Milestone:

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:
Assigned To:	amak	Nosy List:	amak, oberstet, santa4nt
Priority:		Keywords:

Created on 2013-06-12.09:05:40 by oberstet, last changed 2013-06-15.11:47:36 by oberstet.

Messages
msg8043 (view)	Author: Tobias Oberstein (oberstet)	Date: 2013-06-12.09:05:40
CPython 2.7.4 oberstet@THINKPAD-T410S ~ $ python Python 2.7.4 (default, Apr 6 2013, 19:54:46) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> s='\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited' >>> import json >>> json.dumps(s) '"\\u03ba\\u1f79\\u03c3\\u03bc\\u03b5\\ud800edited"' >>> Jython 2.7b1 C:\jython2.7b1\bin>jython Jython 2.7b1 (default:ac42d59644e9, Feb 9 2013, 15:24:52) [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0 Type "help", "copyright", "credits" or "license" for more information. >>> import json >>> s = '\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited' >>> s '\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited' >>> json.dumps(s) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\jython2.7b1\Lib\json\__init__.py", line 231, in dumps return _default_encoder.encode(obj) File "C:\jython2.7b1\Lib\json\encoder.py", line 195, in encode return encode_basestring_ascii(o) File "C:\jython2.7b1\Lib\json\encoder.py", line 48, in py_encode_basestring_ascii s = s.decode('utf-8') File "C:\jython2.7b1\Lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 11-13: illegal encoding >>> json.dumps(s, encoding = 'utf8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\jython2.7b1\Lib\json\__init__.py", line 234, in dumps return cls( File "C:\jython2.7b1\Lib\json\encoder.py", line 193, in encode o = o.decode(_encoding) File "C:\jython2.7b1\Lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 11-13: illegal encoding >>>
msg8044 (view)	Author: Santoso Wijaya (santa4nt)	Date: 2013-06-12.18:42:21
A simplified, minimal code to reproduce using the json module's base parts: In CPython: Python 2.7.4 (default, Apr 19 2013, 18:28:01) [GCC 4.7.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from json.encoder import encode_basestring >>> s = '\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited' >>> o = s.decode('utf-8') >>> o u'\u03ba\u1f79\u03c3\u03bc\u03b5\ud800edited' >>> encode_basestring(o) u'"\u03ba\u1f79\u03c3\u03bc\u03b5\ud800edited"' In Jython: Jython 2.7b1+ (default:3f971d6907b7+, Jun 12 2013, 11:30:15) [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_21 Type "help", "copyright", "credits" or "license" for more information. >>> s = '\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited' >>> s.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/santa/Code/jython/dist/Lib/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 11-13: illegal encoding
msg8046 (view)	Author: Alan Kennedy (amak)	Date: 2013-06-15.11:37:11
\ud800 is an unpaired surrogate which is illegal in the UTF-16 representation used by jython. It is legal in cpythons UCS-2, but does not actually represent any real world characters. It never appears in the real world, only in cpython-specific tests which are broken outside of the cpython world. Expecting these tests to pass on any platform that does not use UCS-2 is a broken expectation. Resolving as a duplicate of #2048 http://bugs.jython.org/issue2048 Which itself is a duplicate of these bug reports Jython doesn't allow to use unmapped unicode codepoint http://bugs.jython.org/issue1707 Invalid Unicode characters cause compile-time error (CPython divergence) http://bugs.jython.org/issue1836
msg8047 (view)	Author: Tobias Oberstein (oberstet)	Date: 2013-06-15.11:47:36
Yep, if the goal of Jython is not to 100% replicate CPython's behavior (including bugs), then this shouldn't be "fixed" in Jython. In general, regarding UTF8 handling, both CPython and Java are broken (not 100% correct), e.g. the builtin UTF8 decoders cannot detect/decode exactly the set of valid UTF8.

History
Date	User	Action	Args
2013-06-15 11:47:36	oberstet	set	messages: + msg8047
2013-06-15 11:37:12	amak	set	status: open -> closed assignee: amak resolution: duplicate messages: + msg8046 nosy: + amak
2013-06-12 18:42:21	santa4nt	set	messages: + msg8044
2013-06-12 17:45:28	santa4nt	set	nosy: + santa4nt type: behaviour
2013-06-12 09:05:40	oberstet	create