Issue2061
Created on 2013-06-12.09:05:40 by oberstet, last changed 2013-06-15.11:47:36 by oberstet.
Messages | |||
---|---|---|---|
msg8043 (view) | Author: Tobias Oberstein (oberstet) | Date: 2013-06-12.09:05:40 | |
CPython 2.7.4 oberstet@THINKPAD-T410S ~ $ python Python 2.7.4 (default, Apr 6 2013, 19:54:46) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> s='\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited' >>> import json >>> json.dumps(s) '"\\u03ba\\u1f79\\u03c3\\u03bc\\u03b5\\ud800edited"' >>> Jython 2.7b1 C:\jython2.7b1\bin>jython Jython 2.7b1 (default:ac42d59644e9, Feb 9 2013, 15:24:52) [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0 Type "help", "copyright", "credits" or "license" for more information. >>> import json >>> s = '\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited' >>> s '\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited' >>> json.dumps(s) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\jython2.7b1\Lib\json\__init__.py", line 231, in dumps return _default_encoder.encode(obj) File "C:\jython2.7b1\Lib\json\encoder.py", line 195, in encode return encode_basestring_ascii(o) File "C:\jython2.7b1\Lib\json\encoder.py", line 48, in py_encode_basestring_ascii s = s.decode('utf-8') File "C:\jython2.7b1\Lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 11-13: illegal encoding >>> json.dumps(s, encoding = 'utf8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\jython2.7b1\Lib\json\__init__.py", line 234, in dumps return cls( File "C:\jython2.7b1\Lib\json\encoder.py", line 193, in encode o = o.decode(_encoding) File "C:\jython2.7b1\Lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 11-13: illegal encoding >>> |
|||
msg8044 (view) | Author: Santoso Wijaya (santa4nt) | Date: 2013-06-12.18:42:21 | |
A simplified, minimal code to reproduce using the json module's base parts: In CPython: Python 2.7.4 (default, Apr 19 2013, 18:28:01) [GCC 4.7.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from json.encoder import encode_basestring >>> s = '\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited' >>> o = s.decode('utf-8') >>> o u'\u03ba\u1f79\u03c3\u03bc\u03b5\ud800edited' >>> encode_basestring(o) u'"\u03ba\u1f79\u03c3\u03bc\u03b5\ud800edited"' In Jython: Jython 2.7b1+ (default:3f971d6907b7+, Jun 12 2013, 11:30:15) [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_21 Type "help", "copyright", "credits" or "license" for more information. >>> s = '\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xed\xa0\x80edited' >>> s.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/santa/Code/jython/dist/Lib/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 11-13: illegal encoding |
|||
msg8046 (view) | Author: Alan Kennedy (amak) | Date: 2013-06-15.11:37:11 | |
\ud800 is an unpaired surrogate which is illegal in the UTF-16 representation used by jython. It is legal in cpythons UCS-2, but does not actually represent any real world characters. It never appears in the real world, only in cpython-specific tests which are broken outside of the cpython world. Expecting these tests to pass on any platform that does not use UCS-2 is a broken expectation. Resolving as a duplicate of #2048 http://bugs.jython.org/issue2048 Which itself is a duplicate of these bug reports Jython doesn't allow to use unmapped unicode codepoint http://bugs.jython.org/issue1707 Invalid Unicode characters cause compile-time error (CPython divergence) http://bugs.jython.org/issue1836 |
|||
msg8047 (view) | Author: Tobias Oberstein (oberstet) | Date: 2013-06-15.11:47:36 | |
Yep, if the goal of Jython is not to 100% replicate CPython's behavior (including bugs), then this shouldn't be "fixed" in Jython. In general, regarding UTF8 handling, both CPython and Java are broken (not 100% correct), e.g. the builtin UTF8 decoders cannot detect/decode exactly the set of valid UTF8. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2013-06-15 11:47:36 | oberstet | set | messages: + msg8047 |
2013-06-15 11:37:12 | amak | set | status: open -> closed assignee: amak resolution: duplicate messages: + msg8046 nosy: + amak |
2013-06-12 18:42:21 | santa4nt | set | messages: + msg8044 |
2013-06-12 17:45:28 | santa4nt | set | nosy:
+ santa4nt type: behaviour |
2013-06-12 09:05:40 | oberstet | create |
Supported by Python Software Foundation,
Powered by Roundup