Message11226
test_codecencodings_tw fails on Java 8 (not on Java 7) like this:
======================================================================
FAIL: test_errorhandle (__main__.Test_Big5)
----------------------------------------------------------------------
Traceback (most recent call last):
File "...\dist\Lib\test\test_multibytecodec_support.py", line 56, in test_errorhandle
self.assertEqual(result, expected,
AssertionError: 'abc\x80\x80\xc1\xc4'.decode('big5', 'replace')=u'abc\ufffd\ufffd\u8b10' != u'abc\ufffd\u8b10'
----------------------------------------------------------------------
The same failure affects some other multibyte codecs the tests for which are currently "expected failures" in regrtest.py, because they also fail in other places.
The cause is a change in the behaviour of the built-in codecs. The behaviour in Python 2.7.13 and Jython 2.7 with Java 7 is:
>>> 'abc\x80\x80\xc1\xc4def'.decode('big5', 'replace')
u'abc\ufffd\u8b10def'
>>> 'abc\x80\xc1\xc4def'.decode('big5', 'replace')
u'abc\ufffd\u6514ef'
It seems to read \x80\x80 as the code 8080, which the Java 7 codec reports as UNMAPPED[2], meaning an unmapped code of source length two.
The codec in Java 8 reads \x80 as MALFORMED[1], meaning that the byte does not belong in the code (Big5 leading bytes are in the reange \x81-\xfe), and the length of the malformed text is one. Then it reports that again for the second \x80.
I think the Java 8 policy is more correct, and not ruled out by the Python documentation. Our tests should perhaps allow it. |
|
Date |
User |
Action |
Args |
2017-03-14 06:17:17 | jeff.allen | set | recipients:
+ jeff.allen |
2017-03-14 06:17:17 | jeff.allen | set | messageid: <1489472237.48.0.810895633667.issue2571@psf.upfronthosting.co.za> |
2017-03-14 06:17:17 | jeff.allen | link | issue2571 messages |
2017-03-14 06:17:16 | jeff.allen | create | |
|