Author jeff.allen
Recipients jeff.allen
Date 2017-03-14.06:17:16
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
test_codecencodings_tw fails on Java 8 (not on Java 7) like this:

FAIL: test_errorhandle (__main__.Test_Big5)
Traceback (most recent call last):
  File "...\dist\Lib\test\", line 56, in test_errorhandle
    self.assertEqual(result, expected,
AssertionError: 'abc\x80\x80\xc1\xc4'.decode('big5', 'replace')=u'abc\ufffd\ufffd\u8b10' != u'abc\ufffd\u8b10'


The same failure affects some other multibyte codecs the tests for which are currently "expected failures" in, because they also fail in other places.

The cause is a change in the behaviour of the built-in codecs. The behaviour in Python 2.7.13 and Jython 2.7 with Java 7 is:

>>> 'abc\x80\x80\xc1\xc4def'.decode('big5', 'replace')
>>> 'abc\x80\xc1\xc4def'.decode('big5', 'replace')

It seems to read \x80\x80 as the code 8080, which the Java 7 codec reports as UNMAPPED[2], meaning an unmapped code of source length two.

The codec in Java 8 reads \x80 as MALFORMED[1], meaning that the byte does not belong in the code (Big5 leading bytes are in the reange \x81-\xfe), and the length of the malformed text is one. Then it reports that again for the second \x80.

I think the Java 8 policy is more correct, and not ruled out by the Python documentation. Our tests should perhaps allow it.
Date User Action Args
2017-03-14 06:17:17jeff.allensetrecipients: + jeff.allen
2017-03-14 06:17:17jeff.allensetmessageid: <>
2017-03-14 06:17:17jeff.allenlinkissue2571 messages
2017-03-14 06:17:16jeff.allencreate