Message11226

Author	jeff.allen
Recipients	jeff.allen
Date	2017-03-14.06:17:16
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1489472237.48.0.810895633667.issue2571@psf.upfronthosting.co.za>
In-reply-to

Content
test_codecencodings_tw fails on Java 8 (not on Java 7) like this: ====================================================================== FAIL: test_errorhandle (__main__.Test_Big5) ---------------------------------------------------------------------- Traceback (most recent call last): File "...\dist\Lib\test\test_multibytecodec_support.py", line 56, in test_errorhandle self.assertEqual(result, expected, AssertionError: 'abc\x80\x80\xc1\xc4'.decode('big5', 'replace')=u'abc\ufffd\ufffd\u8b10' != u'abc\ufffd\u8b10' ---------------------------------------------------------------------- The same failure affects some other multibyte codecs the tests for which are currently "expected failures" in regrtest.py, because they also fail in other places. The cause is a change in the behaviour of the built-in codecs. The behaviour in Python 2.7.13 and Jython 2.7 with Java 7 is: >>> 'abc\x80\x80\xc1\xc4def'.decode('big5', 'replace') u'abc\ufffd\u8b10def' >>> 'abc\x80\xc1\xc4def'.decode('big5', 'replace') u'abc\ufffd\u6514ef' It seems to read \x80\x80 as the code 8080, which the Java 7 codec reports as UNMAPPED[2], meaning an unmapped code of source length two. The codec in Java 8 reads \x80 as MALFORMED[1], meaning that the byte does not belong in the code (Big5 leading bytes are in the reange \x81-\xfe), and the length of the malformed text is one. Then it reports that again for the second \x80. I think the Java 8 policy is more correct, and not ruled out by the Python documentation. Our tests should perhaps allow it.

test_codecencodings_tw fails on Java 8 (not on Java 7) like this:

======================================================================
FAIL: test_errorhandle (__main__.Test_Big5)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "...\dist\Lib\test\test_multibytecodec_support.py", line 56, in test_errorhandle
    self.assertEqual(result, expected,
AssertionError: 'abc\x80\x80\xc1\xc4'.decode('big5', 'replace')=u'abc\ufffd\ufffd\u8b10' != u'abc\ufffd\u8b10'

----------------------------------------------------------------------

The same failure affects some other multibyte codecs the tests for which are currently "expected failures" in regrtest.py, because they also fail in other places.

The cause is a change in the behaviour of the built-in codecs. The behaviour in Python 2.7.13 and Jython 2.7 with Java 7 is:

>>> 'abc\x80\x80\xc1\xc4def'.decode('big5', 'replace')
u'abc\ufffd\u8b10def'
>>> 'abc\x80\xc1\xc4def'.decode('big5', 'replace')
u'abc\ufffd\u6514ef'

It seems to read \x80\x80 as the code 8080, which the Java 7 codec reports as UNMAPPED[2], meaning an unmapped code of source length two.

The codec in Java 8 reads \x80 as MALFORMED[1], meaning that the byte does not belong in the code (Big5 leading bytes are in the reange \x81-\xfe), and the length of the malformed text is one. Then it reports that again for the second \x80.

I think the Java 8 policy is more correct, and not ruled out by the Python documentation. Our tests should perhaps allow it.

History
Date	User	Action	Args
2017-03-14 06:17:17	jeff.allen	set	recipients: + jeff.allen
2017-03-14 06:17:17	jeff.allen	set	messageid: <1489472237.48.0.810895633667.issue2571@psf.upfronthosting.co.za>
2017-03-14 06:17:17	jeff.allen	link	issue2571 messages
2017-03-14 06:17:16	jeff.allen	create