Title: Error handling in test_codecencodings_tw fails on Java 8
Type: behaviour Severity: normal
Components: Library Versions: Jython 2.7
Status: pending Resolution: fixed
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: jeff.allen
Priority: normal Keywords:

Created on 2017-03-14.06:17:17 by jeff.allen, last changed 2017-03-20.08:02:29 by jeff.allen.

msg11226 (view) Author: Jeff Allen (jeff.allen) Date: 2017-03-14.06:17:16
test_codecencodings_tw fails on Java 8 (not on Java 7) like this:

FAIL: test_errorhandle (__main__.Test_Big5)
Traceback (most recent call last):
  File "...\dist\Lib\test\", line 56, in test_errorhandle
    self.assertEqual(result, expected,
AssertionError: 'abc\x80\x80\xc1\xc4'.decode('big5', 'replace')=u'abc\ufffd\ufffd\u8b10' != u'abc\ufffd\u8b10'


The same failure affects some other multibyte codecs the tests for which are currently "expected failures" in, because they also fail in other places.

The cause is a change in the behaviour of the built-in codecs. The behaviour in Python 2.7.13 and Jython 2.7 with Java 7 is:

>>> 'abc\x80\x80\xc1\xc4def'.decode('big5', 'replace')
>>> 'abc\x80\xc1\xc4def'.decode('big5', 'replace')

It seems to read \x80\x80 as the code 8080, which the Java 7 codec reports as UNMAPPED[2], meaning an unmapped code of source length two.

The codec in Java 8 reads \x80 as MALFORMED[1], meaning that the byte does not belong in the code (Big5 leading bytes are in the reange \x81-\xfe), and the length of the malformed text is one. Then it reports that again for the second \x80.

I think the Java 8 policy is more correct, and not ruled out by the Python documentation. Our tests should perhaps allow it.
msg11227 (view) Author: Jeff Allen (jeff.allen) Date: 2017-03-14.07:48:32
Aha! Python 3 agrees with Java 8:

>>> b'abc\x80\x80\xc1\xc4def'.decode('big5', 'replace')

This will be the result of this change set:
and this issue:
where, in a last-minute change of mind, CPython decided not to back-port the fix to 2.7 and 3.2.

However, nothing in the Python documentation seems to guarantee one or other behaviour. Given we have good reasons for using the Java codec, I'll give us a custom test that is either sensitive to version, or accepts the two.
msg11248 (view) Author: Jeff Allen (jeff.allen) Date: 2017-03-19.17:36:31
Darn. Turns out later versions of Java 7 are affected too. I think we must tolerate either behaviour, if we are not to write our own replacement processing.
msg11250 (view) Author: Jeff Allen (jeff.allen) Date: 2017-03-20.08:02:28
I've dealt with this through a more sophisticated test.test_support.get_java_version(). I'm assuming >(1,7,0,60) means that codecs will perform fast resynchronisation after an invalid byte. This seems to work for us on the build bots.

Any problems with this on JVMs I don't have to hand?
Date User Action Args
2017-03-20 08:02:29jeff.allensetstatus: open -> pending
resolution: fixed
messages: + msg11250
2017-03-19 17:36:31jeff.allensetmessages: + msg11248
2017-03-14 07:48:32jeff.allensetversions: + Jython 2.7
messages: + msg11227
priority: normal
assignee: jeff.allen
components: + Library
type: behaviour
2017-03-14 06:17:17jeff.allencreate