Issue2571

classification

Title:	Error handling in test_codecencodings_tw fails on Java 8
Type:	behaviour	Severity:	normal
Components:	Library	Versions:	Jython 2.7
		Milestone:

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	jeff.allen	Nosy List:	jeff.allen
Priority:	normal	Keywords:

Created on 2017-03-14.06:17:17 by jeff.allen, last changed 2017-06-09.04:46:21 by zyasoft.

Messages
msg11226 (view)	Author: Jeff Allen (jeff.allen)	Date: 2017-03-14.06:17:16
test_codecencodings_tw fails on Java 8 (not on Java 7) like this: ====================================================================== FAIL: test_errorhandle (__main__.Test_Big5) ---------------------------------------------------------------------- Traceback (most recent call last): File "...\dist\Lib\test\test_multibytecodec_support.py", line 56, in test_errorhandle self.assertEqual(result, expected, AssertionError: 'abc\x80\x80\xc1\xc4'.decode('big5', 'replace')=u'abc\ufffd\ufffd\u8b10' != u'abc\ufffd\u8b10' ---------------------------------------------------------------------- The same failure affects some other multibyte codecs the tests for which are currently "expected failures" in regrtest.py, because they also fail in other places. The cause is a change in the behaviour of the built-in codecs. The behaviour in Python 2.7.13 and Jython 2.7 with Java 7 is: >>> 'abc\x80\x80\xc1\xc4def'.decode('big5', 'replace') u'abc\ufffd\u8b10def' >>> 'abc\x80\xc1\xc4def'.decode('big5', 'replace') u'abc\ufffd\u6514ef' It seems to read \x80\x80 as the code 8080, which the Java 7 codec reports as UNMAPPED[2], meaning an unmapped code of source length two. The codec in Java 8 reads \x80 as MALFORMED[1], meaning that the byte does not belong in the code (Big5 leading bytes are in the reange \x81-\xfe), and the length of the malformed text is one. Then it reports that again for the second \x80. I think the Java 8 policy is more correct, and not ruled out by the Python documentation. Our tests should perhaps allow it.
msg11227 (view)	Author: Jeff Allen (jeff.allen)	Date: 2017-03-14.07:48:32
Aha! Python 3 agrees with Java 8: >>> b'abc\x80\x80\xc1\xc4def'.decode('big5', 'replace') 'abc\ufffd\ufffd\u8b10def' This will be the result of this change set: https://hg.python.org/cpython/rev/16cbd84de848 and this issue: http://bugs.python.org/issue12016 where, in a last-minute change of mind, CPython decided not to back-port the fix to 2.7 and 3.2. However, nothing in the Python documentation seems to guarantee one or other behaviour. Given we have good reasons for using the Java codec, I'll give us a custom test that is either sensitive to version, or accepts the two.
msg11248 (view)	Author: Jeff Allen (jeff.allen)	Date: 2017-03-19.17:36:31
Darn. Turns out later versions of Java 7 are affected too. I think we must tolerate either behaviour, if we are not to write our own replacement processing.
msg11250 (view)	Author: Jeff Allen (jeff.allen)	Date: 2017-03-20.08:02:28
I've dealt with this through a more sophisticated test.test_support.get_java_version(). I'm assuming >(1,7,0,60) means that codecs will perform fast resynchronisation after an invalid byte. This seems to work for us on the build bots. https://hg.python.org/jython/rev/cc731a59c5eb Any problems with this on JVMs I don't have to hand?

History
Date	User	Action	Args
2017-06-09 04:46:21	zyasoft	set	status: pending -> closed
2017-03-20 08:02:29	jeff.allen	set	status: open -> pending resolution: fixed messages: + msg11250
2017-03-19 17:36:31	jeff.allen	set	messages: + msg11248
2017-03-14 07:48:32	jeff.allen	set	versions: + Jython 2.7 messages: + msg11227 priority: normal assignee: jeff.allen components: + Library type: behaviour
2017-03-14 06:17:17	jeff.allen	create