Issue2571
Created on 2017-03-14.06:17:17 by jeff.allen, last changed 2017-06-09.04:46:21 by zyasoft.
msg11226 (view) |
Author: Jeff Allen (jeff.allen) |
Date: 2017-03-14.06:17:16 |
|
test_codecencodings_tw fails on Java 8 (not on Java 7) like this:
======================================================================
FAIL: test_errorhandle (__main__.Test_Big5)
----------------------------------------------------------------------
Traceback (most recent call last):
File "...\dist\Lib\test\test_multibytecodec_support.py", line 56, in test_errorhandle
self.assertEqual(result, expected,
AssertionError: 'abc\x80\x80\xc1\xc4'.decode('big5', 'replace')=u'abc\ufffd\ufffd\u8b10' != u'abc\ufffd\u8b10'
----------------------------------------------------------------------
The same failure affects some other multibyte codecs the tests for which are currently "expected failures" in regrtest.py, because they also fail in other places.
The cause is a change in the behaviour of the built-in codecs. The behaviour in Python 2.7.13 and Jython 2.7 with Java 7 is:
>>> 'abc\x80\x80\xc1\xc4def'.decode('big5', 'replace')
u'abc\ufffd\u8b10def'
>>> 'abc\x80\xc1\xc4def'.decode('big5', 'replace')
u'abc\ufffd\u6514ef'
It seems to read \x80\x80 as the code 8080, which the Java 7 codec reports as UNMAPPED[2], meaning an unmapped code of source length two.
The codec in Java 8 reads \x80 as MALFORMED[1], meaning that the byte does not belong in the code (Big5 leading bytes are in the reange \x81-\xfe), and the length of the malformed text is one. Then it reports that again for the second \x80.
I think the Java 8 policy is more correct, and not ruled out by the Python documentation. Our tests should perhaps allow it.
|
msg11227 (view) |
Author: Jeff Allen (jeff.allen) |
Date: 2017-03-14.07:48:32 |
|
Aha! Python 3 agrees with Java 8:
>>> b'abc\x80\x80\xc1\xc4def'.decode('big5', 'replace')
'abc\ufffd\ufffd\u8b10def'
This will be the result of this change set:
https://hg.python.org/cpython/rev/16cbd84de848
and this issue:
http://bugs.python.org/issue12016
where, in a last-minute change of mind, CPython decided not to back-port the fix to 2.7 and 3.2.
However, nothing in the Python documentation seems to guarantee one or other behaviour. Given we have good reasons for using the Java codec, I'll give us a custom test that is either sensitive to version, or accepts the two.
|
msg11248 (view) |
Author: Jeff Allen (jeff.allen) |
Date: 2017-03-19.17:36:31 |
|
Darn. Turns out later versions of Java 7 are affected too. I think we must tolerate either behaviour, if we are not to write our own replacement processing.
|
msg11250 (view) |
Author: Jeff Allen (jeff.allen) |
Date: 2017-03-20.08:02:28 |
|
I've dealt with this through a more sophisticated test.test_support.get_java_version(). I'm assuming >(1,7,0,60) means that codecs will perform fast resynchronisation after an invalid byte. This seems to work for us on the build bots.
https://hg.python.org/jython/rev/cc731a59c5eb
Any problems with this on JVMs I don't have to hand?
|
|
Date |
User |
Action |
Args |
2017-06-09 04:46:21 | zyasoft | set | status: pending -> closed |
2017-03-20 08:02:29 | jeff.allen | set | status: open -> pending resolution: fixed messages:
+ msg11250 |
2017-03-19 17:36:31 | jeff.allen | set | messages:
+ msg11248 |
2017-03-14 07:48:32 | jeff.allen | set | versions:
+ Jython 2.7 messages:
+ msg11227 priority: normal assignee: jeff.allen components:
+ Library type: behaviour |
2017-03-14 06:17:17 | jeff.allen | create | |
|