Issue2364
Created on 2015-05-31.11:30:16 by ztane, last changed 2015-09-22.17:28:40 by zyasoft.
msg10092 (view) |
Author: Antti Haapala (ztane) |
Date: 2015-05-31.11:30:16 |
|
bytearray uses `Character.is*` methods to do the various bytearray.isxxx methods. This is not compatible with the CPython behaviour; Jython bytearray tests imply latin-1 character encoding, whereas CPython exactly does 7-bit ASCII testing.
CPython 2.7.9:
>>> bytearray('\xc0').isalpha()
False
and Jython:
>>> bytearray('\xc0').isalpha()
True
|
msg10094 (view) |
Author: Jeff Allen (jeff.allen) |
Date: 2015-06-01.09:05:27 |
|
The docs say it's locale-dependent:
https://docs.python.org/2/library/stdtypes.html#str.isalpha
Jython's locale support is weak, and in our code you can see us fall back on Latin-1, as a rule. However, I agree that on examination CPython seems to have an ascii interpretation hard-wired.
I guess they forgot the docs when dealing with: http://bugs.python.org/issue5793
The policy has been made explicit in Python 3.5 docs:
https://docs.python.org/3.5/library/stdtypes.html#bytearray.isalpha
In Python 3:
>>> '\xc0'.isalpha()
True
>>> b'\xc0'.isalpha()
False
>>> bytearray(b'\xc0').isalpha()
False
>>> (u'\xc0').isalpha()
True
I think consistency with Python 3 is sensible. (Differing views?)
|
msg10097 (view) |
Author: Jim Baker (zyasoft) |
Date: 2015-06-03.17:34:13 |
|
Jeff, +1. bytearray was backported from Python 3, so its behavior on CPython 3 must be considered as canonical. Good to see that is actually documented now.
|
msg10249 (view) |
Author: Jeff Allen (jeff.allen) |
Date: 2015-09-10.23:05:03 |
|
In order to fix this I have:
1. implemented character classifiers (isalpha, etc.) in BaseBytes.
2. re-implemented the BaseBytes methods using these classifiers.
3. made PyUnicode not depend on PyString for these operations.
4. given PyString implementations that use the BaseBytes.isalpha, etc.
Benchmarks show the new PyString methods to be a little quicker than the old ones (as you might hope, given the simplification). Change sets:
https://hg.python.org/jython/rev/50082331db8d
and successors address this.
There are still parts of PyString that use Character.is* methods, for example the transformation methods lower, upper, title.
|
|
Date |
User |
Action |
Args |
2015-09-22 17:28:40 | zyasoft | set | status: pending -> closed |
2015-09-10 23:05:03 | jeff.allen | set | status: open -> pending resolution: fixed messages:
+ msg10249 |
2015-06-03 17:34:13 | zyasoft | set | messages:
+ msg10097 |
2015-06-01 09:05:28 | jeff.allen | set | priority: normal assignee: jeff.allen type: behaviour messages:
+ msg10094 nosy:
+ jeff.allen, zyasoft |
2015-05-31 11:30:16 | ztane | create | |
|