Issue2100
Created on 2013-10-27.07:24:58 by jeff.allen, last changed 2014-12-15.20:40:39 by jeff.allen.
msg8163 (view) |
Author: Jeff Allen (jeff.allen) |
Date: 2013-10-27.07:24:57 |
|
Our implementation of the unicode type does not always deal correctly with those codepoints represented by surrogate pairs. For example:
>>> s = u"\U00010000a"
>>> s.index('a')
2
>>> s[1]
u'a'
This definitely affects the "find" family of methods (find, rfind, index, rindex) in their simplest for. In other cases, the fault is more
subtle, being revealed only when a sub-range is the effective target.
>>> s = u"\U00010000hello world"
>>> s.startswith("hell",1)
False
>>> s.startswith("hell",2)
True
|
msg8180 (view) |
Author: Jeff Allen (jeff.allen) |
Date: 2013-11-25.07:57:10 |
|
Correcting my spelling and adding a solution idea ...
A clean solution might be a general change to the way we index PyUnicode. In general the UTF-16 contains a scatter of surrogate pairs so that the code unit index is offset from the character index. We could keep a table of offsets for converting one index to the other.
Mulling this over, it seems common string operations (find, replace etc.) could still use the java.lang.String implementations, and index translation (if neccesary at all) would be constant-time.
|
msg8461 (view) |
Author: Jim Baker (zyasoft) |
Date: 2014-05-21.20:34:57 |
|
Target beta 4
It would be nice if this is constant time, but correctness first
|
msg8887 (view) |
Author: Jeff Allen (jeff.allen) |
Date: 2014-07-25.20:01:48 |
|
The time is ripe to try this ...
|
msg9243 (view) |
Author: Jeff Allen (jeff.allen) |
Date: 2014-12-15.20:40:39 |
|
Now fixed at
https://hg.python.org/jython/rev/191a9854396d
with subsequent recovery of constant-time performance at:
https://hg.python.org/jython/rev/b96c8402f7ba
|
|
Date |
User |
Action |
Args |
2014-12-15 20:40:39 | jeff.allen | set | status: open -> closed messages:
+ msg9243 |
2014-09-06 17:39:54 | jeff.allen | link | issue2197 dependencies |
2014-07-25 20:01:49 | jeff.allen | set | assignee: jeff.allen messages:
+ msg8887 |
2014-05-21 20:34:57 | zyasoft | set | resolution: accepted messages:
+ msg8461 nosy:
+ zyasoft |
2013-11-25 07:57:10 | jeff.allen | set | messages:
+ msg8180 title: Deficientcies in PyUnicode beyond the BMP -> Deficiencies in PyUnicode beyond the BMP |
2013-10-27 07:24:58 | jeff.allen | create | |
|