Title: Deficiencies in PyUnicode beyond the BMP
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: jeff.allen, zyasoft
Priority: Keywords:

Created on 2013-10-27.07:24:58 by jeff.allen, last changed 2014-12-15.20:40:39 by jeff.allen.

msg8163 (view) Author: Jeff Allen (jeff.allen) Date: 2013-10-27.07:24:57
Our implementation of the unicode type does not always deal correctly with those codepoints represented by surrogate pairs. For example:
>>> s = u"\U00010000a"
>>> s.index('a')
>>> s[1]

This definitely affects the "find" family of methods (find, rfind, index, rindex) in their simplest for. In other cases, the fault is more
subtle, being revealed only when a sub-range is the effective target.

>>> s = u"\U00010000hello world"
>>> s.startswith("hell",1)
>>> s.startswith("hell",2)
msg8180 (view) Author: Jeff Allen (jeff.allen) Date: 2013-11-25.07:57:10
Correcting my spelling and adding a solution idea ...

A clean solution might be a general change to the way we index PyUnicode. In general the UTF-16 contains a scatter of surrogate pairs so that the code unit index is offset from the character index. We could keep a table of offsets for converting one index to the other.

Mulling this over, it seems common string operations (find, replace etc.) could still use the java.lang.String implementations, and index translation (if neccesary at all) would be constant-time.
msg8461 (view) Author: Jim Baker (zyasoft) Date: 2014-05-21.20:34:57
Target beta 4

It would be nice if this is constant time, but correctness first
msg8887 (view) Author: Jeff Allen (jeff.allen) Date: 2014-07-25.20:01:48
The time is ripe to try this ...
msg9243 (view) Author: Jeff Allen (jeff.allen) Date: 2014-12-15.20:40:39
Now fixed at
with subsequent recovery of constant-time performance at:
Date User Action Args
2014-12-15 20:40:39jeff.allensetstatus: open -> closed
messages: + msg9243
2014-09-06 17:39:54jeff.allenlinkissue2197 dependencies
2014-07-25 20:01:49jeff.allensetassignee: jeff.allen
messages: + msg8887
2014-05-21 20:34:57zyasoftsetresolution: accepted
messages: + msg8461
nosy: + zyasoft
2013-11-25 07:57:10jeff.allensetmessages: + msg8180
title: Deficientcies in PyUnicode beyond the BMP -> Deficiencies in PyUnicode beyond the BMP
2013-10-27 07:24:58jeff.allencreate