Issue2100

classification
Title: Deficiencies in PyUnicode beyond the BMP
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7
process
Status: open Resolution: accepted
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: jeff.allen, zyasoft
Priority: Keywords:

Created on 2013-10-27.07:24:58 by jeff.allen, last changed 2014-07-25.20:01:49 by jeff.allen.

Messages
msg8163 (view) Author: Jeff Allen (jeff.allen) Date: 2013-10-27.07:24:57
Our implementation of the unicode type does not always deal correctly with those codepoints represented by surrogate pairs. For example:
>>> s = u"\U00010000a"
>>> s.index('a')
2
>>> s[1]
u'a'

This definitely affects the "find" family of methods (find, rfind, index, rindex) in their simplest for. In other cases, the fault is more
subtle, being revealed only when a sub-range is the effective target.

>>> s = u"\U00010000hello world"
>>> s.startswith("hell",1)
False
>>> s.startswith("hell",2)
True
msg8180 (view) Author: Jeff Allen (jeff.allen) Date: 2013-11-25.07:57:10
Correcting my spelling and adding a solution idea ...

A clean solution might be a general change to the way we index PyUnicode. In general the UTF-16 contains a scatter of surrogate pairs so that the code unit index is offset from the character index. We could keep a table of offsets for converting one index to the other.

Mulling this over, it seems common string operations (find, replace etc.) could still use the java.lang.String implementations, and index translation (if neccesary at all) would be constant-time.
msg8461 (view) Author: Jim Baker (zyasoft) Date: 2014-05-21.20:34:57
Target beta 4

It would be nice if this is constant time, but correctness first
msg8887 (view) Author: Jeff Allen (jeff.allen) Date: 2014-07-25.20:01:48
The time is ripe to try this ...
History
Date User Action Args
2014-09-06 17:39:54jeff.allenlinkissue2197 dependencies
2014-07-25 20:01:49jeff.allensetassignee: jeff.allen
messages: + msg8887
2014-05-21 20:34:57zyasoftsetresolution: accepted
messages: + msg8461
nosy: + zyasoft
2013-11-25 07:57:10jeff.allensetmessages: + msg8180
title: Deficientcies in PyUnicode beyond the BMP -> Deficiencies in PyUnicode beyond the BMP
2013-10-27 07:24:58jeff.allencreate