Issue2197

classification
Title: Out of bounds in unicode.count() with non-BMP point codes
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7
Milestone:
process
Status: closed Resolution:
Dependencies: Deficiencies in PyUnicode beyond the BMP
View: 2100
Superseder:
Assigned To: jeff.allen Nosy List: jeff.allen, zyasoft
Priority: Keywords:

Created on 2014-08-31.22:16:28 by jeff.allen, last changed 2014-12-15.20:35:11 by jeff.allen.

Messages
msg8942 (view) Author: Jeff Allen (jeff.allen) Date: 2014-08-31.22:16:27
I found this working on #2100, but it is sufficiently separate I think to be its own issue. It seems that when PyUnicode.isBasicPlane() is false, count() resorts to the more complex implementation, and this fails here.

>jython
Jython 2.7b3 (default:e81256215fb0, Aug 4 2014, 02:39:51)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_60
Type "help", "copyright", "credits" or "license" for more information.
>>> u = u"aaabbc"
>>> v = u"aaa\U00010002bcc"
>>> u.count(u'b')
2
>>> v.count(u'b')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
java.lang.StringIndexOutOfBoundsException: String index out of range: 8
        at java.lang.String.charAt(String.java:658)
        at org.python.core.PyUnicode$SubsequenceIteratorImpl.nextCodePoint(PyUnicode.java:353)
        at org.python.core.PyUnicode$SubsequenceIteratorImpl.next(PyUnicode.java:342)
        at org.python.core.PyUnicode.unicode_count(PyUnicode.java:1031)
        at org.python.core.PyUnicode$unicode_count_exposer.__call__(Unknown Source)
        at org.python.core.PyObject.__call__(PyObject.java:407)
        at org.python.pycode._pyx4.f$0(<stdin>:1)
        at org.python.pycode._pyx4.call_function(<stdin>)
        at org.python.core.PyTableCode.call(PyTableCode.java:166)
        at org.python.core.PyCode.call(PyCode.java:18)
        at org.python.core.Py.runCode(Py.java:1312)
        at org.python.core.Py.exec(Py.java:1356)
        at org.python.util.PythonInterpreter.exec(PythonInterpreter.java:231)
        at org.python.util.InteractiveInterpreter.runcode(InteractiveInterpreter.java:89)
        at org.python.util.InteractiveInterpreter.runsource(InteractiveInterpreter.java:70)
        at org.python.util.InteractiveInterpreter.runsource(InteractiveInterpreter.java:46)
        at org.python.util.InteractiveConsole.push(InteractiveConsole.java:112)
        at org.python.util.InteractiveConsole.interact(InteractiveConsole.java:93)
        at org.python.util.jython.run(jython.java:396)
        at org.python.util.jython.main(jython.java:145)

java.lang.StringIndexOutOfBoundsException: java.lang.StringIndexOutOfBoundsException: String index out of range: 8

Furthermore, this causes an endless loop:

>>> v.count(u'')

At present, I'm working on #2100, by providing index translation when necessary to deal with supplementary characters. The current implementation, uses custom iterators heavily, and mostly successfully, but I wonder if we could not use the same implementation as we do for BMP strings with the index translation. (I think so, but only if there are no un-paired surrogates.)
msg8943 (view) Author: Jim Baker (zyasoft) Date: 2014-09-01.03:32:44
Thanks for finding this. One thing that I found useful in testing the non-BMP implementation, when it was under development, is running the regrtest with PyUnicode#isBasicPlane always returning false.

Sounds good about the progress on index translation. Adding an additional O(n) factor for indexing is not very nice, and certainly surprising to code that assumes it is constant (if possibly expensive).
msg8944 (view) Author: Jeff Allen (jeff.allen) Date: 2014-09-01.08:07:47
You mean this: http://hg.python.org/jython/file/83cd10f1826d/src/org/python/core/PyUnicode.java#l134

I spotted that, once I needed it, although I'd looked at it many times before without understanding. Turning it on made test_unicode hang. For this report I reproduced the problem with an 'honest' build.

Skipping the test of count, I find replace() also fails when the find and target are both ''.
msg8954 (view) Author: Jeff Allen (jeff.allen) Date: 2014-09-06.17:39:54
Seems necessary to take this as part of #2100 after all.
msg9239 (view) Author: Jim Baker (zyasoft) Date: 2014-12-15.18:39:53
Looks like this is now fixed with all the recent improvements in str/unicode handling
msg9242 (view) Author: Jeff Allen (jeff.allen) Date: 2014-12-15.20:35:10
>>> u = u"aaabbc"
>>> v = u"aaa\U00010002bcc"
>>> u.count(u'b')
2
>>> v.count(u'b')
1
>>> u.count(u'')
7

I agree. Hard to say quite where, but leading up to this merge:
https://hg.python.org/jython/rev/776cae0189ed
History
Date User Action Args
2014-12-15 20:35:11jeff.allensetstatus: open -> closed
messages: + msg9242
2014-12-15 18:39:54zyasoftsetmessages: + msg9239
2014-09-06 17:39:54jeff.allensetassignee: jeff.allen
dependencies: + Deficiencies in PyUnicode beyond the BMP
messages: + msg8954
2014-09-01 08:07:47jeff.allensetmessages: + msg8944
2014-09-01 03:32:44zyasoftsetnosy: + zyasoft
messages: + msg8943
2014-08-31 22:16:29jeff.allencreate