Issue2197

classification

Title:	Out of bounds in unicode.count() with non-BMP point codes
Type:	behaviour	Severity:	normal
Components:	Core	Versions:	Jython 2.7
		Milestone:

process

Status:	closed	Resolution:
Dependencies:	Deficiencies in PyUnicode beyond the BMP View: 2100	Superseder:
Assigned To:	jeff.allen	Nosy List:	jeff.allen, zyasoft
Priority:		Keywords:

Created on 2014-08-31.22:16:28 by jeff.allen, last changed 2014-12-15.20:35:11 by jeff.allen.

Messages
msg8942 (view)	Author: Jeff Allen (jeff.allen)	Date: 2014-08-31.22:16:27
I found this working on #2100, but it is sufficiently separate I think to be its own issue. It seems that when PyUnicode.isBasicPlane() is false, count() resorts to the more complex implementation, and this fails here. >jython Jython 2.7b3 (default:e81256215fb0, Aug 4 2014, 02:39:51) [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_60 Type "help", "copyright", "credits" or "license" for more information. >>> u = u"aaabbc" >>> v = u"aaa\U00010002bcc" >>> u.count(u'b') 2 >>> v.count(u'b') Traceback (most recent call last): File "<stdin>", line 1, in <module> java.lang.StringIndexOutOfBoundsException: String index out of range: 8 at java.lang.String.charAt(String.java:658) at org.python.core.PyUnicode$SubsequenceIteratorImpl.nextCodePoint(PyUnicode.java:353) at org.python.core.PyUnicode$SubsequenceIteratorImpl.next(PyUnicode.java:342) at org.python.core.PyUnicode.unicode_count(PyUnicode.java:1031) at org.python.core.PyUnicode$unicode_count_exposer.__call__(Unknown Source) at org.python.core.PyObject.__call__(PyObject.java:407) at org.python.pycode._pyx4.f$0(<stdin>:1) at org.python.pycode._pyx4.call_function(<stdin>) at org.python.core.PyTableCode.call(PyTableCode.java:166) at org.python.core.PyCode.call(PyCode.java:18) at org.python.core.Py.runCode(Py.java:1312) at org.python.core.Py.exec(Py.java:1356) at org.python.util.PythonInterpreter.exec(PythonInterpreter.java:231) at org.python.util.InteractiveInterpreter.runcode(InteractiveInterpreter.java:89) at org.python.util.InteractiveInterpreter.runsource(InteractiveInterpreter.java:70) at org.python.util.InteractiveInterpreter.runsource(InteractiveInterpreter.java:46) at org.python.util.InteractiveConsole.push(InteractiveConsole.java:112) at org.python.util.InteractiveConsole.interact(InteractiveConsole.java:93) at org.python.util.jython.run(jython.java:396) at org.python.util.jython.main(jython.java:145) java.lang.StringIndexOutOfBoundsException: java.lang.StringIndexOutOfBoundsException: String index out of range: 8 Furthermore, this causes an endless loop: >>> v.count(u'') At present, I'm working on #2100, by providing index translation when necessary to deal with supplementary characters. The current implementation, uses custom iterators heavily, and mostly successfully, but I wonder if we could not use the same implementation as we do for BMP strings with the index translation. (I think so, but only if there are no un-paired surrogates.)
msg8943 (view)	Author: Jim Baker (zyasoft)	Date: 2014-09-01.03:32:44
Thanks for finding this. One thing that I found useful in testing the non-BMP implementation, when it was under development, is running the regrtest with PyUnicode#isBasicPlane always returning false. Sounds good about the progress on index translation. Adding an additional O(n) factor for indexing is not very nice, and certainly surprising to code that assumes it is constant (if possibly expensive).
msg8944 (view)	Author: Jeff Allen (jeff.allen)	Date: 2014-09-01.08:07:47
You mean this: http://hg.python.org/jython/file/83cd10f1826d/src/org/python/core/PyUnicode.java#l134 I spotted that, once I needed it, although I'd looked at it many times before without understanding. Turning it on made test_unicode hang. For this report I reproduced the problem with an 'honest' build. Skipping the test of count, I find replace() also fails when the find and target are both ''.
msg8954 (view)	Author: Jeff Allen (jeff.allen)	Date: 2014-09-06.17:39:54
Seems necessary to take this as part of #2100 after all.
msg9239 (view)	Author: Jim Baker (zyasoft)	Date: 2014-12-15.18:39:53
Looks like this is now fixed with all the recent improvements in str/unicode handling
msg9242 (view)	Author: Jeff Allen (jeff.allen)	Date: 2014-12-15.20:35:10
>>> u = u"aaabbc" >>> v = u"aaa\U00010002bcc" >>> u.count(u'b') 2 >>> v.count(u'b') 1 >>> u.count(u'') 7 I agree. Hard to say quite where, but leading up to this merge: https://hg.python.org/jython/rev/776cae0189ed

History
Date	User	Action	Args
2014-12-15 20:35:11	jeff.allen	set	status: open -> closed messages: + msg9242
2014-12-15 18:39:54	zyasoft	set	messages: + msg9239
2014-09-06 17:39:54	jeff.allen	set	assignee: jeff.allen dependencies: + Deficiencies in PyUnicode beyond the BMP messages: + msg8954
2014-09-01 08:07:47	jeff.allen	set	messages: + msg8944
2014-09-01 03:32:44	zyasoft	set	nosy: + zyasoft messages: + msg8943
2014-08-31 22:16:29	jeff.allen	create