Message9969

Author	zyasoft
Recipients	gsnedders, jeff.allen, zyasoft
Date	2015-04-26.01:10:34
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1430010635.3.0.762188904042.issue2340@psf.upfronthosting.co.za>
In-reply-to

Content
Jeff, your work on clever indexing is not in vain! Geoffrey, your approach sounds simpler, but it would be nice to have forward support for Python 3 in this work. Also, supporting literals at some level is going to necessitate storing them in some way, just based on how we do the support now. I'm going to look at PyUCS4 as simply a spike I did this morning to scratch an itch: it seems to demonstrate that an int[] based-version could work without immediately failing because it was going through a Java code path that required String. This makes sense because such code paths would promote str -> String -> unicode, and while certainly in some code, not in pure Python code, or our tests would be constantly breaking. So here's what I suggest: we take a PEP 394 approach, and provide at least two different representations: UTF16 (using clever indexing) and UCS4 (no Java integration via __tojava__). This will be forward compatible with Python 3 as well. UCS4 can be restricted to sys.maxunicode, addressing Geoffrey's concern, so that we would see the same behavior as on CPython 2.7 or 3.4: ``` >>> u"\UFFFFFFFF" File "<stdin>", line 1 SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character ``` UTF16 will use the existing codebase, and allows for minimal cost of going back and forth to Java. (IIRC, this code does check for isolated surrogates, so not without some cost.) To make this proposal even more ambitious, we could do two more things: 1. Use byte[] to back PyString, avoiding the extra byte of overhead for each byte actually used. This will help break the current subclassing of PyUnicode from PyString, which is not true in Python itself (they are both subclasses of basestring, but not of each other). This extra byte of overhead is also seen in various benchmarks that test str performance, such as working with files. 2. Remove PyStringMap for __dict__ and use PyDictionary in its place. This will also fix #1152612 (PyStringMap stopped supporting only str as keys a long time ago, but it's been hard to remove.)

Jeff, your work on clever indexing is not in vain!

Geoffrey, your approach sounds simpler, but it would be nice to have forward support for Python 3 in this work. Also, supporting literals at some level is going to necessitate storing them in some way, just based on how we do the support now.

I'm going to look at PyUCS4 as simply a spike I did this morning to scratch an itch: it seems to demonstrate that an int[] based-version could work without immediately failing because it was going through a Java code path that required String. This makes sense because such code paths would promote str -> String -> unicode, and while certainly in some code, not in pure Python code, or our tests would be constantly breaking.

So here's what I suggest: we take a PEP 394 approach, and provide at least two different representations: UTF16 (using clever indexing) and UCS4 (no Java integration via __tojava__). This will be forward compatible with Python 3 as well. UCS4 can be restricted to sys.maxunicode, addressing Geoffrey's concern, so that we would see the same behavior as on CPython 2.7 or 3.4:

```
>>> u"\UFFFFFFFF"
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character
```

UTF16 will use the existing codebase, and allows for minimal cost of going back and forth to Java. (IIRC, this code does check for isolated surrogates, so not without some cost.)

To make this proposal even more ambitious, we could do two more things:

1. Use byte[] to back PyString, avoiding the extra byte of overhead for each byte actually used. This will help break the current subclassing of PyUnicode from PyString, which is not true in Python itself (they are both subclasses of basestring, but not of each other). This extra byte of overhead is also seen in various benchmarks that test str performance, such as working with files.

2. Remove PyStringMap for __dict__ and use PyDictionary in its place. This will also fix #1152612 (PyStringMap stopped supporting only str as keys a long time ago, but it's been hard to remove.)

History
Date	User	Action	Args
2015-04-26 01:10:35	zyasoft	set	messageid: <1430010635.3.0.762188904042.issue2340@psf.upfronthosting.co.za>
2015-04-26 01:10:35	zyasoft	set	recipients: + zyasoft, gsnedders, jeff.allen
2015-04-26 01:10:35	zyasoft	link	issue2340 messages
2015-04-26 01:10:34	zyasoft	create