Issue2340

classification
Title: Support isolated surrogate codepoints in unicode with UCS4 encoding
Type: Severity: normal
Components: Versions: Jython 2.7
Milestone: Jython 2.7.2
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: gsnedders, jeff.allen, zyasoft
Priority: Keywords:

Created on 2015-04-25.16:20:06 by zyasoft, last changed 2015-12-29.23:44:46 by zyasoft.

Messages
msg9958 (view) Author: Jim Baker (zyasoft) Date: 2015-04-25.16:20:05
It's likely a losing battle for us if we don't support isolated surrogates in some limited fashion. Here's my proposal:

Create a new subclass of PyUnicode, PyUCS4, which uses a 32-bit int encoding (so int[]) of codepoints. This class will be used in the runtime when constructing from a isolated surrogate literal (so usable when compiling, the most usual case), as well as with unichr and composition. Since users should not be constructing directly (unless they explicitly import from org.python.core), exposing of this subclass should not be required, so long as we reverse the normal pattern we have done of final exposed methods and implementation methods:

```
--- a/src/org/python/core/PyUnicode.java	Thu Apr 23 07:49:39 2015 -0600
+++ b/src/org/python/core/PyUnicode.java	Sat Apr 25 10:03:10 2015 -0600
@@ -679,12 +679,13 @@
 
     @Override
     public PyString __repr__() {
-        return unicode___repr__();
+        return new PyString("u" + encode_UnicodeEscape(getString(), true));
+
     }
 
     @ExposedMethod(doc = BuiltinDocs.unicode___repr___doc)
     final PyString unicode___repr__() {
-        return new PyString("u" + encode_UnicodeEscape(getString(), true));
+        return __repr__();
     }
```

Then the implementation begins like so:

```
package org.python.core;

import java.util.Arrays;

public class PyUCS4 extends PyUnicode {

    private final int[] codepoints;

    public PyUCS4(int[] codepoints) {
        this.codepoints = codepoints;
    }

    @Override
    public int getCodePointCount() {
        return codepoints.length;
    }

    @Override
    public PyString __repr__() {
	    // replace with a real representation
        return new PyString("ucs4<" + Arrays.toString(codepoints) + ">");
    }

    @Override
    public Object __tojava__(Class<?> c) {
        if (c.isAssignableFrom(String.class)) {
	        throw Py.TypeError("Cannot convert unicode with isolated surrogates to Java");
        }
		return super.__tojava__(c);
    }

    @Override
    public String toString() {
        throw Py.TypeError("Cannot convert unicode with isolated surrogates to Java");
    }

}
```

So the risk in this approach is that the Jython runtime pervasively uses free conversion from unicode to String, and back, using __tojava__ and toString; and of course also does something like this for str. (As we know, this causes some amount of grief and numerous bugs.) However, my initial attempt that I have summarized here suggests that something usable can be done; and typical usage is not hitting these boundaries; and that the implementation will be straightforward and probably easy.
msg9959 (view) Author: Jim Baker (zyasoft) Date: 2015-04-25.16:26:19
An alternative, less hacky approach would be to refactor PyUnicode so that a given encoding can be plugged in. See https://www.python.org/dev/peps/pep-0393/ for what was done with CPython 3.
msg9963 (view) Author: Jim Baker (zyasoft) Date: 2015-04-25.21:02:32
See also https://github.com/pypa/virtualenv/pull/746#issuecomment-93758587
msg9965 (view) Author: Geoffrey Sneddon (gsnedders) Date: 2015-04-25.23:36:25
I have mixed feelings about this: it's nice to be able to do something like `eval(r'u"\uD800"')` to see if the Python being run on supports lone surrogates (and not having to hardcode a list of implementations that do/don't). Also seems a bit evil given in a fair few cases it'll suddenly fail much later on when dealing with the string (when toString is called, which isn't obvious from the Python level), and I'm not sure TypeError makes sense there, though I'm not sure what does.

To make a counter-proposal: allow them to parse, but doing *anything* with them throw the error. This allows for something like:

```
try:
    eval(r'u"\uD800"[0]')
except SyntaxError, TypeError:
    supportsSurrogates = False
else:
    supportsSurrogates = True

if supportsSurrogates:
    x = u"[\uD800-\uDFFF]"
else:
    x = u""
```

That said, that's quite a lot more work to do. Meh. Not sure what to suggest.

Also, just to point out: in principle, "\UFFFFFFFF" should still fail. Nothing else actually uses 32-bit code units, despite what the documentation suggests.

Speaking of PEP 939 and Python 3.3, note the definition of the Unicode type changed to be "a sequence of values that represent Unicode code points", which implicitly allows surrogates (both unpaired and paired), so the hypothetical Python 3 release of Jython will eventually have to deal with them properly somehow.
msg9966 (view) Author: Jeff Allen (jeff.allen) Date: 2015-04-26.00:01:25
I favour an approach like that of PEP 393: a unicode object is an immutable sequence of Unicode code points. It is based on an int[] array if it contains any *supplementary* code points. Is the PyUCS4 subclass effectively this pluggable representation?

There are several ways to define a conversion to a Java String, or back, depending on how surrogates and supplementary characters are treated. (The most tolerant is not reversible.) We should choose consciously the one we invoke each time.

The methods of PyUnicode would no longer be implemented using java.lang.String, but by our own code. In the end, I think that is no more difficult to get right as we have byte versions already in bytearray.

I guess I knew that clever indexing trick I provided would eventually be redundant :(
msg9969 (view) Author: Jim Baker (zyasoft) Date: 2015-04-26.01:10:34
Jeff, your work on clever indexing is not in vain!

Geoffrey, your approach sounds simpler, but it would be nice to have forward support for Python 3 in this work. Also, supporting literals at some level is going to necessitate storing them in some way, just based on how we do the support now.

I'm going to look at PyUCS4 as simply a spike I did this morning to scratch an itch: it seems to demonstrate that an int[] based-version could work without immediately failing because it was going through a Java code path that required String. This makes sense because such code paths would promote str -> String -> unicode, and while certainly in some code, not in pure Python code, or our tests would be constantly breaking.

So here's what I suggest: we take a PEP 394 approach, and provide at least two different representations: UTF16 (using clever indexing) and UCS4 (no Java integration via __tojava__). This will be forward compatible with Python 3 as well. UCS4 can be restricted to sys.maxunicode, addressing Geoffrey's concern, so that we would see the same behavior as on CPython 2.7 or 3.4:

```
>>> u"\UFFFFFFFF"
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character
```

UTF16 will use the existing codebase, and allows for minimal cost of going back and forth to Java. (IIRC, this code does check for isolated surrogates, so not without some cost.)

To make this proposal even more ambitious, we could do two more things:

1. Use byte[] to back PyString, avoiding the extra byte of overhead for each byte actually used. This will help break the current subclassing of PyUnicode from PyString, which is not true in Python itself (they are both subclasses of basestring, but not of each other). This extra byte of overhead is also seen in various benchmarks that test str performance, such as working with files.

2. Remove PyStringMap for __dict__ and use PyDictionary in its place. This will also fix #1152612 (PyStringMap stopped supporting only str as keys a long time ago, but it's been hard to remove.)
msg9970 (view) Author: Geoffrey Sneddon (gsnedders) Date: 2015-04-26.01:28:57
Jim, I was imagining storing them (because who doesn't love AST introspection?) but with them poisoned somehow. But looking at the code that'd mean overriding every single method in PyUnicode and that just seems a hellish amount of work.

In case it was unclear, my point about Python 3 compatibility is that lone surrogates have to work everywhere — it may well be the case that being able to convert (Py3) str objects into (Java) String objects is beneficial in many cases, but everything will have to support the case where they can't be converted. (Well, I imagine some of the Jython APIs will require no lone surrogates, but that's rather out of the scope of what the Python code can observe).
History
Date User Action Args
2015-12-29 23:44:46zyasoftsetmilestone: Jython 2.7.1 -> Jython 2.7.2
2015-04-26 01:28:57gsnedderssetmessages: + msg9970
2015-04-26 01:10:35zyasoftsetmessages: + msg9969
2015-04-26 00:01:26jeff.allensetmessages: + msg9966
2015-04-25 23:36:26gsnedderssetnosy: + gsnedders
messages: + msg9965
2015-04-25 21:02:32zyasoftsetmessages: + msg9963
2015-04-25 16:26:19zyasoftsetmessages: + msg9959
2015-04-25 16:20:06zyasoftcreate