Message9958

Author zyasoft
Recipients jeff.allen, zyasoft
Date 2015-04-25.16:20:05
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1429978806.73.0.616426690099.issue2340@psf.upfronthosting.co.za>
In-reply-to
Content
It's likely a losing battle for us if we don't support isolated surrogates in some limited fashion. Here's my proposal:

Create a new subclass of PyUnicode, PyUCS4, which uses a 32-bit int encoding (so int[]) of codepoints. This class will be used in the runtime when constructing from a isolated surrogate literal (so usable when compiling, the most usual case), as well as with unichr and composition. Since users should not be constructing directly (unless they explicitly import from org.python.core), exposing of this subclass should not be required, so long as we reverse the normal pattern we have done of final exposed methods and implementation methods:

```
--- a/src/org/python/core/PyUnicode.java	Thu Apr 23 07:49:39 2015 -0600
+++ b/src/org/python/core/PyUnicode.java	Sat Apr 25 10:03:10 2015 -0600
@@ -679,12 +679,13 @@
 
     @Override
     public PyString __repr__() {
-        return unicode___repr__();
+        return new PyString("u" + encode_UnicodeEscape(getString(), true));
+
     }
 
     @ExposedMethod(doc = BuiltinDocs.unicode___repr___doc)
     final PyString unicode___repr__() {
-        return new PyString("u" + encode_UnicodeEscape(getString(), true));
+        return __repr__();
     }
```

Then the implementation begins like so:

```
package org.python.core;

import java.util.Arrays;

public class PyUCS4 extends PyUnicode {

    private final int[] codepoints;

    public PyUCS4(int[] codepoints) {
        this.codepoints = codepoints;
    }

    @Override
    public int getCodePointCount() {
        return codepoints.length;
    }

    @Override
    public PyString __repr__() {
	    // replace with a real representation
        return new PyString("ucs4<" + Arrays.toString(codepoints) + ">");
    }

    @Override
    public Object __tojava__(Class<?> c) {
        if (c.isAssignableFrom(String.class)) {
	        throw Py.TypeError("Cannot convert unicode with isolated surrogates to Java");
        }
		return super.__tojava__(c);
    }

    @Override
    public String toString() {
        throw Py.TypeError("Cannot convert unicode with isolated surrogates to Java");
    }

}
```

So the risk in this approach is that the Jython runtime pervasively uses free conversion from unicode to String, and back, using __tojava__ and toString; and of course also does something like this for str. (As we know, this causes some amount of grief and numerous bugs.) However, my initial attempt that I have summarized here suggests that something usable can be done; and typical usage is not hitting these boundaries; and that the implementation will be straightforward and probably easy.
History
Date User Action Args
2015-04-25 16:20:06zyasoftsetrecipients: + zyasoft, jeff.allen
2015-04-25 16:20:06zyasoftsetmessageid: <1429978806.73.0.616426690099.issue2340@psf.upfronthosting.co.za>
2015-04-25 16:20:06zyasoftlinkissue2340 messages
2015-04-25 16:20:05zyasoftcreate