Message9958
It's likely a losing battle for us if we don't support isolated surrogates in some limited fashion. Here's my proposal:
Create a new subclass of PyUnicode, PyUCS4, which uses a 32-bit int encoding (so int[]) of codepoints. This class will be used in the runtime when constructing from a isolated surrogate literal (so usable when compiling, the most usual case), as well as with unichr and composition. Since users should not be constructing directly (unless they explicitly import from org.python.core), exposing of this subclass should not be required, so long as we reverse the normal pattern we have done of final exposed methods and implementation methods:
```
--- a/src/org/python/core/PyUnicode.java Thu Apr 23 07:49:39 2015 -0600
+++ b/src/org/python/core/PyUnicode.java Sat Apr 25 10:03:10 2015 -0600
@@ -679,12 +679,13 @@
@Override
public PyString __repr__() {
- return unicode___repr__();
+ return new PyString("u" + encode_UnicodeEscape(getString(), true));
+
}
@ExposedMethod(doc = BuiltinDocs.unicode___repr___doc)
final PyString unicode___repr__() {
- return new PyString("u" + encode_UnicodeEscape(getString(), true));
+ return __repr__();
}
```
Then the implementation begins like so:
```
package org.python.core;
import java.util.Arrays;
public class PyUCS4 extends PyUnicode {
private final int[] codepoints;
public PyUCS4(int[] codepoints) {
this.codepoints = codepoints;
}
@Override
public int getCodePointCount() {
return codepoints.length;
}
@Override
public PyString __repr__() {
// replace with a real representation
return new PyString("ucs4<" + Arrays.toString(codepoints) + ">");
}
@Override
public Object __tojava__(Class<?> c) {
if (c.isAssignableFrom(String.class)) {
throw Py.TypeError("Cannot convert unicode with isolated surrogates to Java");
}
return super.__tojava__(c);
}
@Override
public String toString() {
throw Py.TypeError("Cannot convert unicode with isolated surrogates to Java");
}
}
```
So the risk in this approach is that the Jython runtime pervasively uses free conversion from unicode to String, and back, using __tojava__ and toString; and of course also does something like this for str. (As we know, this causes some amount of grief and numerous bugs.) However, my initial attempt that I have summarized here suggests that something usable can be done; and typical usage is not hitting these boundaries; and that the implementation will be straightforward and probably easy. |
|
Date |
User |
Action |
Args |
2015-04-25 16:20:06 | zyasoft | set | recipients:
+ zyasoft, jeff.allen |
2015-04-25 16:20:06 | zyasoft | set | messageid: <1429978806.73.0.616426690099.issue2340@psf.upfronthosting.co.za> |
2015-04-25 16:20:06 | zyasoft | link | issue2340 messages |
2015-04-25 16:20:05 | zyasoft | create | |
|