Message9200

Author jeff.allen
Recipients Dolda2000, fwierzbicki, jeff.allen, zyasoft
Date 2014-11-09.18:07:22
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1415556443.7.0.368358759848.issue2037@psf.upfronthosting.co.za>
In-reply-to
Content
The check is easy to add, but it exposes a number of places in the core where we have not thought carefully about the difference between str and a Java String. (I know there are historical reasons.) An existing behaviour is:
>>> from java.lang import String
>>> s = String(u"\u0111")
>>> s
?
>>> a = s.charAt(0)
>>> a
'\u0111'
>>> type(a)
<type 'str'>
>>> hex(ord(a))
'0x111'

With a check in the PyString constructor, I get:
>>> from java.lang import String, StringBuilder
>>> s = String(u"\u0111")
>>> s
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: character not in range(256)

Basically, we can't reliably repr() any Java types now.

My initial attempts to recover seem only to move the problem on. Many places, obj.toString() is casually wrapped in a PyString, and these now raise. Taking my cue from Py.java2py, I believe most of these should create PyUnicode objects instead, unless the String represented  systematically byte-like data.

Our choice of PyString in some quite basic core code has often bothered me: what's the encoding, for example? I think it's a good thing we should be forced to get it straight. (I'll keep an eye on divergence from CPython.)

Do we agree the result above should be:
>>> s = String(u"\u0111")
>>> s
u'\u0111'
History
Date User Action Args
2014-11-09 18:07:23jeff.allensetmessageid: <1415556443.7.0.368358759848.issue2037@psf.upfronthosting.co.za>
2014-11-09 18:07:23jeff.allensetrecipients: + jeff.allen, fwierzbicki, zyasoft, Dolda2000
2014-11-09 18:07:23jeff.allenlinkissue2037 messages
2014-11-09 18:07:22jeff.allencreate