Message9200

Author	jeff.allen
Recipients	Dolda2000, fwierzbicki, jeff.allen, zyasoft
Date	2014-11-09.18:07:22
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1415556443.7.0.368358759848.issue2037@psf.upfronthosting.co.za>
In-reply-to

Content
The check is easy to add, but it exposes a number of places in the core where we have not thought carefully about the difference between str and a Java String. (I know there are historical reasons.) An existing behaviour is: >>> from java.lang import String >>> s = String(u"\u0111") >>> s ? >>> a = s.charAt(0) >>> a '\u0111' >>> type(a) <type 'str'> >>> hex(ord(a)) '0x111' With a check in the PyString constructor, I get: >>> from java.lang import String, StringBuilder >>> s = String(u"\u0111") >>> s Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: character not in range(256) Basically, we can't reliably repr() any Java types now. My initial attempts to recover seem only to move the problem on. Many places, obj.toString() is casually wrapped in a PyString, and these now raise. Taking my cue from Py.java2py, I believe most of these should create PyUnicode objects instead, unless the String represented systematically byte-like data. Our choice of PyString in some quite basic core code has often bothered me: what's the encoding, for example? I think it's a good thing we should be forced to get it straight. (I'll keep an eye on divergence from CPython.) Do we agree the result above should be: >>> s = String(u"\u0111") >>> s u'\u0111'

The check is easy to add, but it exposes a number of places in the core where we have not thought carefully about the difference between str and a Java String. (I know there are historical reasons.) An existing behaviour is:
>>> from java.lang import String
>>> s = String(u"\u0111")
>>> s
?
>>> a = s.charAt(0)
>>> a
'\u0111'
>>> type(a)
<type 'str'>
>>> hex(ord(a))
'0x111'

With a check in the PyString constructor, I get:
>>> from java.lang import String, StringBuilder
>>> s = String(u"\u0111")
>>> s
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: character not in range(256)

Basically, we can't reliably repr() any Java types now.

My initial attempts to recover seem only to move the problem on. Many places, obj.toString() is casually wrapped in a PyString, and these now raise. Taking my cue from Py.java2py, I believe most of these should create PyUnicode objects instead, unless the String represented  systematically byte-like data.

Our choice of PyString in some quite basic core code has often bothered me: what's the encoding, for example? I think it's a good thing we should be forced to get it straight. (I'll keep an eye on divergence from CPython.)

Do we agree the result above should be:
>>> s = String(u"\u0111")
>>> s
u'\u0111'

History
Date	User	Action	Args
2014-11-09 18:07:23	jeff.allen	set	messageid: <1415556443.7.0.368358759848.issue2037@psf.upfronthosting.co.za>
2014-11-09 18:07:23	jeff.allen	set	recipients: + jeff.allen, fwierzbicki, zyasoft, Dolda2000
2014-11-09 18:07:23	jeff.allen	link	issue2037 messages
2014-11-09 18:07:22	jeff.allen	create