Message9210

Author jeff.allen
Recipients Dolda2000, fwierzbicki, jeff.allen, zyasoft
Date 2014-11-14.23:17:42
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1416007063.18.0.314024864608.issue2037@psf.upfronthosting.co.za>
In-reply-to
Content
>>> s = String(u"\u0111")
>>> s
u'\u0111'
"... it's the only sane thing it could be." Perhaps, but it's not what CPython would do, if it could do it. :)

Clearly, this works, and should:
>>> from java.lang import String
>>> s = String(u"\u0111")
>>> s.toString()
u'\u0111'

I think it follows that s.__str__() and s.__repr__(), if not overridden, should return the same as s.toString(), therefore a PyUnicode. CPython tolerates that, grudgingly:

class Foo(object):
    def __init__(self, value):
        self.value = value
    def __str__(self):
        return "str " + self.value
    def __repr__(self):
        return "repr " + self.value

Then in CPython:

>>> Foo(u"hello").__str__()
u'str hello'
>>> str(Foo(u"hello"))
'str hello'

If the value contains non-ascii characters, that raises an error:
>>> Foo(u"caf\u00e9").__str__()
u'str caf\xe9'
>>> str(Foo(u"caf\u00e9"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 7: ordinal not in range(128)
>>> Foo(u"abc\u0111").__str__()
u'str abc\u0111'
>>> str(Foo(u"abc\u0111"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0111' in position 7: ordinal not in range(128)

This is the behaviour we should have for str() to address this issue.

It's worth thinking about repr(). repr(), including when it is implicit at the prompt, behaves the same way as str() in CPython:
>>> Foo(u"hello").__repr__()
u'repr hello'
>>> repr(Foo(u"hello"))
'repr hello'
>>> Foo("hello")
repr hello

But notice that in the last case we don't see the value wrapped in u"" quotes: defining __repr__ expresses how you want the object to look. That's why I don't think String(u"\u0111") should echo as u'\u0111'. If we can't have:
>>> String(u"\u0111")
đ
then I think it should raise an error.

In CPython, if you want anything but ascii, you're out of luck:
>>> Foo(u"\u0111")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0111' in position 5: ordinal not in range(128)

This happens irrespective of the encoding. See http://bugs.python.org/issue5876#msg195996, which also sheds light on the unicode __repr__ policy.

The Jython interactive interpreter does not currently behave like CPython: it respects the console encoding as it would a file encoding:

>chcp 850
Active code page: 850

>dist\bin\jython -i repl.py
>>> Foo(u"caf\u00e9") # in cp850 é is 0xa2
repr café
>>> Foo(u"\u0111")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\hg\jython-int\dist\Lib\encodings\cp850.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0111' in position 5: character maps to <undefined>
>>> exit()

>chcp 1250
Active code page: 1250

>dist\bin\jython -i repl.py
>>> Foo(u"caf\u00e9")
repr café
>>> Foo(u"\u0111")  # in cp1250 letter đ is 0xf0
repr đ
>>>

I emphasise that this is what we do currently (before any change), and I intend to leave it like that. It seems useful and we don't have the divergence as a bug. With the proposed fix then, where supported by the console encoding, I see:
>>> from java.lang import String
>>> s = String(u"\u0111")
>>> s
đ
>>> s.__repr__()
u'\u0111'
>>> repr(s)
u'\u0111'
>>> str(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0111' in position 0: ordinal not in range(128)
History
Date User Action Args
2014-11-14 23:17:43jeff.allensetmessageid: <1416007063.18.0.314024864608.issue2037@psf.upfronthosting.co.za>
2014-11-14 23:17:43jeff.allensetrecipients: + jeff.allen, fwierzbicki, zyasoft, Dolda2000
2014-11-14 23:17:43jeff.allenlinkissue2037 messages
2014-11-14 23:17:42jeff.allencreate