Issue2037

classification
Title: Byte-string containing elements greater than 255
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7
Milestone:
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: Dolda2000, fwierzbicki, jeff.allen, zyasoft
Priority: high Keywords:

Created on 2013-04-06.03:02:20 by Dolda2000, last changed 2014-12-02.22:29:21 by jeff.allen.

Messages
msg7987 (view) Author: Fredrik Tolf (Dolda2000) Date: 2013-04-06.03:02:19
Byte-strings can contain elements that aren't bytes. The problem is easily reproduced, like this:

$ jython
Jython 2.5.2 (Debian:hg/91332231a448, May 8 2012, 09:50:46) 
[OpenJDK 64-Bit Server VM (Sun Microsystems Inc.)] on java1.6.0_27
>>> foo = str(java.lang.String(u"\u1234"))
>>> print foo
?
>>> foo
'\u1234'

I can't say I know what the proper solution to this problem would be, but it seems strange that byte-strings should be able to contain non-byte elements.

It also seems like a bug in itself that the repr() representation of such an object does not reproduce the same object when eval'ed:

>>> eval(repr(foo))
'\\u1234'

It is also worth noting that such strings are poison even to Unicode codecs that should be able to handle any bytestring without choking:

>>> unicode(foo, "latin1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'latin-1' codec can't decode byte 0x34 in position 0: ordinal not in range(256)

Perhaps str() should raise an exception when such objects would be created?
msg8329 (view) Author: Jim Baker (zyasoft) Date: 2014-05-04.20:17:07
Wrapping a java.lang.String with str should make this check
msg8460 (view) Author: Jim Baker (zyasoft) Date: 2014-05-21.20:32:34
Target beta 4
msg9192 (view) Author: Jeff Allen (jeff.allen) Date: 2014-11-08.12:30:48
It's easy to add a check, which I'll do as a first step in the constructor(s).

The cost of doing this check grates a little.

When one steps through the example, a notable feature is that the construction happens twice, once by calling __str__(), which wraps the result of toString() in a PyString, and once by shelling that PyString to wrap its implementation string again. A simple fix ends up checking the String a second time. And it's not the only place where we shill check strings that we can tell are clean a priori. I'll look for ways to avoid that where it might be frequent, as a second step.

I note this is tagged 2.5. I'll fix in the tip. Do we intend to back-port it?
msg9194 (view) Author: Jim Baker (zyasoft) Date: 2014-11-08.17:21:23
It's unlikely we will backport fixes to 2.5 unlike they are truly critical.  Instead we should expect 2.7 to be used.
msg9200 (view) Author: Jeff Allen (jeff.allen) Date: 2014-11-09.18:07:22
The check is easy to add, but it exposes a number of places in the core where we have not thought carefully about the difference between str and a Java String. (I know there are historical reasons.) An existing behaviour is:
>>> from java.lang import String
>>> s = String(u"\u0111")
>>> s
?
>>> a = s.charAt(0)
>>> a
'\u0111'
>>> type(a)
<type 'str'>
>>> hex(ord(a))
'0x111'

With a check in the PyString constructor, I get:
>>> from java.lang import String, StringBuilder
>>> s = String(u"\u0111")
>>> s
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: character not in range(256)

Basically, we can't reliably repr() any Java types now.

My initial attempts to recover seem only to move the problem on. Many places, obj.toString() is casually wrapped in a PyString, and these now raise. Taking my cue from Py.java2py, I believe most of these should create PyUnicode objects instead, unless the String represented  systematically byte-like data.

Our choice of PyString in some quite basic core code has often bothered me: what's the encoding, for example? I think it's a good thing we should be forced to get it straight. (I'll keep an eye on divergence from CPython.)

Do we agree the result above should be:
>>> s = String(u"\u0111")
>>> s
u'\u0111'
msg9201 (view) Author: Jim Baker (zyasoft) Date: 2014-11-09.23:15:55
@Jeff, agreed with your conclusions and specifically that representation, it's the only sane thing it could be. Jython 2.2 and earlier did not differentiate unicode/str. We fixed some/most of the problems, but not all.

Incidentally this shows this will not be something we will backport to 2.5, given that it's a breaking change. Retagging accordingly.
msg9210 (view) Author: Jeff Allen (jeff.allen) Date: 2014-11-14.23:17:42
>>> s = String(u"\u0111")
>>> s
u'\u0111'
"... it's the only sane thing it could be." Perhaps, but it's not what CPython would do, if it could do it. :)

Clearly, this works, and should:
>>> from java.lang import String
>>> s = String(u"\u0111")
>>> s.toString()
u'\u0111'

I think it follows that s.__str__() and s.__repr__(), if not overridden, should return the same as s.toString(), therefore a PyUnicode. CPython tolerates that, grudgingly:

class Foo(object):
    def __init__(self, value):
        self.value = value
    def __str__(self):
        return "str " + self.value
    def __repr__(self):
        return "repr " + self.value

Then in CPython:

>>> Foo(u"hello").__str__()
u'str hello'
>>> str(Foo(u"hello"))
'str hello'

If the value contains non-ascii characters, that raises an error:
>>> Foo(u"caf\u00e9").__str__()
u'str caf\xe9'
>>> str(Foo(u"caf\u00e9"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 7: ordinal not in range(128)
>>> Foo(u"abc\u0111").__str__()
u'str abc\u0111'
>>> str(Foo(u"abc\u0111"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0111' in position 7: ordinal not in range(128)

This is the behaviour we should have for str() to address this issue.

It's worth thinking about repr(). repr(), including when it is implicit at the prompt, behaves the same way as str() in CPython:
>>> Foo(u"hello").__repr__()
u'repr hello'
>>> repr(Foo(u"hello"))
'repr hello'
>>> Foo("hello")
repr hello

But notice that in the last case we don't see the value wrapped in u"" quotes: defining __repr__ expresses how you want the object to look. That's why I don't think String(u"\u0111") should echo as u'\u0111'. If we can't have:
>>> String(u"\u0111")
đ
then I think it should raise an error.

In CPython, if you want anything but ascii, you're out of luck:
>>> Foo(u"\u0111")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0111' in position 5: ordinal not in range(128)

This happens irrespective of the encoding. See http://bugs.python.org/issue5876#msg195996, which also sheds light on the unicode __repr__ policy.

The Jython interactive interpreter does not currently behave like CPython: it respects the console encoding as it would a file encoding:

>chcp 850
Active code page: 850

>dist\bin\jython -i repl.py
>>> Foo(u"caf\u00e9") # in cp850 é is 0xa2
repr café
>>> Foo(u"\u0111")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\hg\jython-int\dist\Lib\encodings\cp850.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0111' in position 5: character maps to <undefined>
>>> exit()

>chcp 1250
Active code page: 1250

>dist\bin\jython -i repl.py
>>> Foo(u"caf\u00e9")
repr café
>>> Foo(u"\u0111")  # in cp1250 letter đ is 0xf0
repr đ
>>>

I emphasise that this is what we do currently (before any change), and I intend to leave it like that. It seems useful and we don't have the divergence as a bug. With the proposed fix then, where supported by the console encoding, I see:
>>> from java.lang import String
>>> s = String(u"\u0111")
>>> s
đ
>>> s.__repr__()
u'\u0111'
>>> repr(s)
u'\u0111'
>>> str(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0111' in position 0: ordinal not in range(128)
msg9222 (view) Author: Jeff Allen (jeff.allen) Date: 2014-11-30.08:58:50
Issue #2234 reveals another example of "character smuggling" in the Jython str:
>>> from java.io import StringReader
>>> from org.python.core import PyFileReader
>>> r = StringReader(u"\u86c7\u541e\u8c61")
>>> pfr = PyFileReader(r)
>>> pfr.read(1)
'\u86c7'
>>> type(pfr.read(1))
<type 'str'>

The cost of scanning every character during construction of a PyString nags a bit, but catching this kind of error (which it does) appears worthwhile. A constructor from byte[] (or ByteBuffer), that needs no checking, would be handy when the client already has bytes, and is presumably the future.
msg9224 (view) Author: Jeff Allen (jeff.allen) Date: 2014-12-02.22:29:21
I claim this is fixed in https://hg.python.org/jython/rev/f0c63b42e552,
but I've also made subsequent tweaks that avoid the check in some cases when we know it is not necessary.
History
Date User Action Args
2014-12-02 22:29:21jeff.allensetstatus: open -> closed
messages: + msg9224
2014-11-30 08:58:50jeff.allensetmessages: + msg9222
2014-11-14 23:17:43jeff.allensetmessages: + msg9210
2014-11-09 23:15:56zyasoftsetmessages: + msg9201
versions: - Jython 2.5
2014-11-09 18:07:23jeff.allensetmessages: + msg9200
2014-11-08 17:21:23zyasoftsetmessages: + msg9194
2014-11-08 12:30:49jeff.allensetmessages: + msg9192
versions: + Jython 2.7
2014-11-03 23:18:55jeff.allensetassignee: jeff.allen
nosy: + jeff.allen
2014-06-18 17:51:53zyasoftsetpriority: high
2014-05-21 20:32:34zyasoftsetmessages: + msg8460
2014-05-04 20:17:07zyasoftsetresolution: accepted
messages: + msg8329
nosy: + zyasoft
2013-04-08 17:31:24fwierzbickisetnosy: + fwierzbicki
2013-04-06 03:02:20Dolda2000create