Issue2638

classification
Title: str not default-decoded in str-unicode operations
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7
Milestone:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: jeff.allen, stefan.richthofer, zyasoft
Priority: normal Keywords:

Created on 2017-10-30.07:54:52 by jeff.allen, last changed 2017-11-18.10:59:58 by jeff.allen.

Messages
msg11637 (view) Author: Jeff Allen (jeff.allen) Date: 2017-10-30.07:54:51
Incidental to working on #2632, I noticed that mixed comparisons of unicode and str do not produce the same results in Jython as in CPython.

CPython:

>>> u = u"caf\xe9"
>>> u == u.encode('latin-1')
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False

Jython:

>>> u = u"caf\xe9"
>>> u == u.encode('latin-1')
True

CPython converts the str (or whatever is opposite on the operator) into a unicode, if it can, using the default encoding. Jython just compares the internal Java string without reference to the default encoding. The default is ASCII normally, but may be changed with sys.setdefaultencoding('utf-8'), which is likely to happen exactly when this kind of encoding is significant to use.

We should check all the binary operation of PyUnicode for this problem.
msg11638 (view) Author: Jeff Allen (jeff.allen) Date: 2017-10-30.08:04:16
Assigning to me.
msg11640 (view) Author: Jeff Allen (jeff.allen) Date: 2017-11-01.19:54:42
Many unicode methods like find(s), endswith(s) and split(s) should accept a str argument and apply the default decoding. I've created (surprisingly many) failing tests about this that pass for CPython. Because coercion is centralised, most of them are fixed by changes in just a couple of places.

I'm a bit puzzled by the code around concatenation and comparison in PyObject, PyString and PyUnicode. We seem not to have defined __add__ and __radd__ after the CPython pattern, and the delegation (using PyType.isSubType) is not the same in PyObject._add as in PyObject._eq, for example.

Just for extra fun, there's the special-casing of PyShadowString which I thought should work automatically. Maybe it would if the foundation had been like CPython?
msg11641 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-11-01.20:31:11
Jeff, unfortunately there is nothing like __req__ in Python AFAIK.
That's why PyString needs to be aware of PyShadowString. We could have something like __req__ internally, but I sticked to the most simplistic implementation possible so far. I did not touch __cmp__, because os.name and sys.platform are usually checked by equality for platform detection.

At least we'll get rid of this in Jython 3...
msg11643 (view) Author: Jeff Allen (jeff.allen) Date: 2017-11-02.23:12:16
The special-casing hasn't been an obstacle for this issue. I just wondered if it stemmed from not having quite reproduced the CPython pattern. However, that may be a defect in my understanding. But if not finding __req__ determined the approach, it may be worth another look. __eq__ is its own reverse. You can see this at work here:
https://hg.python.org/jython/file/tip/src/org/python/core/PyObject.java#l1464

Also, when I worked recently on PyType I noticed for the first time the ExposeAsSuperclass marker interface. I wondered if that might be a way to have shadow strings that identify as str, without a new built-in type, but that act as you need them to.
https://hg.python.org/jython/file/tip/src/org/python/core/PyType.java#l373

On the present issue ... I got passing tests, then discovered that startswith and endswith are more complicated than I'd remembered. Mixed tuples produce some interesting cases (CPython 2.7.14):
>>> import sys; reload(sys).setdefaultencoding('utf-8')
>>> 'caf\xc3\xa9'.endswith(('zz',u'f\xe9'))
True
msg11644 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-11-03.14:51:55
To let PyShadowString perform __eq__, PyString's __eq__ would still have to check for PyShadowString and return PyNotImplemented on this to leverage PyShadowString's own __eq__, wouldn't it?

If there's a way to get rid of ShadowString as builtin type that would be fine. I would just regard this low priority for now. Especially given that it won't migrate to Jython 3.
msg11656 (view) Author: Jeff Allen (jeff.allen) Date: 2017-11-14.08:40:38
This turns out to be more complicated than I thought, but I'm making reasonable progress (given other calls on time).

This issue has led me to re-work coercion generally in PyString and PyUnicode. The obvious thing to do is, in PyUnicode, to decode arguments to unicode (as a PyUnicode or a bare Java String) whenever they represent bytes, that is, whenever they have the buffer interface. At the moment (on my machine) I get this behaviour:

>>> s = "coffee"
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
coffee
coffee
coffee
coffee

And if we feed it some non-ascii bytes, it fails identically in all four cases:

>>> s = u"caf\xe9".encode('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
<type 'str'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'buffer'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'bytearray'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'memoryview'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

Whereas if we align the default encoding it works again:

>>> import sys; reload(sys).setdefaultencoding('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
café
café
café
café
>>>

The problem with this is that it too good. In CPython, the first two work (str and buffer), but bytearray and memoryview raise errors.

>>> s = "coffee"
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
coffee
coffee
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found

>>> s = u"caf\xe9".encode('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
<type 'str'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'buffer'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found

>>> import sys; reload(sys).setdefaultencoding('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
café
café
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found
>>>

As a result I fail a few tests where they expect these errors to be raised. However, I really prefer the consistency I'm getting to the prospect of inserting code just to make Jython inconsistent in the same way CPython is. :(
msg11657 (view) Author: Jim Baker (zyasoft) Date: 2017-11-14.21:24:33
Agreed that the consistency you are proposing makes far more sense.
msg11659 (view) Author: Jeff Allen (jeff.allen) Date: 2017-11-18.10:37:18
Jim: good. I'll go that way then, by which I mean make the unicode methods consistently tolerant. Also for unicode a and b, a + T(b.encode()) = a + b. However:

1. T(a.encode()) + b will still be a T or an error. (It's up to T.__add__.) I thought I should have a unicode.__radd__, but not on second thoughts.

2. Tests for equality should do what CPython does: I'm thinking of the surprises assertEqual() and containers indexed by strings might throw up.

There are a few places around the codebase where "if it's not ascii it's not acceptable". In *some* of them, I think the CPython behaviour will be "if it doesn't decode it's not acceptable". Because any other default encoding is deprecated, this is not really tested in the regression suite. Fixing things here will probably move the focus along to those for a few users.
msg11660 (view) Author: Jeff Allen (jeff.allen) Date: 2017-11-18.10:59:58
... although this is plain weird (CPython 2.7.14):

>>> "hello"==buffer("hello")
False
>>> u"hello"==buffer("hello")
True

>>> "hello"==memoryview("hello")
True
>>> u"hello"==memoryview("hello")
False
History
Date User Action Args
2017-11-18 10:59:58jeff.allensetmessages: + msg11660
2017-11-18 10:37:19jeff.allensetmessages: + msg11659
2017-11-14 21:24:33zyasoftsetmessages: + msg11657
2017-11-14 08:40:40jeff.allensetmessages: + msg11656
2017-11-03 14:51:55stefan.richthofersetmessages: + msg11644
2017-11-02 23:12:18jeff.allensetmessages: + msg11643
2017-11-01 20:31:12stefan.richthofersetmessages: + msg11641
2017-11-01 19:54:43jeff.allensetnosy: + zyasoft, stefan.richthofer
messages: + msg11640
2017-10-30 08:04:16jeff.allensetassignee: jeff.allen
messages: + msg11638
title: str not encoded by default to unicode where CPython would -> str not default-decoded in str-unicode operations
2017-10-30 07:54:52jeff.allencreate