Issue2638

classification
Title: str not default-decoded in str-unicode operations
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7
Milestone: Jython 2.7.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: jeff.allen, stefan.richthofer, zyasoft
Priority: normal Keywords:

Created on 2017-10-30.07:54:52 by jeff.allen, last changed 2018-03-07.22:03:49 by jeff.allen.

Messages
msg11637 (view) Author: Jeff Allen (jeff.allen) Date: 2017-10-30.07:54:51
Incidental to working on #2632, I noticed that mixed comparisons of unicode and str do not produce the same results in Jython as in CPython.

CPython:

>>> u = u"caf\xe9"
>>> u == u.encode('latin-1')
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False

Jython:

>>> u = u"caf\xe9"
>>> u == u.encode('latin-1')
True

CPython converts the str (or whatever is opposite on the operator) into a unicode, if it can, using the default encoding. Jython just compares the internal Java string without reference to the default encoding. The default is ASCII normally, but may be changed with sys.setdefaultencoding('utf-8'), which is likely to happen exactly when this kind of encoding is significant to use.

We should check all the binary operation of PyUnicode for this problem.
msg11638 (view) Author: Jeff Allen (jeff.allen) Date: 2017-10-30.08:04:16
Assigning to me.
msg11640 (view) Author: Jeff Allen (jeff.allen) Date: 2017-11-01.19:54:42
Many unicode methods like find(s), endswith(s) and split(s) should accept a str argument and apply the default decoding. I've created (surprisingly many) failing tests about this that pass for CPython. Because coercion is centralised, most of them are fixed by changes in just a couple of places.

I'm a bit puzzled by the code around concatenation and comparison in PyObject, PyString and PyUnicode. We seem not to have defined __add__ and __radd__ after the CPython pattern, and the delegation (using PyType.isSubType) is not the same in PyObject._add as in PyObject._eq, for example.

Just for extra fun, there's the special-casing of PyShadowString which I thought should work automatically. Maybe it would if the foundation had been like CPython?
msg11641 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-11-01.20:31:11
Jeff, unfortunately there is nothing like __req__ in Python AFAIK.
That's why PyString needs to be aware of PyShadowString. We could have something like __req__ internally, but I sticked to the most simplistic implementation possible so far. I did not touch __cmp__, because os.name and sys.platform are usually checked by equality for platform detection.

At least we'll get rid of this in Jython 3...
msg11643 (view) Author: Jeff Allen (jeff.allen) Date: 2017-11-02.23:12:16
The special-casing hasn't been an obstacle for this issue. I just wondered if it stemmed from not having quite reproduced the CPython pattern. However, that may be a defect in my understanding. But if not finding __req__ determined the approach, it may be worth another look. __eq__ is its own reverse. You can see this at work here:
https://hg.python.org/jython/file/tip/src/org/python/core/PyObject.java#l1464

Also, when I worked recently on PyType I noticed for the first time the ExposeAsSuperclass marker interface. I wondered if that might be a way to have shadow strings that identify as str, without a new built-in type, but that act as you need them to.
https://hg.python.org/jython/file/tip/src/org/python/core/PyType.java#l373

On the present issue ... I got passing tests, then discovered that startswith and endswith are more complicated than I'd remembered. Mixed tuples produce some interesting cases (CPython 2.7.14):
>>> import sys; reload(sys).setdefaultencoding('utf-8')
>>> 'caf\xc3\xa9'.endswith(('zz',u'f\xe9'))
True
msg11644 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-11-03.14:51:55
To let PyShadowString perform __eq__, PyString's __eq__ would still have to check for PyShadowString and return PyNotImplemented on this to leverage PyShadowString's own __eq__, wouldn't it?

If there's a way to get rid of ShadowString as builtin type that would be fine. I would just regard this low priority for now. Especially given that it won't migrate to Jython 3.
msg11656 (view) Author: Jeff Allen (jeff.allen) Date: 2017-11-14.08:40:38
This turns out to be more complicated than I thought, but I'm making reasonable progress (given other calls on time).

This issue has led me to re-work coercion generally in PyString and PyUnicode. The obvious thing to do is, in PyUnicode, to decode arguments to unicode (as a PyUnicode or a bare Java String) whenever they represent bytes, that is, whenever they have the buffer interface. At the moment (on my machine) I get this behaviour:

>>> s = "coffee"
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
coffee
coffee
coffee
coffee

And if we feed it some non-ascii bytes, it fails identically in all four cases:

>>> s = u"caf\xe9".encode('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
<type 'str'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'buffer'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'bytearray'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'memoryview'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

Whereas if we align the default encoding it works again:

>>> import sys; reload(sys).setdefaultencoding('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
café
café
café
café
>>>

The problem with this is that it too good. In CPython, the first two work (str and buffer), but bytearray and memoryview raise errors.

>>> s = "coffee"
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
coffee
coffee
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found

>>> s = u"caf\xe9".encode('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
<type 'str'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'buffer'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found

>>> import sys; reload(sys).setdefaultencoding('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
café
café
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found
>>>

As a result I fail a few tests where they expect these errors to be raised. However, I really prefer the consistency I'm getting to the prospect of inserting code just to make Jython inconsistent in the same way CPython is. :(
msg11657 (view) Author: Jim Baker (zyasoft) Date: 2017-11-14.21:24:33
Agreed that the consistency you are proposing makes far more sense.
msg11659 (view) Author: Jeff Allen (jeff.allen) Date: 2017-11-18.10:37:18
Jim: good. I'll go that way then, by which I mean make the unicode methods consistently tolerant. Also for unicode a and b, a + T(b.encode()) = a + b. However:

1. T(a.encode()) + b will still be a T or an error. (It's up to T.__add__.) I thought I should have a unicode.__radd__, but not on second thoughts.

2. Tests for equality should do what CPython does: I'm thinking of the surprises assertEqual() and containers indexed by strings might throw up.

There are a few places around the codebase where "if it's not ascii it's not acceptable". In *some* of them, I think the CPython behaviour will be "if it doesn't decode it's not acceptable". Because any other default encoding is deprecated, this is not really tested in the regression suite. Fixing things here will probably move the focus along to those for a few users.
msg11660 (view) Author: Jeff Allen (jeff.allen) Date: 2017-11-18.10:59:58
... although this is plain weird (CPython 2.7.14):

>>> "hello"==buffer("hello")
False
>>> u"hello"==buffer("hello")
True

>>> "hello"==memoryview("hello")
True
>>> u"hello"==memoryview("hello")
False
msg11669 (view) Author: Jeff Allen (jeff.allen) Date: 2017-11-21.23:22:50
We now have fairly complete support for default encoding when mixing bytes and unicode, thanks to these two change sets:
https://hg.python.org/jython/rev/78482073e91f
https://hg.python.org/jython/rev/f71e0b2cfaf7

People who call sys.setdefaultencoding should now have an experience closer to CPython. I'm hoping this will help with #2633 in these circumstances.

Reading test_unicode_jy.DefaultDecodingTestCase gives a pretty good account of where Jython diverges from CPython, by doing more. The test passes for CPython 2.7.14, thanks to a few if-statements testing for Jython. I've made what I think is a reasonable compromise between CPython behaviour and consistency in the comparisons/equality.

One loose end: str.find(unicode) returns an index in the encoded string, not a byte offset in the original. I think this is wrong, but is what CPython does.

Oh, and I may have broken shadowstring ... is there a test? I'd quite like to modify startswith.
msg11671 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-11-21.23:44:00
> I may have broken shadowstring ... is there a test?

No, a test is a todo.

Easiest way to check this is to try the commands illustrated in http://bugs.jython.org/issue2557.

Additionally, ctypes via JyNI relies on this, especially on Windows. So an "easy" way to check its workability is to run JyNI's ctypes test. I can check that, given that JyNI isn't pip-installable yet.

> I'd quite like to modify startswith.

Do as you like, I'll take care to fix shadowstring afterwards.

I'd appreciate if this issue would not be closed until I give my okay.
Generally speaking I just want to make sure that 2.7.2 wouldn't break JyNI.
msg11672 (view) Author: Jim Baker (zyasoft) Date: 2017-11-22.00:02:21
+1 on keeping shadowstring/JyNI compatibility. But this should hopefully be straightforward.
msg11680 (view) Author: Jeff Allen (jeff.allen) Date: 2017-11-22.21:42:41
As we care whether it works, we should have a test for it to alert us when we break it. I'll add one as you indicate, Stefan and you can improve it if it fails to capture the essence.

By "modify startswith" I meant the one in PyShadowstring, because it copied code from PyString that I've now changed. Ideally, it would call the one in PyString instead, or maybe both call common code. This might be easier if the shadow string it held were a PyString itself. I would feel safer trying that with a test in place.
msg11681 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-11-23.16:07:16
Alright, Jeff. Thanks for adding a test!
Regarding "modify startswith" just go ahead, I'll review and fix things as necessary. I'll maybe find time for that on the sprint next weekend.
msg11758 (view) Author: Jeff Allen (jeff.allen) Date: 2018-03-07.18:53:41
Stefan wrote:

> I'd appreciate if this issue would not be closed until I give my okay.
> Generally speaking I just want to make sure that 2.7.2 wouldn't break JyNI.

OK?
msg11764 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2018-03-07.19:57:58
Sorry, I forgot to follow up on this. Give me a day or so to check...
msg11765 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2018-03-07.21:37:12
Okay, JyNI works as usual with current trunk. Can be closed I suppose.
msg11767 (view) Author: Jeff Allen (jeff.allen) Date: 2018-03-07.22:03:49
A fine example of European co-operation.
History
Date User Action Args
2018-03-07 22:03:49jeff.allensetstatus: pending -> closed
resolution: accepted -> fixed
messages: + msg11767
2018-03-07 21:37:12stefan.richthofersetmessages: + msg11765
2018-03-07 19:57:58stefan.richthofersetmessages: + msg11764
2018-03-07 18:53:42jeff.allensetstatus: open -> pending
messages: + msg11758
milestone: Jython 2.7.2
2017-11-23 16:07:17stefan.richthofersetmessages: + msg11681
2017-11-22 21:42:41jeff.allensetstatus: pending -> open
messages: + msg11680
2017-11-22 00:02:21zyasoftsetmessages: + msg11672
2017-11-21 23:44:00stefan.richthofersetmessages: + msg11671
2017-11-21 23:22:50jeff.allensetstatus: open -> pending
resolution: accepted
messages: + msg11669
2017-11-18 10:59:58jeff.allensetmessages: + msg11660
2017-11-18 10:37:19jeff.allensetmessages: + msg11659
2017-11-14 21:24:33zyasoftsetmessages: + msg11657
2017-11-14 08:40:40jeff.allensetmessages: + msg11656
2017-11-03 14:51:55stefan.richthofersetmessages: + msg11644
2017-11-02 23:12:18jeff.allensetmessages: + msg11643
2017-11-01 20:31:12stefan.richthofersetmessages: + msg11641
2017-11-01 19:54:43jeff.allensetnosy: + zyasoft, stefan.richthofer
messages: + msg11640
2017-10-30 08:04:16jeff.allensetassignee: jeff.allen
messages: + msg11638
title: str not encoded by default to unicode where CPython would -> str not default-decoded in str-unicode operations
2017-10-30 07:54:52jeff.allencreate