Message11656

Author jeff.allen
Recipients jeff.allen, stefan.richthofer, zyasoft
Date 2017-11-14.08:40:38
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1510648840.19.0.213398074469.issue2638@psf.upfronthosting.co.za>
In-reply-to
Content
This turns out to be more complicated than I thought, but I'm making reasonable progress (given other calls on time).

This issue has led me to re-work coercion generally in PyString and PyUnicode. The obvious thing to do is, in PyUnicode, to decode arguments to unicode (as a PyUnicode or a bare Java String) whenever they represent bytes, that is, whenever they have the buffer interface. At the moment (on my machine) I get this behaviour:

>>> s = "coffee"
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
coffee
coffee
coffee
coffee

And if we feed it some non-ascii bytes, it fails identically in all four cases:

>>> s = u"caf\xe9".encode('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
<type 'str'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'buffer'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'bytearray'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'memoryview'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

Whereas if we align the default encoding it works again:

>>> import sys; reload(sys).setdefaultencoding('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
café
café
café
café
>>>

The problem with this is that it too good. In CPython, the first two work (str and buffer), but bytearray and memoryview raise errors.

>>> s = "coffee"
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
coffee
coffee
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found

>>> s = u"caf\xe9".encode('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
<type 'str'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'buffer'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found

>>> import sys; reload(sys).setdefaultencoding('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
café
café
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found
>>>

As a result I fail a few tests where they expect these errors to be raised. However, I really prefer the consistency I'm getting to the prospect of inserting code just to make Jython inconsistent in the same way CPython is. :(
History
Date User Action Args
2017-11-14 08:40:40jeff.allensetmessageid: <1510648840.19.0.213398074469.issue2638@psf.upfronthosting.co.za>
2017-11-14 08:40:40jeff.allensetrecipients: + jeff.allen, zyasoft, stefan.richthofer
2017-11-14 08:40:40jeff.allenlinkissue2638 messages
2017-11-14 08:40:38jeff.allencreate