Message11656
This turns out to be more complicated than I thought, but I'm making reasonable progress (given other calls on time).
This issue has led me to re-work coercion generally in PyString and PyUnicode. The obvious thing to do is, in PyUnicode, to decode arguments to unicode (as a PyUnicode or a bare Java String) whenever they represent bytes, that is, whenever they have the buffer interface. At the moment (on my machine) I get this behaviour:
>>> s = "coffee"
>>> for T in (str, buffer, bytearray, memoryview):
... try:
... print u"" + T(s)
... except Exception as e:
... print T, e
...
coffee
coffee
coffee
coffee
And if we feed it some non-ascii bytes, it fails identically in all four cases:
>>> s = u"caf\xe9".encode('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
... try:
... print u"" + T(s)
... except Exception as e:
... print T, e
...
<type 'str'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'buffer'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'bytearray'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'memoryview'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
Whereas if we align the default encoding it works again:
>>> import sys; reload(sys).setdefaultencoding('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
... try:
... print u"" + T(s)
... except Exception as e:
... print T, e
...
café
café
café
café
>>>
The problem with this is that it too good. In CPython, the first two work (str and buffer), but bytearray and memoryview raise errors.
>>> s = "coffee"
>>> for T in (str, buffer, bytearray, memoryview):
... try:
... print u"" + T(s)
... except Exception as e:
... print T, e
...
coffee
coffee
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found
>>> s = u"caf\xe9".encode('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
... try:
... print u"" + T(s)
... except Exception as e:
... print T, e
...
<type 'str'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'buffer'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found
>>> import sys; reload(sys).setdefaultencoding('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
... try:
... print u"" + T(s)
... except Exception as e:
... print T, e
...
café
café
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found
>>>
As a result I fail a few tests where they expect these errors to be raised. However, I really prefer the consistency I'm getting to the prospect of inserting code just to make Jython inconsistent in the same way CPython is. :( |
|
Date |
User |
Action |
Args |
2017-11-14 08:40:40 | jeff.allen | set | messageid: <1510648840.19.0.213398074469.issue2638@psf.upfronthosting.co.za> |
2017-11-14 08:40:40 | jeff.allen | set | recipients:
+ jeff.allen, zyasoft, stefan.richthofer |
2017-11-14 08:40:40 | jeff.allen | link | issue2638 messages |
2017-11-14 08:40:38 | jeff.allen | create | |
|