Message11656

Author	jeff.allen
Recipients	jeff.allen, stefan.richthofer, zyasoft
Date	2017-11-14.08:40:38
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1510648840.19.0.213398074469.issue2638@psf.upfronthosting.co.za>
In-reply-to

Content
This turns out to be more complicated than I thought, but I'm making reasonable progress (given other calls on time). This issue has led me to re-work coercion generally in PyString and PyUnicode. The obvious thing to do is, in PyUnicode, to decode arguments to unicode (as a PyUnicode or a bare Java String) whenever they represent bytes, that is, whenever they have the buffer interface. At the moment (on my machine) I get this behaviour: >>> s = "coffee" >>> for T in (str, buffer, bytearray, memoryview): ... try: ... print u"" + T(s) ... except Exception as e: ... print T, e ... coffee coffee coffee coffee And if we feed it some non-ascii bytes, it fails identically in all four cases: >>> s = u"caf\xe9".encode('utf-8') >>> for T in (str, buffer, bytearray, memoryview): ... try: ... print u"" + T(s) ... except Exception as e: ... print T, e ... <type 'str'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128) <type 'buffer'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128) <type 'bytearray'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128) <type 'memoryview'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128) Whereas if we align the default encoding it works again: >>> import sys; reload(sys).setdefaultencoding('utf-8') >>> for T in (str, buffer, bytearray, memoryview): ... try: ... print u"" + T(s) ... except Exception as e: ... print T, e ... café café café café >>> The problem with this is that it too good. In CPython, the first two work (str and buffer), but bytearray and memoryview raise errors. >>> s = "coffee" >>> for T in (str, buffer, bytearray, memoryview): ... try: ... print u"" + T(s) ... except Exception as e: ... print T, e ... coffee coffee <type 'bytearray'> decoding bytearray is not supported <type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found >>> s = u"caf\xe9".encode('utf-8') >>> for T in (str, buffer, bytearray, memoryview): ... try: ... print u"" + T(s) ... except Exception as e: ... print T, e ... <type 'str'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128) <type 'buffer'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128) <type 'bytearray'> decoding bytearray is not supported <type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found >>> import sys; reload(sys).setdefaultencoding('utf-8') >>> for T in (str, buffer, bytearray, memoryview): ... try: ... print u"" + T(s) ... except Exception as e: ... print T, e ... café café <type 'bytearray'> decoding bytearray is not supported <type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found >>> As a result I fail a few tests where they expect these errors to be raised. However, I really prefer the consistency I'm getting to the prospect of inserting code just to make Jython inconsistent in the same way CPython is. :(

This turns out to be more complicated than I thought, but I'm making reasonable progress (given other calls on time).

This issue has led me to re-work coercion generally in PyString and PyUnicode. The obvious thing to do is, in PyUnicode, to decode arguments to unicode (as a PyUnicode or a bare Java String) whenever they represent bytes, that is, whenever they have the buffer interface. At the moment (on my machine) I get this behaviour:

>>> s = "coffee"
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
coffee
coffee
coffee
coffee

And if we feed it some non-ascii bytes, it fails identically in all four cases:

>>> s = u"caf\xe9".encode('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
<type 'str'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'buffer'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'bytearray'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'memoryview'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

Whereas if we align the default encoding it works again:

>>> import sys; reload(sys).setdefaultencoding('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
café
café
café
café
>>>

The problem with this is that it too good. In CPython, the first two work (str and buffer), but bytearray and memoryview raise errors.

>>> s = "coffee"
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
coffee
coffee
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found

>>> s = u"caf\xe9".encode('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
<type 'str'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'buffer'> 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found

>>> import sys; reload(sys).setdefaultencoding('utf-8')
>>> for T in (str, buffer, bytearray, memoryview):
...     try:
...         print u"" + T(s)
...     except Exception as e:
...         print T, e
...
café
café
<type 'bytearray'> decoding bytearray is not supported
<type 'memoryview'> coercing to Unicode: need string or buffer, memoryview found
>>>

As a result I fail a few tests where they expect these errors to be raised. However, I really prefer the consistency I'm getting to the prospect of inserting code just to make Jython inconsistent in the same way CPython is. :(

History
Date	User	Action	Args
2017-11-14 08:40:40	jeff.allen	set	messageid: <1510648840.19.0.213398074469.issue2638@psf.upfronthosting.co.za>
2017-11-14 08:40:40	jeff.allen	set	recipients: + jeff.allen, zyasoft, stefan.richthofer
2017-11-14 08:40:40	jeff.allen	link	issue2638 messages
2017-11-14 08:40:38	jeff.allen	create