Issue2164

classification
Title: codecs do not accept memoryview objects for decoding
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7
Milestone:
process
Status: open Resolution: remind
Dependencies: Superseder:
Assigned To: Nosy List: jeff.allen, santa4nt, zyasoft
Priority: Keywords:

Created on 2014-06-10.20:57:28 by zyasoft, last changed 2014-09-18.02:33:02 by zyasoft.

Messages
msg8624 (view) Author: Jim Baker (zyasoft) Date: 2014-06-10.20:57:27
Difference between CPython and Jython seen with this example:

# -*- coding: utf-8 -*-

import codecs

data = memoryview(b"中文")
text, decoded_bytes = codecs.utf_8_decode(data)
assert text == u"中文"
assert type(text) is unicode
assert decoded_bytes == 6

This works fine on CPython. On Jython, it fails with TypeError: utf_8_decode(): 1st arg can't be coerced to String

Current workaround is to use tobytes on the memoryview object:

text, decoded_bytes = codecs.utf_8_decode(data.tobytes())
msg8625 (view) Author: Jim Baker (zyasoft) Date: 2014-06-10.20:57:38
Target beta 4
msg8633 (view) Author: Jeff Allen (jeff.allen) Date: 2014-06-12.19:20:08
I'd happily take this on unless someone is itching to get to know the buffer interface better.
msg8638 (view) Author: Santoso Wijaya (santa4nt) Date: 2014-06-13.18:04:19
Sounds interesting to me. Any tips?
msg8639 (view) Author: Jeff Allen (jeff.allen) Date: 2014-06-13.20:38:07
I decided step 1 was to make PyBuffer extend AutoCloseable, because this work by Indra Talip would have been neater:
http://hg.python.org/jython/rev/355bb70327e0

Been meaning to since Java 7. So I've done that (testing now, maybe push tonight). You can take over from there if you like.

This article is about the buffer protocol: https://wiki.python.org/jython/BufferProtocol , but it needs to be updated with the change I just made.

If you look into how some choice codecs work, at the bottom they all seem to depend on entry points in modules/_codecs.java, so it's those that need changing. For a start, accept a PyObject obytes argument, then something like:
if (obytes instanceof BufferProtocol) {
    try (PyBuffer bytes = ((BufferProtocol)obytes).getBuffer(PyBUF.SIMPLE)) {
        ...
    }
} else {
    throw Py.TypeError("must be string or buffer, not " ... )
}

You should then find the existing code bytes.charAt() still works, or it might be better to say this stuff really is bytes now. The soft option is ask for it as a String again, but IMO that's perpetuating a misdemeanor.

My worry was that a lot of helper methods, and maybe some clients of these methods, would have to change signature, so it would end up really quite extensive. Maybe they should anyway.

I couldn't find a test that exposes this problem, so I was going to add to test_codecs_jy.py, something like:
def round_trip(u, name) :
    s = u.encode(name)
    dec = codecs.getdecoder(name)
    for B in (buffer, memoryview, bytearray) :
        self.assertEqual(u, dec(B(s))[0])

(I think that's correct.) Then call it with a variety of unicode strings and codec names.
msg8642 (view) Author: Jeff Allen (jeff.allen) Date: 2014-06-14.14:40:54
Ok, I committed the helpful change to PyBuffer and made the Wiki change.
msg8688 (view) Author: Jim Baker (zyasoft) Date: 2014-06-19.00:34:58
Jeff, thanks, sounds like a reasonable set of changes that we need to propagate through the codecs implementation.
msg9006 (view) Author: Jim Baker (zyasoft) Date: 2014-09-18.02:33:02
Target beta 4
History
Date User Action Args
2014-09-18 02:33:02zyasoftsetresolution: remind
messages: + msg9006
2014-06-19 00:34:58zyasoftsetmessages: + msg8688
2014-06-14 14:40:54jeff.allensetmessages: + msg8642
2014-06-13 20:38:08jeff.allensetmessages: + msg8639
2014-06-13 18:04:19santa4ntsetmessages: + msg8638
2014-06-12 19:20:08jeff.allensetnosy: + jeff.allen
messages: + msg8633
2014-06-11 01:51:44santa4ntsettype: behaviour
2014-06-11 01:51:37santa4ntsetnosy: + santa4nt
components: + Core
versions: + Jython 2.7
2014-06-10 20:57:39zyasoftsetmessages: + msg8625
2014-06-10 20:57:28zyasoftcreate