Issue2632

classification
Title: Handle unicode data appropriately in csv module
Type: behaviour Severity: normal
Components: Library Versions:
Milestone: Jython 2.7.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: jeff.allen
Priority: normal Keywords:

Created on 2017-10-21.07:16:14 by jeff.allen, last changed 2018-11-04.15:01:36 by jeff.allen.

Messages
msg11625 (view) Author: Jeff Allen (jeff.allen) Date: 2017-10-21.07:16:13
As reported in https://github.com/jythontools/jython/issues/90, an attempt to write non-ascii text via csv results in the infamous:

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Cannot create PyString with non-byte value

Being old-school Python 2, this module thinks in bytes and leaves encoding to the user. By convention (?), the content of a CSV will be interpreted as UTF-8, so clients sensitive to the problem will supply encoded data. We reverse this philosophy in Python 3.

Almost certainly we should use a ByteBuffer where we presently use a StringBuilder since the file is in binary mode and the user should expect to encode the text before calling csv.writer.writerow().
msg11631 (view) Author: Jeff Allen (jeff.allen) Date: 2017-10-25.16:03:31
Actually, non-ascii text is ok unless you supply it as a unicode. In that case, we buffer up the Java chars internally (UTF-16), and then try to treat this String as bytes, hence the error. If the client supplies a unicode object, I believe we should be encoding it with the default encoding. In the same circumstances, CPython says something like:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 12: ordinal not in range(128)

So the StringBuilder can stay, but we ought to encode unicode objects as they arrive, if only so that we can fail the way CPython does.
msg11670 (view) Author: Jeff Allen (jeff.allen) Date: 2017-11-21.23:31:50
Possibly fixed in https://hg.python.org/jython/rev/08978c4d1ab0

We now accept unicode objects and write and them with the default encoding (like CPython).
History
Date User Action Args
2018-11-04 15:01:36jeff.allensetstatus: pending -> closed
resolution: accepted -> fixed
2017-11-21 23:31:50jeff.allensetstatus: open -> pending
resolution: accepted
messages: + msg11670
milestone: Jython 2.7.2
2017-10-25 16:03:31jeff.allensetmessages: + msg11631
title: Handle byte data transparently in csv module -> Handle unicode data appropriately in csv module
2017-10-21 07:16:14jeff.allencreate