Title: Handle unicode data appropriately in csv module
Type: behaviour Severity: normal
Components: Library Versions:
Status: open Resolution:
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: jeff.allen
Priority: normal Keywords:

Created on 2017-10-21.07:16:14 by jeff.allen, last changed 2017-10-25.16:03:31 by jeff.allen.

msg11625 (view) Author: Jeff Allen (jeff.allen) Date: 2017-10-21.07:16:13
As reported in, an attempt to write non-ascii text via csv results in the infamous:

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Cannot create PyString with non-byte value

Being old-school Python 2, this module thinks in bytes and leaves encoding to the user. By convention (?), the content of a CSV will be interpreted as UTF-8, so clients sensitive to the problem will supply encoded data. We reverse this philosophy in Python 3.

Almost certainly we should use a ByteBuffer where we presently use a StringBuilder since the file is in binary mode and the user should expect to encode the text before calling csv.writer.writerow().
msg11631 (view) Author: Jeff Allen (jeff.allen) Date: 2017-10-25.16:03:31
Actually, non-ascii text is ok unless you supply it as a unicode. In that case, we buffer up the Java chars internally (UTF-16), and then try to treat this String as bytes, hence the error. If the client supplies a unicode object, I believe we should be encoding it with the default encoding. In the same circumstances, CPython says something like:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 12: ordinal not in range(128)

So the StringBuilder can stay, but we ought to encode unicode objects as they arrive, if only so that we can fail the way CPython does.
Date User Action Args
2017-10-25 16:03:31jeff.allensetmessages: + msg11631
title: Handle byte data transparently in csv module -> Handle unicode data appropriately in csv module
2017-10-21 07:16:14jeff.allencreate