Issue1840479

classification
Title: coding: utf-8 and PEP 0263?
Type: Severity: normal
Components: Core Versions:
Milestone:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: cgroves, hoehle, otmarhumbel, zyasoft
Priority: normal Keywords:

Created on 2007-11-28.18:47:31 by hoehle, last changed 2008-09-13.23:21:16 by zyasoft.

Messages
msg2022 (view) Author: Jörg Höhle (hoehle) Date: 2007-11-28.18:47:31
Hi,

My understanding of PEP0263 is that the "coding: utf-8" in the first
line should influence the reading of .py files.
Alas, the PEP says: Python-Version: 2.3
whereas jython-2.2 is documented as corresponding to Python 2.2.
http://www.python.org/dev/peps/pep-0263/

So possibly mine is not a bug, but a feature request.

How can I use UTF-8 umlauts in my .py files with Jython?

# foo.py -*- coding: utf-8 -*- http://www.python.org/peps/pep-0263.html
inlineds =  "zäöü!"
inlinedu = u"zäöü!"
explicits=  "z\u00e4\u00f6\u00fc!"
explicitu= u"z\u00e4\u00f6\u00fc!"
all4=[inlineds,inlinedu,explicits,explicitu]
print all4, [len(s) for s in all4]

On a RedHat 5 system this produces:
['z\xC3\xA4\xC3\xB6\xC3\xBC!', u'z\xC3\xA4\xC3\xB6\xC3\xBC!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [8, 8, 20, 5]
Jython 2.2 on java1.6.0_05-ea
uname -a
Linux foo.xy 2.6.9-55.0.9.ELsmp #1 SMP Tue Sep 25 02:16:15 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
LANG=de_DE@UTF-8

Debian produces expected results:
['z\xE4\xF6\xFC!', u'z\xE4\xF6\xFC!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [5,5,20,5]
Jython 2.2 on java1.6.0_02
uname -a
Linux debianbasic 2.6.18-5-686 #1 ... i686 GNU/Linux
LANG=de_DE.UTF-8

However, even on the Debian system changing $LANG gives
LANG=C ./jython.sh foo.py
[u'z\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD!', u'z\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [8, 8, 20, 5]

All happens as if Jython reads the .py file using Java's default
encoding (which is influenced by $LANG but cannot directly be set AFAIK).

java.nio.charset.Charset.defaultCharset()
java.io.OutputStreamWriter(java.io.ByteArrayOutputStream()).getEncoding()
yields Java's default encoding.

I've now installed 2.2.1 and results change, although still
not satisfactorily. The Debian system now always yields:
['z\xC3\xA4\xC3\xB6\xC3\xBC!', u'z\xC3\xA4\xC3\xB6\xC3\xBC!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [8, 8, 20, 5]
like Redhat before, regardless of $LANG.

Thus jython-2.2.1 seems to strictly assume ISO-8859-1 in .py files. At least 2.2.1 behaviour is consistent between the two
Redhat and Debian systems I tested.

Regards,
 Jörg Höhle
msg2023 (view) Author: Jörg Höhle (hoehle) Date: 2007-11-28.18:51:00
I should mention that I'm using standalone-mode (for ease of use for my Java colleagues).
msg2024 (view) Author: Oti Humbel (otmarhumbel) Date: 2007-11-28.21:08:56
I am pretty sure it is a missing feature, since I've been missing it too.
Standalone mode should not make any difference here.
msg2025 (view) Author: Charlie Groves (cgroves) Date: 2007-12-08.22:22:55
Yes, this is just a missing feature.  One of the major changes for 2.2.1 was to no longer use Charset.defaultCharset: it introduces unpredictable behavior between platforms as you saw.  PEP 263 will definitely appear in the next major version of Jython.  For now you're stuck using explicit unicode escapes to get umlauts in .py files.
msg3554 (view) Author: Jim Baker (zyasoft) Date: 2008-09-13.23:21:16
Fixed in 2.5 and tested with test_pep263
History
Date User Action Args
2008-09-13 23:21:16zyasoftsetstatus: open -> closed
nosy: + zyasoft
resolution: fixed
messages: + msg3554
2007-11-28 18:47:31hoehlecreate