Message2022

Author hoehle
Recipients
Date 2007-11-28.18:47:31
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Hi,

My understanding of PEP0263 is that the "coding: utf-8" in the first
line should influence the reading of .py files.
Alas, the PEP says: Python-Version: 2.3
whereas jython-2.2 is documented as corresponding to Python 2.2.
http://www.python.org/dev/peps/pep-0263/

So possibly mine is not a bug, but a feature request.

How can I use UTF-8 umlauts in my .py files with Jython?

# foo.py -*- coding: utf-8 -*- http://www.python.org/peps/pep-0263.html
inlineds =  "zäöü!"
inlinedu = u"zäöü!"
explicits=  "z\u00e4\u00f6\u00fc!"
explicitu= u"z\u00e4\u00f6\u00fc!"
all4=[inlineds,inlinedu,explicits,explicitu]
print all4, [len(s) for s in all4]

On a RedHat 5 system this produces:
['z\xC3\xA4\xC3\xB6\xC3\xBC!', u'z\xC3\xA4\xC3\xB6\xC3\xBC!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [8, 8, 20, 5]
Jython 2.2 on java1.6.0_05-ea
uname -a
Linux foo.xy 2.6.9-55.0.9.ELsmp #1 SMP Tue Sep 25 02:16:15 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
LANG=de_DE@UTF-8

Debian produces expected results:
['z\xE4\xF6\xFC!', u'z\xE4\xF6\xFC!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [5,5,20,5]
Jython 2.2 on java1.6.0_02
uname -a
Linux debianbasic 2.6.18-5-686 #1 ... i686 GNU/Linux
LANG=de_DE.UTF-8

However, even on the Debian system changing $LANG gives
LANG=C ./jython.sh foo.py
[u'z\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD!', u'z\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [8, 8, 20, 5]

All happens as if Jython reads the .py file using Java's default
encoding (which is influenced by $LANG but cannot directly be set AFAIK).

java.nio.charset.Charset.defaultCharset()
java.io.OutputStreamWriter(java.io.ByteArrayOutputStream()).getEncoding()
yields Java's default encoding.

I've now installed 2.2.1 and results change, although still
not satisfactorily. The Debian system now always yields:
['z\xC3\xA4\xC3\xB6\xC3\xBC!', u'z\xC3\xA4\xC3\xB6\xC3\xBC!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [8, 8, 20, 5]
like Redhat before, regardless of $LANG.

Thus jython-2.2.1 seems to strictly assume ISO-8859-1 in .py files. At least 2.2.1 behaviour is consistent between the two
Redhat and Debian systems I tested.

Regards,
 Jörg Höhle
History
Date User Action Args
2008-02-20 17:18:07adminlinkissue1840479 messages
2008-02-20 17:18:07admincreate