Issue1840479

classification

Title:	coding: utf-8 and PEP 0263?
Type:		Severity:	normal
Components:	Core	Versions:
		Milestone:

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	cgroves, hoehle, otmarhumbel, zyasoft
Priority:	normal	Keywords:

Created on 2007-11-28.18:47:31 by hoehle, last changed 2008-09-13.23:21:16 by zyasoft.

Messages
msg2022 (view)	Author: Jörg Höhle (hoehle)	Date: 2007-11-28.18:47:31
Hi, My understanding of PEP0263 is that the "coding: utf-8" in the first line should influence the reading of .py files. Alas, the PEP says: Python-Version: 2.3 whereas jython-2.2 is documented as corresponding to Python 2.2. http://www.python.org/dev/peps/pep-0263/ So possibly mine is not a bug, but a feature request. How can I use UTF-8 umlauts in my .py files with Jython? # foo.py -- coding: utf-8 -- http://www.python.org/peps/pep-0263.html inlineds = "zäöü!" inlinedu = u"zäöü!" explicits= "z\u00e4\u00f6\u00fc!" explicitu= u"z\u00e4\u00f6\u00fc!" all4=[inlineds,inlinedu,explicits,explicitu] print all4, [len(s) for s in all4] On a RedHat 5 system this produces: ['z\xC3\xA4\xC3\xB6\xC3\xBC!', u'z\xC3\xA4\xC3\xB6\xC3\xBC!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [8, 8, 20, 5] Jython 2.2 on java1.6.0_05-ea uname -a Linux foo.xy 2.6.9-55.0.9.ELsmp #1 SMP Tue Sep 25 02:16:15 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux LANG=de_DE@UTF-8 Debian produces expected results: ['z\xE4\xF6\xFC!', u'z\xE4\xF6\xFC!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [5,5,20,5] Jython 2.2 on java1.6.0_02 uname -a Linux debianbasic 2.6.18-5-686 #1 ... i686 GNU/Linux LANG=de_DE.UTF-8 However, even on the Debian system changing $LANG gives LANG=C ./jython.sh foo.py [u'z\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD!', u'z\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [8, 8, 20, 5] All happens as if Jython reads the .py file using Java's default encoding (which is influenced by $LANG but cannot directly be set AFAIK). java.nio.charset.Charset.defaultCharset() java.io.OutputStreamWriter(java.io.ByteArrayOutputStream()).getEncoding() yields Java's default encoding. I've now installed 2.2.1 and results change, although still not satisfactorily. The Debian system now always yields: ['z\xC3\xA4\xC3\xB6\xC3\xBC!', u'z\xC3\xA4\xC3\xB6\xC3\xBC!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [8, 8, 20, 5] like Redhat before, regardless of $LANG. Thus jython-2.2.1 seems to strictly assume ISO-8859-1 in .py files. At least 2.2.1 behaviour is consistent between the two Redhat and Debian systems I tested. Regards, Jörg Höhle
msg2023 (view)	Author: Jörg Höhle (hoehle)	Date: 2007-11-28.18:51:00
I should mention that I'm using standalone-mode (for ease of use for my Java colleagues).
msg2024 (view)	Author: Oti Humbel (otmarhumbel)	Date: 2007-11-28.21:08:56
I am pretty sure it is a missing feature, since I've been missing it too. Standalone mode should not make any difference here.
msg2025 (view)	Author: Charlie Groves (cgroves)	Date: 2007-12-08.22:22:55
Yes, this is just a missing feature. One of the major changes for 2.2.1 was to no longer use Charset.defaultCharset: it introduces unpredictable behavior between platforms as you saw. PEP 263 will definitely appear in the next major version of Jython. For now you're stuck using explicit unicode escapes to get umlauts in .py files.
msg3554 (view)	Author: Jim Baker (zyasoft)	Date: 2008-09-13.23:21:16
Fixed in 2.5 and tested with test_pep263

History
Date	User	Action	Args
2008-09-13 23:21:16	zyasoft	set	status: open -> closed nosy: + zyasoft resolution: fixed messages: + msg3554
2007-11-28 18:47:31	hoehle	create