Issue1841

classification

Title:	Non-ASCII environment variables are encoded incorrectly in os.environ
Type:		Severity:	normal
Components:		Versions:
		Milestone:

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	amak, fwierzbicki, pekka.klarck, zyasoft
Priority:		Keywords:

Created on 2012-02-17.22:45:43 by pekka.klarck, last changed 2015-03-17.14:22:09 by zyasoft.

Messages
msg6782 (view)	Author: Pekka Klärck (pekka.klarck)	Date: 2012-02-17.22:45:42
On my Linux machine with UTF-8 system encoding I got the following: $ a=ä python Python 2.6.6 (r266:84292, Sep 15 2010, 15:52:39) [GCC 4.4.5] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.environ['a'] '\xc3\xa4' >>> _.decode('UTF-8') u'\xe4' $ a=ä jython Jython 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06) [Java HotSpot(TM) Server VM (Sun Microsystems Inc.)] on java1.6.0_21 Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.environ['a'] '\xe4' I have seen Jython to return similarly wrong bytes earlier (e.g. #1592 and #1593) and know that I can decode them using this hack: >>> from java.lang import String >>> String(os.environ['a']).toString() u'\xe4' The problem is that if I set environment variables myself and encode them correctly, using the hack doesn't work: >>> os.environ['b'] = u'\xe4'.encode('UTF-8') >>> String(os.environ['b']).toString() u'\xc3\xa4' In other words I needed to know has the value been set before or during the execution. It turns out that I actually can do that using using java.lang.System.getenv which only knows about the former: >>> from java.lang.System import getenv >>> getenv('a') u'\xe4' >>> getenv('b') is None True Notice also how getenv above returned the correct value as Unicode.
msg6830 (view)	Author: Alan Kennedy (amak)	Date: 2012-03-19.18:18:19
What is the setting of "python.console.encoding" in your registry file? Is it set to the actual encoding of your shell? Note also that you should really be passing an encoding to the String constructor when decoding from bytes, i.e. >>> os.environ['b'] = u'\xe4'.encode('UTF-8') >>> String(os.environ['b'], "UTF-8").toString() If you don't specify an encoding, the bytes are unlikely to be decoded properly.
msg7799 (view)	Author: Frank Wierzbicki (fwierzbicki)	Date: 2013-02-26.18:27:45
No answer in a long time, closing as out of date.
msg7815 (view)	Author: Pekka Klärck (pekka.klarck)	Date: 2013-02-26.21:29:35
Sorry, hadn't noticed Alan's question. Where is the registry file stored? I certainly haven't touched it.
msg7816 (view)	Author: Pekka Klärck (pekka.klarck)	Date: 2013-02-26.21:32:24
Found the registry. "python.console.encoding" is commented out.
msg7817 (view)	Author: Frank Wierzbicki (fwierzbicki)	Date: 2013-02-26.22:00:11
Pekka: by default it is commented out, I think Alan is suggesting that you specify an encoding. Opening back up.
msg9331 (view)	Author: Jim Baker (zyasoft)	Date: 2015-01-07.06:50:02
Behavior as of https://hg.python.org/jython/rev/ea036792f304 - note that a unicode object is now returned, not a str $ a=ä jython27 -c "import os; print repr(os.environ['a'])" u'\xe4' It's really the only choice we have since we are on Java. But at least it's not an incorrect bytestring.
msg9379 (view)	Author: Pekka Klärck (pekka.klarck)	Date: 2015-01-13.17:06:51
Inconsistency with CPython is not ideal, but returning correct Unicode is definitely better than returning incorrect bytes. How does setting environment variables work now? Should it also be set as Unicode and not as bytes like with CPython? Do you get correct Unicode out if you later query the value?
msg9430 (view)	Author: Jim Baker (zyasoft)	Date: 2015-01-20.23:29:35
Setting os.environ should be done with Unicode, but we can support bytes as well if we have an encoding. Otherwise we will get an error like so: ====================================================================== ERROR: test_env_bytes (__main__.OSUnicodeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "dist/Lib/test/test_os_jy.py", line 147, in test_env_bytes newenv["TEST_HOME"] = u"首页".decode("utf-8") File "/Users/jbaker/jythondev/jython27/dist/Lib/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) IllegalArgumentException: java.lang.IllegalArgumentException: Cannot create PyString with non-byte value But what to choose from? * ascii - a good default choice that prevents silent error propagation * sys.getdefaultencoding() appears to always return "ascii"; it's not really changeable, https://docs.python.org/2/library/sys.html#sys.setdefaultencoding Let's double check this by someone running a non US system. * sys.getfilesystemencoding() always returs None, which is legal ("or None if the system default encoding is used", https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding). Let's not change this. * python.console.encoding really should be about the console * python.io.encoding (= env variable PYTHONIOENCODING) is new in 2.7 and corresponds to what is available since 2.6 in CPython
msg9431 (view)	Author: Jim Baker (zyasoft)	Date: 2015-01-20.23:40:57
Writing the new proposed test properly would have been better - don't mix up encode/decode ;) def test_env_bytes(self): with test_support.temp_cwd(name=u"tempcwd-中文"): newenv = os.environ.copy() newenv["TEST_HOME"] = u"首页".encode("utf-8") p = subprocess.Popen([sys.executable, "-c", 'import sys,os;' \ 'sys.stdout.write(os.getenv("TEST_HOME"))'], stdout=subprocess.PIPE, env=newenv) self.assertEqual(p.stdout.read().decode("utf-8"), u"首页") ====================================================================== FAIL: test_env_bytes (__main__.OSUnicodeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "dist/Lib/test/test_os_jy.py", line 153, in test_env_bytes self.assertEqual(p.stdout.read().decode("utf-8"), u"首页") AssertionError: u'\xe9\xa6\x96\xe9\xa1\xb5' != u'\u9996\u9875' - \xe9\xa6\x96\xe9\xa1\xb5 + \u9996\u9875 So one can pass through bytes in this fashion, but an additional level of UTF-8 encoding was added; to "fix" the assertion requires self.assertEqual(p.stdout.read().decode("utf-8").decode("utf-8"), u"首页") which is obviously not desirable.
msg9432 (view)	Author: Pekka Klärck (pekka.klarck)	Date: 2015-01-21.01:23:10
1) I'm fine with getting and setting environment variables, cli arguments, etc. as Unicode on Jython. With Jython only code it would avoid the need to actually know the system encoding, and compatibility with CPython only requires one 'if' to decide is encoding/decoding needed. 2) Getting system encoding with Python is surprisingly complicated. With Robot Framework we get the encoding that is got first using these alternatives in this order: - Any: sys.getfilesystemencoding() - Jython: System.getProperty('file.encoding') - POSIX: Environment variables LANG, LC_CTYPE, LANGUAGE, and LC_ALL - Windows: ctypes.cdll.kernel32.GetACP() Here's the code for anyone interested: https://github.com/robotframework/robotframework/blob/master/src/robot/utils/encodingsniffer.py 3) If there's a reliable way to get the correct encoding, I'd prefer sys.getfilesystemencoding() to return it. 4) The test that requires double decoding definitely looks wrong.
msg9667 (view)	Author: Jim Baker (zyasoft)	Date: 2015-03-17.14:22:09
Closing out - it's an interesting question of supporting bytes instead of unicode (msg9430), but Java natively wants unicode. We might be able to support this with JNR in the future, but note that currently JNR returns String as well, not byte[] - https://github.com/jnr/jnr-posix/blob/master/src/main/java/jnr/posix/LazyPOSIX.java#L332

History
Date	User	Action	Args
2015-03-17 14:22:09	zyasoft	set	status: pending -> closed messages: + msg9667
2015-01-21 01:23:12	pekka.klarck	set	messages: + msg9432
2015-01-20 23:40:58	zyasoft	set	messages: + msg9431
2015-01-20 23:29:36	zyasoft	set	messages: + msg9430
2015-01-13 17:06:52	pekka.klarck	set	messages: + msg9379
2015-01-07 06:50:02	zyasoft	set	status: open -> pending resolution: fixed messages: + msg9331 nosy: + zyasoft
2013-02-26 22:00:11	fwierzbicki	set	status: closed -> open resolution: out of date -> (no value) messages: + msg7817
2013-02-26 21:32:24	pekka.klarck	set	messages: + msg7816
2013-02-26 21:29:35	pekka.klarck	set	messages: + msg7815
2013-02-26 18:27:45	fwierzbicki	set	status: open -> closed resolution: out of date messages: + msg7799 nosy: + fwierzbicki
2012-03-19 18:18:19	amak	set	messages: + msg6830
2012-03-19 17:50:35	amak	set	nosy: + amak
2012-02-17 22:45:43	pekka.klarck	create