Title: Non-ASCII environment variables are encoded incorrectly in os.environ
Type: Severity: normal
Components: Versions:
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: amak, fwierzbicki, pekka.klarck, zyasoft
Priority: Keywords:

Created on 2012-02-17.22:45:43 by pekka.klarck, last changed 2015-03-17.14:22:09 by zyasoft.

msg6782 (view) Author: Pekka Klärck (pekka.klarck) Date: 2012-02-17.22:45:42
On my Linux machine with UTF-8 system encoding I got the following:

$ a=ä python
Python 2.6.6 (r266:84292, Sep 15 2010, 15:52:39) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ['a']
>>> _.decode('UTF-8')

$ a=ä jython
Jython 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06) 
[Java HotSpot(TM) Server VM (Sun Microsystems Inc.)] on java1.6.0_21
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ['a']

I have seen Jython to return similarly wrong bytes earlier (e.g. #1592 and #1593) and know that I can decode them using this hack:

>>> from java.lang import String
>>> String(os.environ['a']).toString()

The problem is that if I set environment variables myself and encode them correctly, using the hack doesn't work:

>>> os.environ['b'] = u'\xe4'.encode('UTF-8')
>>> String(os.environ['b']).toString()

In other words I needed to know has the value been set before or during the execution. It turns out that I actually can do that using using java.lang.System.getenv which only knows about the former:

>>> from java.lang.System import getenv
>>> getenv('a')
>>> getenv('b') is None

Notice also how getenv above returned the correct value as Unicode.
msg6830 (view) Author: Alan Kennedy (amak) Date: 2012-03-19.18:18:19
What is the setting of "python.console.encoding" in your registry file?

Is it set to the actual encoding of your shell?

Note also that you should really be passing an encoding to the String constructor when decoding from bytes, i.e.

>>> os.environ['b'] = u'\xe4'.encode('UTF-8')
>>> String(os.environ['b'], "UTF-8").toString()

If you don't specify an encoding, the bytes are unlikely to be decoded properly.
msg7799 (view) Author: Frank Wierzbicki (fwierzbicki) Date: 2013-02-26.18:27:45
No answer in a long time, closing as out of date.
msg7815 (view) Author: Pekka Klärck (pekka.klarck) Date: 2013-02-26.21:29:35
Sorry, hadn't noticed Alan's question. Where is the registry file stored? I certainly haven't touched it.
msg7816 (view) Author: Pekka Klärck (pekka.klarck) Date: 2013-02-26.21:32:24
Found the registry. "python.console.encoding" is commented out.
msg7817 (view) Author: Frank Wierzbicki (fwierzbicki) Date: 2013-02-26.22:00:11
Pekka: by default it is commented out, I think Alan is suggesting that you specify an encoding. Opening back up.
msg9331 (view) Author: Jim Baker (zyasoft) Date: 2015-01-07.06:50:02
Behavior as of - note that a unicode object is now returned, not a str

$ a=ä jython27 -c "import os; print repr(os.environ['a'])"

It's really the only choice we have since we are on Java. But at least it's not an incorrect bytestring.
msg9379 (view) Author: Pekka Klärck (pekka.klarck) Date: 2015-01-13.17:06:51
Inconsistency with CPython is not ideal, but returning correct Unicode is definitely better than returning incorrect bytes.

How does setting environment variables work now? Should it also be set as Unicode and not as bytes like with CPython? Do you get correct Unicode out if you later query the value?
msg9430 (view) Author: Jim Baker (zyasoft) Date: 2015-01-20.23:29:35
Setting os.environ should be done with Unicode, but we can support bytes as well if we have an encoding. Otherwise we will get an error like so:

ERROR: test_env_bytes (__main__.OSUnicodeTestCase)
Traceback (most recent call last):
  File "dist/Lib/test/", line 147, in test_env_bytes
    newenv["TEST_HOME"] = u"首页".decode("utf-8")
  File "/Users/jbaker/jythondev/jython27/dist/Lib/encodings/", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
IllegalArgumentException: java.lang.IllegalArgumentException: Cannot create PyString with non-byte value

But what to choose from?

* ascii - a good default choice that prevents silent error propagation
* sys.getdefaultencoding() appears to always return "ascii"; it's not really changeable, Let's double check this by someone running a non US system.
* sys.getfilesystemencoding() always returs None, which is legal ("or None if the system default encoding is used", Let's not change this.
* python.console.encoding really should be about the console
* (= env variable PYTHONIOENCODING) is new in 2.7 and corresponds to what is available since 2.6 in CPython
msg9431 (view) Author: Jim Baker (zyasoft) Date: 2015-01-20.23:40:57
Writing the new proposed test properly would have been better - don't mix up encode/decode ;)

    def test_env_bytes(self):
        with test_support.temp_cwd(name=u"tempcwd-中文"):
            newenv = os.environ.copy()
            newenv["TEST_HOME"] = u"首页".encode("utf-8")
            p = subprocess.Popen([sys.executable, "-c",
                                  'import sys,os;' \
            self.assertEqual("utf-8"), u"首页")

FAIL: test_env_bytes (__main__.OSUnicodeTestCase)
Traceback (most recent call last):
  File "dist/Lib/test/", line 153, in test_env_bytes
    self.assertEqual("utf-8"), u"首页")
AssertionError: u'\xe9\xa6\x96\xe9\xa1\xb5' != u'\u9996\u9875'
- \xe9\xa6\x96\xe9\xa1\xb5
+ \u9996\u9875

So one can pass through bytes in this fashion, but an additional level of UTF-8 encoding was added; to "fix" the assertion requires

self.assertEqual("utf-8").decode("utf-8"), u"首页")

which is obviously not desirable.
msg9432 (view) Author: Pekka Klärck (pekka.klarck) Date: 2015-01-21.01:23:10
1) I'm fine with getting and setting environment variables, cli arguments, etc. as Unicode on Jython. With Jython only code it would avoid the need to actually know the system encoding, and compatibility with CPython only requires one 'if' to decide is encoding/decoding needed.

2) Getting system encoding with Python is surprisingly complicated. With Robot Framework we get the encoding that is got first using these alternatives in this order:

- Any: sys.getfilesystemencoding()
- Jython: System.getProperty('file.encoding')
- POSIX: Environment variables LANG, LC_CTYPE, LANGUAGE, and LC_ALL
- Windows: ctypes.cdll.kernel32.GetACP()

Here's the code for anyone interested:

3) If there's a reliable way to get the correct encoding, I'd prefer sys.getfilesystemencoding() to return it.

4) The test that requires double decoding definitely looks wrong.
msg9667 (view) Author: Jim Baker (zyasoft) Date: 2015-03-17.14:22:09
Closing out - it's an interesting question of supporting bytes instead of unicode (msg9430), but Java natively wants unicode.

We might be able to support this with JNR in the future, but note that currently JNR returns String as well, not byte[] -
Date User Action Args
2015-03-17 14:22:09zyasoftsetstatus: pending -> closed
messages: + msg9667
2015-01-21 01:23:12pekka.klarcksetmessages: + msg9432
2015-01-20 23:40:58zyasoftsetmessages: + msg9431
2015-01-20 23:29:36zyasoftsetmessages: + msg9430
2015-01-13 17:06:52pekka.klarcksetmessages: + msg9379
2015-01-07 06:50:02zyasoftsetstatus: open -> pending
resolution: fixed
messages: + msg9331
nosy: + zyasoft
2013-02-26 22:00:11fwierzbickisetstatus: closed -> open
resolution: out of date -> (no value)
messages: + msg7817
2013-02-26 21:32:24pekka.klarcksetmessages: + msg7816
2013-02-26 21:29:35pekka.klarcksetmessages: + msg7815
2013-02-26 18:27:45fwierzbickisetstatus: open -> closed
resolution: out of date
messages: + msg7799
nosy: + fwierzbicki
2012-03-19 18:18:19amaksetmessages: + msg6830
2012-03-19 17:50:35amaksetnosy: + amak
2012-02-17 22:45:43pekka.klarckcreate