Message8825

Author jeff.allen
Recipients jeff.allen, kasso, rpan, zyasoft
Date 2014-06-25.21:13:27
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1403730808.26.0.598447160794.issue2123@psf.upfronthosting.co.za>
In-reply-to
Content
On Windows we do not have the UTF-8 option. This is what CPython does with code page 936:
>python
Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin.encoding
'cp936'
>>> s = "使用"
>>> s
'\xca\xb9\xd3\xc3'
>>> print s
使用
>>>

Jython is now exactly the same, except that Java likes to call the encoding ms936. (Java actually tests the range of the code page number so it can call some of them cp* and some ms*; I assume there's a good reason.)

A str is a sequence of bytes, not characters. When you just type s at the prompt, Python actually prints repr(s), which gives you a "safe" representation, such as you might have written in ascii source code. When you execute print s, it pushes the bytes out through sys.stdout and what you see is the result of the (Windows) console interpreting those bytes, in this case as code page 936. The same bytes would normally come out on my console like this (code page 1252):
>>> s
'\xca\xb9\xd3\xc3'
>>> print s
ʹÓÃ

At a time when CPython only dealt with bytes, Jython chose to allow UTF-16 characters in strings, interchangeably with Java. Since then, Python has evolved to support unicode as a distinct type, and later Jython versions conform to that design.

Bottom line: this aspect of Jython is correct now (probably). Thanks for making us think about it.
History
Date User Action Args
2014-06-25 21:13:28jeff.allensetmessageid: <1403730808.26.0.598447160794.issue2123@psf.upfronthosting.co.za>
2014-06-25 21:13:28jeff.allensetrecipients: + jeff.allen, zyasoft, rpan, kasso
2014-06-25 21:13:28jeff.allenlinkissue2123 messages
2014-06-25 21:13:27jeff.allencreate