Message8825

Author	jeff.allen
Recipients	jeff.allen, kasso, rpan, zyasoft
Date	2014-06-25.21:13:27
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1403730808.26.0.598447160794.issue2123@psf.upfronthosting.co.za>
In-reply-to

Content
On Windows we do not have the UTF-8 option. This is what CPython does with code page 936: >python Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.stdin.encoding 'cp936' >>> s = "使用" >>> s '\xca\xb9\xd3\xc3' >>> print s 使用 >>> Jython is now exactly the same, except that Java likes to call the encoding ms936. (Java actually tests the range of the code page number so it can call some of them cp* and some ms*; I assume there's a good reason.) A str is a sequence of bytes, not characters. When you just type s at the prompt, Python actually prints repr(s), which gives you a "safe" representation, such as you might have written in ascii source code. When you execute print s, it pushes the bytes out through sys.stdout and what you see is the result of the (Windows) console interpreting those bytes, in this case as code page 936. The same bytes would normally come out on my console like this (code page 1252): >>> s '\xca\xb9\xd3\xc3' >>> print s Ê¹ÓÃ At a time when CPython only dealt with bytes, Jython chose to allow UTF-16 characters in strings, interchangeably with Java. Since then, Python has evolved to support unicode as a distinct type, and later Jython versions conform to that design. Bottom line: this aspect of Jython is correct now (probably). Thanks for making us think about it.

On Windows we do not have the UTF-8 option. This is what CPython does with code page 936:
>python
Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin.encoding
'cp936'
>>> s = "使用"
>>> s
'\xca\xb9\xd3\xc3'
>>> print s
使用
>>>

Jython is now exactly the same, except that Java likes to call the encoding ms936. (Java actually tests the range of the code page number so it can call some of them cp* and some ms*; I assume there's a good reason.)

A str is a sequence of bytes, not characters. When you just type s at the prompt, Python actually prints repr(s), which gives you a "safe" representation, such as you might have written in ascii source code. When you execute print s, it pushes the bytes out through sys.stdout and what you see is the result of the (Windows) console interpreting those bytes, in this case as code page 936. The same bytes would normally come out on my console like this (code page 1252):
>>> s
'\xca\xb9\xd3\xc3'
>>> print s
Ê¹ÓÃ

At a time when CPython only dealt with bytes, Jython chose to allow UTF-16 characters in strings, interchangeably with Java. Since then, Python has evolved to support unicode as a distinct type, and later Jython versions conform to that design.

Bottom line: this aspect of Jython is correct now (probably). Thanks for making us think about it.

History
Date	User	Action	Args
2014-06-25 21:13:28	jeff.allen	set	messageid: <1403730808.26.0.598447160794.issue2123@psf.upfronthosting.co.za>
2014-06-25 21:13:28	jeff.allen	set	recipients: + jeff.allen, zyasoft, rpan, kasso
2014-06-25 21:13:28	jeff.allen	link	issue2123 messages
2014-06-25 21:13:27	jeff.allen	create