Message11337

Author	jeff.allen
Recipients	bstjean, jeff.allen, liuxy_hes86, zyasoft
Date	2017-05-01.14:28:41
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1493648922.14.0.386784708518.issue2356@psf.upfronthosting.co.za>
In-reply-to

Content
In a significant change of approach (see #1839) I have addressed this by making sys.getfilesystemencoding() == 'utf-8' and it works pretty well. I've tweaked a lot of exsting code. Some is quite old. I have published to here: https://bitbucket.org/tournesol/jython-utf8 in case anyone sees a massive flaw. If not, I'll push to the main repo. The current regression test runs for my user name "Épreuve" and passes, but not yet for "用户名". I think we are still assuming bytes are unicode in some places. So I estimate that Benoît is now ok, but there's more to do for 雪彦. Just to show off a bit what we can do: > dist\bin\jython Jython 2.7.1rc1 (default:060e4e4a06d8, Apr 30 2017, 23:08:20) [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_60 Type "help", "copyright", "credits" or "license" for more information. >>> import sys, os, os.path >>> os.getcwd() 'C:\\Users\\\xe7\x94\xa8\xe6\x88\xb7\xe5\x90\x8d\\Documents\\Jython\\utf-8' >>> print os.getcwdu() C:\Users\用户名\Documents\Jython\utf-8 >>> f = open(os.path.join(u'c-\u5496\u5561', u'\u56f0\u96be.txt'), 'wb') >>> print f.name c-咖啡\困难.txt >>> f.close() >>> f = open(os.path.join(u's-\U0001f40d', u'pythón'), 'wb') >>> f <open file u's-\U0001f40d\\pyth\xf3n', mode 'wb' at 0x3> I observe that it is mostly having a non-ascii installation location, current directory or TMP/TEMP that cause the trouble. I can perhaps simulate those things without actually having changing user name (which tends to break the tools I need). It's also a clue to a work-around.

In a significant change of approach (see #1839) I have addressed this by making sys.getfilesystemencoding() == 'utf-8' and it works pretty well. I've tweaked a lot of exsting code. Some is quite old. I have published to here:

https://bitbucket.org/tournesol/jython-utf8

in case anyone sees a massive flaw. If not, I'll push to the main repo.

The current regression test runs for my user name "Épreuve" and passes, but not yet for "用户名". I think we are still assuming bytes are unicode in some places. So I estimate that Benoît is now ok, but there's more to do for 雪彦.

Just to show off a bit what we can do:

> dist\bin\jython
Jython 2.7.1rc1 (default:060e4e4a06d8, Apr 30 2017, 23:08:20)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_60
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys, os, os.path
>>> os.getcwd()
'C:\\Users\\\xe7\x94\xa8\xe6\x88\xb7\xe5\x90\x8d\\Documents\\Jython\\utf-8'
>>> print os.getcwdu()
C:\Users\用户名\Documents\Jython\utf-8
>>> f = open(os.path.join(u'c-\u5496\u5561', u'\u56f0\u96be.txt'), 'wb')
>>> print f.name
c-咖啡\困难.txt
>>> f.close()
>>> f = open(os.path.join(u's-\U0001f40d', u'pythón'), 'wb')
>>> f
<open file u's-\U0001f40d\\pyth\xf3n', mode 'wb' at 0x3>

I observe that it is mostly having a non-ascii installation location, current directory or TMP/TEMP that cause the trouble. I can perhaps simulate those things without actually having changing user name (which tends to break the tools I need). It's also a clue to a work-around.

History
Date	User	Action	Args
2017-05-01 14:28:42	jeff.allen	set	messageid: <1493648922.14.0.386784708518.issue2356@psf.upfronthosting.co.za>
2017-05-01 14:28:42	jeff.allen	set	recipients: + jeff.allen, zyasoft, liuxy_hes86, bstjean
2017-05-01 14:28:42	jeff.allen	link	issue2356 messages
2017-05-01 14:28:41	jeff.allen	create