Issue1865

classification

Title:	Jython does not support all encodings available on JVM
Type:		Severity:	normal
Components:		Versions:
		Milestone:

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:
Assigned To:		Nosy List:	amak, fwierzbicki, pekka.klarck, zyasoft
Priority:		Keywords:

Created on 2012-03-21.05:29:02 by pekka.klarck, last changed 2014-05-21.21:53:29 by zyasoft.

Messages
msg6947 (view)	Author: Pekka Klärck (pekka.klarck)	Date: 2012-03-21.05:29:01
I expected Jython to support at least the same encodings that JVM supports. It turned out I was wrong: $ jython Jython 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06) [Java HotSpot(TM) Server VM (Sun Microsystems Inc.)] on java1.6.0_21 Type "help", "copyright", "credits" or "license" for more information. >>> from java.nio.charset import Charset >>> for c in sorted(Charset.availableCharsets()): ... print c ... 'x'.encode(c) ... Big5 Traceback (most recent call last): File "<stdin>", line 3, in <module> LookupError: unknown encoding 'big5'
msg6949 (view)	Author: Pekka Klärck (pekka.klarck)	Date: 2012-03-21.05:43:41
I would expect this to be pretty easy to fix. I haven't looked at how Jython implements encoding look-ups, but it should at least be possible to use java.nio.charset.Charsets directly if look-up fails. If this has a change to make it to 2.5.3 (or later 2.7) I can take a look at it myself. One problem of Jython not supporting all the encodings supported by JVM is that it prevents using file.encoding property for implementing the missing sys.getfilesystemencoding() (issue #1839).
msg7862 (view)	Author: Alan Kennedy (amak)	Date: 2013-02-28.01:10:50
The problem with using java.nio.charset.CharsetEncoder and java.nio.charset.CharsetDecoder is that they don't have a customizable replacement mechanism, which is required for python codecs, to implement the 'xmlcharrefreplace' and 'backslashreplace' error handling methods. http://docs.python.org/2/library/codecs.html In order to support these errors methods, the input has to be processed character by character, checking for every character if the character can be encoded. This approach can be seen in this jsoup code https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/nodes/Entities.java See the "escape" method. The problem with this is that the documentation for "canEncode" says "The default implementation of this method is not very efficient; it should generally be overridden to improve performance." Having looked at the implementation, it is indeed very inefficient: performance would be very poor.
msg7863 (view)	Author: Pekka Klärck (pekka.klarck)	Date: 2013-02-28.07:53:17
The general case is that encoding succeeds. Would it be complicated to implement this so that this case is fast and slow error handlers are used only if there actually are errors?
msg7865 (view)	Author: Alan Kennedy (amak)	Date: 2013-02-28.10:13:54
Yes, it is the case that we could use the java Charsets in the case where the errors flag is "strict", "replace" or "ignore", and fall back to character-by-character for the other error types. But this is not high-priority: there are many other issues to resolve. We might look into for 2.7. In the meantime, patches are welcome.
msg8487 (view)	Author: Jim Baker (zyasoft)	Date: 2014-05-21.21:53:29
Duplicate of #1066, although we might quibble - that's about support of the available Python encodings (as seen in CPython), but I believe Python is a superset of what Java supports anyway

History
Date	User	Action	Args
2014-05-21 21:53:29	zyasoft	set	status: open -> closed resolution: duplicate messages: + msg8487 nosy: + zyasoft
2013-02-28 10:13:54	amak	set	messages: + msg7865
2013-02-28 07:53:17	pekka.klarck	set	messages: + msg7863
2013-02-28 01:10:50	amak	set	messages: + msg7862
2013-02-26 23:45:33	amak	set	nosy: + amak
2013-02-26 18:30:15	fwierzbicki	set	nosy: + fwierzbicki
2012-03-21 05:43:41	pekka.klarck	set	messages: + msg6949
2012-03-21 05:29:02	pekka.klarck	create