Issue1865

classification
Title: Jython does not support all encodings available on JVM
Type: Severity: normal
Components: Versions:
process
Status: closed Resolution: duplicate
Dependencies: Superseder:
Assigned To: Nosy List: amak, fwierzbicki, pekka.klarck, zyasoft
Priority: Keywords:

Created on 2012-03-21.05:29:02 by pekka.klarck, last changed 2014-05-21.21:53:29 by zyasoft.

Messages
msg6947 (view) Author: Pekka Klärck (pekka.klarck) Date: 2012-03-21.05:29:01
I expected Jython to support at least the same encodings that JVM supports. It turned out I was wrong:

$ jython
Jython 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06) 
[Java HotSpot(TM) Server VM (Sun Microsystems Inc.)] on java1.6.0_21
Type "help", "copyright", "credits" or "license" for more information.
>>> from java.nio.charset import Charset     
>>> for c in sorted(Charset.availableCharsets()):
...   print c
...   'x'.encode(c)
... 
Big5
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
LookupError: unknown encoding 'big5'
msg6949 (view) Author: Pekka Klärck (pekka.klarck) Date: 2012-03-21.05:43:41
I would expect this to be pretty easy to fix. I haven't looked at how Jython implements encoding look-ups, but it should at least be possible to use java.nio.charset.Charsets directly if look-up fails. If this has a change to make it to 2.5.3 (or later 2.7) I can take a look at it myself.

One problem of Jython not supporting all the encodings supported by JVM is that it prevents using file.encoding property for implementing the missing sys.getfilesystemencoding() (issue #1839).
msg7862 (view) Author: Alan Kennedy (amak) Date: 2013-02-28.01:10:50
The problem with using java.nio.charset.CharsetEncoder and java.nio.charset.CharsetDecoder is that they don't have a customizable replacement mechanism, which is required for python codecs, to implement the 'xmlcharrefreplace' and 'backslashreplace' error handling methods.

http://docs.python.org/2/library/codecs.html

In order to support these errors methods, the input has to be processed character by character, checking for every character if the character can be encoded. This approach can be seen in this jsoup code

https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/nodes/Entities.java

See the "escape" method.

The problem with this is that the documentation for "canEncode" says "The default implementation of this method is not very efficient; it should generally be overridden to improve performance."

Having looked at the implementation, it is indeed very inefficient: performance would be very poor.
msg7863 (view) Author: Pekka Klärck (pekka.klarck) Date: 2013-02-28.07:53:17
The general case is that encoding succeeds. Would it be complicated to implement this so that this case is fast and slow error handlers are used only if there actually are errors?
msg7865 (view) Author: Alan Kennedy (amak) Date: 2013-02-28.10:13:54
Yes, it is the case that we could use the java Charsets in the case where the errors flag is "strict", "replace" or "ignore", and fall back to character-by-character for the other error types.

But this is not high-priority: there are many other issues to resolve.

We might look into for 2.7.

In the meantime, patches are welcome.
msg8487 (view) Author: Jim Baker (zyasoft) Date: 2014-05-21.21:53:29
Duplicate of #1066, although we might quibble - that's about support of the available Python encodings (as seen in CPython), but I believe Python is a superset of what Java supports anyway
History
Date User Action Args
2014-05-21 21:53:29zyasoftsetstatus: open -> closed
resolution: duplicate
messages: + msg8487
nosy: + zyasoft
2013-02-28 10:13:54amaksetmessages: + msg7865
2013-02-28 07:53:17pekka.klarcksetmessages: + msg7863
2013-02-28 01:10:50amaksetmessages: + msg7862
2013-02-26 23:45:33amaksetnosy: + amak
2013-02-26 18:30:15fwierzbickisetnosy: + fwierzbicki
2012-03-21 05:43:41pekka.klarcksetmessages: + msg6949
2012-03-21 05:29:02pekka.klarckcreate