Issue1865
Created on 2012-03-21.05:29:02 by pekka.klarck, last changed 2014-05-21.21:53:29 by zyasoft.
msg6947 (view) |
Author: Pekka Klärck (pekka.klarck) |
Date: 2012-03-21.05:29:01 |
|
I expected Jython to support at least the same encodings that JVM supports. It turned out I was wrong:
$ jython
Jython 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06)
[Java HotSpot(TM) Server VM (Sun Microsystems Inc.)] on java1.6.0_21
Type "help", "copyright", "credits" or "license" for more information.
>>> from java.nio.charset import Charset
>>> for c in sorted(Charset.availableCharsets()):
... print c
... 'x'.encode(c)
...
Big5
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
LookupError: unknown encoding 'big5'
|
msg6949 (view) |
Author: Pekka Klärck (pekka.klarck) |
Date: 2012-03-21.05:43:41 |
|
I would expect this to be pretty easy to fix. I haven't looked at how Jython implements encoding look-ups, but it should at least be possible to use java.nio.charset.Charsets directly if look-up fails. If this has a change to make it to 2.5.3 (or later 2.7) I can take a look at it myself.
One problem of Jython not supporting all the encodings supported by JVM is that it prevents using file.encoding property for implementing the missing sys.getfilesystemencoding() (issue #1839).
|
msg7862 (view) |
Author: Alan Kennedy (amak) |
Date: 2013-02-28.01:10:50 |
|
The problem with using java.nio.charset.CharsetEncoder and java.nio.charset.CharsetDecoder is that they don't have a customizable replacement mechanism, which is required for python codecs, to implement the 'xmlcharrefreplace' and 'backslashreplace' error handling methods.
http://docs.python.org/2/library/codecs.html
In order to support these errors methods, the input has to be processed character by character, checking for every character if the character can be encoded. This approach can be seen in this jsoup code
https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/nodes/Entities.java
See the "escape" method.
The problem with this is that the documentation for "canEncode" says "The default implementation of this method is not very efficient; it should generally be overridden to improve performance."
Having looked at the implementation, it is indeed very inefficient: performance would be very poor.
|
msg7863 (view) |
Author: Pekka Klärck (pekka.klarck) |
Date: 2013-02-28.07:53:17 |
|
The general case is that encoding succeeds. Would it be complicated to implement this so that this case is fast and slow error handlers are used only if there actually are errors?
|
msg7865 (view) |
Author: Alan Kennedy (amak) |
Date: 2013-02-28.10:13:54 |
|
Yes, it is the case that we could use the java Charsets in the case where the errors flag is "strict", "replace" or "ignore", and fall back to character-by-character for the other error types.
But this is not high-priority: there are many other issues to resolve.
We might look into for 2.7.
In the meantime, patches are welcome.
|
msg8487 (view) |
Author: Jim Baker (zyasoft) |
Date: 2014-05-21.21:53:29 |
|
Duplicate of #1066, although we might quibble - that's about support of the available Python encodings (as seen in CPython), but I believe Python is a superset of what Java supports anyway
|
|
Date |
User |
Action |
Args |
2014-05-21 21:53:29 | zyasoft | set | status: open -> closed resolution: duplicate messages:
+ msg8487 nosy:
+ zyasoft |
2013-02-28 10:13:54 | amak | set | messages:
+ msg7865 |
2013-02-28 07:53:17 | pekka.klarck | set | messages:
+ msg7863 |
2013-02-28 01:10:50 | amak | set | messages:
+ msg7862 |
2013-02-26 23:45:33 | amak | set | nosy:
+ amak |
2013-02-26 18:30:15 | fwierzbicki | set | nosy:
+ fwierzbicki |
2012-03-21 05:43:41 | pekka.klarck | set | messages:
+ msg6949 |
2012-03-21 05:29:02 | pekka.klarck | create | |
|