Issue1066

classification
Title: Need CJKCodecs - multibytecodecs
Type: rfe Severity: normal
Components: Core, Library Versions: Jython 2.7
Milestone:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: zyasoft Nosy List: cgroves, fwierzbicki, jeff.allen, pjenvey, yyamano, zyasoft
Priority: high Keywords: patch

Created on 2008-06-26.00:04:03 by pjenvey, last changed 2014-06-23.17:53:46 by zyasoft.

Files
File name Uploaded Description Edit Remove
cjkcodecs-patch-20120907-1335 yyamano, 2012-09-10.04:23:37
shift_jis.patch zyasoft, 2014-06-12.21:53:38
Messages
msg3306 (view) Author: Philip Jenvey (pjenvey) Date: 2008-06-26.00:04:02
CPython 2.4 included the CJKCodecs package: http://cjkpython.i18n.org/

which provides codecs for chinese/japanese/korean etc charsets, 
implemented in C.

The lack of these codecs causes these tests to fail:

+        test_codecencodings_cn
+        test_codecencodings_hk
+        test_codecencodings_jp
+        test_codecencodings_kr
+        test_codecencodings_tw
+        test_codecmaps_cn
+        test_codecmaps_hk
+        test_codecmaps_jp
+        test_codecmaps_kr
+        test_codecmaps_tw
msg3307 (view) Author: Philip Jenvey (pjenvey) Date: 2008-06-26.00:08:07
cjk also includes the _multibytecodec module, which affects these tests:

test_multibytecodec
test_multibytecodec_support
msg3879 (view) Author: Philip Jenvey (pjenvey) Date: 2008-12-08.05:27:14
We should utilize the nio charsets for these. One gotcha is they encode 
to/decode from actual bytes, not chars (as they should) -- and of course 
our byte bucket (str) is based on chars.

In that case we could probably make the streaming from/to our 'byte 
bucket' more efficient by faking a ByteBuffer that gave back bytes 
from/put back bytes to an underlying char array. That'd avoid an extra 
conversion pass.

The Encoder/Decoder implementations seem to go through the actual 
ByteBuffer methods -- i.e. not through the underlying Buffer arrays 
directly. That'd allow this hack

A CharsetDecoder can take a ByteBuffer instance to fill into -- we'd 
have to use that for this hack, since Charset.encode returns an entirely 
new ByteBuffer

This hack would be kind of a lame, but would go away in Jython 3. Or we 
could just do the extra pass

Another gotcha would be -- can we still retain our error handling 
behavior with Java's Charsets? Briefly looking at them, they seem to 
have fairly similar error handling facilities
msg3880 (view) Author: Philip Jenvey (pjenvey) Date: 2008-12-08.05:58:23
Java supports most of the cjkcodecs but not these:

cp932 (mskanji)
euc_jis_2004 (Japanese)
euc_jisx0213 (Japanese)
hz (Simplified Chinese)
iso2022_jp_1 (iso2022 variants)
iso2022_jp_2
iso2022_jp_2004
iso2022_jp_3
iso2022_jp_ext
shift_jis_2004 (Shiftjis variants)
shift_jisx0213

Determined from: http://java.sun.com/j2se/1.5.0/docs/guide/intl/encoding.doc.html

and:
$ grep getcodec *
big5.py:codec = _codecs_tw.getcodec('big5')
big5hkscs.py:codec = _codecs_hk.getcodec('big5hkscs')
cp932.py:codec = _codecs_jp.getcodec('cp932')
cp949.py:codec = _codecs_kr.getcodec('cp949')
cp950.py:codec = _codecs_tw.getcodec('cp950')
euc_jis_2004.py:codec = _codecs_jp.getcodec('euc_jis_2004')
euc_jisx0213.py:codec = _codecs_jp.getcodec('euc_jisx0213')
euc_jp.py:codec = _codecs_jp.getcodec('euc_jp')
euc_kr.py:codec = _codecs_kr.getcodec('euc_kr')
gb18030.py:codec = _codecs_cn.getcodec('gb18030')
gb2312.py:codec = _codecs_cn.getcodec('gb2312')
gbk.py:codec = _codecs_cn.getcodec('gbk')
hz.py:codec = _codecs_cn.getcodec('hz')
iso2022_jp.py:codec = _codecs_iso2022.getcodec('iso2022_jp')
iso2022_jp_1.py:codec = _codecs_iso2022.getcodec('iso2022_jp_1')
iso2022_jp_2.py:codec = _codecs_iso2022.getcodec('iso2022_jp_2')
iso2022_jp_2004.py:codec = _codecs_iso2022.getcodec('iso2022_jp_2004')
iso2022_jp_3.py:codec = _codecs_iso2022.getcodec('iso2022_jp_3')
iso2022_jp_ext.py:codec = _codecs_iso2022.getcodec('iso2022_jp_ext')
iso2022_kr.py:codec = _codecs_iso2022.getcodec('iso2022_kr')
johab.py:codec = _codecs_kr.getcodec('johab')
shift_jis.py:codec = _codecs_jp.getcodec('shift_jis')
shift_jis_2004.py:codec = _codecs_jp.getcodec('shift_jis_2004')
shift_jisx0213.py:codec = _codecs_jp.getcodec('shift_jisx0213')
msg4243 (view) Author: Jim Baker (zyasoft) Date: 2009-03-12.08:21:29
Deferred to 2.5.1
msg4992 (view) Author: Charlie Groves (cgroves) Date: 2009-08-05.16:50:20
When I looked at this, the nio charsets have similar default error 
handlers, but there's no way to make custom ones.  I think that rules 
using these charsets out with python, since codecs picked up the ability 
to use a user-defined error handling function in 2.3.  It has been a 
couple years since I looked at this though, so I may be misremembering 
things.
msg5020 (view) Author: Philip Jenvey (pjenvey) Date: 2009-08-12.07:07:21
Actually it seems like we could do callable error handlers via nio's 
report error action. That would make the encoder/decoder return a 
CoderResult upon failure but without resetting its state

So we should be able to create a UnicodeError with its start/end/reason 
info from that CoderResult and the input Buffer (to pass to our error 
handler). Then we act upon the handler's result, restarting the 
encoder/decoder from where it left off if necessary
msg5027 (view) Author: Charlie Groves (cgroves) Date: 2009-08-15.20:03:22
Ahh, that does sound workable.  Nice!
msg6055 (view) Author: Jim Baker (zyasoft) Date: 2010-09-09.05:48:16
Let's see if we can write wrappers of NIO in time for 2.5.2.
msg6202 (view) Author: Jim Baker (zyasoft) Date: 2010-10-22.22:20:52
I'm going to try to get this into 2.5.2rc2, so marking high. I think I know the APIs respectively well enough now to write a pure Jython version that leverages java.nio, following Phil's suggestion.
msg6216 (view) Author: Jim Baker (zyasoft) Date: 2010-11-01.15:25:14
This will not make 2.5.2 unless there's a RC3. I recommend we should release as a separate package on PyPI.

Because of how one needs to do the buffering, it's necessary to use Java to manage the loop for reasonable performance.
msg6483 (view) Author: Philip Jenvey (pjenvey) Date: 2011-04-13.20:19:58
FYI Yuji Yamano made some good progress on this task during the PyCon '11 sprint. He actually got it to the point that you could begin encoding asian characters via the codecs module.

I have a preliminary patch from him in a pastebin but I'm sure he'll eventually send us a later version of this patch, and then maybe we can get this in for 2.6
msg7456 (view) Author: Yuji Yamano (yyamano) Date: 2012-09-10.04:23:37
Here is the work in progess patch for the svn trunk.

* Some tests don't pass yet.
* There are still some problems, but I don't remember exectly:-<
* Too many debug log.
msg7552 (view) Author: Jeff Allen (jeff.allen) Date: 2012-12-27.14:57:00
These codecs have become standard in Python 2.7 so the updated test_codecs regression test now fails (or acquires skips). Note related issue #2000. I observe that Python 2.7 has given us *codecs* for the missing asian script encodings but they depend on built-in modules I assume Yuji's patch aims to provide.

Is anyone competent and willing to review the patch?
msg7557 (view) Author: Yuji Yamano (yyamano) Date: 2012-12-28.01:38:12
I'm working on syncing the patch with the latest jython. 
See https://bitbucket.org/yyamano/jython/src/89bbdf124e6b/?at=issue1066
msg8338 (view) Author: Jim Baker (zyasoft) Date: 2014-05-05.20:18:58
Yuji, what's the status of your branch to provide this functionality? Would it be possible to have this synced against Jython trunk?

For such syncing, please note the bitbucket mirror is currently down and has been in that state for the last couple of months; see https://bitbucket.org/site/master/issue/9315/https-bitbucketorg-jython-jython-no-longer, so you will need to sync with hg.python.org/jython
msg8339 (view) Author: Jim Baker (zyasoft) Date: 2014-05-06.23:55:27
Targeting beta 4 of 2.7; required for work on https://github.com/html5lib/html5lib-python/pull/150
msg8495 (view) Author: Jim Baker (zyasoft) Date: 2014-05-21.23:02:38
Currently working on this with the assumption we will use CoderResult for error management
msg8628 (view) Author: Jim Baker (zyasoft) Date: 2014-06-12.03:50:51
I've started to make good progress, using shift_jis as a representative encoding. About 1/4 of the shift_jis tests now pass in test_codecencodings_jp, which seems to be pretty good considering this is mostly covering various error cases.

Given chunking, I suspect we can keep this in Python for now, although we can revisit at a later time.
msg8635 (view) Author: Jim Baker (zyasoft) Date: 2014-06-12.21:52:42
Completed patch for shift_jis - all shift_jis tests pass in test_codecencodings_jp assuming that the following is changed from using a surrogate (not supportable in Jython unicode) in test_multibytecodec_support.py:

unmappedunicode = u'\ufffe'

The next step will be to register all encodings available in Java, ideally without a lot of boilerplate.
msg8637 (view) Author: Jeff Allen (jeff.allen) Date: 2014-06-13.07:07:19
Congratulations on the progress.

0xfffe is a codepoint that is not a character (but it's not technically a surrogate).
http://www.unicode.org/charts/PDF/UFFF0.pdf
Is a unicode object a sequence of code points? Controversial area.
msg8640 (view) Author: Jim Baker (zyasoft) Date: 2014-06-14.00:46:28
Fixed in http://hg.python.org/jython/rev/6c718e5e9ae9 to the extent possible by using java.nio.charset.Charset. Here are the codecs not available, more or less what Philip identified in msg3880:

euc_jis_2004
euc_jisx0213  
hz
iso2022_jp_1
iso2022_jp_2004
iso2022_jp_3
iso2022_jp_ext 
shift_jis_2004

hz could potentially be supported by preprocessing - it's a way of encoding GB2312 as 2 7-bit bytes, with escaping provided by ~{...~}. It's possible that ICU4J could potentially help as well.

We also potentially gain other encodings as well, such as cp1047, as needed by http://bugs.jython.org/issue550200, supporting EBCDIC.

The one remaining issue I see here is that there are a couple of minor corner cases around errors for trailing bytes where it is not final. It's not clear to me what can really be done here in this case, since it seems to be a property of the decoder; at the very least it's something that's picked up by our unit tests, so it's visible.
msg8644 (view) Author: Jeff Allen (jeff.allen) Date: 2014-06-14.22:35:32
I get test failures from test_email and test_email_renamed about the decoding of euc-jp. In a sense this is an improvement, since that bit of the test is skipped if there is no such codec. But now there is ...

======================================================================
FAIL: test_body_encode (email.test.test_email.TestCharset)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\hg\jython-int\dist\Lib\email\test\test_email.py", line 2981, in test_body_encode
    eq('\x1b$B5FCO;~IW\x1b(B',
  File "D:\hg\jython-int\dist\Lib\email\test\test_email.py", line 2981, in test_body_encode
    eq('\x1b$B5FCO;~IW\x1b(B',
AssertionError: '\x1b$B5FCO;~IW\x1b(B' != '\x1b$B5FCO;~IW'

Is this the same for you?
msg8805 (view) Author: Jim Baker (zyasoft) Date: 2014-06-23.17:53:46
Jeff, I'm seeing the same issues in test_email, but we will fix separately. Since it's in the regrtest, we see it every time it fails.
History
Date User Action Args
2014-06-23 17:53:46zyasoftsetmessages: + msg8805
2014-06-23 17:52:16zyasoftsetstatus: pending -> closed
2014-06-14 22:35:33jeff.allensetmessages: + msg8644
2014-06-14 00:46:29zyasoftsetstatus: open -> pending
resolution: accepted -> fixed
messages: + msg8640
2014-06-13 07:07:19jeff.allensetmessages: + msg8637
2014-06-12 21:53:39zyasoftsetfiles: + shift_jis.patch
2014-06-12 21:53:17zyasoftsetfiles: - shift_jis.patch
2014-06-12 21:52:42zyasoftsetmessages: + msg8635
2014-06-12 03:50:52zyasoftsetfiles: + shift_jis.patch
keywords: + patch
messages: + msg8628
2014-05-21 23:02:38zyasoftsetmessages: + msg8495
2014-05-07 22:45:14jeff.allenlinkissue2123 dependencies
2014-05-06 23:55:28zyasoftsetassignee: zyasoft
resolution: accepted
messages: + msg8339
2014-05-05 20:18:58zyasoftsetassignee: zyasoft -> (no value)
messages: + msg8338
2013-07-03 04:10:12pjenveylinkissue2065 dependencies
2013-02-20 00:28:23fwierzbickisetversions: + Jython 2.7, - 2.5.1, 2.7a1, 2.7a2
2012-12-28 01:38:12yyamanosetmessages: + msg7557
2012-12-27 14:57:00jeff.allensetnosy: + jeff.allen
messages: + msg7552
components: + Library
versions: + 2.7a1, 2.7a2
2012-09-10 04:23:38yyamanosetfiles: + cjkcodecs-patch-20120907-1335
messages: + msg7456
2011-04-13 20:19:59pjenveysetnosy: + yyamano
messages: + msg6483
2010-11-01 15:25:14zyasoftsetmessages: + msg6216
2010-10-22 22:20:52zyasoftsetpriority: normal -> high
messages: + msg6202
2010-09-09 05:48:16zyasoftsetpriority: low -> normal
messages: + msg6055
2009-08-15 20:03:22cgrovessetmessages: + msg5027
2009-08-12 07:07:21pjenveysetmessages: + msg5020
2009-08-05 16:50:21cgrovessetnosy: + cgroves
messages: + msg4992
2009-08-05 14:35:19fwierzbickisetnosy: + fwierzbicki
2009-03-21 13:04:14zyasoftsetpriority: low
2009-03-12 08:21:29zyasoftsetmessages: + msg4243
versions: + 2.5.1, - 2.5alpha1
2008-12-08 05:58:25pjenveysetmessages: + msg3880
2008-12-08 05:27:23pjenveysetmessages: + msg3879
2008-10-26 18:55:12zyasoftsetassignee: zyasoft
2008-10-14 17:43:53zyasoftsettitle: Need CJKCodecs for CPython 2.4 -> Need CJKCodecs - multibytecodecs
2008-10-14 17:43:13zyasoftsetnosy: + zyasoft
2008-06-26 00:08:07pjenveysetmessages: + msg3307
2008-06-26 00:04:03pjenveycreate