Issue1066
Created on 2008-06-26.00:04:03 by pjenvey, last changed 2013-02-20.00:28:23 by fwierzbicki.
| msg3306 (view) |
Author: Philip Jenvey (pjenvey) |
Date: 2008-06-26.00:04:02 |
|
CPython 2.4 included the CJKCodecs package: http://cjkpython.i18n.org/
which provides codecs for chinese/japanese/korean etc charsets,
implemented in C.
The lack of these codecs causes these tests to fail:
+ test_codecencodings_cn
+ test_codecencodings_hk
+ test_codecencodings_jp
+ test_codecencodings_kr
+ test_codecencodings_tw
+ test_codecmaps_cn
+ test_codecmaps_hk
+ test_codecmaps_jp
+ test_codecmaps_kr
+ test_codecmaps_tw
|
| msg3307 (view) |
Author: Philip Jenvey (pjenvey) |
Date: 2008-06-26.00:08:07 |
|
cjk also includes the _multibytecodec module, which affects these tests:
test_multibytecodec
test_multibytecodec_support
|
| msg3879 (view) |
Author: Philip Jenvey (pjenvey) |
Date: 2008-12-08.05:27:14 |
|
We should utilize the nio charsets for these. One gotcha is they encode
to/decode from actual bytes, not chars (as they should) -- and of course
our byte bucket (str) is based on chars.
In that case we could probably make the streaming from/to our 'byte
bucket' more efficient by faking a ByteBuffer that gave back bytes
from/put back bytes to an underlying char array. That'd avoid an extra
conversion pass.
The Encoder/Decoder implementations seem to go through the actual
ByteBuffer methods -- i.e. not through the underlying Buffer arrays
directly. That'd allow this hack
A CharsetDecoder can take a ByteBuffer instance to fill into -- we'd
have to use that for this hack, since Charset.encode returns an entirely
new ByteBuffer
This hack would be kind of a lame, but would go away in Jython 3. Or we
could just do the extra pass
Another gotcha would be -- can we still retain our error handling
behavior with Java's Charsets? Briefly looking at them, they seem to
have fairly similar error handling facilities
|
| msg3880 (view) |
Author: Philip Jenvey (pjenvey) |
Date: 2008-12-08.05:58:23 |
|
Java supports most of the cjkcodecs but not these:
cp932 (mskanji)
euc_jis_2004 (Japanese)
euc_jisx0213 (Japanese)
hz (Simplified Chinese)
iso2022_jp_1 (iso2022 variants)
iso2022_jp_2
iso2022_jp_2004
iso2022_jp_3
iso2022_jp_ext
shift_jis_2004 (Shiftjis variants)
shift_jisx0213
Determined from: http://java.sun.com/j2se/1.5.0/docs/guide/intl/encoding.doc.html
and:
$ grep getcodec *
big5.py:codec = _codecs_tw.getcodec('big5')
big5hkscs.py:codec = _codecs_hk.getcodec('big5hkscs')
cp932.py:codec = _codecs_jp.getcodec('cp932')
cp949.py:codec = _codecs_kr.getcodec('cp949')
cp950.py:codec = _codecs_tw.getcodec('cp950')
euc_jis_2004.py:codec = _codecs_jp.getcodec('euc_jis_2004')
euc_jisx0213.py:codec = _codecs_jp.getcodec('euc_jisx0213')
euc_jp.py:codec = _codecs_jp.getcodec('euc_jp')
euc_kr.py:codec = _codecs_kr.getcodec('euc_kr')
gb18030.py:codec = _codecs_cn.getcodec('gb18030')
gb2312.py:codec = _codecs_cn.getcodec('gb2312')
gbk.py:codec = _codecs_cn.getcodec('gbk')
hz.py:codec = _codecs_cn.getcodec('hz')
iso2022_jp.py:codec = _codecs_iso2022.getcodec('iso2022_jp')
iso2022_jp_1.py:codec = _codecs_iso2022.getcodec('iso2022_jp_1')
iso2022_jp_2.py:codec = _codecs_iso2022.getcodec('iso2022_jp_2')
iso2022_jp_2004.py:codec = _codecs_iso2022.getcodec('iso2022_jp_2004')
iso2022_jp_3.py:codec = _codecs_iso2022.getcodec('iso2022_jp_3')
iso2022_jp_ext.py:codec = _codecs_iso2022.getcodec('iso2022_jp_ext')
iso2022_kr.py:codec = _codecs_iso2022.getcodec('iso2022_kr')
johab.py:codec = _codecs_kr.getcodec('johab')
shift_jis.py:codec = _codecs_jp.getcodec('shift_jis')
shift_jis_2004.py:codec = _codecs_jp.getcodec('shift_jis_2004')
shift_jisx0213.py:codec = _codecs_jp.getcodec('shift_jisx0213')
|
| msg4243 (view) |
Author: Jim Baker (zyasoft) |
Date: 2009-03-12.08:21:29 |
|
Deferred to 2.5.1
|
| msg4992 (view) |
Author: Charlie Groves (cgroves) |
Date: 2009-08-05.16:50:20 |
|
When I looked at this, the nio charsets have similar default error
handlers, but there's no way to make custom ones. I think that rules
using these charsets out with python, since codecs picked up the ability
to use a user-defined error handling function in 2.3. It has been a
couple years since I looked at this though, so I may be misremembering
things.
|
| msg5020 (view) |
Author: Philip Jenvey (pjenvey) |
Date: 2009-08-12.07:07:21 |
|
Actually it seems like we could do callable error handlers via nio's
report error action. That would make the encoder/decoder return a
CoderResult upon failure but without resetting its state
So we should be able to create a UnicodeError with its start/end/reason
info from that CoderResult and the input Buffer (to pass to our error
handler). Then we act upon the handler's result, restarting the
encoder/decoder from where it left off if necessary
|
| msg5027 (view) |
Author: Charlie Groves (cgroves) |
Date: 2009-08-15.20:03:22 |
|
Ahh, that does sound workable. Nice!
|
| msg6055 (view) |
Author: Jim Baker (zyasoft) |
Date: 2010-09-09.05:48:16 |
|
Let's see if we can write wrappers of NIO in time for 2.5.2.
|
| msg6202 (view) |
Author: Jim Baker (zyasoft) |
Date: 2010-10-22.22:20:52 |
|
I'm going to try to get this into 2.5.2rc2, so marking high. I think I know the APIs respectively well enough now to write a pure Jython version that leverages java.nio, following Phil's suggestion.
|
| msg6216 (view) |
Author: Jim Baker (zyasoft) |
Date: 2010-11-01.15:25:14 |
|
This will not make 2.5.2 unless there's a RC3. I recommend we should release as a separate package on PyPI.
Because of how one needs to do the buffering, it's necessary to use Java to manage the loop for reasonable performance.
|
| msg6483 (view) |
Author: Philip Jenvey (pjenvey) |
Date: 2011-04-13.20:19:58 |
|
FYI Yuji Yamano made some good progress on this task during the PyCon '11 sprint. He actually got it to the point that you could begin encoding asian characters via the codecs module.
I have a preliminary patch from him in a pastebin but I'm sure he'll eventually send us a later version of this patch, and then maybe we can get this in for 2.6
|
| msg7456 (view) |
Author: Yuji Yamano (yyamano) |
Date: 2012-09-10.04:23:37 |
|
Here is the work in progess patch for the svn trunk.
* Some tests don't pass yet.
* There are still some problems, but I don't remember exectly:-<
* Too many debug log.
|
| msg7552 (view) |
Author: Jeff Allen (jeff.allen) |
Date: 2012-12-27.14:57:00 |
|
These codecs have become standard in Python 2.7 so the updated test_codecs regression test now fails (or acquires skips). Note related issue #2000. I observe that Python 2.7 has given us *codecs* for the missing asian script encodings but they depend on built-in modules I assume Yuji's patch aims to provide.
Is anyone competent and willing to review the patch?
|
| msg7557 (view) |
Author: Yuji Yamano (yyamano) |
Date: 2012-12-28.01:38:12 |
|
I'm working on syncing the patch with the latest jython.
See https://bitbucket.org/yyamano/jython/src/89bbdf124e6b/?at=issue1066
|
|
| Date |
User |
Action |
Args |
| 2013-02-20 00:28:23 | fwierzbicki | set | versions:
+ Jython 2.7, - 2.5.1, 2.7a1, 2.7a2 |
| 2012-12-28 01:38:12 | yyamano | set | messages:
+ msg7557 |
| 2012-12-27 14:57:00 | jeff.allen | set | nosy:
+ jeff.allen messages:
+ msg7552 components:
+ Library versions:
+ 2.7a1, 2.7a2 |
| 2012-09-10 04:23:38 | yyamano | set | files:
+ cjkcodecs-patch-20120907-1335 messages:
+ msg7456 |
| 2011-04-13 20:19:59 | pjenvey | set | nosy:
+ yyamano messages:
+ msg6483 |
| 2010-11-01 15:25:14 | zyasoft | set | messages:
+ msg6216 |
| 2010-10-22 22:20:52 | zyasoft | set | priority: normal -> high messages:
+ msg6202 |
| 2010-09-09 05:48:16 | zyasoft | set | priority: low -> normal messages:
+ msg6055 |
| 2009-08-15 20:03:22 | cgroves | set | messages:
+ msg5027 |
| 2009-08-12 07:07:21 | pjenvey | set | messages:
+ msg5020 |
| 2009-08-05 16:50:21 | cgroves | set | nosy:
+ cgroves messages:
+ msg4992 |
| 2009-08-05 14:35:19 | fwierzbicki | set | nosy:
+ fwierzbicki |
| 2009-03-21 13:04:14 | zyasoft | set | priority: low |
| 2009-03-12 08:21:29 | zyasoft | set | messages:
+ msg4243 versions:
+ 2.5.1, - 2.5alpha1 |
| 2008-12-08 05:58:25 | pjenvey | set | messages:
+ msg3880 |
| 2008-12-08 05:27:23 | pjenvey | set | messages:
+ msg3879 |
| 2008-10-26 18:55:12 | zyasoft | set | assignee: zyasoft |
| 2008-10-14 17:43:53 | zyasoft | set | title: Need CJKCodecs for CPython 2.4 -> Need CJKCodecs - multibytecodecs |
| 2008-10-14 17:43:13 | zyasoft | set | nosy:
+ zyasoft |
| 2008-06-26 00:08:07 | pjenvey | set | messages:
+ msg3307 |
| 2008-06-26 00:04:03 | pjenvey | create | |
|