Issue2548

classification
Title: Unicode u'\N{name}' frequently broken, because ucnhash.dat outdated
Type: behaviour Severity: normal
Components: Core Versions: Jython 2.7
Milestone:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: stefan.richthofer Nosy List: asmeurer, stefan.richthofer
Priority: normal Keywords:

Created on 2017-02-02.16:17:21 by stefan.richthofer, last changed 2017-02-27.04:49:45 by zyasoft.

Messages
msg11064 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-02-02.16:17:21
CPython 2.7 allows to write something like
s = u'\N{DOUBLE-STRUCK ITALIC SMALL D}'
In principle Jython supports that character by
s = u'\u2146'
just the escape notation is not supported.

This issue is a blocker for sympy-support, see #1777.
msg11067 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-02-02.17:38:11
Okay, it looks like ucnhash.dat wasn't updated for quite some time (was it ever?). Misc/make_ucnhashdat.py still has "UnicodeData-3.0.0.txt" hard-coded. After some investigation I found that UnicodeData-3.0.0.txt was released in 2001. The newest release as of this writing is 9.0.

So this issue seems to be just a matter of updating ucnhash.dat. Changing title accordingly.
msg11068 (view) Author: Aaron Meurer (asmeurer) Date: 2017-02-02.18:40:19
Feel free to nosy me on any SymPy blockers.
msg11070 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-02-02.20:07:55
Aaron: Alright. Pleasure for me.

Creating ucnhash.dat for current Unicode 9.0 turns out to be more challenging than I expected. The script responsible for this Misc/make_ucnhashdat.py seems to be ancient. It writes several values in 16 Bit which exceed value of 65535 for current UnicodeData.txt, e.g.:

Raw size = 184608
    writeUcnhashDat()
  File "/data/workspace/linux/Jython/stewori/jython/Misc/make_ucnhashdat.py", line 340, in writeUcnhashDat
    raw.writeto(outf)
  File "/data/workspace/linux/Jython/stewori/jython/Misc/make_ucnhashdat.py", line 188, in writeto
    file.write(struct.pack("!H", self.size()))
struct.error: 'H' format requires 0 <= number <= 65535

So I'll have to switch some stuff to 32 bit numbers, which will also require changes in the parser ucnhash.java.

Stay tuned...
msg11072 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-02-03.18:28:50
Fixed as of https://github.com/jythontools/jython/commit/ebb7f49b47290fe20b1d72991b7a2d37f256fd92.
History
Date User Action Args
2017-02-27 04:49:45zyasoftsetstatus: pending -> closed
2017-02-03 18:28:50stefan.richthofersetstatus: open -> pending
resolution: fixed
messages: + msg11072
2017-02-02 20:07:55stefan.richthofersetmessages: + msg11070
2017-02-02 18:40:19asmeurersetnosy: + asmeurer
messages: + msg11068
2017-02-02 17:38:11stefan.richthofersetmessages: + msg11067
title: Unicode notation u'\N{charachter name}' not supported. -> Unicode u'\N{name}' frequently broken, because ucnhash.dat outdated
2017-02-02 16:17:21stefan.richthofercreate