Issue2772

classification
Title: No-break space and some other non-ASCII spaces are not considered space
Type: behaviour Severity: normal
Components: Versions:
Milestone: Jython 2.7.2
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: jeff.allen, pekka.klarck
Priority: Keywords:

Created on 2019-05-13.14:55:32 by pekka.klarck, last changed 2019-05-14.20:02:06 by pekka.klarck.

Messages
msg12516 (view) Author: Pekka Klärck (pekka.klarck) Date: 2019-05-13.14:55:31
On Jython no-break space (u'\xa0'), figure space (u'\u2007̈́) and narrow no-break space (u'\u202F') are not considered to be space characters. Other space characters listed at https://www.compart.com/en/unicode/category/Zs are.

This affects also string methods like `strip()` and `split()`, but the `re` module doesn't seem to be affected.

Jython 2.7.0 (default:9987c746f838, Apr 29 2015, 02:25:11) 
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_201
Type "help", "copyright", "credits" or "license" for more information.
>>> for ordinal in '0020 00A0 1680 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 200A 202F 205F 3000'.split():
...     char = unichr(int(ordinal, 16))
...     if not char.isspace():
...         print '%s is not space' % ordinal
... 
00A0 is not space
2007 is not space
202F is not space
>>> 
>>> u'\xa0...\u1680'.strip()
u'\xa0...'
>>> u'.\xa0.'.split()
[u'.\xa0.']
>>> import re
>>> re.split(r'\s+', u'.\xa0.', flags=re.UNICODE)
[u'.', u'.']
msg12518 (view) Author: Jeff Allen (jeff.allen) Date: 2019-05-14.06:08:07
I'm happy to report that this works in the development tip, thanks to: https://hg.python.org/jython/rev/a1f68d091a1c .

Jython 2.7.2a1+ (default:a1ae652df5e3+, May 12 2019, 09:17:21)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_80
Type "help", "copyright", "credits" or "license" for more information.
>>> for ordinal in '0020 00A0 1680 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 200A 202F 205F 3000'.split():
...     char = unichr(int(ordinal, 16))
...     if not char.isspace():
...         print '%s is not space' % ordinal
...
>>> u'\xa0...\u1680'.strip()
u'...'
>>> u'.\xa0.'.split()
[u'.', u'.']
>>> import re
>>> re.split(r'\s+', u'.\xa0.', flags=re.UNICODE)
[u'.', u'.']
msg12520 (view) Author: Pekka Klärck (pekka.klarck) Date: 2019-05-14.20:02:06
Awesome, thanks Jeff!
History
Date User Action Args
2019-05-14 20:02:06pekka.klarcksetmessages: + msg12520
2019-05-14 06:08:07jeff.allensetstatus: open -> closed
nosy: + jeff.allen
messages: + msg12518
resolution: out of date
milestone: Jython 2.7.2
type: behaviour
2019-05-13 14:55:32pekka.klarckcreate