Issue2772

classification

Title:	No-break space and some other non-ASCII spaces are not considered space
Type:	behaviour	Severity:	normal
Components:		Versions:
		Milestone:	Jython 2.7.2

process

Status:	closed	Resolution:	out of date
Dependencies:		Superseder:
Assigned To:		Nosy List:	jeff.allen, pekka.klarck
Priority:		Keywords:

Created on 2019-05-13.14:55:32 by pekka.klarck, last changed 2019-05-14.20:02:06 by pekka.klarck.

Messages
msg12516 (view)	Author: Pekka Klärck (pekka.klarck)	Date: 2019-05-13.14:55:31
On Jython no-break space (u'\xa0'), figure space (u'\u2007̈́) and narrow no-break space (u'\u202F') are not considered to be space characters. Other space characters listed at https://www.compart.com/en/unicode/category/Zs are. This affects also string methods like `strip()` and `split()`, but the `re` module doesn't seem to be affected. Jython 2.7.0 (default:9987c746f838, Apr 29 2015, 02:25:11) [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_201 Type "help", "copyright", "credits" or "license" for more information. >>> for ordinal in '0020 00A0 1680 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 200A 202F 205F 3000'.split(): ... char = unichr(int(ordinal, 16)) ... if not char.isspace(): ... print '%s is not space' % ordinal ... 00A0 is not space 2007 is not space 202F is not space >>> >>> u'\xa0...\u1680'.strip() u'\xa0...' >>> u'.\xa0.'.split() [u'.\xa0.'] >>> import re >>> re.split(r'\s+', u'.\xa0.', flags=re.UNICODE) [u'.', u'.']
msg12518 (view)	Author: Jeff Allen (jeff.allen)	Date: 2019-05-14.06:08:07
I'm happy to report that this works in the development tip, thanks to: https://hg.python.org/jython/rev/a1f68d091a1c . Jython 2.7.2a1+ (default:a1ae652df5e3+, May 12 2019, 09:17:21) [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_80 Type "help", "copyright", "credits" or "license" for more information. >>> for ordinal in '0020 00A0 1680 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 200A 202F 205F 3000'.split(): ... char = unichr(int(ordinal, 16)) ... if not char.isspace(): ... print '%s is not space' % ordinal ... >>> u'\xa0...\u1680'.strip() u'...' >>> u'.\xa0.'.split() [u'.', u'.'] >>> import re >>> re.split(r'\s+', u'.\xa0.', flags=re.UNICODE) [u'.', u'.']
msg12520 (view)	Author: Pekka Klärck (pekka.klarck)	Date: 2019-05-14.20:02:06
Awesome, thanks Jeff!

History
Date	User	Action	Args
2019-05-14 20:02:06	pekka.klarck	set	messages: + msg12520
2019-05-14 06:08:07	jeff.allen	set	status: open -> closed nosy: + jeff.allen messages: + msg12518 resolution: out of date milestone: Jython 2.7.2 type: behaviour
2019-05-13 14:55:32	pekka.klarck	create