Issue 10567: Unicode space character \u200b unrecognised a space (original) (raw)

Created on 2010-11-28 17:51 by pbnan, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (14)
msg122690 - (view)	Author: (pbnan)	Date: 2010-11-28 17:51
Python: Python 2.7 (r27:82500, Oct 20 2010, 03:21:03) [GCC 4.5.1] on linux2 Code: >>> c = u'\u200b' >>> c.isspace() False In both 2.6, 3.1 it works. http://www.cs.tut.fi/~jkorpela/chars/spaces.html
msg122692 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2010-11-28 18:05
It returns False on the latest py3k checkout as well.
msg122694 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-28 18:20
The category of U-200B was changed in Unicode 4.0.1: """ The main new features in Unicode 4.0.1 are the following: ... * Changed: general category of U+200B ZERO WIDTH SPACE """ http://unicode.org/versions/Unicode4.0.1/
msg122699 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-11-28 18:40
In 2.6, there was a manually maintained list, probably dating back to before Unicode 4.0. Python uses the following criterion for determining white space characters: /* Returns 1 for Unicode characters having the bidirectional type 'WS', 'B' or 'S' or the category 'Zs', 0 otherwise. */ Since r75272, this is generated from the current Unicode database, and should thus be always correct. Unless you can somehow prove that the criterion should be changed, or that Python computes it incorrectly, I'm closing this report as invalid.
msg122701 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2010-11-28 18:52
It's not just this character. isspace() is also False for \u200c and \u200d (from the same category). and \u2060, \u2800 and \ufeff
msg122703 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-11-28 18:59
> It's not just this character. isspace() is also False for \u200c and \u200d (from the same category). and \u2060, \u2800 and \ufeff What reason do you have to believe that they should be classified as whitespace, other than the web page you are quoting (which is apparently out of date, and the fact that Python 2.6 was classifying them this way, which are also out of date for the very same reason)?
msg122704 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2010-11-28 19:00
I'm not quoting anything. Thank you very much.
msg122706 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-28 19:07
Martin v. Löwis wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > > In 2.6, there was a manually maintained list, probably dating back to before Unicode 4.0. That's not quite correct: Python 1.6.x - 2.5.x used tables for the PyUnicode_ISSPACE() function that were created from the Unicode database. Python 2.6.x introduced a short-cut table for ASCII whitespace, but still reverted back to the generated tables for non-ASCII code points. The tables were never manually maintained, but we also did not update Python for each new Unicode version: Python 1.6: Unicode 3.0 Python 2.0: Unicode 3.0 Python 2.1: Unicode 3.0 Python 2.2: Unicode 3.0 Python 2.3: Unicode 3.2 Python 2.4: Unicode 3.2 Python 2.5: Unicode 4.1 Python 2.6: Unicode 5.1 Python 2.7: Unicode 5.2 > Python uses the following criterion for determining white space characters: > > /* Returns 1 for Unicode characters having the bidirectional type > 'WS', 'B' or 'S' or the category 'Zs', 0 otherwise. */ This definition has been used since Python 1.6.x.
msg122710 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-11-28 19:18
> I'm not quoting anything. Thank you very much. Oops, sorry - I confused you with the OP.
msg122711 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-28 19:20
On Sun, Nov 28, 2010 at 2:07 PM, Marc-Andre Lemburg <report@bugs.python.org> wrote: .. > The tables were never manually maintained, but we also did not update > Python for each new Unicode version: > > Python 1.6: Unicode 3.0 > Python 2.0: Unicode 3.0 > Python 2.1: Unicode 3.0 > Python 2.2: Unicode 3.0 > Python 2.3: Unicode 3.2 > Python 2.4: Unicode 3.2 > Python 2.5: Unicode 4.1 > Python 2.6: Unicode 5.1 > Python 2.7: Unicode 5.2 > Thank you for the summary. Note that Python reference pages have been updated even less frequently. [1] Since Python language and standard library definitions are now (in 3.x) closely tied to the Unicode definition, I wonder whether unicodedata.unidata_version should be more prominently featured in the docs. (Possibly even included in the Python CLI banner, but that is probably an overkill.) [1] http://mail.python.org/pipermail/docs/2010-November/002074.html
msg122712 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-11-28 19:31
>> In 2.6, there was a manually maintained list, probably dating back to before Unicode 4.0. > > That's not quite correct: Python 1.6.x - 2.5.x used tables for the > PyUnicode_ISSPACE() function that were created from the Unicode database. That used to be the case until r39757, when you made this change: ------------------------------------------------------------------------ r39757 \| lemburg	2005-10-20 21:06:35 +0200 (Do, 20. Okt 2005)	7 Zeilen Geänderte Pfade: M /python/trunk/Objects/unicodectype.c Enhance the performance of two important Unicode character type lookups: whitespace and linebreak. These lookup tables are from the Python 1.6 version with the addition of the 205F code point which was added as whitespace code point to Unicode since then. ------------------------------------------------------------------------ In 2.5 and 2.6, there was no table lookup anymore, but a switch statement. Not sure how you arrived at the code; the commit message doesn't say (but the wording suggests it was manually computed). It was not updated in 2.6.
msg122713 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-28 19:35
It is still strange that the .isspace() property value changed, since the code point has not changed in the recent Unicode versions: 4.1.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;; 5.1.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;; 5.2.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;; 6.0.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;; based on http://www.unicode.org/Public/<version>/ucd/UnicodeData.txt True > python2.5 -c 'print u"\u200b".isspace()' True > python2.6 -c 'print u"\u200b".isspace()' True > python2.7 -c 'print u"\u200b".isspace()' False Looking at the code again: Now I know why... The tables in unicodectype.c were generated from the Unicode database, but not by the makeunicodedata.py script. I used a script to generate those tables for Python 1.6.0 and it seems that they were never updated since then. Python 2.7 then replaced them with the data from the makeunicodedata.py script. That's probably why Martin thought they were manually maintained.
msg122714 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-28 19:39
Going back further shows the change: 3.0.1: 200B;ZERO WIDTH SPACE;Zs;0;BN;;;;;N;;;;; 3.2.0: 200B;ZERO WIDTH SPACE;Zs;0;BN;;;;;N;;;;; 4.0.1: 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;; 4.1.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;; 5.1.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;; 5.2.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;; 6.0.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;; Interesting that no one noticed in all these years.
msg122715 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-28 19:42
On Sun, Nov 28, 2010 at 2:40 PM, Marc-Andre Lemburg <report@bugs.python.org> wrote: .. > Going back further shows the change: > > 3.0.1: 200B;ZERO WIDTH SPACE;Zs;0;BN;;;;;N;;;;; > 3.2.0: 200B;ZERO WIDTH SPACE;Zs;0;BN;;;;;N;;;;; > 4.0.1: 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;; Yes, see above.

History
Date	User	Action	Args
2022-04-11 14:57:09	admin	set	github: 54776
2010-11-28 19:42:19	belopolsky	set	messages: +
2010-11-28 19:39:57	lemburg	set	messages: +
2010-11-28 19:35:10	lemburg	set	messages: +
2010-11-28 19:31:54	loewis	set	messages: +
2010-11-28 19:20:24	belopolsky	set	messages: +
2010-11-28 19🔞31	loewis	set	messages: +
2010-11-28 19:10:08	SilentGhost	set	nosy: - SilentGhost
2010-11-28 19:07:39	lemburg	set	nosy: + lemburgmessages: +
2010-11-28 19:00:47	SilentGhost	set	nosy:loewis, belopolsky, SilentGhost, pbnanmessages: +
2010-11-28 18:59:19	loewis	set	messages: + title: Some unicode space characters are not recognized as a space -> Unicode space character \u200b unrecognised a space
2010-11-28 18:54:40	SilentGhost	set	title: Unicode space character \u200b unrecognised a space -> Some unicode space characters are not recognized as a space
2010-11-28 18:52:15	SilentGhost	set	messages: +
2010-11-28 18:40:37	loewis	set	status: pending -> closednosy: + loewismessages: +
2010-11-28 18:20:42	belopolsky	set	status: open -> pendingresolution: not a bugmessages: +
2010-11-28 18:11:35	belopolsky	set	nosy: + belopolsky
2010-11-28 18:06:01	SilentGhost	set	versions: + Python 3.2
2010-11-28 18:05:11	SilentGhost	set	nosy: + SilentGhostmessages: +
2010-11-28 17:51:43	pbnan	create