Issue 8859: split() splits on non whitespace char when ther is no separator given. (original) (raw)

Created on 2010-05-30 18:54 by PeterL, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (8)
msg106773 - (view)	Author: Peter Landgren (PeterL)	Date: 2010-05-30 18:54
When the variable label is equal to '\xc5\xa0 Z\nX W' this line sequence label = " ".join(label.split()) label = unicode(label) results in: 7347: ERROR: gramps.py: line 138: Unhandled exception Traceback (most recent call last): File "C:\Program Files (x86)\gramps\gui\views\listview.py", line 660, in row_changed self.uistate.modify_statusbar(self.dbstate) File "C:\Program Files (x86)\gramps\DisplayState.py", line 521, in modify_statusbar name, obj = navigation_label(dbstate.db, nav_type, active_handle) File "C:\Program Files (x86)\gramps\Utils.py", line 1358, in navigation_label label = unicode(label) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data While this line sequence: label = unicode(label) label = " ".join(label.split()) gives correct result and no error. With the error the variable label changes from '\xc5\xa0 Z\nX W' to '\xc5 Z X W' by the line: label = " ".join(label.split()) Note '\xa0' has been dropped, interpreted as "whitespace"? This happens on Windows. It works perfectly well on Linux.
msg106774 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-05-30 19:12
Both on Linux and Windows I get: >>> '\xa0'.isspace() False >>> u'\xa0'.isspace() True The Unicode char u'\xa0' is U+00A0 NO-BREAK SPACE, so unicode.split correctly considers it a whitespace. However '\xa0' is not a whitespace, so str.split ignores it. The correct solution is to convert your string to Unicode and then split. I'd close this as invalid but I'd like you to confirm that the example I posted and that 'split' return the same result on both Linux and Windows before doing so (the fact that on Linux works it's probably caused by something else -- e.g. the label is already Unicode).
msg106775 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-05-30 19:13
What do you mean, "works perfectly well under Linux"? The error also happens under Linux here, and is expected: you can't call unicode() without an encoding and expect it to decode properly non-ASCII chars (and \xa0 is a non-ASCII char).
msg106776 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-05-30 19:16
Oh, and I agree with Ezio, this is most likely not a bug at all and should probably be closed.
msg106778 - (view)	Author: Peter Landgren (PeterL)	Date: 2010-05-30 20:03
I am not sure I can follow you. I will try to be more specific. The test string consists originally of one character; the Czech Š. 1. On Linux with Python 2.6.4 1.1 If I keep the original code line order: label = obj.get() print type(label), repr(label) label = " ".join(label.split()) print type(label), repr(label) label = unicode(label) if len(label) > 40: label = label[:40] + "..." Both lines print type(label), repr(label) gives: <type 'str'> '\xc5\xa0' 1.2 If I change order and take the unicode conversion first: label = obj.get() label = unicode(label) print type(label), repr(label) label = " ".join(label.split()) print type(label), repr(label) if len(label) > 40: label = label[:40] + "..." Both lines print type(label), repr(label) gives: <type 'unicode'> u'\u0160' 2. On Windows with Python 2.6.5 2.1 The original code line order: The lines print type(label), repr(label) gives <type 'str'> '\xc5\xa0' <type 'str'> '\xc5' 8217: ERROR: gramps.py: line 138: Unhandled exception .... 2.2 If I change order and take the unicode conversion first: Both lines print type(label), repr(label) gives: <type 'unicode'> u'\u0160' 3. If I use this little code: # -- coding: utf-8 -- label = 'Š' print type(label), repr(label) label = " ".join(label.split()) print type(label), repr(label) I get <type 'str'> '\xc5\xa0' <type 'str'> '\xc5\xa0' on both Linux and Windows. The examples above under 1. and 2. comes from an application, Gramps. There is still something I don't understand.
msg106779 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-05-30 20:20
I think the problem is in the default encoding used when you call unicode() without specifying any encoding. >>> '\xc5\xa0'.decode('iso-8859-1').split() [u'\xc5'] >>> '\xc5\xa0'.decode('utf-8').split() [u'\u0160']
msg106785 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-05-31 04:13
I also agree this should be closed.
msg106787 - (view)	Author: Peter Landgren (PeterL)	Date: 2010-05-31 07:13
So as a summary to what Ezio Melotti said: I should always specify encoding when calling split() to be sure nothing nasty happens? (Belive Ezio Melotti meant "calling split()" not "calling unicode()" in his last answer?) Thanks for pointing this out.

History
Date	User	Action	Args
2022-04-11 14:57:01	admin	set	github: 53105
2010-05-31 07:13:58	PeterL	set	messages: +
2010-05-31 04:13:06	rhettinger	set	status: open -> closednosy: + rhettingermessages: +
2010-05-30 20:20:52	ezio.melotti	set	messages: +
2010-05-30 20:03:47	PeterL	set	messages: +
2010-05-30 19:16:22	pitrou	set	messages: +
2010-05-30 19:13:22	pitrou	set	status: pending -> opennosy: + pitroumessages: +
2010-05-30 19:12:34	ezio.melotti	set	status: open -> pendingnosy: + ezio.melottimessages: + resolution: not a bug
2010-05-30 18:54:12	PeterL	create