Issue 5240: time.strptime fails to match data and format with Unicode whitespaces (Py3) (original) (raw)

On Python3, strptime raises a ValueError with some "Unicode whitespaces" even if they are present both in the 'string' and 'format' args in the same position:

strptime("Thu\x20Feb", "%a\x20%b") # normal space, works fine time.struct_time(tm_year=1900, tm_mon=2, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=32, tm_isdst=-1) strptime("Thu\xa0Feb", "%a\xa0%b") # no-break space, fails ValueError: time data 'Thu\xa0Feb' does not match format '%a\xa0%b'

I wrote a small script to find out other chars where it fails (it needs ~5 minutes to run):

l = [] for char in map(chr, range(0xFFFF)): ... try: x = strptime('Thu{0}Feb'.format(char), '%a{0}%b'.format(char)) ... except ValueError: l.append(char) ... l ['\x1c', '\x1d', '\x1e', '\x1f', '%', '\x85', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u2028', '\u2029', '\u202f', '\u205f', '\u3000'] [char.strip() for char in l] ['', '', '', '', '%', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''] [unicodedata.category(char) for char in l] ['Cc', 'Cc', 'Cc', 'Cc', 'Po', 'Cc', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Cf', 'Zl', 'Zp', 'Zs', 'Zs', 'Zs'] [unicodedata.name(char, '???') for char in l] ['???', '???', '???', '???', 'PERCENT SIGN', '???', 'NO-BREAK SPACE', 'OGHAM SPACE MARK', 'EN QUAD', 'EM QUAD', 'EN SPACE', 'EM SPACE', 'THREE-PER-EM SPACE', 'FOUR-PER-EM SPACE', 'SIX-PER-EM SPACE', 'FIGURE SPACE', 'PUNCTUATION SPACE', 'THIN SPACE', 'HAIR SPACE', 'ZERO WIDTH SPACE', 'LINE SEPARATOR', 'PARAGRAPH SEPARATOR', 'NARROW NO-BREAK SPACE', 'MEDIUM MATHEMATICAL SPACE', 'IDEOGRAPHIC SPACE']

All these chars (except % and some control chars) are whitespace and they are removed by the .strip() method, so I guess that something similar happens in strptime too.

The Unicode categories are: "Cc" = "Other, Control" "Zs" = "Separator, Space" "Cf" = "Other, Format" "Zl" = "Separator, Line" "Zp" = "Separator, Paragraph"

Everything seems to work fine on Py2.x (tested on 2.4 and 2.6)

I think you have found the problem, strptime probably uses \s with the re.ASCII flag and fails to match all the Unicode whitespaces:

l ['\x1c', '\x1d', '\x1e', '\x1f', '%', '\x85', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u2028', '\u2029', '\u202f', '\u205f', '\u3000'] [bool(re.match('^\s$', char, re.ASCII)) for char in l] [False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False] [bool(re.match('^\s$', char)) for char in l] [True, True, True, True, False, True, True, True, True, True, True, True, True,True, True, True, True, True, True, True, True, True, True, True, True]

This bug is then related #5239 and the proposed fix should work for both. We can close this as duplicate and include this problem in #5239.

Good work!