msg81847 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2009-02-13 01:46 |
On Py3 strptime("2009", "%Y") fails: >>> strptime("2009", "%Y") Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.0/_strptime.py", line 454, in _strptime_time return _strptime(data_string, format)[0] File "/usr/local/lib/python3.0/_strptime.py", line 325, in _strptime (data_string, format)) ValueError: time data '2009' does not match format '%Y' but non-ascii numbers are supported elsewhere: >>> int("2009") 2009 >>> re.match("^\d{4}$", "2009").group() '2009' The problem seems to be at the line 265 of _strptime.py: return re_compile(self.pattern(format), IGNORECASE | ASCII) The ASCII flag prevent the regex to work properly with '2009': >>> re.match("^\d{4}$", "2009", re.ASCII) >>> I tried to remove the ASCII flag and it worked fine. On Py2.x the problem is the same: >>> strptime(u"2009", "%Y") Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.5/_strptime.py", line 330, in strptime (data_string, format)) ValueError>>> >>> int(u"2009") 2009 >>> re.match("^\d{4}$", u"2009") Here there's probably to add the re.UNICODE flag at the line 265 (untested): return re_compile(self.pattern(format), IGNORECASE |
UNICODE) in order to make it work: >>> re.match("^\d{4}$", u"2009", re.U).group() u'\uff12\uff10\uff10\uff19' |
|
msg81928 - (view) |
Author: Hirokazu Yamamoto (ocean-city) *  |
Date: 2009-02-13 13:52 |
This patch comes from . I think testcase is needed. I'll try if I can. |
|
|
msg81932 - (view) |
Author: Hirokazu Yamamoto (ocean-city) *  |
Date: 2009-02-13 14:13 |
Hmm, this fails on python2 too. Maybe re.ASCII is added for backward compatibility? Again, I'm not familiar with unicode, so I won't call remove_ascii_flag.patch as *fix*. |
|
|
msg81934 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-02-13 14:24 |
> Hmm, this fails on python2 too. Maybe re.ASCII is added for backward > compatibility? Again, I'm not familiar with unicode, so I won't call > remove_ascii_flag.patch as *fix*. re.ASCII was added to many stdlib modules because I wanted to minimize the potential for breakage when I converted the re library to use unicode matching by default. If it is desireable for strptime() and friends to match unicode digits as well as pure-ASCII digits (which sounds like a reasonable request to me), then re.ASCII can probably be dropped without any regret. (py3k doesn't have to be 100% compatible with python2 :-)) |
|
|
msg81938 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2009-02-13 14:44 |
I think Py3 with re.ASCII is the same as Py2 without re.UNICODE (and Py3 without re.ASCII is the same as Py2 with re.UNICODE). It's probably a good idea to have a coherent behavior between Py2 and Py3, so if we remove re.ASCII from Py3 we should add re.UNICODE to Py2. |
|
|
msg81939 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-02-13 14:50 |
Le vendredi 13 février 2009 à 14:44 +0000, Ezio Melotti a écrit : > It's probably a good idea to have a coherent behavior between Py2 and > Py3, so if we remove re.ASCII from Py3 we should add re.UNICODE to Py2. Removing re.ASCII in py3k is a no-brainer, because unicode is how strings work by default. On the other hand, strings in 2.x are 8-bit, so it would probably be better to keep strptime as is. As I said, py3k doesn't have to be compatible with 2.x, that's even the whole point of it. |
|
|
msg81940 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2009-02-13 15:27 |
> Removing re.ASCII in py3k is a no-brainer, because unicode is how > strings work by default. I meant from the line 265 of _strptime.py, not from Python :P |
|
|
msg81941 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-02-13 15:30 |
> > Removing re.ASCII in py3k is a no-brainer, because unicode is how > > strings work by default. > > I meant from the line 265 of _strptime.py, not from Python :P That's what I understood. |
|
|
msg81948 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2009-02-13 16:26 |
Sorry, I misunderstood the meaning of "no-brainer". If we add re.UNICODE on Py2, strptime should work fine with unicode strings, but it could fail somehow with normal strings. Is it more important to provide a way to use Unicode chars that works only with unicode strings or to have a coherent behavior between str and unicode? I don't think that adding re.UNICODE will break any existing code, but it may cause problems if someone tries to use encoded str instead of unicode (but shouldn't work already). Also note that encoded strings should be a problem only if they have to match a strptime directive (e.g. %Y), the other chars should be compared as they are, so it should work with str and unicode as long as they are not mixed (I think that whitespaces are treated differently though). I'll try to add re.UNICODE and see what happens. |
|
|
msg81949 - (view) |
Author: Hirokazu Yamamoto (ocean-city) *  |
Date: 2009-02-13 16:30 |
I added test. But this requires fix to be passed on windows. (I used "\u3000" instead of "\xa0" because "\xa0" cannot be decoded on windows mbcs) |
|
|
msg81952 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-02-13 16:57 |
> If we add re.UNICODE on Py2, strptime should work fine with unicode > strings, but it could fail somehow with normal strings. Is it more > important to provide a way to use Unicode chars that works only with > unicode strings or to have a coherent behavior between str and unicode? I'd say the latter, since str and unicode are often interchangeable in 2.x. |
|
|
msg84665 - (view) |
Author: Hirokazu Yamamoto (ocean-city) *  |
Date: 2009-03-30 21:52 |
This issue seems to be fixed on py3k by r70755. () |
|
|
msg84669 - (view) |
Author: Brett Cannon (brett.cannon) *  |
Date: 2009-03-30 21:54 |
As Hirokazu pointed out, this was fixed. |
|
|