[Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak? (original) (raw)
Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Sat, 13 May 2000 14:56:41 +0200
- Previous message: [Python-Dev] "is", "==", and sameness
- Next message: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
in the current 're' engine, a newline is chr(10) and nothing else.
however, in the new unicode aware engine, I used the new LINEBREAK predicate instead, but it turned out to break one of the tests in the current test suite:
sre.match('a\rb', 'a.b') => None
(unicode adds chr(13), chr(28), chr(29), chr(30), and also unichr(133), unichr(8232), and unichr(8233) to the list of line breaking codes)
what's the best way to deal with this? I see three alter- natives:
a) stick to the old definition, and use chr(10) also for unicode strings
b) use different definitions for 8-bit strings and unicode strings; if given an 8-bit string, use chr(10); if given a 16-bit string, use the LINEBREAK predicate.
c) use LINEBREAK in either case.
I think (c) is the "right thing", but it's the only that may break existing code...
- Previous message: [Python-Dev] "is", "==", and sameness
- Next message: [Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]