[Python-Dev] Possibly inconsistent behavior in re groupdict (original) (raw)

Gordon R. Burgess gordon at parasamgate.com
Sun Sep 25 20:25:31 EDT 2016


I've been lurking for a couple of months, working up the confidence to ask the list about this behavior - I've searched through the PEPs but couldn't find any specific reference to it.

In a nutshell, in the Python 3.5 library re patterns and search buffers both need to be either unicode or byte strings - but the keys in the groupdict are always returned as str in either case.

I don't know whether or not this is by design, but it would make more sense to me if when searching a bytes object with a bytes pattern the keys returned in the groupdict were bytes as well.

I reworked the example a little just now so it would run it on 2.7 as well; on 2.7 the keys in the dictionary correspond to the mode of the pattern as expected (and bytes and unicode are interconverted silently)

Thanks for your time,

Gordon

[Code]

import sys import re from datetime import datetime

data = (u"first string (unicode)",          b"second string (bytes)")

pattern = [re.compile(u"(?P\w+) .*\((?P\w+)\)"),            re.compile(b"(?P\w+) .*\((?P\w+)\)")]

print("*** re consistency check ***\nRun: %s\nVersion: Python %s\n" %       (datetime.now(), sys.version)) for p in pattern:     for d in data:         try:             result = "groupdict: %s" % (p.match(d) and p.match(d).groupdict())         except Exception as e:             result = "error: %s" % e.args[0]         print("mode: %s\npattern: %s\ndata: %s\n%s\n" %               (type(p.pattern).name, p.pattern, d, result))

[Output]

gordon at w540:~/workspace/regex_demo$ python3 regex_demo.py  *** re consistency check *** Run: 2016-09-25 20:06:29.472332 Version: Python 3.5.2+ (default, Sep 10 2016, 10:24:58)  [GCC 6.2.0 20160901]

mode: str pattern: (?P\w+) .*((?P\w+)) data: first string (unicode) groupdict: {'ordinal': 'first', 'type': 'unicode'}

mode: str pattern: (?P\w+) .*((?P\w+)) data: b'second string (bytes)' error: cannot use a string pattern on a bytes-like object

mode: bytes pattern: b'(?P\w+) .*\((?P\w+)\)' data: first string (unicode) error: cannot use a bytes pattern on a string-like object

mode: bytes pattern: b'(?P\w+) .*\((?P\w+)\)' data: b'second string (bytes)' groupdict: {'ordinal': b'second', 'type': b'bytes'}

gordon at w540:~/workspace/regex_demo$ python regex_demo.py  *** re consistency check *** Run: 2016-09-25 20:06:23.375322 Version: Python 2.7.12+ (default, Sep  1 2016, 20:27:38)  [GCC 6.2.0 20160822]

mode: unicode pattern: (?P\w+) .*((?P\w+)) data: first string (unicode) groupdict: {u'ordinal': u'first', u'type': u'unicode'}

mode: unicode pattern: (?P\w+) .*((?P\w+)) data: second string (bytes) groupdict: {u'ordinal': 'second', u'type': 'bytes'}

mode: str pattern: (?P\w+) .*((?P\w+)) data: first string (unicode) groupdict: {'ordinal': u'first', 'type': u'unicode'}

mode: str pattern: (?P\w+) .*((?P\w+)) data: second string (bytes) groupdict: {'ordinal': 'second', 'type': 'bytes'}



More information about the Python-Dev mailing list