[Python-Dev] urlparse brokenness (original) (raw)

Paul Jimenez pj at place.org
Wed Nov 23 06:04:55 CET 2005


It is my assertion that urlparse is currently broken. Specifically, I think that urlparse breaks an abstraction boundary with ill effect.

In writing a mailclient, I wished to allow my users to specify their imap server as a url, such as 'imap://user:password@host:port/'. Which worked fine. I then thought that the natural extension to support configuration of imapssl would be 'imaps://user:password@host:port/'.... which failed - user:passwrod at host:port got parsed as the path of the URL instead of the network location. It turns out that urlparse keeps a table of url schemes that 'use netloc'... that is to say, that have a 'user:password at host:port' part to their URL. I think this 'special knowledge' about particular schemes 1) breaks an abstraction boundary by having a function whose charter is to pull apart a particularly-formatted string behave differently based on the meaning of the string instead of the structure of it and 2) fails to be extensible or forward compatible due to hardcoded 'magic' strings - if schemes were somehow 'registerable' as 'netloc using' or not, then this objection might be nullified, but the previous objection would still stand.

So I propose that urlsplit, the main offender, be replaced with something that looks like:

def urlsplit(url, scheme='', allow_fragments=1, default=('','','','','')): """Parse a URL into 5 components: :///?# Return a 5-tuple: (scheme, netloc, path, query, fragment). Note that we don't break the components up in smaller bits (e.g. netloc is a single string) and we don't expand % escapes.""" key = url, scheme, allow_fragments, default cached = _parse_cache.get(key, None) if cached: return cached if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth clear_cache()

if "://" in url:
    uscheme, npqf = url.split("://", 1)
else:
    uscheme = scheme
    if not uscheme:
        uscheme = default[0]
    npqf = url
pathidx = npqf.find('/')
if pathidx == -1:  # not found
    netloc = npqf
    path, query, fragment = default[1:4]
else:
    netloc = npqf[:pathidx]
    pqf = npqf[pathidx:]
    if '?' in pqf:
        path, qf = pqf.split('?',1)
    else:
        path, qf = pqf, ''.join(default[3:5])
    if ('#' in qf) and allow_fragments:
        query, fragment = qf.split('#',1)
    else:
        query, fragment = default[3:5]
tuple = (uscheme, netloc, path, query, fragment)
_parse_cache[key] = tuple
return tuple

Note that I'm not sold on the _parse_cache, but I'm assuming it was there for a reason so I'm leaving that functionality as-is.

If this isn't the right forum for this discussion, or the right place to submit code, please let me know. Also, please cc: me directly on responses as I'm not subscribed to the firehose that is python-dev.

--pj



More information about the Python-Dev mailing list