Issue 32462: validate mime types loaded from system files. Document that system files take precedence. (original) (raw)

On a Windows 7 system, entering the following:

>>> mime, encoding = mimetypes.guess_type('Untitled.sql')
>>> mime
'text\\plain'

Meaning, the return value is 'text\plain' instead of 'text/plain'. Tracking this down, it's due to .sql being loaded from the Windows registry and the registry is using the wrong slash.

The mimetypes.guess_type() documentation states:

The return value is a tuple (type, encoding) where type is None if > the type can’t be guessed (missing or unknown suffix) or a string of the form 'type/subtype', usable for a MIME content-type header.

I don't know if guess_type() (or add_types) should check for a valid types, if .sql should be added to the valid types (it's on the IANA page), or if the documentation should be fixed so it doesn't look like a guarantee. Or all three. :-)

You can get the same "bad" behavior on a posix system by having a mimetypes file with an incorrect entry in it. That would be a system misconfiguration, as is your Windows registry case, and is outside of Python's control. I suppose we could make it clearer (ie: in that intro paragraph) that the system files are read by default (that is, the built-in tables are only defaults unless you specify otherwise).

It is unfortunately true that the mime types in the Windows registry are less reliable than those on unix systems. This has nothing to do with the mimetypes module itself, though ;) I wonder if we should have made the default to be loading windows registry as non-strict, but that ship has sailed, I think.

Checking for at least minimal validity (xxx/yyy) would at least make things a little better on Windows, so I wouldn't object to adding that.

To summarize, my suggestion would be to add a note to the intro paragraph that system files/registry are read by default and override the built-in tables, and add a minimal sanity check on the mime type values read. Adding .sql to the strict list is a separate issue, and would not change the behavior here (unless I'm missing something, which is possible).

There are issues around adding even a minimal validity check, though: do we backport that? Do we silently ignore strings in the wrong format? Do we "fix" a backslash to be a slash? Do we issue a warning for any problems we find? These questions should be discussed if we decide to go this route.