[Python-Dev] Encoding detection in the standard library? (original) (raw)

Bill Janssen janssen at parc.com
Tue Apr 22 17:14:43 CEST 2008


IMHO, more research has to be done into this area before a "standard" module can be added to the Python's stdlib... and who knows, perhaps we're lucky and by the time everyone is using UTF-8 anyway :-)

I walked over to our computational linguistics group and asked. This is often combined with language guessing (which uses a similar approach, but using characters instead of bytes), and apparently can usually be done with high confidence. Of course, they're usually looking at clean texts, not random "stuff". I'll see if I can get some references and report back -- most of the research on this was done in the 90's.

Bill



More information about the Python-Dev mailing list