[Python-Dev] Encoding detection in the standard library? (original) (raw)

M.-A. Lemburg mal at egenix.com
Tue Apr 22 23:34:20 CEST 2008


On 2008-04-22 18:33, Bill Janssen wrote:

The 2002 paper "A language and character set determination method based on N-gram statistics" by Izumi Suzuki and Yoshiki Mikami and Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go about this.

Thanks for the reference.

Looks like the existing research on this just hasn't made it into the mainstream yet.

Here's their current project: http://www.language-observatory.org/ Looks like they are focusing more on language detection.

Another interesting paper using n-grams: "Language Identification in Web Pages" by Bruno Martins and Mário J. Silva http://xldb.fc.ul.pt/data/Publications_attach/ngram-article.pdf

And one using compression: "Text Categorization Using Compression Models" by Eibe Frank, Chang Chui, Ian H. Witten http://portal.acm.org/citation.cfm?id=789742

They're looking at "LSE"s, language-script-encoding triples; a "script" is a way of using a particular character set to write in a particular language.

Their system has these requirements: R1. the response must be either "correct answer" or "unable to detect" where "unable to detect" includes "other than registered" [the registered set of LSEs]; R2. Applicable to multi-LSE texts; R3. never accept a wrong answer, even when the program does not have enough data on an LSE; and R4. applicable to any LSE text. So, no wrong answers. The biggest disadvantage would seem to be that the registration data for a particular LSE is kind of bulky; on the order of 10,000 shift-codons, each of three bytes, about 30K uncompressed. http://portal.acm.org/ftgateway.cfm?id=772759&type=pdf

For a server based application that doesn't sound too large.

Unless you're using a very broad scope, I don't think that you'd need more than a few hundred LSEs for a typical application - nothing you'd want to put in the Python stdlib, though.

Bill

IMHO, more research has to be done into this area before a "standard" module can be added to the Python's stdlib... and who knows, perhaps we're lucky and by the time everyone is using UTF-8 anyway :-) I walked over to our computational linguistics group and asked. This is often combined with language guessing (which uses a similar approach, but using characters instead of bytes), and apparently can usually be done with high confidence. Of course, they're usually looking at clean texts, not random "stuff". I'll see if I can get some references and report back -- most of the research on this was done in the 90's. Bill

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Apr 22 2008)

Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/


:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
        Registered at Amtsgericht Duesseldorf: HRB 46611


More information about the Python-Dev mailing list