[Python-Dev] Encoding detection in the standard library? (original) (raw)
M.-A. Lemburg mal at egenix.com
Tue Apr 22 12:31:34 CEST 2008
- Previous message: [Python-Dev] Encoding detection in the standard library?
- Next message: [Python-Dev] Encoding detection in the standard library?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 2008-04-21 23:31, Martin v. Löwis wrote:
This is useful when you get a hunk of data which should be some sort of intelligible text from the Big Scary Internet (say, a posted web form or email message), and you want to do something useful with it (say, search the content). I don't think that should be part of the standard library. People will mistake what it tells them for certain.
+1
I also think that it's better to educate people to add (correct) encoding information to their text data, rather than give them a guess mechanism...
http://chardet.feedparser.org/docs/faq.html#faq.yippie
chardet is based on the Mozilla algorithm and at least in my experience that algorithm doesn't work too well.
The Mozilla algorithm may work for Asian encodings due to the fact that those encodings are usually also bound to a specific language (and you can then use character and word frequency analysis), but for encodings which can encode far more than just a single language (e.g. UTF-8 or Latin-1), the correct detection rate is rather low.
The problem becomes completely even more difficult when leaving the normal text domain or when mixing languages in the same text, e.g. when trying to detect source code with comments using a non-ASCII encoding.
The "trick" to just pass the text through a codec and see whether it roundtrips also doesn't necessarily help: Latin-1, for example, will always round-trip, since Latin-1 is a subset of Unicode.
IMHO, more research has to be done into this area before a "standard" module can be added to the Python's stdlib... and who knows, perhaps we're lucky and by the time everyone is using UTF-8 anyway :-)
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Source (#1, Apr 22 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
- Previous message: [Python-Dev] Encoding detection in the standard library?
- Next message: [Python-Dev] Encoding detection in the standard library?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]