[Python-Dev] Encoding detection in the standard library? (original) (raw)
M.-A. Lemburg mal at egenix.com
Tue Apr 22 22:54:35 CEST 2008
- Previous message: [Python-Dev] pydoc works with eggs? (python-2.5.1)
- Next message: [Python-Dev] GSoC student introduction and sandbox commit privileges request
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
[CCing python-dev again]
On 2008-04-22 12:38, Greg Wilson wrote:
I don't think that should be part of the standard library. People will mistake what it tells them for certain. [etc] These are all good arguments, but the fact remains that we can't control our inputs (e.g., we're archiving mail messages sent to lists managed by DrProject), and some of those inputs don't tell us how they're encoded. Under those circumstances, what would you recommend?
I haven't done much research into this, but in general, I think it's better to:
first try to look at other characteristics of a text message, e.g. language, origin, topic, etc.,
then narrow down the number of encodings which could apply,
rank them to try to avoid ambiguities and
then try to see what percentage of the text you can decode using each of the encodings in reverse ranking order (ie. more specialized encodings should be tested first, latin-1 last).
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Source (#1, Apr 22 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
- Previous message: [Python-Dev] pydoc works with eggs? (python-2.5.1)
- Next message: [Python-Dev] GSoC student introduction and sandbox commit privileges request
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]