[Python-Dev] PEP 263 considered faulty (for some Japanese) (original) (raw)

Tom Emerson tree@basistech.com
Wed, 13 Mar 2002 09:41:01 -0500

Previous message: [Python-Dev] PEP 263 considered faulty (for some Japanese)
Next message: [Python-Dev] PEP 263 considered faulty (for some Japanese)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Stephen J. Turnbull writes:

>>>>> "Martin" =3D=3D Martin v Loewis <martin@v.loewis.de> writes: =20 Martin> Reliable detection of encodings is a good thing, though, =20 I would think that UTF-8 can be quite reliably detected without the "BOM".

Detecting UTF-8 is relatively straightforward: Martin D=FCrst has presented on this at the last few Unicode conferences. Implementing this is trivial to anyone who thinks about it.

I suppose you could construct short ambiguous sequences easily for ISO-8859-[678] (which are meaningful in the corresponding natural language), but it seems that even a couple dozen characters would mak= e the odds astronomical that "in the wild" syntactic UTF-8 is intended to be UTF-8 Unicode (assuming you're expecting a text file, such as Python source). Is that wrong=3F Have you any examples=3F I'd be interested to see them; we (XEmacs) have some ideas about "statistical" autodetection of encodings, and they'd be useful test cases.

The problem with the ISO-8859-x is that the encoding space is identical for all of them, making it difficult without statistical or lexical methods to determine which you are looking at. EUC-CN and EUC-KR have a similar problem: just looking at the bytes themselves you cannot immediately tell whether a document is Chinese or Korean. Compare this Big 5, ShiftJIS, or any of the ISO-2022 encodings where it is pretty easy to detect these just by looking at the byte sequences.

There are a bunch of statistical language/encoding identifiers out in the wild, and frankly most of them suck for real text. Anyone working in the space usually starts with Cavnar and Trenkle's "N-Gram-Based Text Categorization", then train it with whatever random data they have (http://odur.let.rug.nl/~vannoord/TextCat/competitors.html has a list of tools). Finding enough text in the languages you are interested in can be hard. For example, Lextek claims to support 260 languages. If you examine the list shows that they used the UN HCR text as their training corpus: languages whose UNHCR translation is provided as GIFs or PDFs are not included in Lextek's tool. So, while it can detect text written in a relatively obscure South American language, it does not detect Chinese, Japanese, Korean, or Arabic. Further, because of the small corpus size, it is very easy to confuse it. My standard test for a language/encoding identifier is to type the string in UPPER CASE. For example, go to

http://odur.let.rug.nl/~vannoord/TextCat/Demo/

and enter

This is a test of the emergency broadcasting system.

and it will decide that the text is in English. If you enter

THIS IS A TEST OF THE EMERGENCY BROADCASTING SYSTEM.

then it cannot identify the text. At least it says that much. Lextek's identifier identifies that text as Catalan or something.

The other issue to deal with when finding training data is its cleanliness. Spidering the web can be hard because English is everywhere. If you don't strip markup, then the markup can overwhelm the text and result in a bogus profile.

Anyway, statistical detection is good and doable, but it has to be thought out, and you need enough data (we use at least 5 megabytes, and often 10-15 megabytes, of clean text for each language and encoding pair we support in our detector) to build a good model. The more languages/encodings you support, the more data you need.

But the Web in general provides (mandatory) protocols for identifying=

content-type, yet I regularly see HTML files with incorrect http-equi= v meta elements, and XHTML with no encoding declaration containing Shif= t JIS. Microsoft software for Japanese apparently ignores Content-Type=

headers and the like in favor of autodetection (probably because the same MS software regularly relies on users to set things like charset=

parameters in MIME Content-Type).

Mandatory protocols are useless if no one actually pays attention to them. That is why browser manufacturers generally ignore the Content-Type header. At the very least if something claims to be ISO-8859-1, it probably isn't. And worse than an XHTML document with no encoding declaration containing ShiftJIS, I've seen XHTML documents that explicitly declare UTF-8 that contain ShiftJIS.

How does Java deal with this=3F Are all files required to be in UTF-8=3F=

--=20 Tom Emerson Basis Technology C= orp. Sr. Computational Linguist http://www.basistech= .com "Beware the lollipop of mediocrity: lick it once and you suck forever= "

Previous message: [Python-Dev] PEP 263 considered faulty (for some Japanese)
Next message: [Python-Dev] PEP 263 considered faulty (for some Japanese)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]