[Python-Dev] repr vs. str and locales again (original) (raw)

Peter Funk pf@artcom-gmbh.de
Sun, 21 May 2000 17:54:06 +0200 (MEST)


Hi!

Ka-Ping Yee:

On Fri, 19 May 2000, M.-A. Lemburg wrote: > Umm, Jyrki's patch does not affect repr(): it's a patch to the > stringprint API which is used for the tpprint slot,

Very sorry! I didn't actually look to see where the patch was being applied. But then how can this have any effect on squishdot's indexing?

Sigh. Let me explain this in some detail.

What do you see here: �������? If all went well, you should see some Umlauts which occur quite often in german words, like "Begr�ssung", "�tzend" or "Gr�tzkacke" and so on.

During the late 80s we here Germany spend a lot of our free time to patch open source tools software like 'elm', 'B-News', 'less' and others to make them "8-Bit clean". For example on ancient Unices like SCO Xenix where the implementations of C-library functions like 'is_print', 'is_lower' where out of reach.

After several years everybody seems to agree on ISO-8859-1 as the new european standard character set, which was also often losely called 8-Bit ASCII, because ASCII is a true subset of ISO latin1. Even at least the german versions of Windows used ISO-8859-1.

As the WWW began to gain popularity nobody with a sane mind really used these splendid ASCII escapes like for example 'ä' instead of '�'. The same holds true for TeX users community where everybody was happy to type real umlauts instead of these ugly backslash escapes sequences used before: "a"o"u ...

To make a short: A lot of effort has been spend to make ALL programs 8-Bit clean: That is to move the bytes through without translating them from or into a bunch of incompatible multi bytes sequences, which nobody can read or even wants to look at.

Now to get to back to your question: There are several nice HTML indexing engines out there. I personally use HTDig. At least on Linux these programs deal fine with HTML files containing 8-bit chars.

But if for some reason Umlauts end up as octal escapes ('\344' instead of '�') due to the use of a Python 'print some_tuple' during the creation of HTML files, a search engine will be unable to find those words with escaped umlauts.

Mit freundlichen Gr��en, Peter P.S.: Hope you didn't find my explanation boring or off-topic.