[Python-Dev] Re: Alternative Implementation for PEP 292: Simple String Substitutions (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Mon Sep 13 06:21:32 CEST 2004


"Fredrik" == Fredrik Lundh <fredrik at pythonware.com> writes:

Fredrik> M.-A. Lemburg wrote:

>>> (google for "stringlib" for some work I'm doing in this area)

>> Ah, now I know where you're coming from :-) Shift tables don't
>> work well in the Unicode world with its large alphabet.

Fredrik> since most real-life text use characters from only a
Fredrik> small number of regions in that alphabet,

This is true of "most real-life text", but it's going to be false most of the time for a large (and rapidly growing) minority of users: those working with texts comprised mostly of Asian ideographs. Unihan (spread over about 80 256-character rows) has a potential big problem: because it is ordered by root, then stroke count, the simpler (and usually more frequently used) ideographs with a common root cluster near the root. Whether those clusters frequently overlap based on a simple compression method like "lowest 5 bits" I don't know offhand.

I don't know whether the composed Hangul (~ 40 rows) would show clustering; that would depend on phonetic frequencies in the Korean language.

Of course the find algorithm you present is almost surely a big win over the brute-force method, even in the presence of some degree of clustering in Unihan and Hangul. But I worry that it's an exceptional example, when you use assumptions like "real-life text uses characters drawn from a small number of short contiguous regions in the alphabet."

-- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.



More information about the Python-Dev mailing list