How to Sort Unicode Strings Alphabetically in Python (original) (raw)
Sorting strings in Python is something that you can almost take for granted if you do it all the time. Yet, it can become surprisingly tricky considering all the edge cases lurking in the vast Unicode standard. Understanding Unicode isn’t an easy feat, so prepare yourself for a whirlwind tour of surprising edge cases and effective ways of dealing with them.
Python pioneered and popularized a robust sorting algorithm called Timsort, which now ships with several major programming languages, including Java, Rust, and Swift. When you call sorted() or list.sort(), Python uses this algorithm under the surface to rearrange elements. As long as your sequence contains comparable elements of compatible types, then you can sort numbers, strings, and other data types in the expected order:
Unless you specify otherwise, Python sorts these elements by value in ascending order, comparing them pairwise. Each pair must contain elements that are comparable using either the less than (<) or greater than (>) operator. Sometimes, these comparison operators are undefined for specific data types like complex numbers or between two distinct types. In these cases, a comparison will fail:
In most cases, you’ll be working with homogeneous sequences comprising elements of the same type, so you’ll rarely run into this problem in the wild. However, things may start to fall apart once you throw strings with non-Latin characters into the mix, such as letters with diacritics or accents:
In this example, you run into a common challenge associated with sorting strings. When a string contains characters whose ordinal values extend beyond the usual ASCII range, then you might get unexpected results like what you have here. The name Łukasz ends up after Zbigniew, even though the letter Ł (pronounced like a w sound in English) occurs earlier than Z in the Polish alphabet. Why’s that?
Python sorts strings lexicographically by comparing Unicode code points of the individual characters from left to right. It just so happens that the letter Ł has a higher ordinal value in Unicode than the letter Z, making it greater than any of the Latin letters:
Code points are unique yet arbitrary identifiers of Unicode characters, and they don’t inherently agree with the alphabetical order of spoken languages. So, lexicographic sorting won’t be appropriate for languages other than English.
Different cultures follow different rules for sorting strings, even when they share the same alphabet or parts of it with other cultures. For example, the ch digraph is considered two separate letters (c and h) in Polish, but it becomes a stand-alone letter placed between h and i in the Czech alphabet. This is known as a contraction in Unicode.
Moreover, the sorting order can sometimes differ within the same culture, depending on the context. For example, most German phone books tend to treat letters with an umlaut (ä, ö, ü) similarly to the ae, oe, and ue letter combinations. However, other countries overwhelmingly treat these letters the same as their Latin counterparts (a, o, u).
There’s no universally correct way to sort Unicode strings. You need to tell Python which rules to apply to get the desired ordering. So, how do you sort Unicode strings alphabetically in Python?
How to Sort Strings Using the Unicode Collation Algorithm (UCA)
The problem of sorting Unicode strings isn’t unique to Python. It’s a common challenge in any programming language or database. To address it, Technical Report #10 in the Unicode Technical Standard (UTS) describes the collation of Unicode strings, which is a consistent way of comparing two strings to establish their sorting order.
The Unicode Collation Algorithm (UCA) assigns a hierarchy of numeric weights to each character, allowing the creation of binary sort keys that account for accents and other special cases. These keys are defined at four levels that determine various features of a character:
- Primary: The base letter
- Secondary: The accents
- Tertiary: The letter case
- Quaternary: Other features
Later in this tutorial, you’ll learn how to leverage these weight levels to customize the Unicode collation algorithm by, for example, ignoring the letter case in case-insensitive sorting.
While the UCA supplies the Default Unicode Collation Element Table (DUCET), you should generally customize this default collation table to the specific needs of a particular language and application. It’s virtually impossible to ensure the desired sort order for all languages using only one character table. Therefore, software libraries implementing the UCA usually rely on the Common Locale Data Repository (CLDR) to provide such customization.
This repository contains several XML documents with language-specific information. For example, the collation rules for sorting text in Polish explain the relationship between the letters z, ź, and ż in both uppercase and lowercase:
Without going into the technical details of the collation rule syntax, you can observe that z comes before ź, which comes before ż. The same rule applies regardless of whether the letters are uppercase or lowercase.
To use the Unicode Collation Algorithm in Python, you can install the pyuca library into your virtual environment. While this library is slightly obsolete and only supports Unicode up to version 10, you can download the latest DUCET table from Unicode’s official website and supply the corresponding filename when creating a collator object:
Alternatively, you can create a collator instance without specifying the file, in which case it’ll use an older version that was distributed with the library. Calling the collator’s .sort_key()
method on a string reveals a tuple of weights that you can use in a comparison.
Notice that weights at the first index in both tuples are identical, meaning that the letters z and ż are essentially treated as equal, disregarding accents. This is an intended behavior of the default collation table in UCA when you leave it without customization, but as you’ll soon learn, this isn’t ideal.
The second weight, which corresponds to the following letter, determines the sorting order of both strings. The remaining weights are irrelevant at this point.
You can sort a sequence of Unicode strings using pyuca by providing the reference to your collator’s .sort_key()
method as the key function:
Remember to omit the parentheses at the end of the method name to avoid calling it! You want to pass a reference to a function or method that Python will later call for each item in the list. If you were to call your method here, then it would execute immediately, returning a single value instead of the callable object that sorted()
expects.
As you can see, the default collation table in UCA doesn’t conform to the Polish language rules, as the name Żaneta should sort after Zbigniew. Unfortunately, pyuca doesn’t let you specify custom rules to alleviate that. While you could manually try editing the downloaded text file before supplying it to the collator, it’s a tedious and error-prone task.
Not only is pyuca limited in this regard, but it also hasn’t been actively maintained for a few years now. As a pure-Python library, it might negatively affect the performance of more demanding applications. Overall, you’re better off finding an alternative tool like PyICU, which is a Python binding to IBM’s open-source International Components for Unicode (ICU) library implemented in C++ and Java.
Once you have PyICU installed, getting started is pretty straightforward and doesn’t look much different than using pyuca. Here’s an example of how you can create a collator and supply an optional set of rules specific to a given language:
In this case, you use the Polish flavor of the Unicode collation rules, which results in the expected sort order of names. To get a complete list of all available languages, use the following code snippet:
These languages are built into the library and shipped with it using a binary format, independently of your operating system, which may support a different set of languages. You can use any of these to customize your collator instance.
Additionally, you may build a collator from scratch with entirely custom rules expressed in a format similar to the one you’ve seen in the XML file before:
You’ve created an instance of the RuleBasedCollator
with the collation rules for sorting text in Polish, which you borrowed from the Common Locale Data Repository (CLDR). As a result, the Polish names are now arranged correctly.
This is barely scratching the surface of what’s possible with PyICU. The library isn’t just for sorting Unicode strings. It can also help you develop multilingual applications for users worldwide.
The biggest downside of PyICU is its potentially cumbersome installation. It’s also a relatively large library, both in terms of size on disk and an extensive API. If one of these becomes a deal-breaker for you, or if you can’t use any third-party libraries, then consider alternative options.
Next up, you’ll learn about the locale
module in Python’s standard library.
Leverage Python’s locale
Module to Sort Unicode Strings
Python’s official sorting how-to guide recommends using the standard library module locale to respect cultural and geographical conventions when sorting strings. It’s a Python interface to C localization functions that expose the regional settings on a POSIX-compatible operating system. Some of them include:
- Address and phone formats
- Currency symbol
- Date and time format
- Decimal symbol (for example, point vs comma)
- Language
- Measurement units
- Paper size (for example, A4 vs Letter)
These parameters are collectively known as locale. You can control them globally through a set of environment variables, like LC_ALL
, which will apply to your entire operating system. Alternatively, you can selectively override some of them within your program when it runs.
The locale is about so much more than a language. For example, English has several dialects spoken on many continents. Each dialect may have distinct traits, such as these:
American | British | |
---|---|---|
Spelling | color | colour |
Vocabulary | truck | lorry |
Measurement System | imperial | metric |
Currency | dollars | pounds |
Pronunciation | uh-loo-muh-nuhm | al-yoo-min-ee-um |
Even regions of the same country sometimes exhibit slight variations. To account for these possible differences, a locale identifier consists of two mandatory and two optional parts:
- Language: A two-letter, lowercase language code (ISO 639)
- Territory: A two-letter, uppercase country code (ISO 3166)
- Charmap: An optional character encoding
- Modifier: An optional modifier for specific locale variations
For example, the identifier ca_ES.UTF-8@valencia
represents the Catalan language spoken in Spain using the Valencian dialect with the UTF-8 character encoding. On the other hand, en_US
is another valid identifier that corresponds to English spoken in the United States with your operating system’s default character encoding.
If you’re on macOS or a Linux distribution, then you can check your current locale by typing the locale
command at your command prompt:
You get information about specific locale categories and their values. The LC_COLLATE
category controls how strings are compared, which determines the sorting order of Unicode strings.
To reveal the complete list of available locales installed in your system, append the -a
flag to the locale
command:
The C
and POSIX
locales are defaults defining a minimal set of regional settings that must comply with the C programming language on a POSIX-compliant system.
To install an additional locale, pick one from a list of the supported locales and then generate the corresponding localization files from the chosen template. If you’re on Linux, then you can list the supported locales by revealing the contents of a special file:
These commands enable a new locale for Portuguese as spoken in Brazil. If you’d like your operating system to pick it up, then you must also install the corresponding language pack before changing the current locale:
The package name that you should install may vary. For instance, the one above is suitable for Linux distributions based on Debian. After changing the current locale, most of the standard Unix commands will reflect that in their output:
When you try to print the contents of a missing file with cat
, you get an error message in Portuguese. Note that setting environment variables has only a temporary effect that will last until you close the current terminal session. To apply the new locale permanently, edit your shell configuration file accordingly.
Now, how do you control the locale in Python? You’ve already seen that the default lexicographic sorting in Python may lead to incorrect results. To fix that, you can try using locale.strxfrm(), which transforms a string into a locale-aware counterpart, as the sort key:
Depending on your current locale, the list may end up sorted differently than without the custom key. However, the result can still be incorrect when you don’t specify the right locale. In such a case, Python relies on your operating system’s default locale, which may be unsuitable for the given language.
When sorting strings, you always want to override the collation rules to match the language in question:
At last, you get the expected result by setting the collation category to the Polish locale.
Unfortunately, calling locale.setlocale()
is a global setting that won’t help you with sorting words in more than one language simultaneously. It also isn’t thread-safe, so you’d typically invoke it once after starting your program, as opposed to handling the incoming HTTP requests in a multi-threaded web application, for example.
Apart from being able to handle at most one language at a time, locale
will only work on POSIX-compliant systems, making it less portable than other solutions. It requires the correct locale to be installed in your system, and it supports ISO 14651, which is a subset of the complete Unicode Collation Algorithm. That said, it may be enough in more straightforward use cases.
Transliterate Foreign Characters Into Latin Equivalents
You don’t always care about precise collation rules of foreign languages. Instead, you may prefer to make the sort order appear correct to an English speaker by converting each string into a Latin script that best approximates the original. In linguistics, such conversion is known as the romanization of text, and it can take a few forms, including a combination of these:
- Transliteration: Swapping letters from one script with another
- Transcription: Representing a foreign script phonetically
The basic idea behind transliteration is to replace every non-ASCII letter with its closest Latin counterpart before comparing the strings. You can also use this technique to represent Unicode on a legacy system that only supports ASCII characters or to create a clean URL. Note that simply stripping any character beyond ASCII would alter the text, removing too much information.
Here are a few examples of words transliterated from different alphabets:
Alphabet | Original | Transliteration |
---|---|---|
Cyrillic | Петроград | Petrograd |
Greek | Κνωσός | Knosos |
Kurdish | Xirabreşkê | Xirabreske |
Norwegian | Svolvær | Svolvaer |
Polish | Łódź | Lodz |
This method works best with alphabets derived from Latin or Greek, as opposed to other alphabets like Arabic, Chinese, or Hebrew. To handle non-Roman languages, you’ll need to devise a way of transcribing sounds associated with the visual symbols.
As an example, take a look at these city names and their transcriptions:
Alphabet | Original | Transcription |
---|---|---|
Arabic | الرياض | Riyadh |
Chinese | 北京 | Beijing |
Hebrew | יְרוּשָׁלַיִם | Jerusalem |
Japanese | 東京 | Dongjing (Tokyo) |
Korean | 서울 | Seoul |
Transcribing a given text is inherently more complicated than transliterating it, so you’ll only consider transliteration from now on.
One way to strip accents or diacritics from Latin letters is by taking advantage of Unicode equivalence, which states that alternative sequences of code points can encode semantically identical characters. Even if two Unicode strings look the same to a human reader, they may have different lengths and contents, affecting the sort order. For example, the letter é has two canonically equivalent Unicode representations:
Canonical Equivalence of Unicode Strings
The representation on the left encodes the last letter in the word café using Unicode code point U+00E9. Conversely, the representation on the right encodes the same letter using two code points: U+0065 for the Latin letter e, followed by U+0301, which is the acute accent that combines with the preceding letter. The combining characters like these don’t stand on their own but modify how other characters render.
The Unicode standard specifies four normal forms and the corresponding algorithms for converting between them through a process called Unicode normalization. In Python, you can use the standard-library module unicodedata to perform such a normalization:
Notice that the two variables render the same glyphs when you evaluate them, but the underlying strings have different lengths and don’t compare as equal. Normalizing either of the strings into the common Unicode representation allows for an apples-to-apples comparison.
Unicode’s NFD form allows you to decompose a Unicode string into plain Latin letters while discarding any combining characters that may result from this normalization. To make your life easier, define the following helper function, which you’ll evolve later:
After decomposing a string, you .encode()
it into a sequence of bytes while ignoring ordinal values that can’t fit in the ASCII range. Then, you .decode()
the resulting bytes into a Python string again, which you return to the caller.
To verify if your function works as expected, you can test it against a few pangrams, which are humorous and often nonsensical sentences containing every letter from the given alphabet:
You define and iterate through a Python dictionary of three pangrams before printing their transliterations on the screen. Unfortunately, while your solution works fine with most letters, it sometimes removes them instead of finding a replacement.
Apparently, certain letters with diacritics don’t have a corresponding combining character in Unicode. For example, the Polish letter Ł is a standalone character with an inseparable stroke, making it already normalized:
Trying to decompose Ł yields the same letter because it’s already decomposed. You can mitigate this unwanted behavior by allowing your transliteration function to accept an optional character mapping to .translate() the string with:
You use a generator expression to prepare a string with all the combining characters in Unicode. If the caller provided a custom mapping, then you separate its keys and values into two strings called src
and dst
, which you pass to str.maketrans() to build a translation table. Otherwise, you build a translation table that just removes all combining characters. Finally, you normalize the text to the NFD form before applying the translation table.
This new function helps deal with special cases, but it doesn’t scale too well, as you still have to manually build the translation table yourself each time. Moreover, it doesn’t let you map a single code point, such as U+00E6 representing the letter æ, into more than one character, which might be necessary:
To handle most corner cases of transliteration and even address elements of transcription, you can install Unidecode or another third-party library:
Unlike your earlier attempts at transliteration, calling the unidecode()
function can make the resulting string longer than the original, so no information gets lost during the process.
Now, if you want to sort a sequence of Unicode strings according to their transliterations but without altering them, then you can use your transliteration function as the sort key:
This won’t give you an accurate result, but it should look right to most English-speaking audiences.
Another common challenge associated with sorting Unicode strings, which also requires Unicode normalization, is case-insensitive comparison. It’s quite widespread in real-life scenarios, so you’ll read about it now.
Perform Case-Insensitive Sorting of Unicode Strings
When you sort strings comprising only ASCII characters, then uppercase letters always come first because of their underlying ordinal values:
Whether this is desired depends on the language and context. Either way, string sorting in Python is case-sensitive by default. To disregard the letter case when comparing strings, you can convert all letters to lowercase or uppercase and use them as the sort key:
Now, your strings are sorted alphabetically regardless of their letter case. Unfortunately, while this quick fix works for most English words, it breaks down when encountering accented letters like the French é:
Under the surface, you still sort your strings lexicographically using character code points. So, even though you compare a lowercase letter é against lowercase h, the latter comes first because it has a smaller ordinal value.
It’s worth noting that the concept of uppercase and lowercase letters isn’t universal to all writing systems like it appears in the Latin alphabet. Besides, as you spend more time working with foreign texts, you may run into some really surprising edge cases in Unicode.
For instance, the German letter ß (sharp S) never occurs at the start of a word. Therefore, for a long time, it was only available as a lowercase letter without an uppercase variant in the alphabet:
When you try uppercasing it in Python, you get a double-S digraph instead. But, some fonts do provide a capital form of sharp S, and Unicode even assigns a dedicated code point (U+1E9E) to it:
Nevertheless, you can’t restore its original capital form using chained calls to .lower()
and .upper()
in Python. At the same time, various dialects of the German language in countries like Switzerland don’t use sharp S at all.
Another interesting quirk is the Greek letter Σ (sigma), which has two alternative lowercase forms depending on where you put it in a word:
At the end of a word, a lowercase sigma takes the final form ς, but everywhere else, it appears as the more traditional σ.
Uppercase versions of the Latin b, Greek β, and Cyrillic в have an identical visual appearance despite having distinct code points:
As you can see, there are far too many corner cases to try and cover them all by hand.
Fortunately, Python 3.3 introduced a new string method called str.casefold(), which adds support for Unicode case folding for the purpose of caseless comparison of text. According to the Unicode glossary, folding lets you temporarily ignore certain distinctions between characters. In general, case folding works similarly to lowercasing a string, but there are a few notable exceptions:
Special cases like the German sharp S are handled differently, letting you compare words like GROSS and groß as equal. Also, starting from Unicode version 8.0, Cherokee letters like U+13AD representing Ꭽ fold to uppercase letters in order to give the stability guarantees required by case folding.
That’s great! Can you now use str.casefold()
instead of str.lower()
or str.upper()
as your key function for case-insensitive sorting of Unicode strings in Python? Well, not exactly.
First of all, it won’t address the problem associated with arbitrary ordinal values of accented letters. Secondly, the strings that you wish to compare must share a common Unicode representation. Because case folding doesn’t always preserve the normal form of the original string, you should generally perform the normalization after folding your strings.
So, to implement a case-insensitive key function for sorting strings, as specified by the Unicode Standard, you must combine str.casefold()
with Unicode normalization:
Using the functools module, you define a partial function nfd()
that takes a string as an argument and applies the Unicode canonical decomposition to it. You then call this function twice to normalize the string before and after the case folding. Calling it twice acts as an extra safeguard against rare edge cases involving one particular combining character in Greek that requires special treatment to fully comply with the Unicode Standard.
Having this new key function in place, you can now sort your French animals alphabetically while ignoring the letter case:
Success! While exploring the quirks of Unicode is undoubtedly a great learning experience, you can achieve a similar effect with much less effort using the Unicode Collation Algorithm (UCA).
If you already have the PyICU library installed, then you can ignore the letter case entirely by tweaking your collator object:
By decreasing the collator’s strength to secondary, you make it disregard the subsequent levels of character weights in UCA, including the tertiary level responsible for uppercase and lowercase differences. You can also decide whether uppercase letters should come first or the other way around by configuring the CaseFirst parameter.
Comparing strings in a case-insensitive way is one of the most common requirements when sorting or searching for strings. But, there are many more special cases that you may need to consider, like the handling of punctuation, whitespace characters, or numbers. Libraries like PyICU let you customize the collation rules accordingly.
Next up, you’ll learn about the natural sort order, which your users might expect in some scenarios.
Sort Strings With Numbers in Natural Order
Arranging strings in the natural sort order is similar to alphabetical sorting, but it takes digits that may be part of the string into account. A typical example of that would involve listing a directory full of log files whose names contain a version number:
While such ordering looks natural to most humans, the lexicographic sorting in Python and other programming languages produces a less-than-intuitive result for the same input sequence of strings:
Because Python compares your strings by value, all digits end up treated as ordinary character code points instead of numbers. So, a string that contains "10"
comes before "2"
, disrupting the natural ordering of numbers. To fix this, you can try using regular expressions by isolating the individual alphanumeric chunks of the string into separate elements to compare:
Your sort key function, natural_order()
, returns a tuple whose elements are either strings or integers. This compound key allows Python to compare the corresponding tuple elements one by one from left to right using the appropriate comparison for the data type.
Note that this is a fairly straightforward implementation that can’t handle signed numbers, floating-point numbers, or any of the edge cases that you’ve learned so far. To deal with special cases like locale-aware and case-insensitive sorting, you should reach for a hassle-free library like natsort, which plays nicely with PyICU:
Use the locale
module to set the desired language and country code. When natsort detects PyICU in your virtual environment, it’ll use it and only fall back to locale
if it can’t find this third-party library. Just make sure to set the locale before importing natsort
. Otherwise, it’ll fall back to your operating system’s default locale.
So far, you’ve been sorting plain strings in Python, which is rarely the case in real-world programming projects. You’re more likely to sort complex data structures or objects by their specific attributes. In the next section, you’ll learn how to do just that.
Sort Complex Objects by More Than One Key
Ordering people by name is a classic example of sorting complex objects, which often comes up in practice. Suppose that you need to compile a list of employees in a company according to their first and last names for an annual report. You can leverage Python’s named tuple to represent each person:
Because each person is a tuple consisting of the first and last names, Python knows how to compare these objects without requiring you to implement any special methods in your class. When you pass a list of Person
instances to the sorted()
function, each object becomes a compound key. Unfortunately, sorting still relies on the lexicographic order of Unicode strings, so you don’t get the expected result.
You can fix the situation by defining your own compound key using the pyuca collator:
This time, the result comes out a bit better but still not quite correct, as Zbigniew is placed after Żaneta. Do you remember why?
It’s the same problem that you ran into earlier while using pyuca, which relies on the default collation table without letting you customize it for the specific locale. Fortunately, you can define a compound key using PyICU equally well:
Overall, this code looks very much the same but uses a slightly different API that exposes the locale customization. When you configure suitable language rules, you get the desired result.
If you’re curious, then you can experiment by swapping the attributes in your compound key to observe how that affects the sorting order:
In this case, you’ve sorted your employees primarily by last name. You then sort individuals who share the same last name by first name.
Now you know how to sort complex objects using a custom compound key, which takes locale-aware Unicode collation rules into account. With that skill under your belt, you conclude this comprehensive tutorial.
Conclusion
Congratulations on successfully navigating the intricate task of sorting Unicode strings in Python! You’re now aware of the potential hurdles that you might face in real-world projects, as well as strategies to overcome them. You’ve gained insight into the Unicode Collation Algorithm (UCA) and several Python libraries that expose it to you.
You know the difference between alphabetical and lexicographic sorting, understand what collation rules are, and can explain how the canonical decomposition in Unicode works. You’re able to customize your sorting order to make it locale-aware, reflecting specific cultural conventions. Furthermore, you can romanize non-Latin texts through transliteration and transcription for compatibility with systems that only support ASCII.
Your newfound expertise lets you safely perform case-insensitive sorting and natural sorting within the Unicode domain. Finally, you can construct compound sort keys for ordering complex objects based on a subset of attributes. Armed with this thorough knowledge, you’re well prepared to develop reliable multilingual applications catered to a global user base.