curl -v linux.ars (Internationalization) (original) (raw)

Biz & IT —

This week in Linux.Ars: GNOME tweaks, Sun and Wal-Mart, plus much, much more

Introduction

We're back. Did you miss us? You shouldn't in the future, as we strongly believe we have the infrastructure in place to dish up fresh servings on a weekly basis.

This week, Linux.Ars looks at internationalization and localization of the Linux desktop, something at which the system shines, as well as Ghost for Unix, a portable hard drive imaging program. Additionally, everyone's favorite retailer is getting even deeper into the Linux game.

Intrusion on www.gnome.org

Several of the GNOME Project servers were compromised last week, leaving various services unavailable. All critical GNOME web sites and the main FTP archive are running again; only minor sites, such as art.gnome.org, still remain unavailable. As a result of this, the release of GNOME 2.6 was delayed until today, even though no code has been compromised. The initial discovery of the intrusion is detailed here. Updates about the intrusion can be found in this post to the gnome-hackers mailing list.

Wal-Mart sells more PCs with Linux

The world's largest retailer, Wal-Mart, has begun selling Microtel PCs bundled with Sun Microsystems' Java Desktop System, Sun's Linux distribution. There are several models available, ranging from US$298 to US$698. The US$398 Microtel SYSWM8003 comes with an AMD Athlon XP 2400+ processor, 128MB of memory, a CD-ROM drive, a 40GB hard drive and Sun's StarOffice software suite, but no monitor. The US$698 SYSWM8006 has an Intel P4 processor, 256MB of memory, an 80GB hard drive and a CD-RW/DVD-ROM combination drive. It should be noted that these are not the only Linux PCs that Wal-Mart sells, as it also ships PCs with LindowsOS installed and Lycoris Desktop/LX installed. Wal-Mart seems determined to be the lowest-cost PC retailer around, and if they can convince customers that not having Windows XP is no problem, they could be the ones spearheading the adoption of Linux on the desktop

TTT: Tools, Tips and Tweaks

Internationalization and localization, or how to write badly in many languages

With software and hardware getting cheaper and easier to access, computing is becoming increasingly international in scope, with an increasing demand for the ability to compute in non-English languages and non-Roman scripts. The past few years have seen releases from commercial operating system and productivity software vendors gaining support for input, display and printing compliant to national standards for scores of locales. Fortunately for us, Linux has excellent multilingual support.

Internationalization (i18n, for I?18 letters?N) and localization (l10n, for L?10 letters?N) are terms used to describe the typical efforts involved in getting a piece of software to speak different languages.

Internationalization refers to the ability of software to deal with input and output in various locales, so that the software will provide an interface to the user that is capable of handling characters corresponding the language used in the user's locale, and items such as date and time formats, digit grouping, currency units, units of measurement and the like will correspond to the standard uses in the locale.

Localization is a related concept. It refers to the ability of software to provide a user interface in the language specified by the locale. Usually, this is accomplished by translating all the text that the software presents into the languages that the software supports, and depending on the locale, choosing the appropriate translation to present to the user.

A locale usually encompasses the specific dialect of a language used in a region (often a country), occasionally specifying the character set used for the script, which standardizes the representation of the alphabet, numerals, diacritic marks and symbols used in text written in the language.

Increasingly, the character set of choice is Unicode. Certain Unicode-based encodings are more popular (mappings from the machine numerical representation of a character to the textual representation of the character; not necessarily the actual glyph displayed, for glyphs can result from the combination of letters, diacritics and the like), such as UTF-8 (a variable-length encoding whose lower-order code points are similar to the ISO 8859-1 Latin 1 character set used for most Western European languages) and UCS-2 (a 16-bit encoding of a subset of Unicode used pervasively by Windows NT and derivatives). In Linux, the most popular Unicode encoding is UTF-8. Other encodings tend to be popular in certain locales; for instance, in the US and many Western European nations, ISO 8859-1 (Latin 1) and ISO 8859-15 (Latin 9) are popular; in Taiwan and China, the Big5 and GB2312 encodings are widely used; and in Japan, the EUC-JP and Shift-JIS encoding are frequently used. The reasons for using non-Unicode character sets are varied; for instance, the national encodings may be richer than the Unicode representation of the script, or the use of the character set may be deeply entrenched.

In Linux, there is no standardized method for developers to internationalize or localize their applications; the method used depends on the user interface chosen, licensing, etc. of the software. For instance, frequently, GTK+ and GNOME applications use the GNU gettext library (LGPL-licensed), which is a convenient framework for incorporating and maintaining translations of the text used in the application into various languages, and the Pango library (LGPL-licensed) in order to lay out text in the Unicode character set. Applications using the Qt widget toolkit can use Qt's built-in means for dealing with translations, or (in the case of applications using the KDE framework) can use gettext. Applications such as MULE for XEmacs have their own mechanisms for internationalization and localization. Conversions between encodings can be accomplished by the use of the iconv library (LGPL-licensed).

However, as far as end users are concerned, things are much simpler. On a system-wide scale, the locale can be set by fiddling with a number of environment variables in the configuration file of your favorite shell (e.g. /etc/bash.bashrc for bash, /etc/csh.cshrc for csh and tcsh, /etc/zshenv for zsh, /etc/profile for sh, ksh and pdksh, and so on) and in configuration files for various components, such as /etc/gdm/gdm.conf for the GNOME Display Manager (the graphical login on GNOME systems). On a per-user scale, these settings can be made in your shell's configuration file, e.g. .bashrc for bash, .cshrc for csh/tcsh, .zshrc for zsh, and so on. If you log in graphically, it might also help to set it in your .xsession (graphical login script) if you have one. There are a number of available knobs to turn:

Variously, LANG and LANGUAGE?these set the language for display and input in many applications.
LC_ALL?this sets all of the following in compliant applications. Frequently LANG is treated as its equivalent; however, if set, LC_ALL takes precedence.
LC_COLLATE?this is frequently honored for sorting text strings; e.g., in German locales, the letter "?" (Eszett, or sharp "s") is sorted as though it were "ss".
LC_CTYPE?this indicates the locale according to which character categorization (uppercase letters, lowercase letters, printable, etc.) and conversion are performed.
LC_MESSAGES?this indicates the language in which text is presented to the user.
LC_NUMERIC?this indicates how numbers are represented?the characters used for digit grouping, decimal point, digit group sizes, etc. are decided by this variable.
LC_TIME?this governs the representation of time?for instance, whether time should be represented on a 12-hour or 24-hour clock.

Usually, just setting the LANG (for many applications) and LANGUAGE (for software such as GNOME) is sufficient.

My language is Spanish as is written in the U.S., using the Unicode
character set, in the UTF-8 encoding.
LANG=es_US.UTF-8 LANGUAGE=es_US.UTF-8 export LANG LANGUAGE

The locales you intend to use must be generated first; to do this, you edit /etc/locale.gen and run the locale-gen utility. Here's a sample /etc/locale.gen:

en_US ISO-8859-1 en_US.UTF-8 UTF-8 es_US.UTF-8 UTF-8

Running locale-gen results in this output:

root@athena:~# /usr/sbin/locale-gen Generating locales... en_US.ISO-8859-1... done en_US.UTF-8... done es_US.UTF-8... done Generation complete.

In order to configure the system for text input in a certain language using a particular keyboard layout, it is possible to use the XKB framework with XFree86 via the X Keyboard extension. To do this, you can edit your XF86Config or XF86Config-4 file, usually found in /etc/X11 or /usr/X11R6/lib/X11. Alternately, you can use the setxkbmap tool.

There are various XKB settings that can be set

Keyboard model?this indicates the physical keys available on your keyboard, e.g. pc104compose for a regular 104-key keyboard with support for switching languages using the right-hand "Windows logo" key.
Keyboard layout?this indicates the language-specific layout or layouts that you want to use, for instance us for the U.S. keyboard layout for the Roman script, ar for the 101-key Arabic keyboard layout, dev for the INSCRIPT keyboard layout for the Devanagari script, etc.
Options?these indicate things such as optional mappings (such as that for the Euro currency symbol), the language switching keys (in XFree86 4.3.0 or later, where up to four languages can be specified) and so on.

There are other settings; for more information, see the XFree86.org documentation on XKB. The available choices for these and other settings can be found in the file /usr/X11R6/lib/X11/xkb/xfree86.lst.

The configuration looks like this:

Section "InputDevice" Identifier "Keyboard1" Driver "Keyboard"

We want the US keyboard layout with an optional Arabic
keyboard layout. (You can specify multiple layouts -- up
to four -- only with XFree86 4.3.0 or later.)
Option "XkbLayout" "us,ar"

104-key PC keyboard with the right-hand Windows Logo key
mapped to the Compose key to combine letters and accents.
Option "XkbModel" "pc104compose"

We want to use the Alt-Shift key combination
to switch languages. We also want to swap the left-hand
Ctrl and Caps Lock keys.
Option "XkbOptions" "grp:alt_shift_toggle+ctrl:caps_ac" EndSection

You can try out the settings in the current session using the setxkbmap utility.

setxkbmap -layout us,ar -model pc104compose -option grp:alt_shift_toggle+ctrl:caps_ac

Users of languages where it isn't easy to use a keyboard layout for text entry (especially Chinese, Japanese and Korean) can frequently use input method editors using the XIM (X Input Method) API. There is a good HOWTO on this topic.

Of course, in order to be able to view text in a particular language, you need a font that provides glyphs for that language in the character set of your choice. One of the most easily obtainable Unicode fonts that has support for several scripts is the GNU Freefont collection. Another set of fonts that carry most Latin glyphs as well as scripts such as Arabic, Hebrew and Cyrillic are Microsoft's core fonts for the web. There are various web sites dedicated to information about fonts available for many languages in different encodings.

Putting all of this together, it is possible to have a desktop environment in one's native language (even if that language isn't English) by making a few settings. For instance, the following screenshot shows a recent GNOME 2.5 snapshot (mostly) in the Hindi language (locale hi_IN.UTF-8, XKB layout dev):

GNOME 2.5 in Hindi

You might notice that the quality and extent of the translation varies from software to software and translator to translator. Localization is a painstaking procedure, and not all translations are alike in quality and availability.

If you'd like to participate in localization efforts for your language, several open-source software projects have internationalization and localization projects that could use your help. KDE, GNOME and OpenOffice.org all have localization projects; there is also the Free Software Translation Project.

Modifying or designing software to allow for internationalization is beyond the scope of this write-up. However, there are several good resources available.

A note: While Mozilla the web browser has excellent multilingual support, it tends to fall down a bit on displaying complex scripts such as Thai and several Indic scripts, such as Devanagari. There is a Bugzilla report filed and a patch in the works, with a patched binary of Mozilla available for download. This patched binary works well for the complex scripts, but falls down on right-to-left support, which the regular unpatched Gecko engine gets right. We hope that these issues will be resolved soon.

Page: 1 2 Next →