Cork encoding (original) (raw)

From Wikipedia, the free encyclopedia

Latin script character encoding used by LaTeX

The Cork (also known as T1 or EC) encoding is a character encoding used for encoding glyphs in fonts.[1] It is named after the city of Cork in Ireland, where during a TeX Users Group (TUG) conference in 1990 a new encoding was introduced for LaTeX.[1] It contains 256 characters supporting most west- and east-European languages with the Latin alphabet.[2]

In 8-bit TeX engines the font encoding has to match the encoding of hyphenation patterns where this encoding is most commonly used.[3] In LaTeX one can switch to this encoding with \usepackage[T1]{fontenc}, while in ConTeXt MkII this is the default encoding already. In modern engines such as XeTeX and LuaTeX Unicode is fully supported and the 8-bit font encodings are obsolete.

Cork encoding

| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | | | ---- | ----------------------------------------------------------- | ------------------------------------------------ | -------------------------------------------- | ---------------------------------------- | ---------------------------------------- | ------------------------------------------------------------ | -------------------------------------------------------- | ------------------------------------------------------------------------------ | --------------------------------------------------------- | ---------------------------------------------------------------- | ---------------------------------------------------------------- | ---------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | | 0x | `0060 | ´00B4 | ˆ02C6 | ˜02DC | ¨00A8 | ˝02DD | ˚02DA | ˇ02C7 | ˘02D8 | ¯00AF | ˙02D9 | ¸00B8 | ˛02DB | ‚201A | ‹2039 | ›203A | | 1x | “201C | ”201D | „201E | «00AB | »00BB | –2013 | —2014 | ZWSP [a]200B | ₀ [b]2080 | ı [c]0131 | ȷ [c]0237 | ﬀFB00 | ﬁFB01 | ﬂFB02 | ﬃFB03 | ﬄFB04 | | 2x | SP | ! | " | # | $ | % | & | ’2019 | ( | ) | * | + | , | - | . | / | | 3x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? | | 4x | @ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | | 5x | P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ | | 6x | ‘2018 | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | | 7x | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | SHY [d] | | 8x | Ă0102 | Ą0104 | Ć0106 | Č010C | Ď010E | Ě011A | Ę0118 | Ğ011E | Ĺ0139 | Ľ013D | Ł0141 | Ń0143 | Ň0147 | Ŋ014A | Ő0150 | Ŕ0154 | | 9x | Ř0158 | Ś015A | Š0160 | Ş015E | Ť0164 | Ţ0162 | Ű0170 | Ů016E | Ÿ0178 | Ź0179 | Ž017D | Ż017B | Ĳ0132 | İ0130 | đ0111 | §00A7 | | Ax | ă0103 | ą0105 | ć0107 | č010D | ď010F | ě011B | ę0119 | ğ011F | ĺ013A | ľ013E | ł0142 | ń0144 | ň0148 | ŋ014B | ő0151 | ŕ0155 | | Bx | ř0159 | ś015B | š0161 | ş015F | ť0165 | ţ0163 | ű0171 | ů016F | ÿ00FF | ź017A | ž017E | ż017C | ĳ0133 | ¡00A1 | ¿00BF | £00A3 | | Cx | À | Á | Â | Ã | Ä | Å | Æ | Ç | È | É | Ê | Ë | Ì | Í | Î | Ï | | Dx | Ð [e] | Ñ | Ò | Ó | Ô | Õ | Ö | Œ0152 | Ø | Ù | Ú | Û | Ü | Ý | Þ | SS [f]1E9E | | Ex | à | á | â | ã | ä | å | æ | ç | è | é | ê | ë | ì | í | î | ï | | Fx | ð | ñ | ò | ó | ô | õ | ö | œ0153 | ø | ù | ú | û | ü | ý | þ | ß00DF |

Hexadecimal values under the characters in the table are the Unicode character codes.
The first 12 characters are often used as combining characters.

^ 0x17 is dubbed a “compound word mark” (CWM) in the Cork encoding, and is an innovation of this standard. It is an invisible character that separates compounds in a complex word, for instance in German, in order to disallow esthetic ligatures at compound boundaries.[2] It is mapped to the Unicode “zero-width space” (ZWSP, U+200B), defined at about the same time, whose purpose is similar, if not identical.
^ 0x18 is a “small o”, used to compose ‰ or ‱ (or arbitrary smaller quantities) out of percent sign (%).[2]
^ a b Dotless i and dotless j may be used to compose accented variants like i with macron (ī).
^ 0x7F is the hyphenation character, not really a soft hyphen (SHY) as defined by Unicode.
^ 0xD0 is used both as Eth (Ð, U+00D0) and as D with stroke (Đ, U+0110) which might be a problem at some occasions (like copying text from PDF, hyphenation, ...)
^ 0xDF contains SS (two letters S). It allows TeX to automatically convert the German lowercase ß into the uppercase form.

Supported languages

[edit]

The encoding supports most European languages written in Latin alphabet. Notable exceptions are:

Esperanto and Maltese language (using IL3)
Latvian language and Lithuanian language (using L7X)
Welsh language

Languages with slightly suboptimal support include:

Galician language, Portuguese language and Spanish language – due to the lack of characters ª and º, which are not superscript versions of lowercase "a" and "o" (superscripts are thinner) and they are often underlined
Croatian language, Bosnian language, Serbian language – due to the shared use of the slot for Đ
Turkish language – due to dotless i having different uppercase and lowercase combinations than in other languages

^ a b Petrlik, Lukas (1996-06-19). "The Czech and Slovak Character Encoding Mess Explained". cs-encodings-faq. 1.10. Archived from the original on 2016-06-21. Retrieved 2016-06-21.
^ a b c Ferguson, Michael (1990), "Report on Multilingual Activities" (PDF), TUGboat, 11 (4): 514–516
^ TeX hyphenation patterns