Is there a standard for cyrillic (similar to ASCII/ANSI) ? (original) (raw)

Andras Kornai posted this (or its predecessor version 1.2) to the
RUSTEX-L list. I've his permission to post it here.

Note, also, that comp.sources.misc recently (this week?) had a posting
of "translit", a seemingly sophisticated, configurable transliteration
program by Jan Labanowski. Its distributed tables seem to cover most
if not all Cyrillic variations (e.g., GOST to KOI8; KOI8 to Library of
Congress, etc.), _plus_ KOI-8 to LaTeX (using wncyr10).

To get the translit package, I quote from the readme file, omitting
some details which precede or follow:

>Via FTP (if you are on Internet):
>---------------------------------
> ftp kekule.osc.edu (or ftp 128.146.36.48)
> Login: anonymous
> Password: Your_email_address (Please...)
> ftp> ascii (or binary if you retrieve binary files)
> ftp> cd pub/russian/translit
> ftp> get file_name
> ..... (for each file)
> ftp> quit

>Via E-mail:
>-----------
> Send message:
> send translit/file_name from russian
> to OSCP...@osc.edu or OSCP...@OHSTPY.BITNET. You can retrieve more files
> with a single message by placing several lines of the above format.
> The file will be forwarded to your mailbox automatically.

As well as being available as a collection individual files, it is
available as a package in any these files, to suit your taste (again
from it's readme.doc file:

>translit.tar.Z --- Compressed tar file with the whole distribution.
>translit.tar.z.uu --- uuencoded file from the above. It can be transmitted
> via e-mail, but it is a large file, and if your mailer
> sets limits on your messages, it may not be correctly
>translit.zip --- This is a "zipped" file (i.e., compressed with a ZIP
>translit.zip.uu --- Uuencoded file from above. Can be sent via e-mail but
> it is big.

Andras Kornai's Cyrillic encoding FAQ follows. (I've not checked
between the FAQ and the translit program to see if they agree 100%.)

-KH

CYRILLIC ENCODING FAQ Version 1.3, March 13 1993

ACKNOWLEDGEMENTS Most of the information was provided by the following:

David J. Birnbaum djbpi...@pitt.edu
Frank da Cruz f...@watsun.cc.columbia.edu
Bur Davis bda...@adobe.com
George Fowler gfow...@ucs.indiana.edu
Richard B. Paine RPA...@CCNODE.Colorado.EDU
Slava Paperno P...@CORNELLA.cit.cornell.edu
Keld J. Simonsen Keld.Simon...@dkuug.dk
Glenn E. Thobe th...@getunx.info.com
Dimitri Vulis D...@CUNYVMS1.BITNET
Johan W. van Wingen pre...@rulmvs.leidenuniv.nl

Thanks to all who contributed -- I am responsible for the errors that
still remain.

Andras Kornai (and...@calera.com, kor...@csli.stanford.edu)

Q: What are the commonly used computer encodings for Cyrillic?
A: Broadly speaking, there are three kinds of schemes in use: those that
replace Cyrillic characters by 7-bit ascii values, those that use the
full 8-bit range 0-255, and those using multi-byte codes. Presently
only the first two types are in wide use, but for reference purposes I
will also discuss the third type.

Q: What kind of transliteration schemes are there?
A: The most important one is called KOI-7: the Russian alphabet is given
by the ASCII characters (note the exchange of upper and lower cases):

UPPER CASE: abwgde$vzijklmnoprstufhc~{}"yx|`q
lower case: ABWGDE#VZIJKLMNOPRSTUFHC^[]_YX\@Q

The following extensions to the official standard KOI-7 are supported in
Glenn Thobe's conversion programs for invertibility: '"'=YER, '#'=yo,
'$'=YO, '<'=left guillemet, '>'=right guillemet.

A slightly different (multicharacter) scheme is employed by Steve
Gaardner's (gaar...@theory.tc.cornell.edu) conversion code from Old
KOI-8, included below. This particular scheme provides easy
readability but suffers from some transliteration weirdness, such as
mapping short ii and yeri on the same character. Since proper
transliteration often requires context-sensitive rules, and differs
from language to language within the same script, a fuller discussion
is beyond the scope of the present document. For an overview of the
major Cyrillic to Latin transliteration schemes used in the US, see pp
457-460 of the Style Manual of the US Government Printing Office, for
sale by the Superintendent of Documents, USGPO, Washington DC 20402,
Stock Number 021-000-00120-1 (paper) or 021-000-00120-0 (hardbound).
See also the Chicago Manual of Style, and Transliteracija russkikh
slov latinskimi bukvami, GOST 167876-71

#include <stdio.h>
char transtbl[64][5] =
{"yu", "a", "b", "ts", "d" , "e", "f", "g", "kh", "i", "y" , "k", "l",
"m", "n", "o", "p", "ya", "r" , "s", "t", "u", "zh", "v", "'",
"y", "z", "sh", "e", "shch", "ch", "`",
"YU", "A", "B", "TS", "D" , "E", "F", "G", "KH", "I", "Y" , "K", "L",
"M", "N", "O", "P", "YA", "R" , "S", "T", "U", "ZH", "V", "'",
"Y", "Z", "SH", "E", "SHCH", "CH", "`" };
main()
{
int c;

while ((c = getchar()) != EOF)
{ if ( c > 0x80) c -= 0x80;
if ( c < 0x40) putchar(c);
else printf("%s",transtbl[c-0x40]);
}

Q: What are the eight-bit schemes?

A: For the IBM mainframe world, which includes the ES (edinnaja sistema)
clones of 360-370 mainframes, the basic scheme, called DKOI-8, extends
EBCDIC by putting the Cyrillic letters in the unused slots, mostly in
the rectangle 0x8a to 0xff (first hex digit >=8, second digit >=a). The
mysteries of EBCDIC/ASCII conversion go beyond the scope of this
document, and in the table that follows I will ignore 8-bit ascii values
below 0xa0 and refer the reader to Dimitri Vulis' excellent document,
which sheds some light on the IBM meaning of the characters 0x80-0x9f
which are reserved in both IS0 8859-1 (Latin-1) and 8859-5 (Cyrillic).

/* From 8859-5 to DKOI-8. ebcdic(isoval) = isotoibm[isoval-160] */

int isotoibm[96] = {
0x41,0xaa,0x4a,0xb1,0x9f,0xb2,0x6a,0xb5,
0xbd,0xb4,0x9a,0x8a,0x5f,0xca,0xaf,0xbc,
0x90,0x8f,0xea,0xfa,0xbe,0xa0,0xb6,0xb3,
0x9d,0xda,0x9b,0x8b,0xb7,0xb8,0xb9,0xab,
0x64,0x65,0x62,0x66,0x63,0x67,0x9e,0x68,
0x74,0x71,0x72,0x73,0x78,0x75,0x76,0x77,
0xac,0x69,0xed,0xee,0xeb,0xef,0xec,0xbf,
0x80,0xfd,0xfe,0xfb,0xfc,0xad,0xae,0x59,
0x44,0x45,0x42,0x46,0x43,0x47,0x9c,0x48,
0x54,0x51,0x52,0x53,0x58,0x55,0x56,0x57,
0x8c,0x49,0xcd,0xce,0xcb,0xcf,0xcc,0xe1,
0x70,0xdd,0xde,0xdb,0xdc,0x8d,0x8e,0xdf

There are minor variations to DKOI, called Cyrillic Extended Code Page
037 (most common on BITNET), CECP 500 (which is the definitive one), the
"JNET" and the "FORTRAN" mappings. The differences between these are
tabulated below. Notice that EBCDIC/DKOI, unlike ASCII, is not uniquely
defined even on the 0-127 range:

8859-5 037 500 JNET FORTRAN

0x21 0x5a 0x4f 0x5a 0x4f exclamation point (bang)
0x5b 0xba 0x4a 0xad 0x4a opening square bracket
0x5d 0xbb 0x5a 0xbd 0x5a closing square bracket
0x5e 0xb0 0x5f 0x5f 0x5f circumflex accent
0x7c 0x4f 0xbb 0x6a 0x4f logical or (vertical bar)
[a2] 0x4a 0xb0 0x43 0x43 centsign (in 037)/capital dje (in 500)
[ac] 0x5f 0xba 0x54 0x54 logical not (in 037)/capital kje (in 500)
0xd5 0xef 0xef 0xbb 0xad small ie
0xe3 0x46 0x46 0x4a 0xbb small u
0xe5 0x47 0x47 0xfc 0xbd small kha
0xfc 0xdc 0xdc 0x6a 0xfc small kje

For the Internet, the most important code seems to be Old KOI-8, widely used
in the Relcom groups (but probably not a whole lot elsewhere). Old KOI-8
(GOST 19768-74) from 1974 more or less follows Latin transliteration order
and does not include upper-case hard sign, or letters common to other Slavic
Cyrillic alphabets (Bulgarian, Macedonian, Serbian, Ukrainian...). In the
0-127 range it is identical with ascii, and for the 192-254 region see the
transtabl array above. Some software, including uunpack (also used in
Sergej Ryzhkov's bml, aka Beauty Mail system for PCs) which is distributed
by Relcom, force upper-case hard sign to 255, others (and the standard!)
declare this incorrect, or perhaps reserve 255 for DEL. In an earlier
version of Andrew Hume's and...@research.att.com tcs, which supports
conversion across a wide variety of Cyrillic encodings, this was called the
"mystery DOS Cyrillic encoding", except that his sha and shcha seem to be
interchanged. Tcs is available for anon ftp from research.att.com in
directory /dist/tcs.shar.Z. The semantics of 128-191 in Old KOI is unclear
to me. If there is an official code page (it was suggested that Xenix users
might have one), please post it.

For the PC community, Code Page 866 seems to be quite important. This is
what Microsoft is using in its russified version of MS-DOS. In 0-31
ascii control chars are replaced by a random selection of dingbats. In
32-126 it is identical to ascii, and in 127 it has something that looks
like a little house (the interpretation of such positions seems to be
subject to much uncertainty). The Russian part (128-255) is identical to
Brjabrin's alternativnyj variant, except for 242-251, where some of the
accents/symbols of AV are replaced by non-Russian Cyrillic characters
and other symbols. Unfortunately CP 866 covers only Ukrainian and
Belorussian, with the vague suggestion that e.g. Macedonian users could
redefine the six non-Russian Cyrillic positions. This problem is
largely resolved in Code Page 1251, the Microsoft Cyrillic Windows 3.1
character set, (also endorsed by WordPerfect and Adobe), which contains
all Cyrillic letters used by modern Slavic languages. CP 1251 is fully
compatible with ascii on 0-127 (leaves control positions undefined), has
the Russian alphabet (in order, but without io) in 192-256, and puts the
non-Russian Cyrillic, Russian io, and a few symbols in 128-191.

Brjabrin's Alternativnyj Variant (AV) is also widely used on PCs. It
has Russian in 128 to 175 in alphabetical order except for yo, graphics
characters in 176 to 223, again Russian in 224-241. The same set of
graphics characters, but not in the same order, is used in Brajabin's
Osnovnoj Variant: they are similar ...

read more »