Sentence-Dictionary Linking - EDRDG Wiki (original) (raw)

To enable dictionary systems, apps, etc. to use the Japanese-English sentences from the Tanaka Corpus/Tatoeba as examples, a set of word-level indices have been compiled and are associated with each sentence (at present about 150,000 sentences have indices.) These indices are maintained within the Tatoeba system (there is a special GUI for this), and periodically downloaded for use with dictionary systems. The indices are particularly associated with the JMdict/EDICT2 dictionary files, but may also be used elsewhere.

Index Format

The indices for a sentence consist of a line of text with space-delimited index elements for each word in the sentence. The following is an example:

Sentence: その家はかなりぼろ屋になっている。

Indices: 其の[01]{その} 家(いえ)[01] は 可也{かなり} ぼろ屋[01]~ になる[01]{になっている}

The format of the index elements is as follows:

Some indices are followed by a "|" character and a digit. These are an artefact from a former maintenance system, and can be safely ignored.

The fields after the indexing headword ()[]{}~ must be in that order.

File Format

A file of the Japanese-English sentence pairs with the indices can be downloaded from the Tatoeba site. This file, which is generated once each week, is in UTF-8 encoding, and has the following format:

Jpn_seq_no[TAB]Eng_seq_no[TAB]Japanese sentence[TAB]English sentence[TAB]Indices

Another version, which is used by the WWWJDIC servers, has the sentences and indices on separate lines. The format is:

A: Japanese sentence[TAB]English sentence#ID=Engseq_Jpnseq

B: Indices

This file can be downloaded in EUC-JP coding or UTF-8 coding.