wikokit (original) (raw)

Introduction

This page will describe Java files to be added to the parser in order to parse one more Wiktionary language edition (WLE).

In order to parse new WLE you have to know the formatting rules of this WLE, because each WLE has different communities and different rules. See interwiki at enwikt formatting rules.

1. Adding new files and classes

Example of existing parser module (Russian Wiktionary)

There are files and classes in the parser in order to extract information from the Russian Wiktionary.

Russian language code is "ru" (ISO 639) is used as the suffix Ru in names of new Java files and suffix .ru in names of new Java packages (related to Russian language).

There is a special package wikokit.base.wikt.multi which contains subpackages for each WLE: en, ru.

Russian Wiktionary module files in the package wikokit.base.wikt.multi.ru.name:

LanguageTypeRu.java — names of languages in Russian and links to the LanguageType codes;
POSRu.java — names of parts of speech in Russian and the links to the POS objects;
RelationRu.java — names of semantic relations in Russian and the links to the Relation objects.
LabelRu.java — names of context labels in Russian.
LabelCategoryRu.java — names of categories of context labels in Russian.

Russian Wiktionary module files in the package wikokit.base.wikt.multi.ru:

each file in this package corresponds to the file in the package wikokit.base.wikt.word (see next section), which correspond to one level (header, subsection) of the Wiktionary entry;
- additional file POSTemplateRu.java — correspondences between part of speech (POS) templates in ruwikt and POS names.

New parser module (copy and paste)

Let's we want to extend our parser by files and classes in order to parse e.g. French Wiktionary (language code is "fr"). The following "сopy and paste" method could simplify work:

copy folder (package) wikokit.base.wikt.multi.ru to wikokit.base.wikt.multi.fr;
copy wikokit.base.wikt.multi.ru.name to wikokit.base.wikt.multi.fr.name;
rename files in this folders (e.g. LanguageTypeRu.java to LanguageTypeFr.java etc.)
change the code of these files in accordance with formatting rules of your (e.g. French) Wiktionary — the most fun and hard part of work :)

2. Adding new code to existing files and classes

Each file in the package wikokit.base.wikt.word contains the call of WLE parser module. E.g. WLanguage.java contains code:

LanguageType wikt_lang; // language of Wiktionary

if(l  == LanguageType.ru) {
  lang_sections = WLanguageRu.splitToLanguageSections(page_title, text);
} else if(l == LanguageType.en) {
  lang_sections = WLanguageEn.splitToLanguageSections(page_title, text);
} else {
  throw new NullPointerException("Null LanguageType");
}

If you remember we wanted to parse new French Wiktionary. Then this code should be extended:

if(l  == LanguageType.ru) {
  lang_sections = WLanguageRu.splitToLanguageSections(page_title, text);

// these two lines were added
} else if(l == LanguageType.fr) {
  lang_sections = WLanguageFr.splitToLanguageSections(page_title, text);

} else if(l == LanguageType.en) {
  lang_sections = WLanguageEn.splitToLanguageSections(page_title, text);
} else {
  throw new NullPointerException("Null LanguageType");
}

Now we are calling WLanguageFr.java which should be located at the package wikokit.base.wikt.multi.fr.

Comments

Don't forget about unit test. It's a best documentation of our code. Every nontrivial class and function in this project have unit tests, e.g. class WLanguageRu has unit tests in the file WLanguageRuTest.java
Parsed Wiktionary language edition is defined as input parameter of the file Main.java in the wikt_parser project:

LanguageType wikt_lang; // language of Wiktionary

We should also add code to this file (Main.java) in order to parse French Wiktionary.