Buckwalter Arabic Morphological Analyzer Version 2.0 (original) (raw)

Introduction

Buckwalter Arabic Morphological Analyzer Version 2.0 was developed by Tim Buckwalter at the Linguistic Data Consortium (LDC) and contains a Perl script for morphology analysis and part-of-speech (POS) tagging of Arabic text. The release includes lexicons with approximately 83,000 entries of Arabic prefixes, suffixes, and stems as well as compatibility tables that are referenced by the script in the analysis of the text.

The analyzer considers each Arabic word token in all possible prefix-stem-suffix segmentations and lists all known/possible annotation solutions, POS labels, and glosses. The generated output may then be reviewed by users, and the most appropriate annotation selected from among several choices.

This tool has been used frequently for LDC releases of annotated Arabic text.

Data

The data consists primarily of the Perl script, lexicons, and compatibility tables.

Here are the three Arabic-English lexicon files:

Prefixes (299 entries)
Suffixes (618 entries)
Stems (82,158 entries representing 38,600 lemmas)

The lexicons are supplemented by three morphological compatibility tables used for controlling possible word part combinations:

Prefix-stem (1,648 entries)
Stem-suffix (1,285 entries)
Prefix-suffix (598 entries)

The documentation consists of a readme file with a description of the lexicon files, the morphological compatibility tables, the morphology analysis algorithm, a summary of stem morphological categories, and a table with the author's Arabic transliteration system.

Samples

To see an example of the analyzer's output, please examine this sample.

Updates

There are no updates available at this time.

Additional Licensing Instructions