Buckwalter Arabic Morphological Analyzer Version 2.0 (original) (raw)

Introduction

Buckwalter Arabic Morphological Analyzer Version 2.0 was developed by Tim Buckwalter at the Linguistic Data Consortium (LDC) and contains a Perl script for morphology analysis and part-of-speech (POS) tagging of Arabic text. The release includes lexicons with approximately 83,000 entries of Arabic prefixes, suffixes, and stems as well as compatibility tables that are referenced by the script in the analysis of the text.

The analyzer considers each Arabic word token in all possible prefix-stem-suffix segmentations and lists all known/possible annotation solutions, POS labels, and glosses. The generated output may then be reviewed by users, and the most appropriate annotation selected from among several choices.

This tool has been used frequently for LDC releases of annotated Arabic text.

Data

The data consists primarily of the Perl script, lexicons, and compatibility tables.

Here are the three Arabic-English lexicon files:

The lexicons are supplemented by three morphological compatibility tables used for controlling possible word part combinations:

The documentation consists of a readme file with a description of the lexicon files, the morphological compatibility tables, the morphology analysis algorithm, a summary of stem morphological categories, and a table with the author's Arabic transliteration system.

Samples

To see an example of the analyzer's output, please examine this sample.

Updates

There are no updates available at this time.

Additional Licensing Instructions

This 'members-only' corpus is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.

Portions (c) 2002-2004 QAMUS LLC (www.qamus.org), (c) 2002-2004 Trustees of the University of Pennsylvania