CTK (original) (raw)
CTK: Champollion Tool Kit
Built around LDC's champollion sentence aligner kernel, Champollion Tool Kit (CTK) aims to providing ready-to-use parallel text sentence alignment tools for as many language pairs as possible.
Champollion depends heavily on lexical information, but uses sentence length information as well. A translation lexicon is required. Past experiments indicate that champollion's performance improves as the translation lexicon become larger.
All source code was written in perl.
Your contribution is very welcomed, especially the following:
- wrapper scripts for new languages
- sample (of full) translation lexicons
- hand aligned development data
Please be aware that although champollion is designed for aligning noisy (deletions/insertions) parallel text, it's not capable of aligning comparable text.
What's New
[2005-08-25] CTK 1.1 released!
[2004-07-01] CTK 1.0 released!
Downloads
All software is available from:http://sourceforge.net/projects/champollion/.
All software is licensed under the OSI-approvedGNU General Public License. Please contact us if you would like the software under another license.
Available Language Pairs
- English-Arabic
- English-Chinese (GB)
Content of Distributions
- Source code for Champollion
- Wrapper for each language pair
- Hand-aligned evaluation corpus, containing 3,684 English-Chinese sentence pairs from three different sources. The corpus can be used for development and evaluation of new/existing sentence aligners.
Mailing Lists
- champollion-announce - sign up for CTK announcements (moderated, low volume)
- champollion-devel - send any questions and bug reports to this list (unmoderated)
Documentation
Minimum documentation is available in the README file in the distribution package. Full documentation is coming soon!
Research Papers
http://papers.ldc.upenn.edu/LREC2006/Champollion.pdf
Xiaoyi Ma
Champollion: A Robust Parallel Text Sentence Aligner
LREC 2006: Fifth International Conference on Language Resources and Evaluation, Genova, Italy, 2006
Linguistic Data Consortium
The CTK related efforts are based at the Linguistic Data Consortium at the University of Pennsylvania. The research and development of CTK is funded by TIDES Machine Translation Project.