Maryam Aminian | The George Washington University (original) (raw)
Phd Student at Computer Science Department
Supervisors: Mona Diab
Address: New York, United States
less
Uploads
Papers by Maryam Aminian
We introduce an electronic three-way lexicon, Tharwa, comprising Dialectal Arabic, Modern Standar... more We introduce an electronic three-way lexicon, Tharwa, comprising Dialectal Arabic, Modern Standard Arabic and English corre- spondents. The paper focuses on Egyptian Arabic as the first pilot dialect for the resource, with plans to expand to other dialects of Arabic in later phases of the project. We describe Tharwa’s creation process and report on its current status. The lexical entries are augmented with various elements of linguistic information such as POS, gender, rationality, number, and root and pattern information. The lexicon is based on a compilation of information from both monolingual and bilingual existing resources such as paper dictionaries and electronic, corpus-based dictionaries. Multiple levels of quality checks are performed on the output of each step in the creation process. The importance of this lexicon lies in the fact that it is the first resource of its kind bridging multiple variants of Arabic with English. Furthermore, it is a wide coverage lexical resource containing over 73,000 Egyptian entries. Tharwa is pub- licly available. We believe it will have a significant impact on both Theoretical Linguistics as well as Computational Linguistics research.
Automatic induction of semantic verb classes is one of the most challenging tasks in computationa... more Automatic induction of semantic verb classes is one of the most challenging tasks in computational lexical semantics with a wide variety of applications in natural language processing. The large num- ber of Persian speakers and the lack of such semantic classes for Persian verbs have motivated us to use unsupervised algorithms for Persian verb clustering. In this paper, we have done experiments on inducing the se- mantic classes of Persian verbs based on Levin’s theory for verb classes. Syntactic information extracted from dependency trees is used as base features for clustering the verbs. Since there has been no manual classi- fication of Persian verbs prior to this paper, we have prepared a manual classification of 265 verbs into 43 semantic classes. We show that spectral clustering algorithm outperforms KMeans and improves on the baseline algorithm with about 17% in Fmeasure and 0.13 in Rand index.
Dialects and standard forms of a language typically share a set of cognates that could bear the s... more Dialects and standard forms of a language typically share a set of cognates that could bear the same meaning in both varieties or only be shared homographs but serve as faux amis. Moreover, there are words that are used exclusively in the dialect or the standard variety. Both phenomena, faux amis and exclusive vocabulary, are consid- ered out of vocabulary (OOV) phenomena. In this paper, we present this problem of OOV in the context of machine translation. We present a new approach for dialect to English Statistical Machine Translation (SMT) enhancement based on normaliz- ing dialectal language into standard form to provide equivalents to address both as- pects of the OOV problem posited by di- alectal language use. We specifically fo- cus on Arabic to English SMT. We use two publicly available dialect identifica- tion tools: AIDA and MADAMIRA, to identify and replace dialectal Arabic OOV words with their modern standard Arabic (MSA) equivalents. The results of evalua- tion on two blind test sets show that using AIDA to identify and replace MSA equiv- alents enhances translation results by 0.4% absolute BLEU (1.6% relative BLEU) and using MADAMIRA achieves 0.3% ab- solute BLEU (1.2% relative BLEU) en- hancement over the baseline. We show our replacement scheme reaches a notice- able enhancement in SMT performance for faux amis words.
We introduce an electronic three-way lexicon, Tharwa, comprising Dialectal Arabic, Modern Standar... more We introduce an electronic three-way lexicon, Tharwa, comprising Dialectal Arabic, Modern Standard Arabic and English corre- spondents. The paper focuses on Egyptian Arabic as the first pilot dialect for the resource, with plans to expand to other dialects of Arabic in later phases of the project. We describe Tharwa’s creation process and report on its current status. The lexical entries are augmented with various elements of linguistic information such as POS, gender, rationality, number, and root and pattern information. The lexicon is based on a compilation of information from both monolingual and bilingual existing resources such as paper dictionaries and electronic, corpus-based dictionaries. Multiple levels of quality checks are performed on the output of each step in the creation process. The importance of this lexicon lies in the fact that it is the first resource of its kind bridging multiple variants of Arabic with English. Furthermore, it is a wide coverage lexical resource containing over 73,000 Egyptian entries. Tharwa is pub- licly available. We believe it will have a significant impact on both Theoretical Linguistics as well as Computational Linguistics research.
Automatic induction of semantic verb classes is one of the most challenging tasks in computationa... more Automatic induction of semantic verb classes is one of the most challenging tasks in computational lexical semantics with a wide variety of applications in natural language processing. The large num- ber of Persian speakers and the lack of such semantic classes for Persian verbs have motivated us to use unsupervised algorithms for Persian verb clustering. In this paper, we have done experiments on inducing the se- mantic classes of Persian verbs based on Levin’s theory for verb classes. Syntactic information extracted from dependency trees is used as base features for clustering the verbs. Since there has been no manual classi- fication of Persian verbs prior to this paper, we have prepared a manual classification of 265 verbs into 43 semantic classes. We show that spectral clustering algorithm outperforms KMeans and improves on the baseline algorithm with about 17% in Fmeasure and 0.13 in Rand index.
Dialects and standard forms of a language typically share a set of cognates that could bear the s... more Dialects and standard forms of a language typically share a set of cognates that could bear the same meaning in both varieties or only be shared homographs but serve as faux amis. Moreover, there are words that are used exclusively in the dialect or the standard variety. Both phenomena, faux amis and exclusive vocabulary, are consid- ered out of vocabulary (OOV) phenomena. In this paper, we present this problem of OOV in the context of machine translation. We present a new approach for dialect to English Statistical Machine Translation (SMT) enhancement based on normaliz- ing dialectal language into standard form to provide equivalents to address both as- pects of the OOV problem posited by di- alectal language use. We specifically fo- cus on Arabic to English SMT. We use two publicly available dialect identifica- tion tools: AIDA and MADAMIRA, to identify and replace dialectal Arabic OOV words with their modern standard Arabic (MSA) equivalents. The results of evalua- tion on two blind test sets show that using AIDA to identify and replace MSA equiv- alents enhances translation results by 0.4% absolute BLEU (1.6% relative BLEU) and using MADAMIRA achieves 0.3% ab- solute BLEU (1.2% relative BLEU) en- hancement over the baseline. We show our replacement scheme reaches a notice- able enhancement in SMT performance for faux amis words.