Mohammed Attia | Columbia University (original) (raw)

Papers by Mohammed Attia

Research paper thumbnail of Multi-Dialect Arabic POS Tagging: A CRF Approach

This paper introduces a new dataset of POS-tagged Arabic tweets in four major dialects along with... more This paper introduces a new dataset of POS-tagged Arabic tweets in four major dialects along with tagging guidelines. The data, which we are releasing publicly, includes tweets in Egyptian, Levantine, Gulf, and Maghrebi, with 350 tweets for each dialect with appropriate train/test/development splits for 5-fold cross validation. We use a Conditional Random Fields (CRF) sequence labeler to train POS taggers for each dialect and examine the effect of cross and joint dialect training, and give benchmark results for the datasets. Using clitic n-grams, clitic metatypes, and stem templates as features, we were able to train a joint model that can correctly tag four different dialects with an average accuracy of 89.3%.

Research paper thumbnail of Multilingual Multi-class Sentiment Classification Using Convolutional Neural Networks

This paper describes a language-independent model for multi-class sentiment analysis using a simp... more This paper describes a language-independent model for multi-class sentiment analysis using a simple neural network architecture of five layers (Embedding, Conv1D, GlobalMaxPooling and two Fully-Connected). The advantage of the proposed model is that it does not rely on language-specific features such as ontologies, dictionaries, or morphological or syntactic pre-processing. Equally important, our system does not use pre-trained word2vec embeddings which can be costly to obtain and train for some languages. In this research, we also demonstrate that oversampling can be an effective approach for correcting class imbalance in the data. We evaluate our methods on three publicly available datasets for English, German and Arabic, and the results show that our system’s performance is comparable to, or even better than, the state of the art for these datasets. We make our source-code publicly available.

Research paper thumbnail of Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM

ArXiv, 2017

Arabic word segmentation is essential for a variety of NLP applications such as machine translati... more Arabic word segmentation is essential for a variety of NLP applications such as machine translation and information retrieval. Segmentation entails breaking words into their constituent stems, affixes and clitics. In this paper, we compare two approaches for segmenting four major Arabic dialects using only several thousand training examples for each dialect. The two approaches involve posing the problem as a ranking problem, where an SVM ranker picks the best segmentation, and as a sequence labeling problem, where a bi-LSTM RNN coupled with CRF determines where best to segment words. We are able to achieve solid segmentation results for all dialects using rather limited training data. We also show that employing Modern Standard Arabic data for domain adaptation and assuming context independence improve overall results.

Research paper thumbnail of Effective multi-dialectal arabic POS tagging

Natural Language Engineering, 2020

This work introduces robust multi-dialectal part of speech tagging trained on an annotated data s... more This work introduces robust multi-dialectal part of speech tagging trained on an annotated data set of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses conditional random fields (CRFs), while the second combines word- and character-based representations in a deep neural network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a data set of 350 tweets per dialect.

Research paper thumbnail of The Power of Language Music: Arabic Lemmatization through Patterns

The interaction between roots and patterns in Arabic has intrigued lexicographers and morphologis... more The interaction between roots and patterns in Arabic has intrigued lexicographers and morphologists for centuries. While roots provide the consonantal building blocks, patterns provide the syllabic vocalic moulds. While roots provide abstract semantic classes, patterns realize these classes in specific instances. In this way both roots and patterns are indispensable for understanding the derivational, morphological and, to some extent, the cognitive aspects of the Arabic language. In this paper we perform lemmatization (a high-level lexical processing) without relying on a lookup dictionary. We use a hybrid approach that consists of a machine learning classifier to predict the lemma pattern for a given stem, and mapping rules to convert stems to their respective lemmas with the vocalization defined by the pattern.

Research paper thumbnail of A jellyfish dictionary for Arabic

In a festschrift to Martin Gellerstam (Gottlieb and Mogensen, 2007), an article was published by ... more In a festschrift to Martin Gellerstam (Gottlieb and Mogensen, 2007), an article was published by John Sinclair in which he introduced the concept of a jellyfish dictionary. It presented the idea of a self-updating dictionary that is able to automatically monitor language change. "It would, so to speak, float on top of a corpus, rather like a jelly-fish, its tendrils constantly sensing the state of the language." We think that an electronic jellyfish dictionary should be able to perform three major tasks. It should be able to tell which words have newly appeared in a language, which words are not in use anymore, and which word usages or senses have changed based on contemporary data. In this paper we explain our methodology for realizing a jellyfish dictionary for Arabic by automatically performing the three tasks: detecting new words, flagging obsolete words, and discovering word senses.

Research paper thumbnail of Diacritization of Maghrebi Arabic Sub-Dialects

Diacritization process attempt to restore the short vowels in Arabic written text; which typicall... more Diacritization process attempt to restore the short vowels in Arabic written text; which typically are omitted. This process is essential for applications such as Text-to-Speech (TTS). While diacritization of Modern Standard Arabic (MSA) still holds the lion share, research on dialectal Arabic (DA) diacritization is very limited. In this paper, we present our contribution and results on the automatic diacritization of two sub-dialects of Maghrebi Arabic, namely Tunisian and Moroccan, using a character-level deep neural network architecture that stacks two bi-LSTM layers over a CRF output layer. The model achieves word error rate of 2.7% and 3.6% for Moroccan and Tunisian respectively and is capable of implicitly identifying the sub-dialect of the input.

Research paper thumbnail of An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic Modeling Finite State Networks

Morphological ambiguity is a major concern for syntactic parsers, POS taggers and other NLP tools... more Morphological ambiguity is a major concern for syntactic parsers, POS taggers and other NLP tools. For example, the greater the number of morphological analyses given for a lexical entry, the longer a parser takes in analyzing a sentence, and the greater the number of parses it produces. Xerox Arabic Finite State Morphology and Buckwalter Arabic Morphological Analyzer are two of the best known, well documented, morphological analyzers for Modern Standard Arabic (MSA). Yet there are significant problems with both systems in design as well as coverage that increase the ambiguity rate. This paper shows how an ambiguity-controlled morphological analyzer for Arabic is built in a rule-based system that takes the stem as the base form using finite state technology. The paper also points out sources of legal and illegal ambiguities in MSA, and how ambiguity in the new system is reduced without compromising precision. At the end, an evaluation of Xerox, Buckwalter, and our system is conducte...

Research paper thumbnail of Explicit Fine grained Syntactic and Semantic Annotation of the Idafa Construction in Arabic

Idafa in traditional Arabic grammar is an umbrella construction that covers several phenomena inc... more Idafa in traditional Arabic grammar is an umbrella construction that covers several phenomena including what is expressed in English as noun-noun compounds and Saxon and Norman genitives. Additionally, Idafa participates in some other constructions, such as quantifiers, quasi-prepositions, and adjectives. Identifying the various types of the Idafa construction (IC) is of importance to Natural Language processing (NLP) applications. Noun-Noun compounds exhibit special behavior in most languages impacting their semantic interpretation. Hence distinguishing them could have an impact on downstream NLP applications. The most comprehensive syntactic representation of the Arabic language is the LDC Arabic Treebank (ATB). In the ATB, ICs are not explicitly labeled and furthermore, there is no distinction between ICs of noun-noun relations and other traditional ICs. Hence, we devise a detailed syntactic and semantic typification process of the IC phenomenon in Arabic. We target the ATB as a ...

Research paper thumbnail of PoS, Morphology and Dependencies Annotation Guidelines for Arabic

Research paper thumbnail of Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach

Arabic is written as a sequence of consonants and long vowels, with short vowels normally omitted... more Arabic is written as a sequence of consonants and long vowels, with short vowels normally omitted. Diacritization attempts to recover short vowels and is an essential step for Text-to-Speech (TTS) systems. Though Automatic diacritization of Modern Standard Arabic (MSA) has received significant attention, limited research has been conducted on dialectal Arabic (DA) diacritization. Phonemic patterns of DA vary greatly from MSA and even from one another, which accounts for the noted difficulty of mutual intelligibility between dialects. In this paper we present our research and benchmark results on the automatic diacritization of two Maghrebi sub-dialects, namely Tunisian and Moroccan, using Conditional Random Fields (CRF). Aside from using character n-grams as features, we also employ character-level Brown clusters, which are hierarchical clusters of characters based on the contexts in which they appear. We achieved word-level diacritization errors of 2.9% and 3.8% for Moroccan and Tu...

Research paper thumbnail of An Automatically Built Named Entity Lexicon for Arabic

We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity ... more We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN’s instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest, most ma...

Research paper thumbnail of Universal Dependencies 2.0 – CoNLL 2017 Shared Task Development and Test Data

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treeban... more Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). This release contains the test data used in the CoNLL 2017 shared task on parsing Universal Dependencies. Due to the shared task the test data was held hidden and not released together with the training and development data of UD 2.0. Therefore this release complements the UD 2.0 release (http://hdl.handle.net/11234/1-1983) to a full release of UD treebanks. In addition, the present release contains 18 new parallel test sets and 4 test sets in surprise languages. The present r...

Research paper thumbnail of CogALex-V Shared Task: GHHH - Detecting Semantic Relations via Word Embeddings

This paper describes our system submission to the CogALex-2016 Shared Task on Corpus-Based Identi... more This paper describes our system submission to the CogALex-2016 Shared Task on Corpus-Based Identification of Semantic Relations. Our system won first place for Task-1 and second place for Task-2. The evaluation results of our system on the test set is 88.1% (79.0% for TRUE only) f-measure for Task-1 on detecting semantic similarity, and 76.0% (42.3% when excluding RANDOM) for Task-2 on identifying finer-grained semantic relations. In our experiments, we try word analogy, linear regression, and multi-task Convolutional Neural Networks (CNNs) with word embeddings from publicly available word vectors. We found that linear regression performs better in the binary classification (Task-1), while CNNs have better performance in the multi-class semantic classification (Task-2). We assume that word analogy is more suited for deterministic answers rather than handling the ambiguity of one-to-many and many-to-many relationships. We also show that classifier performance could benefit from balan...

Research paper thumbnail of GHH at SemEval-2018 Task 10: Discovering Discriminative Attributes in Distributional Semantics

Proceedings of The 12th International Workshop on Semantic Evaluation

This paper describes our system submission to the SemEval 2018 Task 10 on Capturing Discriminativ... more This paper describes our system submission to the SemEval 2018 Task 10 on Capturing Discriminative Attributes. Given two concepts and an attribute, the task is to determine whether the attribute is semantically related to one concept and not the other. In this work we assume that discriminative attributes can be detected by discovering the association (or lack of association) between a pair of words. The hypothesis we test in this contribution is whether the semantic difference between two pairs of concepts can be treated in terms of measuring the distance between words in a vector space, or can simply be obtained as a by-product of word co-occurrence counts.

Research paper thumbnail of POS Tagging for Improving Code-Switching Identification in Arabic

Proceedings of the Fourth Arabic Natural Language Processing Workshop

When speakers code-switch between their native language and a second language or language variant... more When speakers code-switch between their native language and a second language or language variant, they follow a syntactic pattern where words and phrases from the embedded language are inserted into the matrix language. This paper explores the possibility of utilizing this pattern in improving code-switching identification between Modern Standard Arabic (MSA) and Egyptian Arabic (EA). We try to answer the question of how strong is the POS signal in word-level code-switching identification. We build a deep learning model enriched with linguistic features (including POS tags) that outperforms the state-of-the-art results by 1.9% on the development set and 1.0% on the test set. We also show that in intrasentential code-switching, the selection of lexical items is constrained by POS categories, where function words tend to come more often from the dialectal language while the majority of content words come from the standard language.

Research paper thumbnail of GHHT at CALCS 2018: Named Entity Recognition for Dialectal Arabic Using Neural Networks

Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

This paper describes our system submission to the CALCS 2018 shared task on named entity recognit... more This paper describes our system submission to the CALCS 2018 shared task on named entity recognition on codeswitched data for the language variant pair of Modern Standard Arabic and Egyptian dialectal Arabic. We build a a Deep Neural Network that combines word and character-based representations in convolutional and recurrent networks with a CRF layer. The model is augmented with stacked layers of enriched information such pre-trained embeddings, Brown clusters and named entity gazetteers. Our system is ranked second among those participating in the shared task achieving an FB1 average of 70.09%.

Research paper thumbnail of A Neural Architecture for Dialectal Arabic Segmentation

Proceedings of the Third Arabic Natural Language Processing Workshop

The automated processing of Arabic dialects is challenging due to the lack of spelling standards ... more The automated processing of Arabic dialects is challenging due to the lack of spelling standards and the scarcity of annotated data and resources in general. Segmentation of words into their constituent tokens is an important processing step for natural language processing. In this paper, we show how a segmenter can be trained on only 350 annotated tweets using neural networks without any normalization or reliance on lexical features or linguistic resources. We deal with segmentation as a sequence labeling problem at the character level. We show experimentally that our model can rival state-of-the-art methods that heavily depend on additional resources.

Research paper thumbnail of Learning from Relatives: Unified Dialectal Arabic Segmentation

Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Arabic dialects do not just share a common koiné, but there are shared pandialectal linguistic ph... more Arabic dialects do not just share a common koiné, but there are shared pandialectal linguistic phenomena that allow computational models for dialects to learn from each other. In this paper we build a unified segmentation model where the training data for different dialects are combined and a single model is trained. The model yields higher accuracies than dialect-specific models, eliminating the need for dialect identification before segmentation. We also measure the degree of relatedness between four major Arabic dialects by testing how a segmentation model trained on one dialect performs on the other dialects. We found that linguistic relatedness is contingent with geographical proximity. In our experiments we use SVM-based ranking and bi-LSTM-CRF sequence labeling.

Research paper thumbnail of Multilingual Code-switching Identification via LSTM Recurrent Neural Networks

Proceedings of the Second Workshop on Computational Approaches to Code Switching, 2016

This paper describes the HHU-UH-G system submitted to the EMNLP 2016 Second Workshop on Computati... more This paper describes the HHU-UH-G system submitted to the EMNLP 2016 Second Workshop on Computational Approaches to Code Switching. Our system ranked first place for Arabic (MSA-Egyptian) with an F1-score of 0.83 and second place for Spanish-English with an F1-score of 0.90. The HHU-UH-G system introduces a novel unified neural network architecture for language identification in code-switched tweets for both Spanish-English and MSA-Egyptian dialect. The system makes use of word and character level representations to identify code-switching. For the MSA-Egyptian dialect the system does not rely on any kind of language-specific knowledge or linguistic resources such as, Part Of Speech (POS) taggers, morphological analyzers, gazetteers or word lists to obtain state-ofthe-art performance.

Research paper thumbnail of Multi-Dialect Arabic POS Tagging: A CRF Approach

This paper introduces a new dataset of POS-tagged Arabic tweets in four major dialects along with... more This paper introduces a new dataset of POS-tagged Arabic tweets in four major dialects along with tagging guidelines. The data, which we are releasing publicly, includes tweets in Egyptian, Levantine, Gulf, and Maghrebi, with 350 tweets for each dialect with appropriate train/test/development splits for 5-fold cross validation. We use a Conditional Random Fields (CRF) sequence labeler to train POS taggers for each dialect and examine the effect of cross and joint dialect training, and give benchmark results for the datasets. Using clitic n-grams, clitic metatypes, and stem templates as features, we were able to train a joint model that can correctly tag four different dialects with an average accuracy of 89.3%.

Research paper thumbnail of Multilingual Multi-class Sentiment Classification Using Convolutional Neural Networks

This paper describes a language-independent model for multi-class sentiment analysis using a simp... more This paper describes a language-independent model for multi-class sentiment analysis using a simple neural network architecture of five layers (Embedding, Conv1D, GlobalMaxPooling and two Fully-Connected). The advantage of the proposed model is that it does not rely on language-specific features such as ontologies, dictionaries, or morphological or syntactic pre-processing. Equally important, our system does not use pre-trained word2vec embeddings which can be costly to obtain and train for some languages. In this research, we also demonstrate that oversampling can be an effective approach for correcting class imbalance in the data. We evaluate our methods on three publicly available datasets for English, German and Arabic, and the results show that our system’s performance is comparable to, or even better than, the state of the art for these datasets. We make our source-code publicly available.

Research paper thumbnail of Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM

ArXiv, 2017

Arabic word segmentation is essential for a variety of NLP applications such as machine translati... more Arabic word segmentation is essential for a variety of NLP applications such as machine translation and information retrieval. Segmentation entails breaking words into their constituent stems, affixes and clitics. In this paper, we compare two approaches for segmenting four major Arabic dialects using only several thousand training examples for each dialect. The two approaches involve posing the problem as a ranking problem, where an SVM ranker picks the best segmentation, and as a sequence labeling problem, where a bi-LSTM RNN coupled with CRF determines where best to segment words. We are able to achieve solid segmentation results for all dialects using rather limited training data. We also show that employing Modern Standard Arabic data for domain adaptation and assuming context independence improve overall results.

Research paper thumbnail of Effective multi-dialectal arabic POS tagging

Natural Language Engineering, 2020

This work introduces robust multi-dialectal part of speech tagging trained on an annotated data s... more This work introduces robust multi-dialectal part of speech tagging trained on an annotated data set of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses conditional random fields (CRFs), while the second combines word- and character-based representations in a deep neural network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a data set of 350 tweets per dialect.

Research paper thumbnail of The Power of Language Music: Arabic Lemmatization through Patterns

The interaction between roots and patterns in Arabic has intrigued lexicographers and morphologis... more The interaction between roots and patterns in Arabic has intrigued lexicographers and morphologists for centuries. While roots provide the consonantal building blocks, patterns provide the syllabic vocalic moulds. While roots provide abstract semantic classes, patterns realize these classes in specific instances. In this way both roots and patterns are indispensable for understanding the derivational, morphological and, to some extent, the cognitive aspects of the Arabic language. In this paper we perform lemmatization (a high-level lexical processing) without relying on a lookup dictionary. We use a hybrid approach that consists of a machine learning classifier to predict the lemma pattern for a given stem, and mapping rules to convert stems to their respective lemmas with the vocalization defined by the pattern.

Research paper thumbnail of A jellyfish dictionary for Arabic

In a festschrift to Martin Gellerstam (Gottlieb and Mogensen, 2007), an article was published by ... more In a festschrift to Martin Gellerstam (Gottlieb and Mogensen, 2007), an article was published by John Sinclair in which he introduced the concept of a jellyfish dictionary. It presented the idea of a self-updating dictionary that is able to automatically monitor language change. "It would, so to speak, float on top of a corpus, rather like a jelly-fish, its tendrils constantly sensing the state of the language." We think that an electronic jellyfish dictionary should be able to perform three major tasks. It should be able to tell which words have newly appeared in a language, which words are not in use anymore, and which word usages or senses have changed based on contemporary data. In this paper we explain our methodology for realizing a jellyfish dictionary for Arabic by automatically performing the three tasks: detecting new words, flagging obsolete words, and discovering word senses.

Research paper thumbnail of Diacritization of Maghrebi Arabic Sub-Dialects

Diacritization process attempt to restore the short vowels in Arabic written text; which typicall... more Diacritization process attempt to restore the short vowels in Arabic written text; which typically are omitted. This process is essential for applications such as Text-to-Speech (TTS). While diacritization of Modern Standard Arabic (MSA) still holds the lion share, research on dialectal Arabic (DA) diacritization is very limited. In this paper, we present our contribution and results on the automatic diacritization of two sub-dialects of Maghrebi Arabic, namely Tunisian and Moroccan, using a character-level deep neural network architecture that stacks two bi-LSTM layers over a CRF output layer. The model achieves word error rate of 2.7% and 3.6% for Moroccan and Tunisian respectively and is capable of implicitly identifying the sub-dialect of the input.

Research paper thumbnail of An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic Modeling Finite State Networks

Morphological ambiguity is a major concern for syntactic parsers, POS taggers and other NLP tools... more Morphological ambiguity is a major concern for syntactic parsers, POS taggers and other NLP tools. For example, the greater the number of morphological analyses given for a lexical entry, the longer a parser takes in analyzing a sentence, and the greater the number of parses it produces. Xerox Arabic Finite State Morphology and Buckwalter Arabic Morphological Analyzer are two of the best known, well documented, morphological analyzers for Modern Standard Arabic (MSA). Yet there are significant problems with both systems in design as well as coverage that increase the ambiguity rate. This paper shows how an ambiguity-controlled morphological analyzer for Arabic is built in a rule-based system that takes the stem as the base form using finite state technology. The paper also points out sources of legal and illegal ambiguities in MSA, and how ambiguity in the new system is reduced without compromising precision. At the end, an evaluation of Xerox, Buckwalter, and our system is conducte...

Research paper thumbnail of Explicit Fine grained Syntactic and Semantic Annotation of the Idafa Construction in Arabic

Idafa in traditional Arabic grammar is an umbrella construction that covers several phenomena inc... more Idafa in traditional Arabic grammar is an umbrella construction that covers several phenomena including what is expressed in English as noun-noun compounds and Saxon and Norman genitives. Additionally, Idafa participates in some other constructions, such as quantifiers, quasi-prepositions, and adjectives. Identifying the various types of the Idafa construction (IC) is of importance to Natural Language processing (NLP) applications. Noun-Noun compounds exhibit special behavior in most languages impacting their semantic interpretation. Hence distinguishing them could have an impact on downstream NLP applications. The most comprehensive syntactic representation of the Arabic language is the LDC Arabic Treebank (ATB). In the ATB, ICs are not explicitly labeled and furthermore, there is no distinction between ICs of noun-noun relations and other traditional ICs. Hence, we devise a detailed syntactic and semantic typification process of the IC phenomenon in Arabic. We target the ATB as a ...

Research paper thumbnail of PoS, Morphology and Dependencies Annotation Guidelines for Arabic

Research paper thumbnail of Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach

Arabic is written as a sequence of consonants and long vowels, with short vowels normally omitted... more Arabic is written as a sequence of consonants and long vowels, with short vowels normally omitted. Diacritization attempts to recover short vowels and is an essential step for Text-to-Speech (TTS) systems. Though Automatic diacritization of Modern Standard Arabic (MSA) has received significant attention, limited research has been conducted on dialectal Arabic (DA) diacritization. Phonemic patterns of DA vary greatly from MSA and even from one another, which accounts for the noted difficulty of mutual intelligibility between dialects. In this paper we present our research and benchmark results on the automatic diacritization of two Maghrebi sub-dialects, namely Tunisian and Moroccan, using Conditional Random Fields (CRF). Aside from using character n-grams as features, we also employ character-level Brown clusters, which are hierarchical clusters of characters based on the contexts in which they appear. We achieved word-level diacritization errors of 2.9% and 3.8% for Moroccan and Tu...

Research paper thumbnail of An Automatically Built Named Entity Lexicon for Arabic

We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity ... more We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN’s instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest, most ma...

Research paper thumbnail of Universal Dependencies 2.0 – CoNLL 2017 Shared Task Development and Test Data

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treeban... more Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). This release contains the test data used in the CoNLL 2017 shared task on parsing Universal Dependencies. Due to the shared task the test data was held hidden and not released together with the training and development data of UD 2.0. Therefore this release complements the UD 2.0 release (http://hdl.handle.net/11234/1-1983) to a full release of UD treebanks. In addition, the present release contains 18 new parallel test sets and 4 test sets in surprise languages. The present r...

Research paper thumbnail of CogALex-V Shared Task: GHHH - Detecting Semantic Relations via Word Embeddings

This paper describes our system submission to the CogALex-2016 Shared Task on Corpus-Based Identi... more This paper describes our system submission to the CogALex-2016 Shared Task on Corpus-Based Identification of Semantic Relations. Our system won first place for Task-1 and second place for Task-2. The evaluation results of our system on the test set is 88.1% (79.0% for TRUE only) f-measure for Task-1 on detecting semantic similarity, and 76.0% (42.3% when excluding RANDOM) for Task-2 on identifying finer-grained semantic relations. In our experiments, we try word analogy, linear regression, and multi-task Convolutional Neural Networks (CNNs) with word embeddings from publicly available word vectors. We found that linear regression performs better in the binary classification (Task-1), while CNNs have better performance in the multi-class semantic classification (Task-2). We assume that word analogy is more suited for deterministic answers rather than handling the ambiguity of one-to-many and many-to-many relationships. We also show that classifier performance could benefit from balan...

Research paper thumbnail of GHH at SemEval-2018 Task 10: Discovering Discriminative Attributes in Distributional Semantics

Proceedings of The 12th International Workshop on Semantic Evaluation

This paper describes our system submission to the SemEval 2018 Task 10 on Capturing Discriminativ... more This paper describes our system submission to the SemEval 2018 Task 10 on Capturing Discriminative Attributes. Given two concepts and an attribute, the task is to determine whether the attribute is semantically related to one concept and not the other. In this work we assume that discriminative attributes can be detected by discovering the association (or lack of association) between a pair of words. The hypothesis we test in this contribution is whether the semantic difference between two pairs of concepts can be treated in terms of measuring the distance between words in a vector space, or can simply be obtained as a by-product of word co-occurrence counts.

Research paper thumbnail of POS Tagging for Improving Code-Switching Identification in Arabic

Proceedings of the Fourth Arabic Natural Language Processing Workshop

When speakers code-switch between their native language and a second language or language variant... more When speakers code-switch between their native language and a second language or language variant, they follow a syntactic pattern where words and phrases from the embedded language are inserted into the matrix language. This paper explores the possibility of utilizing this pattern in improving code-switching identification between Modern Standard Arabic (MSA) and Egyptian Arabic (EA). We try to answer the question of how strong is the POS signal in word-level code-switching identification. We build a deep learning model enriched with linguistic features (including POS tags) that outperforms the state-of-the-art results by 1.9% on the development set and 1.0% on the test set. We also show that in intrasentential code-switching, the selection of lexical items is constrained by POS categories, where function words tend to come more often from the dialectal language while the majority of content words come from the standard language.

Research paper thumbnail of GHHT at CALCS 2018: Named Entity Recognition for Dialectal Arabic Using Neural Networks

Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

This paper describes our system submission to the CALCS 2018 shared task on named entity recognit... more This paper describes our system submission to the CALCS 2018 shared task on named entity recognition on codeswitched data for the language variant pair of Modern Standard Arabic and Egyptian dialectal Arabic. We build a a Deep Neural Network that combines word and character-based representations in convolutional and recurrent networks with a CRF layer. The model is augmented with stacked layers of enriched information such pre-trained embeddings, Brown clusters and named entity gazetteers. Our system is ranked second among those participating in the shared task achieving an FB1 average of 70.09%.

Research paper thumbnail of A Neural Architecture for Dialectal Arabic Segmentation

Proceedings of the Third Arabic Natural Language Processing Workshop

The automated processing of Arabic dialects is challenging due to the lack of spelling standards ... more The automated processing of Arabic dialects is challenging due to the lack of spelling standards and the scarcity of annotated data and resources in general. Segmentation of words into their constituent tokens is an important processing step for natural language processing. In this paper, we show how a segmenter can be trained on only 350 annotated tweets using neural networks without any normalization or reliance on lexical features or linguistic resources. We deal with segmentation as a sequence labeling problem at the character level. We show experimentally that our model can rival state-of-the-art methods that heavily depend on additional resources.

Research paper thumbnail of Learning from Relatives: Unified Dialectal Arabic Segmentation

Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Arabic dialects do not just share a common koiné, but there are shared pandialectal linguistic ph... more Arabic dialects do not just share a common koiné, but there are shared pandialectal linguistic phenomena that allow computational models for dialects to learn from each other. In this paper we build a unified segmentation model where the training data for different dialects are combined and a single model is trained. The model yields higher accuracies than dialect-specific models, eliminating the need for dialect identification before segmentation. We also measure the degree of relatedness between four major Arabic dialects by testing how a segmentation model trained on one dialect performs on the other dialects. We found that linguistic relatedness is contingent with geographical proximity. In our experiments we use SVM-based ranking and bi-LSTM-CRF sequence labeling.

Research paper thumbnail of Multilingual Code-switching Identification via LSTM Recurrent Neural Networks

Proceedings of the Second Workshop on Computational Approaches to Code Switching, 2016

This paper describes the HHU-UH-G system submitted to the EMNLP 2016 Second Workshop on Computati... more This paper describes the HHU-UH-G system submitted to the EMNLP 2016 Second Workshop on Computational Approaches to Code Switching. Our system ranked first place for Arabic (MSA-Egyptian) with an F1-score of 0.83 and second place for Spanish-English with an F1-score of 0.90. The HHU-UH-G system introduces a novel unified neural network architecture for language identification in code-switched tweets for both Spanish-English and MSA-Egyptian dialect. The system makes use of word and character level representations to identify code-switching. For the MSA-Egyptian dialect the system does not rely on any kind of language-specific knowledge or linguistic resources such as, Part Of Speech (POS) taggers, morphological analyzers, gazetteers or word lists to obtain state-ofthe-art performance.