Mohamed Abdel-Hady | Microsoft Research (original) (raw)

Papers by Mohamed Abdel-Hady

Research paper thumbnail of Cross-lingual Twitter Polarity Detection via Projection across Word-Aligned Corpora

In this paper, we propose an unsupervised framework that leverages the sentiment resources and to... more In this paper, we propose an unsupervised framework that
leverages the sentiment resources and tools available in English language to automatically generate stand-alone polarity lexicons and classifiers for languages with scarce subjectivity resources and thus avoids the need for labor intensive manual annotation. Starting with a list of English
sentiment-bearing words, we expand this lexicon using WordNet synsets. For each sentence pair in a given bilingual parallel corpus, the high-precision English polarity lexicon is applied to the English side then the output sentiment label is projected onto the target language side via statistically derived word alignments. The resulting lexicon is applied to a large pool of unlabeled tweets in the target language, in order to automatically label tweets as training data to train polarity classifier. Our experiments using Spanish and Portuguese as target ones have shown that the resulting classifiers help to improve polarity classification performance compared to lexicon-based classification for under-resourced languages in social media.

Research paper thumbnail of Unsupervised Active Learning of CRF model for Cross-Lingual Named Entity Recognition

Manual annotation of the training data of information ex- traction models is a time consuming and... more Manual annotation of the training data of information ex-
traction models is a time consuming and expensive process but necessary for the building of information extraction systems. Active learning has been proven to be effective in reducing manual annotation efforts for supervised learning tasks where a human judge is asked to annotate the most informative examples with respect to a given model. However, in most cases reliable human judges are not available for all languages. In this paper, we propose a cross-lingual unsupervised active learning paradigm (XLADA) that generates high-quality automatically annotated training data from a word-aligned parallel corpus. To evaluate our paradigm, we applied XLADA on English-French and English-Chinese bilingual corpora then we trained French and Chinese information extraction models. The experimental results show that XLADA can produce effective models without manually-annotated training data.

Research paper thumbnail of Class-Dependent Canonical Correlation Analysis for Scalable Cross-Lingual Document Categorization

Canonical Correlation Analysis (CCA) is used to infer a semantic space into which text documents,... more Canonical Correlation Analysis (CCA) is used to infer a semantic space into which text documents, written in different languages, can be mapped to a language-independent representation, called latent topics. This highly reduces the complexity of dealing with different languages since we can train a document classifier using the labeled documents in one language, and then apply it to classify documents in another language. This topic modeling task is usually performed in a class-independent manner. The performance of CCA depends on the amount of documents used for inferring the semantic space. However, CCA has a high computational complexity with respect to the number of training documents. In this paper, we proposed a scalable variant of CCA, CD-CCA, to improve its scalability and complexity where the projection is performed in a class-dependent manner. It generates a semantic space for each category separately. Then a binary document classifier is trained for each category on its own semantic space. CD-CCA was applied on English-Chinese document classification. The experimental results showed that CD-CCA can deal with large training sets without hurting the performance of the underlying classifiers compared to traditional CCA. CD-CCA opens the door for distributed training of the semantic spaces of the different categories.

Research paper thumbnail of Domain adaptation for cross-lingual query classification using search query logs

Query Intent classifier is used by a search engine to classify an online search query whether hav... more Query Intent classifier is used by a search engine to classify an online search query whether having a certain type of intent such as adult intent or commercial intent. Training such classifiers for each language is a supervised machine learning task that requires a large amount of labeled training queries. The manual annotation of training queries for each new emerging language using human judges is expensive, error-prone and time consuming. In this paper, we leverage the existing query classifiers in a source language and the abundant unlabeled queries in the query log of the underserved target language to reduce the cost and automate the training data annotation process. The most clicked search results of a query are used to predict the intent of this query instead of human judges. Document classifier is trained on hidden topics extracted by latent semantic indexing from the translation of source language documents into the target language. The experimental results, using English as the source language and Arabic as the target one, show that the proposed unsupervised method has trained support vector machines as Arabic query classifiers to detect both commercial and health intent without need for human-judged Arabic queries. The unsupervised classifiers outperform the classifiers based on direct query translation and the decision fusion of both classifier is superior.

Research paper thumbnail of Cross-lingual Twitter Polarity Detection via Projection across Word-Aligned Corpora

In this paper, we propose an unsupervised framework that leverages the sentiment resources and to... more In this paper, we propose an unsupervised framework that
leverages the sentiment resources and tools available in English language to automatically generate stand-alone polarity lexicons and classifiers for languages with scarce subjectivity resources and thus avoids the need for labor intensive manual annotation. Starting with a list of English
sentiment-bearing words, we expand this lexicon using WordNet synsets. For each sentence pair in a given bilingual parallel corpus, the high-precision English polarity lexicon is applied to the English side then the output sentiment label is projected onto the target language side via statistically derived word alignments. The resulting lexicon is applied to a large pool of unlabeled tweets in the target language, in order to automatically label tweets as training data to train polarity classifier. Our experiments using Spanish and Portuguese as target ones have shown that the resulting classifiers help to improve polarity classification performance compared to lexicon-based classification for under-resourced languages in social media.

Research paper thumbnail of Unsupervised Active Learning of CRF model for Cross-Lingual Named Entity Recognition

Manual annotation of the training data of information ex- traction models is a time consuming and... more Manual annotation of the training data of information ex-
traction models is a time consuming and expensive process but necessary for the building of information extraction systems. Active learning has been proven to be effective in reducing manual annotation efforts for supervised learning tasks where a human judge is asked to annotate the most informative examples with respect to a given model. However, in most cases reliable human judges are not available for all languages. In this paper, we propose a cross-lingual unsupervised active learning paradigm (XLADA) that generates high-quality automatically annotated training data from a word-aligned parallel corpus. To evaluate our paradigm, we applied XLADA on English-French and English-Chinese bilingual corpora then we trained French and Chinese information extraction models. The experimental results show that XLADA can produce effective models without manually-annotated training data.

Research paper thumbnail of Class-Dependent Canonical Correlation Analysis for Scalable Cross-Lingual Document Categorization

Canonical Correlation Analysis (CCA) is used to infer a semantic space into which text documents,... more Canonical Correlation Analysis (CCA) is used to infer a semantic space into which text documents, written in different languages, can be mapped to a language-independent representation, called latent topics. This highly reduces the complexity of dealing with different languages since we can train a document classifier using the labeled documents in one language, and then apply it to classify documents in another language. This topic modeling task is usually performed in a class-independent manner. The performance of CCA depends on the amount of documents used for inferring the semantic space. However, CCA has a high computational complexity with respect to the number of training documents. In this paper, we proposed a scalable variant of CCA, CD-CCA, to improve its scalability and complexity where the projection is performed in a class-dependent manner. It generates a semantic space for each category separately. Then a binary document classifier is trained for each category on its own semantic space. CD-CCA was applied on English-Chinese document classification. The experimental results showed that CD-CCA can deal with large training sets without hurting the performance of the underlying classifiers compared to traditional CCA. CD-CCA opens the door for distributed training of the semantic spaces of the different categories.

Research paper thumbnail of Domain adaptation for cross-lingual query classification using search query logs

Query Intent classifier is used by a search engine to classify an online search query whether hav... more Query Intent classifier is used by a search engine to classify an online search query whether having a certain type of intent such as adult intent or commercial intent. Training such classifiers for each language is a supervised machine learning task that requires a large amount of labeled training queries. The manual annotation of training queries for each new emerging language using human judges is expensive, error-prone and time consuming. In this paper, we leverage the existing query classifiers in a source language and the abundant unlabeled queries in the query log of the underserved target language to reduce the cost and automate the training data annotation process. The most clicked search results of a query are used to predict the intent of this query instead of human judges. Document classifier is trained on hidden topics extracted by latent semantic indexing from the translation of source language documents into the target language. The experimental results, using English as the source language and Arabic as the target one, show that the proposed unsupervised method has trained support vector machines as Arabic query classifiers to detect both commercial and health intent without need for human-judged Arabic queries. The unsupervised classifiers outperform the classifiers based on direct query translation and the decision fusion of both classifier is superior.