Rania Al-sabbagh - Academia.edu (original) (raw)

Papers by Rania Al-sabbagh

Research paper thumbnail of Interactive Annotation for Event Modality in Modern Standard and Egyptian Arabic Tweets

Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop, 2014

We present an interactive procedure to annotate a large-scale corpus of Modern Standard and Egypt... more We present an interactive procedure to annotate a large-scale corpus of Modern Standard and Egyptian Arabic tweets for event modality that comprises obligation, permission, commitment, ability, and volition. The procedure splits up the annotation process into a series of simplified questions, dispenses with the requirement of expert linguistic knowledge, and captures nested modality triggers and their attributes semi-automatically.

Research paper thumbnail of Unsupervised Construction of a Lexicon and a Repository of Variation Patterns for Arabic Modal Multiword Expressions

Proceedings of the 10th Workshop on Multiword Expressions (MWE), 2014

We present an unsupervised approach to build a lexicon of Arabic Modal Multiword Expressions (AM-... more We present an unsupervised approach to build a lexicon of Arabic Modal Multiword Expressions (AM-MWEs) and a repository of their variation patterns. These novel resources are likely to boost the automatic identification and extraction of AM-MWEs 1 .

Research paper thumbnail of Mining the Web for the Induction of a Dialectical Arabic Lexicon

This paper describes the first phase of building a lexicon of Egyptian Cairene Arabic (ECA) - one... more This paper describes the first phase of building a lexicon of Egyptian Cairene Arabic (ECA) - one of the most widely understood dialects in the Arab World - and Modern Standard Arabic (MSA). Each ECA entry is mapped to its MSA synonym, Part-of-Speech (POS) tag and top-ranked contexts based on Web queries; and thus each entry is provided with basic syntactic and semantic information for a generic lexicon compatible with multiple NLP applications. Moreover, through their MSA synonyms, ECA entries acquire access to MSA available NLP tools and resources which are considerably available. Using an associationist approach based on the correlations between word co-occurrence patterns in both dialects, we change the direction of the acquisition process from parallel to circular to overcome a bottleneck of current research on Arabic dialects, namely the lack of parallel corpora, and to alleviate accuracy rates for using unrelated Web documents which are more frequently available. Manually eva...

Research paper thumbnail of A Supervised POS Tagger for Written Arabic Social Networking Corpora

This paper presents an implementation of Brill's Transformation-Based Part-of-Speech (POS) ta... more This paper presents an implementation of Brill's Transformation-Based Part-of-Speech (POS) tagging algorithm trained on a manually-annotated Twitter-based Egyptian Arabic corpus of 423,691 tokens and 70,163 types. Unlike standard POS morphosyntactic annotation schemes which label each word based on its word-level morphosyntactic features, we use a function-based annotation scheme in which words are labeled based on their grammatical functions rather than their morpho-syntactic structures given that these two do not necessarily map. While a standard morpho-syntactic scheme makes comparisons with other work easier, the function-based scheme is assumed to be more efficient for building higher-up tools such as base-phrase chunkers, dependency parsers and for NLP applications like subjectivity and sentiment analysis. The function-based scheme also gives new insights about linguistic structural realizations specific to Egyptian Arabic which is currently an under-resourced language.

Research paper thumbnail of YADAC: Yet Another Dialectal Arabic Corpus

This paper presents the first phase of building YADAC – a multi-genre Dialectal Arabic (DA) corpu... more This paper presents the first phase of building YADAC – a multi-genre Dialectal Arabic (DA) corpus – that is compiled using Web data from microblogs (i.e. Twitter), blogs/forums and online knowledge market services in which both questions and answers are user-generated. In addition to introducing two new genres to the current efforts of building DA corpora (i.e. microblogs and question-answer pairs extracted from online knowledge market services), the paper highlights and tackles several new issues related to building DA corpora that have not been handled in previous studies: function-based Web harvesting and dialect identification, vowel-based spelling variation, linguistic hypercorrection and its effect on spelling variation, unsupervised Part-of-Speech (POS) tagging and base phrase chunking for DA. Although the algorithms for both POS tagging and base-phrase chunking are still under development, the results are promising

Research paper thumbnail of A Unified Framework to Identify and Extract Uncertainty Cues, Holders, and Scopes in One Fell-Swoop

Lecture Notes in Computer Science, 2015

ABSTRACT We present a unified framework based on supervised sequence labeling methods to identify... more ABSTRACT We present a unified framework based on supervised sequence labeling methods to identify and extract uncertainty cues, holders, and scopes in one-fell swoop with an application on Arabic tweets. The underlying technology employs Support Vector Machines with a rich set of morphological, syntactic, lexical, semantic, pragmatic, dialectal, and genre-specific features, and yields an average F1 score of 0.759.%.

Research paper thumbnail of Using the Semantic-Syntactic Interface for Reliable Arabic Modality

We introduce a novel modality scheme where triggers are words and phrases that convey modality me... more We introduce a novel modality scheme where triggers are words and phrases that convey modality meanings and subcategorize for clauses and verbal phrases. This semanticsyntactic working definition of modality enables us to design practical and replicable annotation guidelines and procedures that alleviate some shortcomings of current purely semantic modality annotation schemes and yield high inter-annotator agreement rates. We use this scheme to annotate a tweet-based Arabic corpus for modality information. This novel language resource, being the first, initiates NLP research on Arabic modality.

Research paper thumbnail of 3arif: A Corpus of Modern Standard and Egyptian Arabic Tweets Annotated for Epistemic Modality Using Interactive Crowdsourcing

We present 3arif 1 , a large-scale corpus of Modern Standard and Egyptian Arabic tweets annotated... more We present 3arif 1 , a large-scale corpus of Modern Standard and Egyptian Arabic tweets annotated for epistemic modality 2 . To create 3arif, we design an interactive crowdsourcing annotation procedure that splits up the annotation process into a series of simplified questions, dispenses with the requirement for expert linguistic knowledge and captures nested modality triggers and their attributes semiautomatically. This work is licensed under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer are added by the organizers. License details: http://creativecommons.org/licenses/by/4.0/ 1 Pronounced as ʕa:rif in Arabic IPA and as EArif in Buckwalter's transliteration scheme. It means I/he know(s). 2 3arif is available at http://www.rania-alsabbagh.com/3arif.html 17. ‫نفسه‬ ‫ھيرشح‬ ‫مش‬ ‫كدة‬ ‫وعلشان‬ ‫ھتنتخبه[‬ ‫بس‬ % 12 ‫]نسبة‬ ‫ان‬ ‫ومتاكد‬ ‫عارف‬ ‫البرادعي‬ AlbrAdEy EArf wmtAkd An [nsbp 12% bs htntxbh] wEl$An kdp m$ hyr$H nfsh Elbaradei knows and is sure that [only 12% will vote for him]. So, he will not run for presidency. rep. USER, MOD PRS TRUE, (AlbrAdEy, STRG PRS TRUE, (nsbp 12% bs htntxbh))

Research paper thumbnail of Using Web Mining Techniques to Build a Multi-Dialect Lexicon of Arabic

Research paper thumbnail of Variations on the same theme

Studies in Arabic Linguistics, 2014

Research paper thumbnail of Arabic Anaphora Resolution: A Distributional, Monolingual and Bilingual Approach

Abstract—This paper presents an algorithm for Anaphora Resolution (AR) in Arabic. The paper is mo... more Abstract—This paper presents an algorithm for Anaphora Resolution (AR) in Arabic. The paper is motivated by the poor performance of current Arabic-English Machine Translation (MT) systems in terms of AR and the fact that AR is an understudied issue in Arabic Natural Language Processing (ANLP). The algorithm suggested follows a distributional, monolingual and bilingual bootstrapping approach to acquire AR-related features that cannot be provided by monolingual resources, using a second language (here English). ...

Research paper thumbnail of A Bilingual Approach for Arabic Paraphrases Acquisition: Preliminary Experiments

Arabic paraphrase acquisition; a research which is motivated by the importance of paraphrasing fo... more Arabic paraphrase acquisition; a research which is motivated by the importance of paraphrasing for overcoming sparseness of data and its importance for many NLP applications such as Question Answering (QA) and Information Retrieval (IR). The proposed approach develops an unsupervised bilingual algorithm to acquire Arabic paraphrases at the phrase level which is rather more challenging than the elementary word-level paraphrasing and is less efficiently handled by current Arabic paraphrasing systems. ...

Research paper thumbnail of A Supervised POS Tagger for Written Arabic Social Networking Corpora

netfiles.uiuc.edu, 2012

This paper presents an implementation of Brill's Transformation-Based Part-of-Speech (POS) ta... more This paper presents an implementation of Brill's Transformation-Based Part-of-Speech (POS) tagging algorithm trained on a manually-annotated Twitter-based Egyptian Arabic corpus of 423,691 tokens and 70,163 types. Unlike standard POS morphosyntactic ...

Research paper thumbnail of Mining the web for the induction of a dialectical arabic lexicon

LREC. European Language Resources Association, May 1, 2010

This paper describes the first phase of building a lexicon of Egyptian Cairene Arabic (ECA)–one o... more This paper describes the first phase of building a lexicon of Egyptian Cairene Arabic (ECA)–one of the most widely understood dialects in the Arab World–and Modern Standard Arabic (MSA). Each ECA entry is mapped to its MSA synonym, Part-of-Speech (POS) tag and top-ranked contexts based on Web queries; and thus each entry is provided with basic syntactic and semantic information for a generic lexicon compatible with multiple NLP applications. Moreover, through their MSA synonyms, ECA entries acquire access to ...

Research paper thumbnail of YADAC: Yet another Dialectal Arabic Corpus

… of the 8th International Conference on …, 2012

This paper presents the first phase of building YADAC–a multi-genre Dialectal Arabic (DA) corpus–... more This paper presents the first phase of building YADAC–a multi-genre Dialectal Arabic (DA) corpus–that is compiled using Web data from microblogs (ie Twitter), blogs/forums and online knowledge market services in which both questions and answers are user-generated. In addition to introducing two new genres to the current efforts of building DA corpora (ie microblogs and question-answer pairs extracted from online knowledge market services), the paper highlights and tackles several new issues related ...

Research paper thumbnail of A Web-Based Approach for Arabic PP Attachment

Proceedings of the 6th International Conference on Informatics and Systems, 2008

This paper presents a web-based algorithm for Arabic PP attachment. The paper is motivated by the... more This paper presents a web-based algorithm for Arabic PP attachment. The paper is motivated by the importance of PP attachment for NLP applications and tasks and by the poor performance of current Arabic parser. The algorithm uses web frequencies to measure the collocational association between the PP and candidate binders; binder with the highest association is selected as the correct one. The algorithm achieves a performance rate of≈ 82%, which is higher than the used baseline performance (≈ 79%).

Research paper thumbnail of Arabic anaphora resolution using the Web as corpus

Proceedings of the 7th Conference on Language Engineering (CLE’07), Dec 5, 2007

Arabic Anaphora Resolution Using the Web as Corpus Abstract—This paper presents a dynamic algorit... more Arabic Anaphora Resolution Using the Web as Corpus Abstract—This paper presents a dynamic algorithm for Anaphora Resolution (AR) in Arabic unrestricted texts. The poor performance of current Arabic/English Machine Translation (MT) systems in terms of AR and the fact that AR is an understudied issue in Arabic Natural Language Processing (ANLP) are the main motivations for this paper. The algorithm suggested follows a statistical approach to AR and makes use of the web as corpus to overcome the inherit problem of statistical ...

Research paper thumbnail of Cue-based bootstrapping of Arabic semantic features

JADT, 2008

Motivated by the fact that semantic features are understudied in Arabic Natural Language Processi... more Motivated by the fact that semantic features are understudied in Arabic Natural Language Processing (ANLP) in spite of being essential for some Natural Language Processing (NLP) tasks such as Anaphora Resolution (AR), Word Sense Disambiguation (WSD) and Prepositional Phrase (PP) attachment, this paper presents a cue-based algorithm to build an Arabic lexicon that tackles such semantic features. The lexicon, whose entries are extracted from the World Wide Web (WWW) using bilingual and monolingual ...

Research paper thumbnail of The Location of Sentential Negation in Arabic Varieties

Brill's Annual of Afroasiatic Languages and Linguistics, 2013

Research paper thumbnail of Interactive Annotation for Event Modality in Modern Standard and Egyptian Arabic Tweets

Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop, 2014

We present an interactive procedure to annotate a large-scale corpus of Modern Standard and Egypt... more We present an interactive procedure to annotate a large-scale corpus of Modern Standard and Egyptian Arabic tweets for event modality that comprises obligation, permission, commitment, ability, and volition. The procedure splits up the annotation process into a series of simplified questions, dispenses with the requirement of expert linguistic knowledge, and captures nested modality triggers and their attributes semi-automatically.

Research paper thumbnail of Unsupervised Construction of a Lexicon and a Repository of Variation Patterns for Arabic Modal Multiword Expressions

Proceedings of the 10th Workshop on Multiword Expressions (MWE), 2014

We present an unsupervised approach to build a lexicon of Arabic Modal Multiword Expressions (AM-... more We present an unsupervised approach to build a lexicon of Arabic Modal Multiword Expressions (AM-MWEs) and a repository of their variation patterns. These novel resources are likely to boost the automatic identification and extraction of AM-MWEs 1 .

Research paper thumbnail of Mining the Web for the Induction of a Dialectical Arabic Lexicon

This paper describes the first phase of building a lexicon of Egyptian Cairene Arabic (ECA) - one... more This paper describes the first phase of building a lexicon of Egyptian Cairene Arabic (ECA) - one of the most widely understood dialects in the Arab World - and Modern Standard Arabic (MSA). Each ECA entry is mapped to its MSA synonym, Part-of-Speech (POS) tag and top-ranked contexts based on Web queries; and thus each entry is provided with basic syntactic and semantic information for a generic lexicon compatible with multiple NLP applications. Moreover, through their MSA synonyms, ECA entries acquire access to MSA available NLP tools and resources which are considerably available. Using an associationist approach based on the correlations between word co-occurrence patterns in both dialects, we change the direction of the acquisition process from parallel to circular to overcome a bottleneck of current research on Arabic dialects, namely the lack of parallel corpora, and to alleviate accuracy rates for using unrelated Web documents which are more frequently available. Manually eva...

Research paper thumbnail of A Supervised POS Tagger for Written Arabic Social Networking Corpora

This paper presents an implementation of Brill's Transformation-Based Part-of-Speech (POS) ta... more This paper presents an implementation of Brill's Transformation-Based Part-of-Speech (POS) tagging algorithm trained on a manually-annotated Twitter-based Egyptian Arabic corpus of 423,691 tokens and 70,163 types. Unlike standard POS morphosyntactic annotation schemes which label each word based on its word-level morphosyntactic features, we use a function-based annotation scheme in which words are labeled based on their grammatical functions rather than their morpho-syntactic structures given that these two do not necessarily map. While a standard morpho-syntactic scheme makes comparisons with other work easier, the function-based scheme is assumed to be more efficient for building higher-up tools such as base-phrase chunkers, dependency parsers and for NLP applications like subjectivity and sentiment analysis. The function-based scheme also gives new insights about linguistic structural realizations specific to Egyptian Arabic which is currently an under-resourced language.

Research paper thumbnail of YADAC: Yet Another Dialectal Arabic Corpus

This paper presents the first phase of building YADAC – a multi-genre Dialectal Arabic (DA) corpu... more This paper presents the first phase of building YADAC – a multi-genre Dialectal Arabic (DA) corpus – that is compiled using Web data from microblogs (i.e. Twitter), blogs/forums and online knowledge market services in which both questions and answers are user-generated. In addition to introducing two new genres to the current efforts of building DA corpora (i.e. microblogs and question-answer pairs extracted from online knowledge market services), the paper highlights and tackles several new issues related to building DA corpora that have not been handled in previous studies: function-based Web harvesting and dialect identification, vowel-based spelling variation, linguistic hypercorrection and its effect on spelling variation, unsupervised Part-of-Speech (POS) tagging and base phrase chunking for DA. Although the algorithms for both POS tagging and base-phrase chunking are still under development, the results are promising

Research paper thumbnail of A Unified Framework to Identify and Extract Uncertainty Cues, Holders, and Scopes in One Fell-Swoop

Lecture Notes in Computer Science, 2015

ABSTRACT We present a unified framework based on supervised sequence labeling methods to identify... more ABSTRACT We present a unified framework based on supervised sequence labeling methods to identify and extract uncertainty cues, holders, and scopes in one-fell swoop with an application on Arabic tweets. The underlying technology employs Support Vector Machines with a rich set of morphological, syntactic, lexical, semantic, pragmatic, dialectal, and genre-specific features, and yields an average F1 score of 0.759.%.

Research paper thumbnail of Using the Semantic-Syntactic Interface for Reliable Arabic Modality

We introduce a novel modality scheme where triggers are words and phrases that convey modality me... more We introduce a novel modality scheme where triggers are words and phrases that convey modality meanings and subcategorize for clauses and verbal phrases. This semanticsyntactic working definition of modality enables us to design practical and replicable annotation guidelines and procedures that alleviate some shortcomings of current purely semantic modality annotation schemes and yield high inter-annotator agreement rates. We use this scheme to annotate a tweet-based Arabic corpus for modality information. This novel language resource, being the first, initiates NLP research on Arabic modality.

Research paper thumbnail of 3arif: A Corpus of Modern Standard and Egyptian Arabic Tweets Annotated for Epistemic Modality Using Interactive Crowdsourcing

We present 3arif 1 , a large-scale corpus of Modern Standard and Egyptian Arabic tweets annotated... more We present 3arif 1 , a large-scale corpus of Modern Standard and Egyptian Arabic tweets annotated for epistemic modality 2 . To create 3arif, we design an interactive crowdsourcing annotation procedure that splits up the annotation process into a series of simplified questions, dispenses with the requirement for expert linguistic knowledge and captures nested modality triggers and their attributes semiautomatically. This work is licensed under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer are added by the organizers. License details: http://creativecommons.org/licenses/by/4.0/ 1 Pronounced as ʕa:rif in Arabic IPA and as EArif in Buckwalter's transliteration scheme. It means I/he know(s). 2 3arif is available at http://www.rania-alsabbagh.com/3arif.html 17. ‫نفسه‬ ‫ھيرشح‬ ‫مش‬ ‫كدة‬ ‫وعلشان‬ ‫ھتنتخبه[‬ ‫بس‬ % 12 ‫]نسبة‬ ‫ان‬ ‫ومتاكد‬ ‫عارف‬ ‫البرادعي‬ AlbrAdEy EArf wmtAkd An [nsbp 12% bs htntxbh] wEl$An kdp m$ hyr$H nfsh Elbaradei knows and is sure that [only 12% will vote for him]. So, he will not run for presidency. rep. USER, MOD PRS TRUE, (AlbrAdEy, STRG PRS TRUE, (nsbp 12% bs htntxbh))

Research paper thumbnail of Using Web Mining Techniques to Build a Multi-Dialect Lexicon of Arabic

Research paper thumbnail of Variations on the same theme

Studies in Arabic Linguistics, 2014

Research paper thumbnail of Arabic Anaphora Resolution: A Distributional, Monolingual and Bilingual Approach

Abstract—This paper presents an algorithm for Anaphora Resolution (AR) in Arabic. The paper is mo... more Abstract—This paper presents an algorithm for Anaphora Resolution (AR) in Arabic. The paper is motivated by the poor performance of current Arabic-English Machine Translation (MT) systems in terms of AR and the fact that AR is an understudied issue in Arabic Natural Language Processing (ANLP). The algorithm suggested follows a distributional, monolingual and bilingual bootstrapping approach to acquire AR-related features that cannot be provided by monolingual resources, using a second language (here English). ...

Research paper thumbnail of A Bilingual Approach for Arabic Paraphrases Acquisition: Preliminary Experiments

Arabic paraphrase acquisition; a research which is motivated by the importance of paraphrasing fo... more Arabic paraphrase acquisition; a research which is motivated by the importance of paraphrasing for overcoming sparseness of data and its importance for many NLP applications such as Question Answering (QA) and Information Retrieval (IR). The proposed approach develops an unsupervised bilingual algorithm to acquire Arabic paraphrases at the phrase level which is rather more challenging than the elementary word-level paraphrasing and is less efficiently handled by current Arabic paraphrasing systems. ...

Research paper thumbnail of A Supervised POS Tagger for Written Arabic Social Networking Corpora

netfiles.uiuc.edu, 2012

This paper presents an implementation of Brill's Transformation-Based Part-of-Speech (POS) ta... more This paper presents an implementation of Brill's Transformation-Based Part-of-Speech (POS) tagging algorithm trained on a manually-annotated Twitter-based Egyptian Arabic corpus of 423,691 tokens and 70,163 types. Unlike standard POS morphosyntactic ...

Research paper thumbnail of Mining the web for the induction of a dialectical arabic lexicon

LREC. European Language Resources Association, May 1, 2010

This paper describes the first phase of building a lexicon of Egyptian Cairene Arabic (ECA)–one o... more This paper describes the first phase of building a lexicon of Egyptian Cairene Arabic (ECA)–one of the most widely understood dialects in the Arab World–and Modern Standard Arabic (MSA). Each ECA entry is mapped to its MSA synonym, Part-of-Speech (POS) tag and top-ranked contexts based on Web queries; and thus each entry is provided with basic syntactic and semantic information for a generic lexicon compatible with multiple NLP applications. Moreover, through their MSA synonyms, ECA entries acquire access to ...

Research paper thumbnail of YADAC: Yet another Dialectal Arabic Corpus

… of the 8th International Conference on …, 2012

This paper presents the first phase of building YADAC–a multi-genre Dialectal Arabic (DA) corpus–... more This paper presents the first phase of building YADAC–a multi-genre Dialectal Arabic (DA) corpus–that is compiled using Web data from microblogs (ie Twitter), blogs/forums and online knowledge market services in which both questions and answers are user-generated. In addition to introducing two new genres to the current efforts of building DA corpora (ie microblogs and question-answer pairs extracted from online knowledge market services), the paper highlights and tackles several new issues related ...

Research paper thumbnail of A Web-Based Approach for Arabic PP Attachment

Proceedings of the 6th International Conference on Informatics and Systems, 2008

This paper presents a web-based algorithm for Arabic PP attachment. The paper is motivated by the... more This paper presents a web-based algorithm for Arabic PP attachment. The paper is motivated by the importance of PP attachment for NLP applications and tasks and by the poor performance of current Arabic parser. The algorithm uses web frequencies to measure the collocational association between the PP and candidate binders; binder with the highest association is selected as the correct one. The algorithm achieves a performance rate of≈ 82%, which is higher than the used baseline performance (≈ 79%).

Research paper thumbnail of Arabic anaphora resolution using the Web as corpus

Proceedings of the 7th Conference on Language Engineering (CLE’07), Dec 5, 2007

Arabic Anaphora Resolution Using the Web as Corpus Abstract—This paper presents a dynamic algorit... more Arabic Anaphora Resolution Using the Web as Corpus Abstract—This paper presents a dynamic algorithm for Anaphora Resolution (AR) in Arabic unrestricted texts. The poor performance of current Arabic/English Machine Translation (MT) systems in terms of AR and the fact that AR is an understudied issue in Arabic Natural Language Processing (ANLP) are the main motivations for this paper. The algorithm suggested follows a statistical approach to AR and makes use of the web as corpus to overcome the inherit problem of statistical ...

Research paper thumbnail of Cue-based bootstrapping of Arabic semantic features

JADT, 2008

Motivated by the fact that semantic features are understudied in Arabic Natural Language Processi... more Motivated by the fact that semantic features are understudied in Arabic Natural Language Processing (ANLP) in spite of being essential for some Natural Language Processing (NLP) tasks such as Anaphora Resolution (AR), Word Sense Disambiguation (WSD) and Prepositional Phrase (PP) attachment, this paper presents a cue-based algorithm to build an Arabic lexicon that tackles such semantic features. The lexicon, whose entries are extracted from the World Wide Web (WWW) using bilingual and monolingual ...

Research paper thumbnail of The Location of Sentential Negation in Arabic Varieties

Brill's Annual of Afroasiatic Languages and Linguistics, 2013