Karen McNeil | Georgetown University (original) (raw)
Papers by Karen McNeil
Social and technological changes over the past several decades have led to widespread writing of ... more Social and technological changes over the past several decades have led to widespread writing of "spoken" Arabic dialects. However, there is little quantitative research on this phenomenon and most existing research is limited to Egypt and Morocco. In addition, little is known about the characteristics of these newly written vernaculars, even though encoding an unwritten language in writing is not merely a technical assignment of sound to letter. Rather, it is a complex process that must balance practical considerations with ideological stances, such as autonomy from the standard language (Mühleisen 2005). The spread of vernacular into writing and the accompanying tension over its form constitutes the process of vernacularization. This dissertation documents and analyzes this vernacularization as it is occurring in Tunisia, examining how Tunisians writing in dɛrja collectively position themselves in relation to Standard Arabic, French, and other Arabic vernaculars. Using a 32-million-word online corpus and an innovative method for quantifying language choice, I found that the proportion of Tunisian Arabic on the online forum studied increased from 19.7% in 2010 to 69.9% in 2021.
The linguistic situation in the Arab world is in an important state of transition, with the "spok... more The linguistic situation in the Arab world is in an important state of transition, with the "spoken" vernaculars increasingly functioning as written languages as well. While this fact is widely acknowledged and the subject of a growing body of qualitative literature, there is little quantitative research detailing the process in action. The current project examines this development as it is occurring in Tunisia: I present the findings from a corpus study comparing the frequency of Tunisian Arabic-Standard Arabic equivalent pairs in online forum posts from 2010 with those from 2021. The findings show that the proportion of Tunisian lexical items, compared to their Standard Arabic equivalents, increased from a minority (19.7%) to a majority (69.9%) over this period. At the same time, metalinguistic comments on the forum reveal that, although its status is still contentious, Tunisian has become unmarked as a written language. These changes can be attributed to major developments in Tunisian society over the period of studyincluding internet access and the 2011 revolution. These findings suggest destabilization of the diglossic language situation in Tunisia and a privileging of national identity vis-à-vis the rest of the Arab world.
Perspectives on Arabic Linguistics 34, 2023
Social and technological changes over the past several decades have led to widespread writing of ... more Social and technological changes over the past several decades have led to widespread writing of "spoken" Arabic dialects. In Tunisia, there has been a noticeable growth of vernacular prose literature, part of a larger development of Tunisian Arabic as a written language. Tunisia does not have a history of colloquial literature: previously even the use of "derja" in literary dialogue was rare. From this nearly non-existent base, a small "leak" of vernacular writing appeared in the latter part of the 20th century, followed by a flood – first online, and increasingly in print – in the first two decades of the 21st. This has culminated in over a dozen vernacular novels and literary translations.
Fī in Tunisian Arabic is a preposition that describes a containment relationship and is roughly e... more Fī in Tunisian Arabic is a preposition that describes a containment relationship and is roughly equivalent to the English prepositions ‘in’ and ‘into’. In addition to its use as a marker of spatial relationship, fī has developed an aspectual use in Tunisian Arabic as a marker of the progressive aspect, e.g. nušrub fī al-tāy ‘I’m drinking tea’. This feature has been sparsely attested in other varieties of Arabic, but only in Tunisian has it developed into an integral, obligatory part of the aspectual system. This work describes this aspectual usage of fī and explores its possible origins.
Because of the many varieties of Arabic, there can never be 'one' authoritative corpus of the lan... more Because of the many varieties of Arabic, there can never be 'one' authoritative corpus of the language. To achieve the best results for language-learning resources and natural language processing, corpora for both the standard language and the spoken varieties are needed. To this end, the Tunisian Arabic Corpus (TAC) is a project, led by Karen McNeil and Miled Faiza, seeking to build a four-million-word corpus of Tunisian Spoken Arabic. There are many challenges to creating Arabic corpora, and dialectal corpora in particular, including those of sources, balance, and parsing. The corpus currently consists only of about 881,000 words, and issues of balance and parsing have not been completely solved. Nonetheless, the corpus has proved to be a useful resource to Arabic students and researchers, and also presents a model for others who wish to create dialectal Arabic corpora.
Drafts by Karen McNeil
UNPUBLISHED PAPER COMPARING THE HAND-ROLLED PARSER/POS TAGGER USED IN TUNISIYA.ORG WITH SOME ML M... more UNPUBLISHED PAPER COMPARING THE HAND-ROLLED PARSER/POS TAGGER USED IN TUNISIYA.ORG WITH SOME ML METHODS
This paper presents a comparison of several different part-of-speech taggers trained on a hand-annotated Tunisian Arabic sample of 6,000 words in 450 sentences. Despite the small size of the annotated corpus, the Trigram Tagger and Brill Tagger performed nearly as well as a custom rule-based tagger. This work not only contributes a new resource to Tunisian Arabic-a severely under-resourced language-but also provides insight for developing materials for other under-resourced languages.
Social and technological changes over the past several decades have led to widespread writing of ... more Social and technological changes over the past several decades have led to widespread writing of "spoken" Arabic dialects. However, there is little quantitative research on this phenomenon and most existing research is limited to Egypt and Morocco. In addition, little is known about the characteristics of these newly written vernaculars, even though encoding an unwritten language in writing is not merely a technical assignment of sound to letter. Rather, it is a complex process that must balance practical considerations with ideological stances, such as autonomy from the standard language (Mühleisen 2005). The spread of vernacular into writing and the accompanying tension over its form constitutes the process of vernacularization. This dissertation documents and analyzes this vernacularization as it is occurring in Tunisia, examining how Tunisians writing in dɛrja collectively position themselves in relation to Standard Arabic, French, and other Arabic vernaculars. Using a 32-million-word online corpus and an innovative method for quantifying language choice, I found that the proportion of Tunisian Arabic on the online forum studied increased from 19.7% in 2010 to 69.9% in 2021.
The linguistic situation in the Arab world is in an important state of transition, with the "spok... more The linguistic situation in the Arab world is in an important state of transition, with the "spoken" vernaculars increasingly functioning as written languages as well. While this fact is widely acknowledged and the subject of a growing body of qualitative literature, there is little quantitative research detailing the process in action. The current project examines this development as it is occurring in Tunisia: I present the findings from a corpus study comparing the frequency of Tunisian Arabic-Standard Arabic equivalent pairs in online forum posts from 2010 with those from 2021. The findings show that the proportion of Tunisian lexical items, compared to their Standard Arabic equivalents, increased from a minority (19.7%) to a majority (69.9%) over this period. At the same time, metalinguistic comments on the forum reveal that, although its status is still contentious, Tunisian has become unmarked as a written language. These changes can be attributed to major developments in Tunisian society over the period of studyincluding internet access and the 2011 revolution. These findings suggest destabilization of the diglossic language situation in Tunisia and a privileging of national identity vis-à-vis the rest of the Arab world.
Perspectives on Arabic Linguistics 34, 2023
Social and technological changes over the past several decades have led to widespread writing of ... more Social and technological changes over the past several decades have led to widespread writing of "spoken" Arabic dialects. In Tunisia, there has been a noticeable growth of vernacular prose literature, part of a larger development of Tunisian Arabic as a written language. Tunisia does not have a history of colloquial literature: previously even the use of "derja" in literary dialogue was rare. From this nearly non-existent base, a small "leak" of vernacular writing appeared in the latter part of the 20th century, followed by a flood – first online, and increasingly in print – in the first two decades of the 21st. This has culminated in over a dozen vernacular novels and literary translations.
Fī in Tunisian Arabic is a preposition that describes a containment relationship and is roughly e... more Fī in Tunisian Arabic is a preposition that describes a containment relationship and is roughly equivalent to the English prepositions ‘in’ and ‘into’. In addition to its use as a marker of spatial relationship, fī has developed an aspectual use in Tunisian Arabic as a marker of the progressive aspect, e.g. nušrub fī al-tāy ‘I’m drinking tea’. This feature has been sparsely attested in other varieties of Arabic, but only in Tunisian has it developed into an integral, obligatory part of the aspectual system. This work describes this aspectual usage of fī and explores its possible origins.
Because of the many varieties of Arabic, there can never be 'one' authoritative corpus of the lan... more Because of the many varieties of Arabic, there can never be 'one' authoritative corpus of the language. To achieve the best results for language-learning resources and natural language processing, corpora for both the standard language and the spoken varieties are needed. To this end, the Tunisian Arabic Corpus (TAC) is a project, led by Karen McNeil and Miled Faiza, seeking to build a four-million-word corpus of Tunisian Spoken Arabic. There are many challenges to creating Arabic corpora, and dialectal corpora in particular, including those of sources, balance, and parsing. The corpus currently consists only of about 881,000 words, and issues of balance and parsing have not been completely solved. Nonetheless, the corpus has proved to be a useful resource to Arabic students and researchers, and also presents a model for others who wish to create dialectal Arabic corpora.
UNPUBLISHED PAPER COMPARING THE HAND-ROLLED PARSER/POS TAGGER USED IN TUNISIYA.ORG WITH SOME ML M... more UNPUBLISHED PAPER COMPARING THE HAND-ROLLED PARSER/POS TAGGER USED IN TUNISIYA.ORG WITH SOME ML METHODS
This paper presents a comparison of several different part-of-speech taggers trained on a hand-annotated Tunisian Arabic sample of 6,000 words in 450 sentences. Despite the small size of the annotated corpus, the Trigram Tagger and Brill Tagger performed nearly as well as a custom rule-based tagger. This work not only contributes a new resource to Tunisian Arabic-a severely under-resourced language-but also provides insight for developing materials for other under-resourced languages.