Benchmarking Arabic AI with Large Language Models (original) (raw)
Related papers
JASMINE: Arabic GPT Models for Few-Shot Learning
arXiv (Cornell University), 2022
Task agnostic generative pretraining (GPT) has recently proved promising for zero-and few-shot learning, gradually diverting attention from the expensive supervised learning paradigm. Although the community is accumulating knowledge as to capabilities of English-language autoregressive models such as GPT-3 adopting this generative approach, scholarship about these models remains acutely Anglocentric. Consequently, the community currently has serious gaps in its understanding of this class of models, their potential, and their societal impacts in diverse settings, linguistic traditions, and cultures. To alleviate this issue for Arabic, a collection of diverse languages and language varieties with more than 400 million population, we introduce JASMINE, a suite of powerful Arabic autoregressive Transformer language models ranging in size between 300 million-13 billion parameters. We pretrain our new models with large amounts of diverse data (∼ 400GB of text) from different Arabic varieties and domains. We evaluate JASMINE extensively in both intrinsic and extrinsic settings, using a comprehensive benchmark for zero-and few-shot learning across a wide range of NLP tasks. We also carefully develop and release a novel benchmark for both automated and human evaluation of Arabic autoregressive models focused at investigating potential social biases, harms, and toxicity in these models. We aim to responsibly release our models with interested researchers, along with code for experimenting with them. 1
A Large and Diverse Arabic Corpus for Language Modeling
arXiv (Cornell University), 2022
Large Language Models (LLMs) have ushered in a major paradigm shift in Natural Language Processing (NLP), where large pre-trained Language models (LMs) have become a fundamental component of most NLP tasks. These models are intelligent enough to find relevant and meaningful representations of a language without any supervision. They are used to finetune typical NLP tasks with substantially higher precision than conventional shallow learning techniques. However, training these models requires a massively large corpus that adequately represents a language. Due to the availability of enormous corpora, English LLMs typically perform better than their counterparts. This effort focuses on the design and development of a large Arabic corpus. The corpus comprises over 500 GB of Arabic cleaned text, intended to improve cross-domain knowledge and downstream generalization capability of LLMs. The corpus was employed in the training of a large Arabic LLM. In order to assess the efficacy of the LLM, a variety of typical NLP tasks were fine-tuned. The fine-tuned tasks exhibited a significant boost in accuracy ranging between 4.5 and 8.5%, when compared to those downstreamed from multilingual BERT (mBERT). To the best of our knowledge, this is currently the largest clean and diverse Arabic corpus ever assembled.
16th Conference of the European Chapter of the Association for Computational Linguistics, 2021
A sufficient amount of annotated data is usually required to fine-tune pre-trained language models for downstream tasks. Unfortunately, attaining labeled data can be costly, especially for multiple language varieties and dialects. We propose to self-train pre-trained language models in zero- and few-shot scenarios to improve performance on data-scarce varieties using only resources from data-rich ones. We demonstrate the utility of our approach in the context of Arabic sequence labeling by using a language model fine-tuned on Modern Standard Arabic (MSA) only to predict named entities (NE) and part-of-speech (POS) tags on several dialectal Arabic (DA) varieties. We show that self-training is indeed powerful, improving zero-shot MSA-to-DA transfer by as large as ˷10{\%} F$_1$ (NER) and 2{\%} accuracy (POS tagging). We acquire even better performance in few-shot scenarios with limited amounts of labeled data. We conduct an ablation study and show that the performance boost observed directly results from training data augmentation possible with DA examples via self-training. This opens up opportunities for developing DA models exploiting only MSA resources. Our approach can also be extended to other languages and tasks.
A Comparative Study of Deep Learning Approaches for Arabic Language Processing
Jordan Journal of Electrical Engineering, 2025
Arabic is a difficult language for natural language processing (NLP) because of its complicated morphology, dialectal differences and the limited annotated resources. Although deep learning algorithms have reached state-of-the-art results in many NLP tasks, comprehensive comparative studies for Arabic remains scarce. This paper addresses this gap by systematically evaluating three prominent deep learning architectures - namely Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs) and Transformers - across five essential Arabic NLP tasks: i) mention of sentiment analysis, ii) named entity recognition, iii) machine translation, iv) text classification and v) dialect identification. We differ the performance of models trained from scratch with fine-tuned versions of AraBERT, a powerful Transformer-based model pre-trained on a large Arabic corpus. Our experiments employ the Arabic datasets already existing in literature and utilizes accuracy, F1-score and BLEU as the evaluation metrics. The results are indicative of the supremacy of Transformer-based models with regard to AraBERT that shows the highest scores in each task. Notably, AraBERT attains 95. 2% accuracy on sentiment analysis, which is higher than the accuracies of RNNs and CNNs. These improvements also become apparent in other tasks, with AraBERT ending up as the best among RNN, CNN and others. A 3-point difference for 3 BLEU in machine translation and 2. 3% F1-score on dialect recognition. This extensive assessment, in turn, highlights the advantages and disadvantages of deep learning architectures for Arabic NLP. The excellent AraBERT representation also demonstrates how transfer learning and synergy between Transformer architectures and large-scale pre-training can significantly help Arabic language technology development.
Zero-Resource Multi-Dialectal Arabic Natural Language Understanding
ArXiv, 2021
A reasonable amount of annotated data is required for fine-tuning pre-trained language models (PLM) on downstream tasks. However, obtaining labeled examples for different language varieties can be costly. In this paper, we investigate the zero-shot performance on Dialectal Arabic (DA) when finetuning a PLM on modern standard Arabic (MSA) data only — identifying a significant performance drop when evaluating such models on DA. To remedy such performance drop, we propose self-training with unlabeled DA data and apply it in the context of named entity recognition (NER), part-of-speech (POS) tagging, and sarcasm detection (SRD) on several DA varieties. Our results demonstrate the effectiveness of self-training with unlabeled DA data: improving zero-shot MSA-to-DA transfer by as large as ~10% F1 (NER), 2% accuracy (POS tagging), and 4.5% F1 (SRD). We conduct an ablation experiment and show that the performance boost observed directly results from the unlabeled DA examples used for self-t...
Dolphin: A Challenging and Diverse Benchmark for Arabic NLG
arXiv (Cornell University), 2023
We present Dolphin, a novel benchmark that addresses the need for a natural language generation (NLG) evaluation framework dedicated to the wide collection of Arabic languages and varieties. The proposed benchmark encompasses a broad range of 13 different NLG tasks, including dialogue generation, question answering, machine translation, summarization, among others. Dolphin comprises a substantial corpus of 40 diverse and representative public datasets across 50 test splits, carefully curated to reflect real-world scenarios and the linguistic richness of Arabic. It sets a new standard for evaluating the performance and generalization capabilities of Arabic and multilingual models, promising to enable researchers to push the boundaries of current methodologies. We provide an extensive analysis of Dolphin, highlighting its diversity and identifying gaps in current Arabic NLG research. We also offer a public leaderboard that is both interactive and modular and evaluate several models on our benchmark, allowing us to set strong baselines against which researchers can compare. 1
Embed More Ignore Less (EMIL): enriched representations for Arabic NLP
2020
Our research focuses on the potential improvements of exploiting language specific characteristics in the form of embeddings by neural networks. More specifically, we investigate the capability of neural techniques and embeddings to represent language specific characteristics in two sequence labeling tasks: named entity recognition (NER) and part of speech (POS) tagging. In both tasks, our preprocessing is designed to use enriched Arabic representation by adding diacritics to undiacritized text. In POS tagging, we test the ability of a neural model to capture syntactic characteristics encoded within these diacritics by incorporating an embedding layer for diacritics alongside embedding layers for words and characters. In NER, our architecture incorporates diacritic and POS embeddings alongside word and character embeddings. Our experiments are conducted on 7 datasets (4 NER and 3 POS). We show that embedding the information that is encoded in automatically acquired Arabic diacritics improves the performance across all datasets on both tasks. Embedding the information in automatically assigned POS tags further improves performance on the NER task.
From RNN techniques to pre-trained models: emphasis on the use in Arabic machine translation
IAES International Journal of Artificial Intelligence, 2024
In recent years, neural machine translation (NMT) has garnered significant attention due to its superior performance compared to traditional statistical machine translation. However, NMT's effectiveness can be limited when translating between languages with dissimilar structures, such as English and Arabic. To address this challenge, recent advances in natural language processing (NLP) have introduced unsupervised pre-training of large neural models, showing promise for enhancing various NLP tasks. This paper proposes a solution that leverages unsupervised pre-training of large neural models to enhance Arabic machine translation (MT). Specifically, we utilize pre-trained checkpoints from publicly available Arabic NLP models, like Arabic bidirectional encoder representations from transformers (AraBERT) and Arabic generative pre-trained transformer (AraGPT), to initialize and warm-start the encoder and decoder of our transformer-based sequence-to-sequence model. This approach enables us to incorporate Arabic-specific linguistic knowledge, such as word morphology and context, into the translation process. Through a comprehensive empirical study, we rigorously evaluated our models against commonly used approaches in Arabic MT. Our results demonstrate that our pre-trained models achieve new state-of-the-art performance in Arabic MT. These findings underscore the effectiveness of pre-trained checkpoints in improving Arabic MT, with potential real-world applications.
Embed More Ignore Less (EMIL): Exploiting Enriched Representations for Arabic NLP
2020
Our research focuses on the potential improvements of exploiting language specific characteristics in the form of embeddings by neural networks. More specifically, we investigate the capability of neural techniques and embeddings to represent language specific characteristics in two sequence labeling tasks: named entity recognition (NER) and part of speech (POS) tagging. In both tasks, our preprocessing is designed to use enriched Arabic representation by adding diacritics to undiacritized text. In POS tagging, we test the ability of a neural model to capture syntactic characteristics encoded within these diacritics by incorporating an embedding layer for diacritics alongside embedding layers for words and characters. In NER, our architecture incorporates diacritic and POS embeddings alongside word and character embeddings. Our experiments are conducted on 7 datasets (4 NER and 3 POS). We show that embedding the information that is encoded in automatically acquired Arabic diacritics...
2021
Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen highresource languages. Building language models and, more generally, NLP systems for nonstandardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a characterbased model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pretrained on large multilingual ...