A Large and Diverse Arabic Corpus for Language Modeling (original) (raw)
Related papers
Benchmarking Arabic AI with Large Language Models
arXiv (Cornell University), 2023
Recent advancements in Large Language Models (LLMs) have significantly influenced the landscape of language and speech research. Despite this progress, these models lack specific benchmarking against state-of-the-art (SOTA) models tailored to particular languages and tasks. LAraBench addresses this gap for Arabic Natural Language Processing (NLP) and Speech Processing tasks, including sequence tagging and content classification across different domains. We utilized models such as GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13bchat, Whisper, and USM, employing zero and few-shot learning techniques to tackle 33 distinct tasks across 61 publicly available datasets. This involved 98 experimental setups, encompassing ∼296K data points, ∼46 hours of speech, and 30 sentences for Text-to-Speech (TTS). This effort resulted in 330+ sets of experiments. Our analysis focused on measuring the performance gap between SOTA models and LLMs. The overarching trend observed was that SOTA models generally outperformed LLMs in zero-shot learning, with a few exceptions. Notably, larger computational models with few-shot learning techniques managed to reduce these performance gaps. Our findings provide valuable insights into the applicability of LLMs for Arabic NLP and speech processing tasks. * The contribution was made while the author was at the Qatar Computing Research Institute. † Equal contribution. 1 We are referring to models with billions of parameters as LLMs.
From RNN techniques to pre-trained models: emphasis on the use in Arabic machine translation
IAES International Journal of Artificial Intelligence, 2024
In recent years, neural machine translation (NMT) has garnered significant attention due to its superior performance compared to traditional statistical machine translation. However, NMT's effectiveness can be limited when translating between languages with dissimilar structures, such as English and Arabic. To address this challenge, recent advances in natural language processing (NLP) have introduced unsupervised pre-training of large neural models, showing promise for enhancing various NLP tasks. This paper proposes a solution that leverages unsupervised pre-training of large neural models to enhance Arabic machine translation (MT). Specifically, we utilize pre-trained checkpoints from publicly available Arabic NLP models, like Arabic bidirectional encoder representations from transformers (AraBERT) and Arabic generative pre-trained transformer (AraGPT), to initialize and warm-start the encoder and decoder of our transformer-based sequence-to-sequence model. This approach enables us to incorporate Arabic-specific linguistic knowledge, such as word morphology and context, into the translation process. Through a comprehensive empirical study, we rigorously evaluated our models against commonly used approaches in Arabic MT. Our results demonstrate that our pre-trained models achieve new state-of-the-art performance in Arabic MT. These findings underscore the effectiveness of pre-trained checkpoints in improving Arabic MT, with potential real-world applications.
hULMonA: The Universal Language Model in Arabic
Proceedings of the Fourth Arabic Natural Language Processing Workshop
Arabic is a complex language with limited resources which makes it challenging to produce accurate text classification tasks such as sentiment analysis. The utilization of transfer learning (TL) has recently shown promising results for advancing accuracy of text classification in English. TL models are pre-trained on large corpora, and then fine-tuned on taskspecific datasets. In particular, universal language models (ULMs), such as recently developed BERT, have achieved state-of-the-art results in various NLP tasks in English. In this paper, we hypothesize that similar success can be achieved for Arabic. The work aims at supporting the hypothesis by developing the first Universal Language Model in Arabic (hUL-MonA-meaning our dream), demonstrating its use for Arabic classifications tasks, and demonstrating how a pre-trained multilingual BERT can also be used for Arabic. We then conduct a benchmark study to evaluate both ULM successes with Arabic sentiment analysis. Experiment results show that the developed hULMonA and multilingual ULM are able to generalize well to multiple Arabic data sets and achieve new state of the art results in Arabic Sentiment Analysis for some of the tested sets.
Supporting Undotted Arabic with Pre-trained Language Models
ArXiv, 2021
We observe a recent behaviour on social media, in which users intentionally remove consonantal dots from Arabic letters, in order to bypass content-classification algorithms. Content classification is typically done by finetuning pre-trained language models, which have been recently employed by many naturallanguage-processing applications. In this work we study the effect of applying pre-trained Arabic language models on “undotted” Arabic texts. We suggest several ways of supporting undotted texts with pre-trained models, without additional training, and measure their performance on two Arabic natural-languageprocessing downstream tasks. The results are encouraging; in one of the tasks our method shows nearly perfect performance.
AraLegal-BERT: A pretrained language model for Arabic Legal text
Cornell University - arXiv, 2022
The effectiveness of the BERT model on multiple linguistic tasks has been well documented. On the other hand, its potentials for narrow and specific domains such as Legal, have not been fully explored. In this paper, we examine how BERT can be used in the Arabic legal domain and try customizing this language model for several downstream tasks using several different domain-relevant training and testing datasets to train BERT from scratch. We introduce the AraLegal-BERT, a bidirectional encoder Transformer-based model that have been thoroughly tested and carefully optimized with the goal to amplify the impact of NLP-driven solution concerning jurisprudence, legal documents, and legal practice. We fine-tuned AraLegal-BERT and evaluated it against three BERT variations for Arabic language in three natural languages understanding (NLU) tasks. The results show that the base version of AraLegal-BERT achieve better accuracy than the general and original BERT over the Legal text.
DziriBERT: a Pre-trained Language Model for the Algerian Dialect
ArXiv, 2021
Pre-trained transformers are now the de facto models in Natural Language Processing given their state-of-the-art results in many tasks and languages. However, most of the current models have been trained on languages for which large text resources are already available (such as English, French, Arabic, etc.). Therefore, there is still a number of lowresource languages that need more attention from the community. In this paper, we study the Algerian dialect which has several specificities that make the use of Arabic or multilingual models inappropriate. To address this issue, we collected more than one Million Algerian tweets, and pre-trained the first Algerian language model: DziriBERT. When compared to existing models, DziriBERT achieves the best results on two Algerian downstream datasets. The obtained results show that pre-training a dedicated model on a small dataset (150 MB) can outperform existing models that have been trained on much more data (hundreds of GB). Finally, our m...
A Comparative Study of Deep Learning Approaches for Arabic Language Processing
Jordan Journal of Electrical Engineering, 2025
Arabic is a difficult language for natural language processing (NLP) because of its complicated morphology, dialectal differences and the limited annotated resources. Although deep learning algorithms have reached state-of-the-art results in many NLP tasks, comprehensive comparative studies for Arabic remains scarce. This paper addresses this gap by systematically evaluating three prominent deep learning architectures - namely Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs) and Transformers - across five essential Arabic NLP tasks: i) mention of sentiment analysis, ii) named entity recognition, iii) machine translation, iv) text classification and v) dialect identification. We differ the performance of models trained from scratch with fine-tuned versions of AraBERT, a powerful Transformer-based model pre-trained on a large Arabic corpus. Our experiments employ the Arabic datasets already existing in literature and utilizes accuracy, F1-score and BLEU as the evaluation metrics. The results are indicative of the supremacy of Transformer-based models with regard to AraBERT that shows the highest scores in each task. Notably, AraBERT attains 95. 2% accuracy on sentiment analysis, which is higher than the accuracies of RNNs and CNNs. These improvements also become apparent in other tasks, with AraBERT ending up as the best among RNN, CNN and others. A 3-point difference for 3 BLEU in machine translation and 2. 3% F1-score on dialect recognition. This extensive assessment, in turn, highlights the advantages and disadvantages of deep learning architectures for Arabic NLP. The excellent AraBERT representation also demonstrates how transfer learning and synergy between Transformer architectures and large-scale pre-training can significantly help Arabic language technology development.
Dolphin: A Challenging and Diverse Benchmark for Arabic NLG
arXiv (Cornell University), 2023
We present Dolphin, a novel benchmark that addresses the need for a natural language generation (NLG) evaluation framework dedicated to the wide collection of Arabic languages and varieties. The proposed benchmark encompasses a broad range of 13 different NLG tasks, including dialogue generation, question answering, machine translation, summarization, among others. Dolphin comprises a substantial corpus of 40 diverse and representative public datasets across 50 test splits, carefully curated to reflect real-world scenarios and the linguistic richness of Arabic. It sets a new standard for evaluating the performance and generalization capabilities of Arabic and multilingual models, promising to enable researchers to push the boundaries of current methodologies. We provide an extensive analysis of Dolphin, highlighting its diversity and identifying gaps in current Arabic NLG research. We also offer a public leaderboard that is both interactive and modular and evaluate several models on our benchmark, allowing us to set strong baselines against which researchers can compare. 1
A Comparative Study on Various Deep Learning Techniques for Arabic NLP Syntactic Tasks
2022
It is well known that there are three basic tasks in Natural language processing(NLP) (Tokenization, Part-Of-Speech tagging, Named Entity Recognition), which in turn can be divided into two levels, lexical and syntactic. The former level includes tokenization. The latter level includes part of speech (POS) and the named entity recognition (NER) tasks. Recently, deep learning has been shown to perform well in various natural language processing tasks such as POS, NER, sentiment analysis, language modelling, and other tasks. In addition, it performs well without the need for manually designed external resources or timeconsuming feature engineering. In this study, the focus is on using Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (BLSTM), Bidirectional Long Short-Term Memory with Conditional Random Field (BLSTM-CRF), and Long Short-Term Memory with Conditional Random Field (LSTM-CRF) deep learning techniques for tasks in Syntactic level and comparing their performance. The models are trained and tested by using the KALIMAT corpus. The obtained results show that a BLSTM-CRF model overcame the other models in the NER task. As for the POS task, the BLSTM-CRF model obtained the highest F1-score compared to the other models.
Building TALAA, a Free General and Categorized Arabic Corpus
Arabic natural language processing (ANLP) has gained increasing interest over the last decade. However, the development of ANLP tools depends on the availability of large corpora. It turns out unfortunately that the scientific community has a deficit in large and varied Arabic corpora, especially ones that are freely accessible. With the Internet continuing its exponential growth, Arabic Internet content has also been following the trend, yielding large amounts of textual data available through different Arabic websites. This paper describes the TALAA corpus, a voluminous general Arabic corpus, built from daily Arabic newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens contained in 57,827 different articles. A part of the TALAA corpus has been tagged to construct an annotated Arabic corpus of about 7000 tokens, the POS-tagger used containing a set of 58 detailed tags. The annotated corpus was manually checked by two human experts. The methodology used to construct TALAA is presented and various metrics are applied to it, showing the usefulness of the corpus. The corpus can be made available to the scientific community upon authorisation.