Detecting Arabic textual threats in social media using artificial intelligence: An overview

Towards Accurate Detection of Offensive Language in Online Communication in Arabic

Procedia Computer Science

We present the results of predictive modelling for the detection of antisocial behaviour in online communication in Arabic, such as comments which contain obscene or offensive words and phrases. We collected and labelled a large dataset of YouTube comments in Arabic which contains a broad range of both offensive and inoffensive comments. We used this dataset to train a Support Vector Machine classifier and experimented with combinations of word-level features, N-gram features and a variety of pre-processing techniques. We summarise the pre-processing steps and features that allow training a classifier which is more precise, with 90.05% accuracy, than classifiers reported by previous studies on Arabic text.

Automatic Detection of Cyberbullying and Abusive Language in Arabic Content on Social Networks: A Survey

Procedia Computer Science 189 (2021) 156–166, 2021

As a key player in today's world, online social networks are emerging, providing a platform for expression and content distribution. This technology enables users to communicate easily with each other and share their data instantly. However, the internet isn't generally protected; it can be a source for abusive and harmful content and causing harm to others. There is a great need for approaches and strategies to solve these issues due to the negative effect of abusive language and cyberbullying. Arabic text is known for its challenges, complexity, and scarcity of its resources. Many languages have made many efforts to find automated solutions for detecting abusive language and cyberbullying, but not much for the Arabic language. This work analyzes 27 studies on automatic Arabic abusive language and cyberbullying and its related detection approaches. The goal of this paper is to review the findings of the previous studies about cyberbullying and abusive detection in Arabic content on online social networks and help researcher in the future to develop automatic detection systems that are effective and realistic.

Detection of anti-social behaviour in online communication in Arabic

University of Limerick, 2019

Azalden Alakrot Detection of AntiSocial Behaviour in Online Communication in Arabic Antisocial behaviour on social media cannot be easily ignored as it affects a large and growing percentage of the world's population. It often has a negative effect on people's lives; incidents of online abuse that may seem insignificant can have a cumulative impact on mental health. An increasing number of incidents of suicide and violence have been reportedly provoked by antisocial behaviour on social media. Most of the existing machine-learning approaches for detection of offensive language are specifically tailored for online communication in English. Solutions targeting Arabic language are rare, while, as we also demonstrate in this thesis, offensive language is wide spread in Arabic social media as well. Our hypothesis has been that Arabic may require a specific approach different from the solutions for English due to the specific linguistic characteristics of Arabic text and the unique to Arabic mixture of dialects frequently observed within the same conversation on social media. The objective of this thesis is to contribute to the work on the automatic prevention of antisocial behaviour in online written communication in Arabic by introducing a large dataset of YouTube comments and proposing a text-mining pipeline for training a binary classifier. The main challenge to automatic detection of offensive language is the absence of appropriate training datasets. Thus, as part of this work we undertook to collect data iii from Arabic social media (Arabic YouTube channels) and construct a labelled dataset. Then we utilised this dataset to experiment with a variety of text preprocessing techniques, feature-selection methods, and classification machinelearning algorithms in order to recommend a process for automatic detection of offensive language in online written communication in Arabic. Our results are encouraging; they suggest Support Vector Machines classifier can be effectively deployed for the detection of offensive language in online written communication in Arabic. We believe that the proposed text-mining process will open the door for further research in this direction and will eventually result in effective automatic prevention of incidents of verbal abuse on Arabic social media. iv In memory of my father To my mother With love and eternal appreciation Azalden Alakrot v I would like to express my gratitude to everyone who supported me throughout the course of this PhD research. I am thankful for their aspiring guidance, invaluable constructive criticism and friendly advice during the work. First and foremost, I would like to thank The Libyan Ministry of Higher Education and Scientific Research for offering a scholarship to me for completing this work. I also would thank my host, the Department of Computer Science and Information Systems (CSIS) at the University of Limerick for the respectful atmosphere and the excellent working place, where I spent most of my time in the last few years, on my disk behind my computer working on this research project, and where I also met many kind people. I am indebted to my co-supervisors Dr Nikola Nikolov, who has guided me through this academic journey with unlimited support, Dr Liam Murray, who also guided me through this journey and gave me big encouragement to complete this work. I would also like to think Prof. Tiziana Margaria, Head of CSIS, for her support, especially with my conference fees.

WOLI at SemEval-2020 Task 12: Arabic Offensive Language Identification on Different Twitter Datasets

2020

Communicating through social platforms has become one of the principal means of personal communications and interactions. Unfortunately, healthy communication is often interfered by offensive language that can have damaging effects on the users. A key to fight offensive language on social media is the existence of an automatic offensive language detection system. This paper presents the results and the main findings of SemEval-2020, Task 12 OffensEval Sub-task A Zampieri et al. (2020), on Identifying and categorising Offensive Language in Social Media. The task was based on the Arabic OffensEval dataset Mubarak et al. (2020). In this paper, we describe the system submitted by WideBot AI Lab for the shared task which ranked 10th out of 52 participants with Macro-F1 86.9% on the golden dataset under CodaLab username “yasserotiefy”. We experimented with various models and the best model is a linear SVM in which we use a combination of both character and word n-grams. We also introduced...

NAYEL at SemEval-2020 Task 12: TF/IDF-Based Approach for Automatic Offensive Language Detection in Arabic Tweets

2020

In this paper, we present the system submitted to “SemEval-2020 Task 12”. The proposed system aims at automatically identify the Offensive Language in Arabic Tweets. A machine learning based approach has been used to design our system. We implemented a linear classifier with Stochastic Gradient Descent (SGD) as optimization algorithm. Our model reported 84.20%, 81.82% f1-score on development set and test set respectively. The best performed system and the system in the last rank reported 90.17% and 44.51% f1-score on test set respectively.

Offensive Language Detection in Social Networks for Arabic Language Using Clustering Techniques

2021

With the advent of social networks, the users have obtained a golden opportunity to express their opinions using text and multimedia. However, some users abused these platforms by introducing acts such as Cyber-Bullying and Cyber-Harassment. Despite the various negative health and social effects, the works proposed toward the detection of these acts are still limited, especially in non-English languages. In Arabic, few works studied this phenomenon. These works had limited datasets. As the number of available training datasets are limited, it is still hard to train classifiers to detect these acts. Therefore, clustering has posed as an alternative solution to tackle this difficulty. In this work, we propose the use of clustering to detect Cyber-Bullying and Cyber-Harassment. We adopted various clustering algorithms including K-Means and Expectation Maximization (EM). Moreover, we used various natural language processing (NLP) tools for this objective. The results illustrate that the...

Dataset Construction for the Detection of Anti-Social Behaviour in Online Communication in Arabic

Procedia Computer Science, 2018

Warning: this paper contains a range of words which may cause offence. In recent years, many studies target antisocial behaviour such as offensive language and cyberbullying in online communication. Typically, these studies collect data from various reachable sources, the majority of the datasets being in English. However, to the best of our knowledge, there is no dataset collected from the YouTube platform targeting Arabic text and overall there are only a few datasets of Arabic text, collected from other social platforms for the purpose of offensive language detection. Therefore, in this paper we contribute to this field by presenting a dataset of YouTube comments in Arabic, specifically designed to be used for the detection of offensive language in a machine learning scenario. Our dataset contains a range of offensive language and flaming in the form of YouTube comments. We document the labelling process we have conducted, taking into account the difference in the Arab dialects and the diversity of perception of offensive language throughout the Arab world. Furthermore, statistical analysis of the dataset is presented, in order to make it ready for use as a training dataset for predictive modelling.

Automatic Detection of Offensive Language for Urdu and Roman Urdu

IEEE Access

In recent years, unethical behavior in the cyber-environment has been revealed. The presence of offensive language on social media platforms and automatic detection of such language is becoming a major challenge in modern society. The complexity of natural language constructs makes this task even more challenging. Until now, most of the research has focused on resource-rich languages like English. Roman Urdu and Urdu are two scripts of writing the Urdu language on social media. The Roman script uses the English language characters while the Urdu script uses Urdu language characters. Urdu and Hindi languages are similar with the only difference in their writing script but the Roman scripts of both languages are similar. This study is about the detection of offensive language from the user's comments presented in a resourcepoor language Urdu. We propose the first offensive dataset of Urdu containing user-generated comments from social media. We use individual and combined n-grams techniques to extract features at character-level and word-level. We apply seventeen classifiers from seven machine learning techniques to detect offensive language from both Urdu and Roman Urdu text comments. Experiments show that the regression-based models using character n-grams show superior performance to process the Urdu language. Character-level tri-gram outperforms the other word and character n-grams. LogitBoost and SimpleLogistic outperform the other models and achieve 99.2% and 95.9% values of F-measure on Roman Urdu and Urdu datasets respectively. Our designed dataset is publically available on GitHub for future research.

Hate Speech in the Arab Electronic Press and Social Networks

Revue d'Intelligence Artificielle, 2021

Nowadays we are witnessing an open world, characterized by globalization which is accompanied by a technology through which information circulates without borders, especially with the widespread use of social networking sites being the most common communication tool, that gives access through various applications to a large space for the presentation of multiple ideas, including extremist ideas, and the spread of hate speech. This paper introduces a system of detection of hate speech in the texts of Arabic read media and social media, which is based on a combined use of NLP, and machine learning methods. The training of the detection model is done on a large Dataset of articles, tweets and comments, collected, balanced and tokenized afterwards using BERT in Arabic. The trained model detects hate speech in Arabic and various Arabic based dialects, by classifying the texts into two classes: Neutral and Abusive. The above-mentioned model is evaluated using precision metrics, recall and...

Offensive Language Detection in Arabic Social Networks Using Evolutionary-Based Classifiers Learned From Fine-Tuned Embeddings

IEEE Access

Social networks facilitate communication between people from all over the world. Unfortunately, the excessive use of social networks leads to the rise of antisocial behaviors such as the spread of online offensive language, cyberbullying (CB), and hate speech (HS). Therefore, abusive\offensive and hate detection become a crucial part of cyberharassment. Manual detection of cyberharassment is cumbersome, slow, and not even feasible in rapidly growing data. In this study, we addressed the challenges of automatic detection of the offensive tweets in the Arabic language. The main contribution of this study is to design and implement an intelligent prediction system encompassing a two-stage optimization approach to identify and classify the offensive from the non-offensive text. In the first stage, the proposed approach fine-tuned the pre-trained word embedding models by training them for several epochs on the training dataset. The embeddings of the vocabularies in the new dataset are trained and added to the old embeddings. While in the second stage, it employed a hybrid approach of two classifiers, namely XGBoost and SVM, and a genetic algorithm (GA) to mitigate the drawback of the classifiers in finding the optimal hyperparameter values to run the proposed approach. We tested the proposed approach on Arabic Cyberbullying Corpus (ArCybC), which contains tweets collected from four Twitter domains: gaming, sports, news, and celebrities. The ArCybC dataset has four categories: sexual, racial, intelligence, and appearance. The proposed approach produced superior results, in which the SVM algorithm with the Aravec SkipGram word embedding model achieved an accuracy rate of 88.2% and an F1-score rate of 87.8%. INDEX TERMS Arabic harassment dataset, deep learning, evolutionary algorithm, fine-tuned word embedding, hate speech, offensive language, optimization.

Detecting Arabic textual threats in social media using artificial intelligence: An overview (original) (raw)

Related papers