Homophobic and Hate Speech Detection Using Multilingual-BERT Model on Turkish Social Media (original) (raw)
Related papers
Textual-based Turkish Offensive Language Detection Model
16. ULUSAL YAZILIM MÜHENDİSLİĞİ SEMPOZYUMU (UYMS 2022), 2023
Harmful social media comments and posts have a variety of unintended repercussions for individuals. In addition to psychological disorders, researchers have studied this problem as a possible cause of suicidal behaviors. With approximately 16.1 million members, Turkey is the sixth-largest Twitter community in terms of currently active users in 2022, reflecting a diversified demographic for its size. As a result, there is an increasing demand for a high-quality Turkish hate speech detection model for usage in social networks. The vast bulk of prior research has been conducted on tiny, label-imbalanced datasets. This study investigates traditional machine learning and recent deep learning algorithms to detect hate speech in Turkish Language. We used different classification methods and algorithms to detect offensive Turkish language on a big dataset that includes more than 53000 posts. The obtained results are demonstrating that BERT-Features achieved promising results. Additionally, BiLSTM, and Logistic Regression achieved the best performance on the used dataset. The findings for all models demonstrate the resilience of LGMB, Logistic Regression, and BiLSTM for detecting offensive language with around 95% in terms of ROC (AUC).
Modeling Hate Speech Detection in Social Media Interactions Using Bert
2020
Hate speech propagation in social media sites has been happening over time and there is need to accurately identify and counter it so that those offended can seek redress and offenders can be punished for perpetrating the vice. In this paper, we demonstrate how fine tuning a pre-trained Google Bidirectional Encoder Representation from Transformers (BERT) model has been used to achieve an improvement in accuracy of classification of tweets as either hate speech or not. Random forests and logistic regression algorithms have been used to build baseline models with a publicly available twitter dataset from hatebase.org. To validate the BERT model, we collected data using Tweepy API and combined with data from hatebase.org for training. The results obtained show an improvement in accuracy of tweets classification as either hate speech or not from the baseline models by 7.22%.
HASOCOne@FIRE-HASOC2020: Using BERT and Multilingual BERT models for Hate Speech Detection
2020
Hateful and Toxic content has become a significant concern in today’s world due to an exponential rise in social media. The increase in hate speech and harmful content motivated researchers to dedicate substantial efforts to the challenging direction of hateful content identification. In this task, we propose an approach to automatically classify hate speech and offensive content. We have used the datasets obtained from FIRE 2019 and 2020 shared tasks. We perform experiments by taking advantage of transfer learning models. We observed that the pre-trained BERT model and the multilingual-BERT model gave the best results. The code is made publically available at https://github.com/suman101112/hasoc-fire-2020
AI_ML_NIT_Patna @HASOC 2020: BERT Models for Hate Speech Identification in Indo-European Languages
2020
The current paper describes the system submitted by team AI_ML_NIT_Patna. The task aims to identify offensive language in code-mixed dataset of comments in Indo-European languages offered for English, German, Hindi collected from Twitter. We participated in both Sub-task A, which aims to classify comments into two class, namely: Hate and Offensive (HOF), and NonHate and offensive (NOT), and Sub-task B, which aims to identify discrimination between Hate (HATE), profane (PRFN) and offensive (OFFN) comments. In order to address these tasks, we utilized pre-trained multi-lingual transformer (BERT) based neural network models and their fine-tuning. This resulted in a better performance on the validation, and test set. Our model achieved 0.88 weighted F1-score for English language in Sub-task A on testing dataset, and got 3rd rank on the leaderboard private test data having F1 Macro average of 0.5078.
PRHLT-UPV at SemEval-2020 Task 12: BERT for Multilingual Offensive Language Detection
2020
The present paper describes the system submitted by the PRHLT-UPV team for the task 12 of SemEval-2020: OffensEval 2020. The official title of the task is Multilingual Offensive Language Identification in Social Media, and aims to identify offensive language in texts. The languages included in the task are English, Arabic, Danish, Greek and Turkish. We propose a model based on the BERT architecture for the analysis of texts in English. The approach leverages knowledge within a pre-trained model and performs fine-tuning for the particular task. In the analysis of the other languages the Multilingual BERT is used, which has been pre-trained for a large number of languages. In the experiments, the proposed method for English texts is compared with other approaches to analyze the relevance of the architecture used. Furthermore, simple models for the other languages are evaluated to compare them with the proposed one. The experimental results show that the model based on BERT outperforms...
Enhancing Arabic offensive language detection with BERT-BiGRU model
Bulletin of Electrical Engineering and Informatics
With the advent of Web 2.0, various platforms and tools have been developed to allow internet users to express their opinions and thoughts on diverse topics and occurrences. Nevertheless, certain users misuse these platforms by sharing hateful and offensive speeches, which has a negative impact on the mental health of internet society. Thus, the detection of offensive language has become an active area of research in the field of natural language processing. Rapidly detecting offensive language on the internet and preventing it from spreading is of great practical significance in reducing cyberbullying and self-harm behaviors. Despite the crucial importance of this task, limited work has been done in this field for non-English languages such as Arabic. Therefore, in this paper, we aim to improve the results of Arabic offensive language detection without the need for laborious preprocessing or feature engineering work. To achieve this, we combine the bidirectional encoder representations from transformers (BERT) model model with a bidirectional gated recurrent unit (BiGRU) layer to further enhance the extracted context and semantic features. The experiments were conducted on the Arabic dataset provided by the SemEval 2020 Task 12. The evaluation results show the effectiveness of our model compared to the baseline and related work models by achieving a macro F1score of 93.16%.