DEEP at HASOC2019: A Machine Learning Framework for Hate Speech and Offensive Language Detection (original) (raw)
Related papers
2020
In current times, social media is the most widely used platform, and everyone has the right to express their speculations, ideas, thoughts, etc. In such a case, it is often seen that hate speech and offensive contents are spreading like wildfire, making a detrimental impact on the world. It is important to identify and eradicate such offensive content from social media. This paper is a contribution to the Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC) 2020 shared task. Our target is to present deep learning models to detect hate speech and offensive content in three languages English, Hindi, and German. Our team NSIT_ML_Geeks has developed models using Convolutional Neural Networks (CNN), Bi-directional long short term memory (BiLSTM), and hybrid models (CNN+BiLSTM). The word-embeddings used are GloVe and fastText to convert our corpus into vectors of real numbers to train models. Our best models for Hindi sub-task A and B secured First and Secon...
Forum for Information Retrieval Evaluation, 2021
The widespread of offensive content online such as hate speech poses a growing societal problem. AI tools are necessary for supporting the moderation process at online platforms. For the evaluation of these identification tools, continuous experimentation with data sets in different languages are necessary. The HASOC track (Hate Speech and Offensive Content Identification) is dedicated to develop benchmark data for this purpose. This paper presents the HASOC subtrack for English, Hindi, and Marathi. The data set was assembled from Twitter. This subtrack has two sub-tasks. Task A is a binary classification problem (Hate and Not Offensive) offered for all three languages. Task B is a fine-grained classification problem for three classes (HATE) Hate speech, OFFENSIVE and PROFANITY offered for English and Hindi. Overall, 652 runs were submitted by 65 teams. The performance of the best classification algorithms for task A are F1 measures 0.91, 0.78 and 0.83 for Marathi, Hindi and English, respectively. This overview presents the tasks and the data development as well as the detailed results. The systems submitted to the competition applied a variety of technologies. The best performing algorithms were mainly variants of transformer architectures.
IIIT-Hyderabad at HASOC 2019: Hate Speech Detection
2019
Copyright c ©2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). FIRE 2019, 12-15 December 2019, Kolkata, India. Abstract. Automatic identification of offensive language in various social media platforms especially Twitter poses a great challenge to the AI community. The repercussions of such writings are hazardous to individuals, communities, organizations and nations. The HASOC shared task attempts for automatic detection of abusive language on Twitter in English, German and Hindi languages. As a part of this task, we (team A3-108) submitted different machine learning and neural network based models for all the languages. Our best performing model was an ensemble model of SVM, Random Forest and Adaboost classifiers with majority voting.
2020
The identification of Hate Speech in Social Media has received much attention in research recently. There has been an ever-growing increase in demand particularly for research in languages other than English. The Hate Speech and Offensive Content (HASOC) track has created resources for Hate Speech Identification in three different languages namely Hindi, German, and English. We have participated in both Sub-tasks A and B of the 2020 shared task on hate speech and offensive content identification in Indo-European languages. Our approach relies on a combined model of multilingual RoBERTa (a Robustly Optimized BERT Pretraining Approach) model with pre-trained vectors and a Random Forest model using Word2Vec, TF-IDF, and other textual features as input. Our system has achieved a maximum Macro F1-score of 50.28% for English Sub-task A which is quite satisfactory relative to the performance of other systems and secured 8th position among participating teams.
QMUL-NLP at HASOC 2019: Offensive Content Detection and Classification in Social Media
2019
With the development of the Internet, the Web has become an information dissemination platform, an information amplifier, and a new social media. The information load and participation of the Internet far exceeds the existing traditional media, and various problems have emerged. There has been significant work in several languages in particular for English. However, there is a lack of research in this recent and relevant topic for most other languages. This track intends to develop data and evaluation resources for several languages. The objectives are to stimulate research for these languages and to find out the quality of hate speech detection technology in other languages. The paper mainly describes the organization of the HASOC 2019 Task, a Shared Task on Hate Speech and Offensive Content Identification in Indo-European Languages. The task is organized in three related classification subtasks: subtask A is a coarse-grained binary classification to identify hate speech and offens...
2019
In this paper, the results obtained from the Support Vector Machine, XGBoost method by IRLab@IIT(BHU) on HASOC shared task-organized at FIRE-2019 are reported. The HASOC shared task has three subtasks, namely Hate speech identification, Offensive language identification and Fine-grained classification for the English, Hindi and German languages. The best result for English is obtained after applying Support Vector Machine, XGBoost with a frequency-based feature for hate speech and offensive content identification.
2020
This paper describes the system submitted by our team, KBCNMUJAL, for Task 2 of the shared task Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC), at Forum for Information Retrieval Evaluation, December 16-20, 2020, Hyderabad, India. The datasets of two Dravidian languages Viz. Malayalam and Tamil of size 4000 observations, each were shared by the HASOC organizers. These datasets are used to train the machine using different machine learning algorithms, based on classification and regression models. The datasets consist of tweets or YouTube comments with two class labels offensive and not offensive. The machine is trained to classify such social media messages in these two categories. Appropriate n-gram feature sets are extracted to learn the specific characteristics of the Hate Speech text messages. These feature models are based on TFIDF weights of n-gram. The referred work and respective experiments show that the features such as word, character ...
Role of Artificial Intelligence in Detection of Hateful Speech for Hinglish Data on Social Media
Lecture Notes in Electrical Engineering, 2021
Social networking platforms provide a conduit to disseminate our ideas, views and thoughts and proliferate information. This has led to the amalgamation of English with natively spoken languages. Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world. Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages. Thus, the worldwide hate speech detection rate of around 44% drops even more considering the content in Indian colloquial languages and slangs. In this paper, we propose a methodology for efficient detection of unstructured code-mix Hinglish language. Fine-tuning based approaches for Hindi-English code-mixed language are employed by utilizing contextual based embeddings such as ELMo (Embeddings for Language Models), FLAIR, and transformer-based BERT (Bidirectional Encoder Representations from Transformers). Our proposed approach is compared against the pre-existing methods and results are compared for various datasets. Our model outperforms the other methods and frameworks.
2019
Recently, automated hate speech and offensive content identification has received significant attention due to rapid propagation of cyberbullying which undermines objective discussions in social media and adversely affects the outcome of the online social democratic processes. A special type of Recurrent Neural Network (RNN) based deep learning approach called Long Short Term Memory (LSTM) is implemented for automatic hate speech and offensvie content identification. Separating offensive content is quite challenging because the abusive language is quite subjective in nature and highly context dependent. This paper offers language-agnostic solution in three Indo-European languages (English, German, and Hindi) since no pre-trained word embedding is used. Experimental results offer very attractive insights.
Automated Detection of Hate Speech and Profanity for Multiple and Mixed Languages
Detecting profanity, abuse or hate by building efficient Text Content Moderation systems has become an integral process and a practice for various digital and online media platforms. Chat platforms and community discussion forums are essential customer services that are being provided by many online media business platforms which allow the users which to express their opinions. Many users tend to misuse this service to spread profane and abusive content online. The aim of this research is to design techniques and create independent models that are able to identify and detect instances of hate, abuse or profane content in English, Hindi and Hinglish. Further discrimination of the inappropriate instances into various classes, namely: (i) HATE (hate speech), (ii) OFFENSIVE (insulting, degrading) and (iii) PROFANE (swear words, cursing) has been performed. This work which aims to solve a prominent problem in the field of Natural Language Processing leverages the Machine Learning and Deep Learning Models to perform the Classification of text.