Offensive Text Detection Using Machine Learning Techniques (original) (raw)

Quantitative Prediction of Offensiveness using Text Mining of Twitter Data

Virtual communities reflect worldwide connectivity, and an enabler for real time information sharing and targeted advertising. Twitter has widely emerged as one of the extensively used micro blogging service. This is the platform to share ideas, feelings and views for any event. People have freedom to post Tweets for a particular event. The success of an event can be predicted by users’ responses. Individual interaction patterns can strongly indicate personalities. Garbage or bosh replies can harm the fidelity of an event. To make it trustworthy, we have performed sentiment analysis for the prediction of offensiveness in Tweets. We have collected data from Twitter search and stream API. Text mining techniques (preprocessing, stemming, negation rule, tokenization and stop words removal) are used for cleaning data. Our approach can predict offensiveness in Tweets effectively. We also performed comparative analysis of different machine learning classifiers, i.e., Naïve Bays (NB), Support Vector Machine (SVM) and Logistic Regression (LR) to find sentiment polarity and found that SVM outperforms others. An in-house tool, ‘Interaction Pattern Predictor’, is developed using Python programming language. Our results are trustworthy as we have used three large data dictionaries to train our developed tool.

Social media detection of political toxic speech using machine learning algorithms

Natural Language process, 2022

The world has seen a huge transnational adoption of online technologies, which encompass Twitter, WhatsApp, Tik Tok, Facebook and WeChat. This rise had also been reciprocated by the rise in hate speech on these platforms. The theoretical base of the research was Media Ecology Theory. The core premise of Media Ecology Theory is that society cannot avoid the effects of technology, and that technology will forever have an impact on nearly every aspect of modern life. More studies had been done before focusing on hate speech, religion hate and political hate speech. Previous researchers have focused on singular language detection of hate speech which needs high computing power. Furthermore, other researchers’ solution was unable to collect sufficient crowdsourcing data. The research used CRISP-DM to achieve the objectives. The data sets were extracted from Kaggle and Twitter and all text data where cleaned before they were fed to the classifiers in order to reduce noise. Tokenization, expansion of contractions, special character removal, stops word removal, lemmatization are pre-processing steps which were used. Texting cleaning was done using nltk module with the help of python libraries TextBlob and VADER. Three classifiers which are KNN, SVM and Naïves Baye were used in the study. The research used the compound factor from (postive, negative, neutral) sentiment score to label data. 84% accuracy of the sentiment polarity score prediction was produced. In terms of classifiers SVM produced best accuracy level of 89%

IRJET- Tweet Sentiment Analysis and Study and Comparison of Various Approaches and Classification Algorithms Used

IRJET, 2020

Starting from March 2006, Twitter has been a major face of social media. Twitter provides an efficient way for users to share data in textual, pictorial and video form. Users share multiple aspects of personal and public opinions on various events happening around them. This social interaction has both positive and negative aspects associated with it. Unfortunately, hate speech has been a major issue and as a consequence one of the major drawback of social media platforms. Despite numerous attempts, a perfect detection system is hard to develop due to the vague definition of hate speech and the intent of the writer is not always accurately reflected in the tweet. This study explains, in detail, the processes used to conclude the data (in the form of tweets) and classify them as positive, negative and neutral. This can help in reducing the rate of online harassment and hate speech. This study compares and evaluates various methods of preprocessing of data, feature selection, and predictive algorithms. Various text preprocessing techniques like tokenization, vectorization and supervised classification algorithms like Logistic Regression, Decision Tree, Random Forest Classifier, kNN Classifier, Multinomial Naive Bayes, SVM-C, and Decision Tree are evaluated in this study.

KBCNMUJAL@HASOC-Dravidian-CodeMix-FIRE2020: Using Machine Learning for Detection of Hate Speech and Offensive Codemix Social Media text

2020

This paper describes the system submitted by our team, KBCNMUJAL, for Task 2 of the shared task Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC), at Forum for Information Retrieval Evaluation, December 16-20, 2020, Hyderabad, India. The datasets of two Dravidian languages Viz. Malayalam and Tamil of size 4000 observations, each were shared by the HASOC organizers. These datasets are used to train the machine using different machine learning algorithms, based on classification and regression models. The datasets consist of tweets or YouTube comments with two class labels offensive and not offensive. The machine is trained to classify such social media messages in these two categories. Appropriate n-gram feature sets are extracted to learn the specific characteristics of the Hate Speech text messages. These feature models are based on TFIDF weights of n-gram. The referred work and respective experiments show that the features such as word, character ...

An analysis of hateful contents detection techniques on social media

2016

Background: Detecting hateful contents on social media becomes a broad and important research area along with the popularity of social media. Objective: This paper aims primarily to understand the different techniques applied within the scope of detecting the use of hateful language on social media, their strengths and challenges to provide a solid and concrete reference to future researchers and practitioners. Methodology: In this paper, we investigated previous researches done in the domain of hateful contents detection on social media. We selected relevant published journal articles and conference proceedings from 2010 to 2015. Results: We observed that Support Vector Machine (SVM) algorithm is the most frequently applied for text classification. Data ambiguity problem, classification of sarcastic sentences and lack of necessary resources are identified as the difficulties for researchers in detecting the use of hateful contents. Conclusion: Future researchers must pay more atten...

Multi-Class Sentiment Analysis of Social Media Data with Machine Learning Algorithms

Computers, materials & continua, 2021

The volume of social media data on the Internet is constantly growing. This has created a substantial research field for data analysts. The diversity of articles, posts, and comments on news websites and social networks astonishes imagination. Nevertheless, most researchers focus on posts on Twitter that have a specific format and length restriction. The majority of them are written in the English language. As relatively few works have paid attention to sentiment analysis in the Russian and Kazakh languages, this article thoroughly analyzes news posts in the Kazakhstan media space. The amassed datasets include texts labeled according to three sentiment classes: positive, negative, and neutral. The datasets are highly imbalanced, with a significant predominance of the positive class. Three resampling techniques (undersampling, oversampling, and synthetic minority oversampling (SMOTE)) are used to resample the datasets to deal with this issue. Subsequently, the texts are vectorized with the TF-IDF metric and classified with seven machine learning (ML) algorithms: naïve Bayes, support vector machine, logistic regression, k-nearest neighbors, decision tree, random forest, and XGBoost. Experimental results reveal that oversampling and SMOTE with logistic regression, decision tree, and random forest achieve the best classification scores. These models are effectively employed in the developed social analytics platform.

Hate Speech Classification Using SVM and Naive BAYES

2022

The spread of hatred that was formerly limited to verbal communications has rapidly moved over the Internet. Social media and community forums that allow people to discuss and express their opinions are becoming platforms for the spreading of hate messages. Many countries have developed laws to avoid online hate speech. They hold the companies that run the social media responsible for their failure to eliminate hate speech. But as online content continues to grow, so does the spread of hate speech However, manual analysis of hate speech on online platforms is infeasible due to the huge amount of data as it is expensive and time consuming. Thus, it is important to automatically process the online user contents to detect and remove hate speech from online media. Many recent approaches suffer from interpretability problem which means that it can be difficult to understand why the systems make the decisions they do. Through this work, some solutions for the problem of automatic detection of hate messages were proposed using Support Vector Machine (SVM) and Naïve Bayes algorithms. This achieved near state-of-the-art performance while being simpler and producing more easily interpretable decisions than other methods. Empirical evaluation of this technique has resulted in a classification accuracy of approximately 99% and 50% for SVM and NB respectively over the test set.

Hate Speech Detection in Social Media Using the Ensemble Learning Technique

Eswar Publications, 2023

Our lives have become intertwined with social media platforms such as Twitter, Facebook, LinkedIn, etc. They provide us with a platform to express our opinions and share our thoughts with the world. However, some individuals abuse the freedom of expression afforded to them by these platforms and utilize them to disseminate content that is derogatory and promotes hate speech. This has become a significant problem today, and detecting such content is a challenging task. In this research paper, we propose a solution for hate speech detection in social media using natural language processing techniques. We use a publicly available dataset provided by CrowdFlower and perform text pre-processing to clean the dataset. We then conduct feature engineering to extract key features that can be used in machine learning classification algorithms. We compare the performance of various algorithms about each feature set and conduct an in-depth analysis of the results obtained.

NIT_Agartala_NLP_Team at SemEval-2019 Task 6: An Ensemble Approach to Identifying and Categorizing Offensive Language in Twitter Social Media Corpora

Proceedings of the 13th International Workshop on Semantic Evaluation

The paper describes the systems submitted to OffensEval (SemEval 2019, Task 6) on 'Identifying and Categorizing Offensive Language in Social Media' by the 'NIT Agartala NLP Team'. A Twitter annotated dataset of 13,240 English tweets was provided by the task organizers to train the individual models, with the best results obtained using an ensemble model composed of six different classifiers. The ensemble model produced macro-averaged F 1-scores of 0.7434, 0.7078 and 0.4853 on Subtasks A, B, and C, respectively. The paper highlights the overall low predictive nature of various linguistic features and surface level count features, as well as the limitations of a traditional machine learning approach when compared to a Deep Learning counterpart.

Detection of Hate Speech and Offensive Language in Twitter Using Sentiment Analysis

IJRASET, 2021

The dramatic development of online media, for example, Twitter and local area gatherings has upset correspondence and content distributing, but at the same time is progressively misused for the spread of disdain discourse and the association of disdain based exercises. The secrecy and portability managed by such media has made the rearing and spread of disdain discourse-in the long run prompting disdain wrongdoing-easy in a virtual land scape past the domains of conventional law requirement. Existing techniques in the identification of disdain discourse principally cast the issue as a regulated report grouping task [33]. These can be partitioned into two classifications: one depends on manual element designing that are then devoured by calculations, for example, SVM, Naive Bayes, and Logistic Regression [3, 9, 11, 15, 19, 23, 35-39] (exemplary techniques); the other addresses the later profound learning worldview that utilizes neural organizations to consequently learn multi-facets of dynamic highlights from crude information [13, 26, 30, 34] (profound learning strategies). In this technique We show that it is a significantly more testing task, as our examination of the language in the commonplace datasets shows that disdain discourse needs interesting, discriminative highlights and hence is found in the 'long tail' in a dataset that is hard to find. We then, at that point propose Deep Neural Network structures filling in as highlight extractors that are especially powerful for catching the semantics of disdain discourse. Our techniques are assessed on the biggest assortment of disdain discourse datasets dependent on Twitter, and are demonstrated to have the option to beat best in class by up to 6 rate focuses in large scale normal F1, or 9 rate focuses in the seriously difficult instance of recognizing derisive substance. As an intermediary to evaluate and think about the semantic attributes of disdain and non-disdain Tweets, we additionally propose to contemplate the 'uniqueness' of the jargon for each class.