Automatic Hate Speech Detection in English-Odia Code Mixed Social Media Data Using Machine Learning Techniques (original) (raw)
Related papers
IJERT-Detection of Hate Speech using Text Mining and Natural Language Processing
International Journal of Engineering Research and Technology (IJERT), 2020
https://www.ijert.org/detection-of-hate-speech-using-text-mining-and-natural-language-processing https://www.ijert.org/research/detection-of-hate-speech-using-text-mining-and-natural-language-processing-IJERTV9IS110257.pdf In today's modern world, technology connected with humanity is doing wonderful things. On the other hand, people inclined to social networks where they have anonymity are bringing out the very nastiest of people in the form of hate speech. Social media hate speech is a serious societal problem which can contribute to magnify the violence ranging from lynching to ethical cleansing. One of the critical tasks of automatic detection of hate speech is differentiating it from the other context of offensive languages. The existing works to distinguish the two categories using the lexical methods showed very low performance metrics values which led to major misclassification. The works with supervised machine learning approaches indeed gave significant results in distinguishing hate and offensive but the presence or absence of certain words of both the classes can serve as both merit and demerit to achieve accurate classification. In this paper, a ternary classification of tweets into hate speech, offensive and neither is performed using multi class classifiers. Among the four classifiers: Logistic Regression, Random forests, Support Vector Machines (SVM) and Naïve Bayes. It can be seen that Random Forest classifier performs significantly well with almost all feature combinations giving maximum accuracy of 0.90 for TFIDF feature technique.
2020
This paper describes the system submitted by our team, KBCNMUJAL, for Task 2 of the shared task Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC), at Forum for Information Retrieval Evaluation, December 16-20, 2020, Hyderabad, India. The datasets of two Dravidian languages Viz. Malayalam and Tamil of size 4000 observations, each were shared by the HASOC organizers. These datasets are used to train the machine using different machine learning algorithms, based on classification and regression models. The datasets consist of tweets or YouTube comments with two class labels offensive and not offensive. The machine is trained to classify such social media messages in these two categories. Appropriate n-gram feature sets are extracted to learn the specific characteristics of the Hate Speech text messages. These feature models are based on TFIDF weights of n-gram. The referred work and respective experiments show that the features such as word, character ...
Social Media based Hate Speech Detection using Machine Learning
International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 2022
Hate speech is a crime that has been increasing in recent years, not only in person but also online. There are several causes for this. There is tremendous growth in social media that promotes full freedom of expression through anonymity features. Freedom of expression is a human right, but hate speech directed at individuals or groups on the basis of race, caste, religion, ethnicity or nationality, gender, disability, gender identity, etc. is a violation of that sovereignty. Freedom of expression is a human right, but hate speech directed at individuals or groups on the basis of race, caste, religion, ethnicity or nationality, gender, disability, gender identity, etc. is a violation of that sovereignty. It promotes violence and hate crimes, creates social imbalances, and undermines peace, trust and human rights. Revealing hate speech in social media discourse is a very important but complex task. On the one hand, the anonymity provided by the Internet, especially social networks, makes people more likely to engage in hostile behavior. On the other hand, the desire to express one's thoughts on the Internet has increased, leading to the spread of hate speech. Governments and social media platforms can benefit from detection and prevention technologies, as this kind of bigoted language can wreak havoc on society. We help resolve this dilemma by providing a systematic overview of research on this topic in this survey. This project aims to accurately predict various forms by addressing different categories of hate individually and examining a set of text mining functions. Hate speech detection
An Improve Framework for hate speech detection using Machine Learning Approach
IJARCCE, 2021
Hate Speech is any correspondence that decries an individual or a gathering based on some trademark, for example, race, identity, sex, sexual direction, ethnicity, religion, or other trademark. Harmful language (e.g., scorn discourse, damaging discourse, or other hostile discourse) principally targets individuals from minority gatherings and can catalyze genuine savagery towards them. The paper proposes an improve framework for hate speech detection using machine learning approach. This system uses a twitter dataset that contains tweeted messages of both hate speech, offensive language, and also messages that is neither hate speech nor offensive language. The dataset was downloaded from kaggle.com, the dataset contains a total of 24,784 twitted messages. The dataset is made up of 8 columns which we later reduced it to two columns by means of feature_extraction. The reduced columns are the tweet columns which contain the twitted messages and the class columns which contains 0,1 and 2, where 0 is classified as hate speech, 1 is classified as offensive language and 2 is classified as neither hate speech or offensive language. we trained our model using support vector machine and random forest classifier and had an accuracy of 95% and 99%. We then deployed our model to web using python flask for easy evaluation and testing. Our experimental results show that our proposed system had better performance in terms of classifying text as hate speech.
Automatic Hate Speech Detection using Machine Learning: A Comparative Study
International Journal of Advanced Computer Science and Applications, 2020
The increasing use of social media and information sharing has given major benefits to humanity. However, this has also given rise to a variety of challenges including the spreading and sharing of hate speech messages. Thus, to solve this emerging issue in social media sites, recent studies employed a variety of feature engineering techniques and machine learning algorithms to automatically detect the hate speech messages on different datasets. However, to the best of our knowledge, there is no study to compare the variety of feature engineering techniques and machine learning algorithms to evaluate which feature engineering technique and machine learning algorithm outperform on a standard publicly available dataset. Hence, the aim of this paper is to compare the performance of three feature engineering techniques and eight machine learning algorithms to evaluate their performance on a publicly available dataset having three distinct classes. The experimental results showed that the...
Identification of HATE speech tweets in Pashto language using Machine Learning techniques
International Journal of Advanced Trends in Computer Science and Engineering, 2021
From the last few years, researchers are very much attracted to sentiment analysis, especially towards hate speech detection systems. As in different languages procreation of hate speech has compelling and symbolic consideration on social media. Hate speech has a great impact on society, using hate words harms others dignity. Hate speech detection systems are important to stop the transformation of hate words into crimes. In this research, a framework is developed for hate speech detection system in the Pashto language. A dataset is created for which data is collected from Twitter. Because there is no related data available. Most of the research work has been done in this domain for other languages, and it's very mature in the context of detecting hate speech. But when it arrives at the morphological languages not much work has been done especially in the Pashto language. This research aimed and collected data from Twitter, Tweets related to ethnicity and religion. The data collected from twitter has been annotated manually and categorized the data as hate or not by comparing it with the offensive content. For hate speech detection systems to view the impact of different features/attribute this study performed experiments on the existing classifiers i.e., SVM, Naïve Bayes, Decision tree and KNN. SVM produced the highest result at dataset of 500 i.e., 74% among all the classifiers. KNN and Decision Tree produced same result at dataset of 1500 i.e., 65.0%. Dataset of 2800 Decision Tree produced the highest result i.e., 72% and SVM produced 71.9%.
Hate Speech Analysis Using Machine Learning
Hate speech is usually outlined as any form of communication that disparages a person or a group on the premises of some characteristic such as race, colour, ethnicity, gender, sexual orientation, nationality, religion, or other characteristic . Opinionated text has created a new area of research in text analysis. Traditionally, facts and information-centric view of text was expanded to enable sentiment-aware applications. Nowadays, the increased use of the internet and online activities like ticket booking, online transactions, e-commerce, social media communications, blogging, forums etc. has led to the need for extraction, transformation and analysis of huge amount of information. Hence, new methods are needed to analyze and summarize this information. (Kumar et al 2015) Previous research works face the problem of users being able to obfuscate tweets to beat the current state of the art hate speech detection by using new slang words or through inventive clever spellings of words that are not available in the popular pre-trained word embeddings such as Word2Vec or GloVe, but is highly common with hateful comments.
Hate Speech Classification Using SVM and Naive BAYES
2022
The spread of hatred that was formerly limited to verbal communications has rapidly moved over the Internet. Social media and community forums that allow people to discuss and express their opinions are becoming platforms for the spreading of hate messages. Many countries have developed laws to avoid online hate speech. They hold the companies that run the social media responsible for their failure to eliminate hate speech. But as online content continues to grow, so does the spread of hate speech However, manual analysis of hate speech on online platforms is infeasible due to the huge amount of data as it is expensive and time consuming. Thus, it is important to automatically process the online user contents to detect and remove hate speech from online media. Many recent approaches suffer from interpretability problem which means that it can be difficult to understand why the systems make the decisions they do. Through this work, some solutions for the problem of automatic detection of hate messages were proposed using Support Vector Machine (SVM) and Naïve Bayes algorithms. This achieved near state-of-the-art performance while being simpler and producing more easily interpretable decisions than other methods. Empirical evaluation of this technique has resulted in a classification accuracy of approximately 99% and 50% for SVM and NB respectively over the test set.
Role of Artificial Intelligence in Detection of Hateful Speech for Hinglish Data on Social Media
Lecture Notes in Electrical Engineering, 2021
Social networking platforms provide a conduit to disseminate our ideas, views and thoughts and proliferate information. This has led to the amalgamation of English with natively spoken languages. Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world. Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages. Thus, the worldwide hate speech detection rate of around 44% drops even more considering the content in Indian colloquial languages and slangs. In this paper, we propose a methodology for efficient detection of unstructured code-mix Hinglish language. Fine-tuning based approaches for Hindi-English code-mixed language are employed by utilizing contextual based embeddings such as ELMo (Embeddings for Language Models), FLAIR, and transformer-based BERT (Bidirectional Encoder Representations from Transformers). Our proposed approach is compared against the pre-existing methods and results are compared for various datasets. Our model outperforms the other methods and frameworks.
2021
This paper describes the system submitted by our team KBCNMUJAL for Task 2 of the shared task "Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC)" at FIRE 2020. The datasets of two Dravidian languages viz Malayalam and Tamil of size 4000 rows each, are shared by HASOC organizers. These datasets are used to train the machine using different machine learning algorithms, based on classification and regression models. The datasets consist of twitter messages with two class labels "offensive" and "not offensive". The machine is trained to classify such social media messages in these two categories. Appropriate n-gram feature sets are extracted to learn the specific characteristics of the Hate Speech text messages. These feature models are based on TF-IDF weights of n-gram. The referred work and respective experiments show that the features such as word, character and combined model of word and character n-grams could be used to identify the term patterns of offensive text contents. As part of the HASOC shared task at FIRE 2020, the test data sets are made available by the HASOC track organisers. The best performing classification models developed for both languages, are applied on test datasets. The model which gives the highest accuracy result on training dataset for Malayalam language, was experimented to predict the labels of respective Test dataset. This system has obtained an F1 score of 0.77 and the model has received a HASOC rank of 2. Similarly the best performing model for Tamil language has obtained an F1 score of 0.87. It has received 3rd rank in the shared task participation of our team KBCNMUJAL. With the name of our team the system is named as HASOC_kbcnmujal .