FoSIL - Offensive language classification of German tweets combining SVMs and deep learning techniques (original) (raw)
Related papers
Proceedings of the 13th International Workshop on Semantic Evaluation, 2019
This paper describes our system submissions as part of our participation (team name: JU ETCE 17 21) in the SemEval 2019 shared task 6: "OffensEval: Identifying and Categorizing Offensive Language in Social Media". We participated in all the three sub-tasks: i) Sub-task A: offensive language identification, ii) Sub-task B: automatic categorization of offense types, and iii) Sub-task C: offense target identification. We employed machine learning as well as deep learning approaches for the sub-tasks. We employed Convolutional Neural Network (CNN) and Recursive Neural Network (RNN) Long Short-Term Memory (LSTM) with pre-trained word embeddings. We used both word2vec and Glove pre-trained word embeddings. We obtained the best F1score using CNN based model for sub-task A, LSTM based model for sub-task B and Logistic Regression based model for sub-task C. Our best submissions achieved 0.7844, 0.5459 and 0.48 F1-scores for sub-task A, sub-task B and sub-task C respectively.
Association for Computational Linguistics, 2019
This paper describes the submissions of our team, HAD-Tübingen, for the SemEval 2019-Task 6: "OffensEval: Identifying and Categorizing Offensive Language in Social Me-dia". We participated in all the three sub-tasks: Sub-task A-"Offensive language identifica-tion", sub-task B-"Automatic categorization of offense types" and sub-task C-"Offense target identification". As a baseline model we used a Long short-term memory recurrent neu-ral network (LSTM) to identify and categorize offensive tweets. For all the tasks we experimented with external databases in a postpro-cessing step to enhance the results made by our model. The best macro-average F 1 scores obtained for the sub-tasks A, B and C are 0.73, 0.52, and 0.37, respectively.
upInf - Offensive Language Detection in German Tweets
As part of the shared task of GermEval 2018 we developed a system that is able to detect offensive speech in German tweets. To increase the size of the existing training set we made an application for gathering trending tweets in Germany. This application also assists in manual annotation of those tweets. The main part of the training data consists of the set provided by the organizers of the shared task. We implement three different models. The first one follows the n-gram approach. The second model utilizes word vectors to create word clusters which contributes to a new array of features. Our last model is a composition of a recurrent and a convolutional neural network. We evaluate our approaches by splitting the given data into train, validation and test sets. The final evaluation is done by the organizers of the task who compare our predicted results with the unpublished ground truth.
Towards the Automatic Classification of Offensive Language and Related Phenomena in German Tweets
2018
In recent years the automatic detection of abusive language, offensive language and hate speech in several different forms of online communication has received a lot of attention by the Computational Linguistics and Language Technology community. While most approaches work on English data, publications on languages other than English are rare. This paper, submitted to the GermEval 2018 Shared Task on the Identification of Offensive Language, provides the results of several experiments regarding the classification of offensive language in German language tweets.
RGCL at GermEval 2019: Offensive Language Detection with Deep Learning
KONVENS, 2019
This paper describes the system submitted by the RGCL team to GermEval 2019 Shared Task 2: Identification of Offensive Language. We experimented with five different neural network architectures in order to classify Tweets in terms of offensive language. By means of comparative evaluation, we select the best performing for each of the three subtasks. Overall, we demonstrate that using only minimal preprocessing we are able to obtain competitive results.
Proceedings of the 13th International Workshop on Semantic Evaluation, 2019
Offensive language identification (OLI) in user generated text is automatic detection of any profanity, insult, obscenity, racism or vulgarity that degrades an individual or a group. It is helpful for hate speech detection, flame detection and cyber bullying. Due to immense growth of accessibility to social media, OLI helps to avoid abuse and hurts. In this paper, we present deep and traditional machine learning approaches for OLI. In deep learning approach, we have used bi-directional LSTM with different attention mechanisms to build the models and in traditional machine learning, TF-IDF weighting schemes with classifiers namely Multinomial Naive Bayes and Support Vector Machines with Stochastic Gradient Descent optimizer are used for model building. The approaches are evaluated on the OffensEval@SemEval2019 dataset and our team SSN NLP submitted runs for three tasks of OffensEval shared task. The best runs of SSN NLP obtained the F1 scores as 0.53, 0.48, 0.3 and the accuracies as 0.63, 0.84 and 0.42 for the tasks A, B and C respectively. Our approaches improved the base line F1 scores by 12%, 26% and 14% for Task A, B and C respectively.
CUSATNLP@DravidianLangTech-EACL2021:Language Agnostic Classification of Offensive Content in Tweets
2021
Identifying offensive information from tweets is a vital language processing task. This task concentrated more on English and other foreign languages these days. In this shared task on Offensive Language Identification in Dravidian Languages, in the First Workshop of Speech and Language Technologies for Dravidian Languages in EACL 2021, the aim is to identify offensive content from code mixed Dravidian Languages Kannada, Malayalam, and Tamil. Our team used language agnostic BERT (Bidirectional Encoder Representation from Transformers) for sentence embedding and a Softmax classifier. The language-agnostic representation based classification helped obtain good performance for all the three languages, out of which results for the Malayalam language are good enough to obtain a third position among the participating teams.
2020
In this paper, we present our approaches and results for SemEval-2020 Task 12, Multilingual Offensive Language Identification in Social Media (OffensEval 2020). The OffensEval 2020 had three subtasks: A) Identifying the tweets to be offensive (OFF) or non-offensive (NOT) for Arabic, Danish, English, Greek, and Turkish languages, B) Detecting if the offensive tweet is targeted (TIN) or untargeted (UNT) for the English language, and C) Categorizing the offensive targeted tweets into three classes, namely: individual (IND), Group (GRP), or Other (OTH) for the English language. We participate in all the subtasks A, B, and C. In our solution, first we use the pre-trained BERT model for all subtasks, A, B, and C and then we apply the BiLSTM model with attention mechanism (Attn-BiLSTM) for the same. Our result demonstrates that the pre-trained model is not giving good results for all types of languages and is compute and memory intensive whereas the Attn-BiLSTM model is fast and gives good...
bhanodaig at SemEval-2019 Task 6: Categorizing Offensive Language in social media
Proceedings of the 13th International Workshop on Semantic Evaluation, 2019
This paper describes the work that our team bhanodaig did at Indian Institute of Technology (ISM) towards OffensEval i.e. identifying and categorizing offensive language in social media. Out of three sub-tasks, we have participated in sub-task B: automatic categorization of offensive types. We perform the task of categorizing offensive language, whether the tweet is targeted insult or untargeted. We use Linear Support Vector Machine for classification. The official ranking metric is macroaveraged F1. Our system gets the score 0.5282 with accuracy 0.8792. However, as new entrant to the field, our scores are encouraging enough to work for better results in future.