Text Mining Techniques for Cyberbullying Detection: State of the Art (original) (raw)
Advances in Science, Technology and Engineering Systems Journal
The dramatic growth of social media during the last years has been associated with the emergence of a new bullying types. Platforms such as Facebook, Twitter, YouTube, and others are now privileged ways to disseminate all kinds of information. Indeed, communicating through social media without revealing the real identity has emerged an ideal atmosphere for cyberbullying, where people can pour out their hatred. Therefore, become very urgent to find automated methods to detect cyberbullying through text mining techniques. So, many researchers have recently investigated various approaches, and the number of scientific studies about this topic is growing very rapidly. Nonetheless, the methods are used to classify the phenomenon and evaluation methods are still under discussion. Subsequently, comparing the results between the studies and identifying their performance is still difficult. Therefore, the current systematic review has been conducted with the aim of survey the researches and studies that have been conducted so far by the research community in the topic of cyberbullying classification based on text language. In order to direct future studies on the topic to a more consistent and compatible perspective on recent works, we undertook a deep review of evaluation methods, features, dataset size, language, and dataset source of the latest research in this field. We made a choice to focus more on techniques that adopted neural networks and machine learning algorithms. After conducting systematic searches and applying the inclusion criteria, 16 different studies were included. It was found that the best accuracy was achieved when a deep learning approach is used particularly CNN approach. It was found also that, SVM is the most common classifier in both Arabic and Latin languages and outperformed the other classifiers. Also, the most widely used feature is N-Gram especially bigram and trigram. Furthermore, results show that Twitter is the main source for the collected datasets, and there are no unified datasets. There is also a shortage of studies in Arabic texts for cyberbullying identification in contrast with English texts.