Combining Psycho-linguistic, Content-based and Chat-based Features to Detect Predation in Chatrooms (original) (raw)

A Learning-Based Approach for the Identification of Sexual Predators in Chat Logs

2012

The existence of sexual predators that enter into chat rooms or forums and try to convince children to provide some sexual favour is a socially worrying issue. Manually monitoring these interactions is a way to attack this problem. However, this manual approach simply cannot keep pace because of the high number of conversations and the huge number of chatrooms or forums where these conversations daily take place. We need tools that automatically process massive amounts of conversations and alert about possible offenses. The sexual predator identification challenge within PAN 2012 is a valuable way to promote research in this area. Our team faced this task as a Machine Learning problem and we designed several innovative sets of features that guide the construction of classifiers for identifying sexual predation. Our methods are driven by psycholinguistic, chat-based, and tf/idf features and yield to very effective classifiers.

Sexual predator detection in chats with chained classifiers

This paper describes a novel approach for sexual predator detection in chat conversations based on sequences of classifiers. The proposed approach divides documents into three parts, which, we hypothesize, correspond to the different stages that a predator employs when approaching a child. Local classifiers are trained for each part of the documents and their outputs are combined by a chain strategy: predictions of a local classifier are used as extra inputs for the next local classifier. Additionally, we propose a ring-based strategy, in which the chaining process is iterated several times, with the goal of further improving the performance of our method. We report experimental results on the corpus used in the first international competition on sexual predator identification (PAN'12). Experimental results show that the proposed method outperforms a standard (global) classification technique for the different settings we consider; besides the proposed method compares favorably with most methods evaluated in the PAN'12 competition.

Child Predator Detection in Online Chat Conversation using Support Vector Machine

2021

Increase in Internet use and facilitating access to social media platform has help the predatory to establish online relationships with children which has boost to increase in online solicitation. We are proposing system that enables us to detect a predator in online chats using Text classification method. In this paper, the use of machine learning algorithm named as support vector machine has been used to determine cyber predators. The main objective of our system is to detect child predator base on chat, comments and post of social media account and send predator record to cyber cell admin & the use of PAN12 dataset is done for text classification Purpose. This paper presents our current development to enable the creation of the child predator system using SVM text classification.

Detection of child exploiting chats from a mixed chat dataset as a text classification task

2011

Detection of child exploitation in Internet chatting is an important issue for the protection of children from prospective online paedophiles. This paper investigates the effectiveness of text classifiers to identify Child Exploitation (CE) in chatting. As the chatting occurs among two or more users by typing texts, the text of chat-messages can be used as the data to be analysed by text classifiers. Therefore the problem of identification of CE chats can be framed as the problem of text classification by categorizing the chatlogs into predefined CE types. Along with three traditional text categorizing techniques a new approach has been made to accomplish the task. Psychometric and categorical information by LIWC (Linguistic Inquiry and Word Count) has been used and improvement of performance in some classifier has been found. For the experiments of current research the chat logs are collected from various websites open to public. Classification-via-Regression, J-48-Decision-Tree and Naïve-Bayes classifiers are used. Comparison of the performance of the classifiers is shown in the result.

Automated Identification of Child Abuse in Chat Rooms by Using Data Mining

Data Mining Trends and Applications in Criminal Science and Investigations, 2000

Providing a safe environment for juveniles and children in online social networks is considered as one of the major factors of improving public safety. Due to the prevalence of the online conversations, mitigating the undesirable effects of child abuse in cyber space has become inevitable. Using automatic ways to combat this kind of crime is challenging and demands efficient and scalable data mining techniques. The problem can be casted as a combination of textual preprocessing in data/text mining and pattern classification in machine learning. This chapter covers different data mining methods including preprocessing, feature extraction and the popular ways of feature enrichment through extracting sentiments and emotional features. A brief tutorial on classification algorithms in the domain of automated predator identification is also presented through the chapter. Finally, the discussion is summarized and the challenges and open issues in this application domain are discussed.

Advanced Data Preprocessing for Detecting Cybercrime in Text-Based Online Interactions

Pattern Recognition and Artificial Intelligence, 2020

Social media provides a powerful platform for individuals to communicate globally. This capability has many benefits, but it can also be used by malevolent individuals, i.e. predators. Anonymity exacerbates the problem. The motivation of our work is to help protect our children from this potentially hostile environment, without excluding them from utilizing its benefits. In our research, we aim to develop an online sexual predator identification system, designed to detect cybercrime related to child grooming. We will use AI techniques to analyze chat interactions available from different social networks. However, before any meaningful analysis can be carried out, chats must be preprocessed into a consistent and suitable format. This task poses challenges in itself. In this paper we show how different and diverse chat formats can be automatically normalized into a consistent text-based format that can be subsequently used for analysis.

The Detection of Sexual Harassment and Chat Predators Using Artificial Neural Network

Karbala international journal of modern science, 2021

The vast increase in using social media sites like Twitter and Facebook led to frequent sexual_harassment on the Internet, which is considered a major societal problem. This paper aims to detect sexual_harassment and cyber_predators in early phase. We used deeplearning like Bidirectionallylong-short-term memory. Word representations are carefully reviewed in text specific to mapping to real number vectors. The chat sexual predators Detection_approach with the proposed_model. The best results obtained by the performance measured with F0.5-score were the result is_0.927 with proposed_models. The accuracy measured is_97.27% in the proposed_model. The comments sexual_harassment Detection_approach the result is_0.925 F0.5-score, and accuracy measured is_99.12%.

Recognizing Predatory Chat Documents using Semi-supervised Anomaly Detection

Electronic Imaging, 2016

Chat-logs are informative documents available to nowadays social network providers. Providers and law enforcement tend to use these huge logs anonymously for automatic online Sexual Predator Identification (SPI) which is a relatively new area of application. The task plays an important role in protecting children and juveniles against being exploited by online predators. Pattern recognition techniques facilitate automatic identification of harmful conversations in cyber space by law enforcements. These techniques usually require a large volume of high-quality training instances of both predatory and non-predatory documents. However, collecting non-predatory documents is not practical in real-world applications, since this category contains a large variety of documents with many topics including politics, sports, science, technology and etc. We utilized a new semi-supervised approach to mitigate this problem by adapting an anomaly detection technique called One-class Support Vector Machine which does not require non-predatory samples for training. We compared the performance of this approach against other state-ofthe-art methods which use both positive and negative instances. We observed that although anomaly detection approach utilizes only one class label for training (which is a very desirable property in practice); its performance is comparable to that of binary SVM classification. In addition, this approach outperforms the classic two-class Naïve Bayes algorithm, which we used as our baseline, in terms of both classification accuracy and precision.

Automatically Identifying Online Grooming Chats Using CNN-based Feature Extraction

2021

With the increasing importance of social media in everyone’s life, the risk of its misuse by criminals is also increasing. In particular children are at risk of becoming victims of online related crime, especially sexual abuse. For example, sexual predators use online grooming to gain the trust of children and young adults. In this paper, a two-step approach using a CNN to identify sexual predators in social networks is proposed. For the identification of a sexual predator profile an F0.5 score of 0.79 and an F2 score of 0.98 were obtained. The score was lower for the identification of specific line which initialized the grooming process (F2 = 0.61).

A Novel Way of Identifying Cyber Predators

ArXiv, 2017

Recurrent Neural Networks with Long Short-Term Memory cell (LSTM-RNN) have impressive ability in sequence data processing, particularly for language model building and text classification. This research proposes the combination of sentiment analysis, new approach of sentence vectors and LSTM-RNN as a novel way for Sexual Predator Identification (SPI). LSTM-RNN language model is applied to generate sentence vectors which are the last hidden states in the language model. Sentence vectors are fed into another LSTM-RNN classifier, so as to capture suspicious conversations. Hidden state enables to generate vectors for sentences never seen before. Fasttext is used to filter the contents of conversations and generate a sentiment score so as to identify potential predators. The experiment achieves a record-breaking accuracy and precision of 100% with recall of 81.10%, exceeding the top-ranked result in the SPI competition.