Chat mining: Predicting user and message attributes in computer-mediated communication (original) (raw)

A feature selection method for author identification in interactive communications based on supervised learning and language typicality

Engineering Applications of Artificial Intelligence, 2016

Authorship attribution, conceived as the identification of the origin of a text between different authors, has been a very active area of research in the scientific community mainly supported by advances in Natural Language Processing (NLP), machine learning and Computational Intelligence. This paradigm has been mostly addressed from a literary perspective, aiming at identifying the stylometric features and writeprints which unequivocally typify the writer patterns and allow their unique identification. On the other hand, the upsurge of social networking platforms and interactive messaging have undoubtedly made the anonymous expression of feelings, the sharing of experiences and social relationships much easier than in other traditional communication media. Unfortunately, the popularity of such communities and the virtual identification of their users deploy a rich substrate for cybercrimes against unsuspecting victims and other forms of illegal uses of social networks that call for the activity tracing of accounts. In the context of oneto-one communications this manuscript postulates the identification of the sender of a message as a useful approach to detect impersonation attacks in interactive communication scenarios. In particular this work proposes to select linguistic features extracted from messages via NLP techniques by means of a novel feature selection algorithm based on the dissociation between essential traits of the sender and receiver influences. The performance and computational efficiency of different supervised learning models when incorporating the proposed feature selection method is shown to be promising with real SMS data in terms of identification accuracy, and paves the way towards future research lines focused on applying the concept of language typicality in the discourse analysis field.

A framework for authorship identification of online messages: Writing-style features and classification techniques

Journal of The American Society for Information Science and Technology, 2006

With the rapid proliferation of Internet technologies and applications, misuse of online messages for inappropriate or illegal purposes has become a major concern for society. The anonymous nature of online-message distribution makes identity tracing a critical problem. We developed a framework for authorship identification of online messages to address the identity-tracing problem. In this framework, four types of writing-style features (lexical, syntactic, structural, and content-specific features) are extracted and inductive learning algorithms are used to build feature-based classification models to identify authorship of online messages. To examine this framework, we conducted experiments on English and Chinese online-newsgroup messages. We compared the discriminating power of the four types of features and of three classification techniques: decision trees, backpropagation neural networks, and support vector machines. The experimental results showed that the proposed approach was able to identify authors of online messages with satisfactory accuracy of 70 to 95%. All four types of message features contributed to discriminating authors of online messages. Support vector machines outperformed the other two classification techniques in our experiments. The high performance we achieved for both the English and Chinese datasets showed the potential of this approach in a multiple-language context.

Conversationally-inspired Stylometric Features for Authorship Attribution in Instant Messaging

Authorship attribution (AA) aims at recognizing automati-cally the author of a given text sample. Traditionally applied to literary texts, AA faces now the new challenge of recog-nizing the identity of people involved in chat conversations. These share many aspects with spoken conversations, but AA approaches did not take it into account so far. Hence, this paper tries to fill the gap and proposes two novelties that improve the effectiveness of traditional AA approaches for this type of data: the first is to adopt features inspired by Conversation Analysis (in particular for turn-taking), the second is to extract the features from individual turns rather than from entire conversations. The experiments have been performed over a corpus of dyadic chat conversations (77 in-dividuals in total). The performance in identifying the per-sons involved in each exchange, measured in terms of area under the Cumulative Match Characteristic curve, is 89.5%.

Evaluation of the performance and efficiency of the automated linguistic features for author identification in short text messages using different variable selection techniques

The aim of this paper was to evaluate the efficiency of automated linguistic features to test its capacity or discriminating power as style markers for author identification in short text messages of the Facebook genre. The corpus used to evaluate the automated linguistics features was compiled from 221 Facebook texts (each text is about 2 to 3 lines/35-40 words) written in English, which were written in the same genre and topic and posted in the same year group, totaling 7530 words. To compose the dataset for linguistic features performance or evaluation, frequency values were collected from 16 linguistic feature types involving parts of speech, function words, word bigrams, character tri grams, average sentence length in terms of words, average sentence length in terms of characters, Yule's K measure, Simpson's D measure, average words length, FW/CW ratio, average characters, content specific key words, type/token ratio, total number of short words less than four characters, contractions, and total number of characters in words which were selected from five corpora, totalling 328 test features. The evaluation of the 16 linguistic feature types differ from those of other analyses because the study used different variable selection methods including feature type frequency, variance, term frequency/ inverse document frequency (TF.IDF), signal-noise ratio, and Poisson term distribution. The relationships between known and anonymous text messages were examined using hierarchical linear and non-hierarchical nonlinear clustering methods, taking into accounts the nonlinear patterns among the data. There were similarities between the anonymous text messages and the authors of the non-anonymous text messages in terms function word and parts of speech usages based on TF.IDF technique and the efficiency of function word usages (=60%) and the efficiency of parts of speech frequencies (=50%). There were no similarities between the

Classifying text using machine learning models and determining conversation drift

Cornell University - arXiv, 2022

Text classification helps analyse texts for semantic meaning and relevance, by mapping the words against this hierarchy. An analysis of various types of texts is invaluable to understanding both their semantic meaning, as well as their relevance. Text classification is a method of categorising documents. It combines computer text classification and natural language processing to analyse text in aggregate. This method provides a descriptive categorization of the text, with features like content type, object field, lexical characteristics, and style traits. In this research, the authors aim to use natural language feature extraction methods in machine learning which are then used to train some of the basic machine learning models like Naive Bayes, Logistic Regression, and Support Vector Machine. These models are used to detect when a teacher must get involved in a discussion when the lines go off-topic.

IJERT-E-mail communication Analysis using Text Mining

International Journal of Engineering Research and Technology (IJERT), 2013

https://www.ijert.org/e-mail-communication-analysis-using-text-mining https://www.ijert.org/research/e-mail-communication-analysis-using-text-mining-IJERTV2IS100929.pdf Electronic messages are one of the mostly preferred ways of communication in present days. This paper focuses on analyzing a communication using their textual contents. Analyzing mail communications essential for determining who the parties are between them communication took place and what was the topic on which they were communicating. Online communication (using social networking sites or messengers) analysis is advantageous when we want to detect any criminal communication and also helpful for civilians and military. In this paper we are introducing text mining approaches for analyzing communication via e-messages. By this analysis, we can generate much information about user-behavior, their activities and mostly preferred title of discussions.

Style mining of electronic messages for multiple authorship discrimination

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '03, 2003

This paper considers the use of computational stylistics for performing authorship attribution of electronic messages, addressing categorization problems with as many as 20 different classes (authors). Effective stylistic characterization of text is potentially useful for a variety of tasks, as language style contains cues regarding the authorship, purpose, and mood of the text, all of which would be useful adjuncts to information retrieval or knowledge-management tasks. We focus here on the problem of determining the author of an anonymous message, based only on the message text. Several multiclass variants of the Winnow algorithm were applied to a vector representation of the message texts to learn models for discriminating different authors. We present results comparing the classification accuracy of the different approaches. The results show that stylistic models can be learned to determine an author's identity with promising accuracy. This work thus forms a baseline for future research in author attribution of electronic messages.

Utterances Assessment in Chat Conversations

Research in Computing …, 2010

With the continuous evolution of collaborative environments, the needs of automatic analyses and assessment of participants in instant messenger conferences (chat) have become essential. For these aims, on one hand, a series of factors based on natural language processing (including lexical analysis and Latent Semantic Analysis) and data-mining have been taken into consideration. On the other hand, in order to thoroughly assess participants, measures as Page's essay grading, readability and social networks analysis metrics were computed. The weights of each factor in the overall grading system are optimized using a genetic algorithm whose entries are provided by a perceptron in order to ensure numerical stability. A gold standard has been used for evaluating the system's performance.