Authorship Attribution for Online Social Media (original) (raw)
Related papers
Journal of The American Society for Information Science and Technology, 2006
With the rapid proliferation of Internet technologies and applications, misuse of online messages for inappropriate or illegal purposes has become a major concern for society. The anonymous nature of online-message distribution makes identity tracing a critical problem. We developed a framework for authorship identification of online messages to address the identity-tracing problem. In this framework, four types of writing-style features (lexical, syntactic, structural, and content-specific features) are extracted and inductive learning algorithms are used to build feature-based classification models to identify authorship of online messages. To examine this framework, we conducted experiments on English and Chinese online-newsgroup messages. We compared the discriminating power of the four types of features and of three classification techniques: decision trees, backpropagation neural networks, and support vector machines. The experimental results showed that the proposed approach was able to identify authors of online messages with satisfactory accuracy of 70 to 95%. All four types of message features contributed to discriminating authors of online messages. Support vector machines outperformed the other two classification techniques in our experiments. The high performance we achieved for both the English and Chinese datasets showed the potential of this approach in a multiple-language context.
Optimizing Authorship Profiling of Online Messages
2016
Authorship profiling is of growing importance in the current information age, partly due to its application in digital forensics. Methodologies of profiling like any other authorship analysis consist majorly of feature extraction and application of analytical techniques. Choice of feature sets and analytical techniques may significantly affect the performance of authorship analysis. Hence, a need for methods that can help improve on the success of authorship profiling undertakings. The present study sought through experiments, the writing features, analytical technique and number of class labels that can help improve the effectiveness of profiling the country of affiliation of authors of online messages. The experiment showed that the most effective model was achieved when all feature set types in our study were used within a two-class dataset that was analysed with the Neural Network (Multilayer Perceptron) machine learning scheme. The study recommends a need for further studies in finding models that can maximize both effectiveness and efficiency in profiling the authorship of online messages.
Authorship Attribution using Content based Features and N-gram features
International Journal of Engineering and Advanced Technology, 2019
The internet is increasing exponentially with textual content primarily through social websites. The problems were also increasing with anonymous textual data in the internet. The researchers are searching for alternative techniques to know the author of an unknown document. Authorship Attribution is one such technique to predict the details of an unknown document. The researchers extracted various classes of stylistic features like character, lexical, syntactic, structural, content and semantic features to distinguish the authors writing style. In this work, the experiment performed with most frequent content specific features, n-grams of character, word and POS tags. A standard dataset is used for experimentation and identified that the combination of content based and n-gram features achieved best accuracy for prediction of author. Two standard classification algorithms were used for author prediction. The Random forest classifier attained best accuracy for prediction of author w...
arXiv Computer Science, 2020
In recent years, messages and text posted on the Internet are used in criminal investigations. Unfortunately, the authorship of many of them remains unknown. In some channels, the problem of establishing authorship may be even harder, since the length of digital texts is limited to a certain number of characters. In this work, we aim at identifying authors of tweet messages, which are limited to 280 characters. We evaluate popular features employed traditionally in authorship attribution which capture properties of the writing style at different levels. We use for our experiments a self-captured database of 40 users, with 120 to 200 tweets per user. Results using this small set are promising, with the different features providing a classification accuracy between 92% and 98.5%. These results are competitive in comparison to existing studies which employ short texts such as tweets or SMS.
Forensic Authorship Analysis of Microblogging Texts Using N-Grams and Stylometric Features
2020 8th International Workshop on Biometrics and Forensics (IWBF), 2020
In recent years, messages and text posted on the Internet are used in criminal investigations. Unfortunately, the authorship of many of them remains unknown. In some channels, the problem of establishing authorship may be even harder, since the length of digital texts is limited to a certain number of characters. In this work, we aim at identifying authors of tweet messages, which are limited to 280 characters. We evaluate popular features employed traditionally in authorship attribution which capture properties of the writing style at different levels. We use for our experiments a self-captured database of 40 users, with 120 to 200 tweets per user. Results using this small set are promising, with the different features providing a classification accuracy between 92% and 98.5%. These results are competitive in comparison to existing studies which employ short texts such as tweets or SMS.
Analysis of authorship attribution technique on Urdu tweets empowered by machine learning
International Journal of Advanced Trends in Computer Science and Engineering, 2021
The process of identifying the author of an anonymous document from a group of unknown documents is called authorship attribution. As the world is trending towards shorter communications, the trend of online criminal activities like phishing and bullying are also increasing. The criminal hides their identity behind the screen name and connects anonymously. Which generates difficulty while tracing criminals during the cybercrime investigation process. This paper evaluates current techniques of authorship attribution at the linguistic level and compares the accuracy rate in terms of English and Urdu context, by using the LDA model with n-gram technique and cosine similarity, used to work on Stylometry features to identify the writing style of a specific author. Two datasets are used Urdu_TD and English_TD based on 180 English and Urdu tweets against each author. The overall accuracy that we achieved from Urdu_TD is 84.52% accuracy and 93.17% accuracy on English_TD. The task is done without using any labels for authorship.
Authorship Authentication of Short Messages from Social Networks Machines
Dataset consists of 17000 twits collected from Twitter, as 500 twits for each of 34 authors that meet certain criteria. Raw data is collected by using the software Nvivo. The collected raw data is preprocessed to extract frequencies of 200 features. In the data analysis 128 of features are eliminated since they are rare in tweets. For the beginning ten of these 34 authors are selected. Since artificial neural networks are more successful distinguishing two classes, 90 neural networks are trained for pairwise classification. These experts then organized as a special team (CANNT) to aggregate decisions of these 90 competing experts. To improve the accuracy of author authentication, a novel technique, batch identification is used and 100% accuracy is achieved.
The aim of this paper is to explore the usefulness of using features from different linguistic levels to email authorship identification. Using various email datasets provided by PAN'11 lab we tested several feature groups in both authorship attribution and authorship verification subtasks. The selected feature groups combined with Regularized Logistic Regression and One-Class SVM machine learning methods performed well above average in authorship attribution subtasks and below average in authorship verification subtasks.
An Empirical Method Exploring a Large Set of Features for Authorship Identification
2016
In this paper, we deal with the author identification issues of the document whose origin is unknown. To overcome these problems, we propose a new hybrid approach combining the statistical and stylistic analysis. Our introduced method is based on determining the lexical and syntactic features of the written text in order to identify the author of the document. These features are explored to build a machine learning process. We obtained promising results by relying on PAN@CLEF2014 English literature corpus. The experimental results are comparable to those obtained by the best state of the art methods.