An Empirical Method Exploring a Large Set of Features for Authorship Identification (original) (raw)
Related papers
Authorship Analysis and Identification Techniques: A Review
International Journal of Computer Applications, 2013
Trends in data mining are increasing over the time. Current world is of internet and everything is available over internet, which leads to criminal and malicious activity. So the identity of available content is now a need. Available content is always in the form of text data. Authorship analysis is the statistical study of linguistic and computational characteristics of the written documents of individuals. This paper describes review of various methods for authorship analysis and identification for a set of provided text. Surely research in authorship analysis and identification will continue and even increase over decades. In this article, we put our vision of future authorship analysis and identification with high performance and solution for behavioral feature extraction from set of text documents.
Design and Implementation of a Machine Learning-Based Authorship Identification Model
Scientific Programming
In this paper, a novel approach is presented for authorship identification in English and Urdu text using the LDA model with n-grams texts of authors and cosine similarity. The proposed approach uses similarity metrics to identify various learned representations of stylometric features and uses them to identify the writing style of a particular author. The proposed LDA-based approach emphasizes instance-based and profile-based classifications of an author’s text. Here, LDA suitably handles high-dimensional and sparse data by allowing more expressive representation of text. The presented approach is an unsupervised computational methodology that can handle the heterogeneity of the dataset, diversity in writing, and the inherent ambiguity of the Urdu language. A large corpus has been used for performance testing of the presented approach. The results of experiments show superiority of the proposed approach over the state-of-the-art representations and other algorithms used for authors...
Unsupervised method for the authorship identification task
This paper presents an approach for tackling the authorship identification task. The approach is based on comparing the similarity between a given unknown document against the known documents using a number of different phrase-level and lexical-syntactic features, so that an unknown document can be classified as having been written by the same author, if the different similarity measures obtained are close to a predetermined threshold for each language in the task. The method has shown competitive results, achieving the overall 6th place in the competition ranking.
Author identification using writer-dependent and writer-independent strategies
2008
In this work we discuss author identification for documents written in Portuguese. Two different approaches were compared. The first is the writer-independent model which reduces the pattern recognition problem to a single model and two classes, hence, makes it possible to build robust system even when few genuine samples per writer are available. The second is the personal model, which very often performs better but needs a bigger number of samples per writer. We also introduce a stylometric feature set based on the conjunctions and adverbs of the Portuguese language. Experiments on a database composed of short articles from 30 different authors and Support Vector Machine (SVM) as classifier demonstrate that the proposed strategy can produced results comparable to the literature.
Authorship identification for heterogeneous documents
2002
yuuta-t,matsu¡ @is.aist-nara.ac.jp The study of authorship identification in Japanese has for the most part been restricted to literary texts using basic statistical methods. In the present study, authors of mailing list messages are identified using a machine learning technique (Support Vector Machines). In addition, the classifier trained on the mailing list data is applied to identify the author of Web documents in order to investigate performance in authorship identification for more heterogeneous documents. Experimental results show better identification performance when we use the features of not only conventional word N-gram information but also of frequent sequential patterns extracted by a data mining technique (PrefixSpan).
Authorship Attribution using Content based Features and N-gram features
International Journal of Engineering and Advanced Technology, 2019
The internet is increasing exponentially with textual content primarily through social websites. The problems were also increasing with anonymous textual data in the internet. The researchers are searching for alternative techniques to know the author of an unknown document. Authorship Attribution is one such technique to predict the details of an unknown document. The researchers extracted various classes of stylistic features like character, lexical, syntactic, structural, content and semantic features to distinguish the authors writing style. In this work, the experiment performed with most frequent content specific features, n-grams of character, word and POS tags. A standard dataset is used for experimentation and identified that the combination of content based and n-gram features achieved best accuracy for prediction of author. Two standard classification algorithms were used for author prediction. The Random forest classifier attained best accuracy for prediction of author w...
A Survey on Authorship Analysis Tasks and Techniques
SEEU Review
Authorship Analysis (AA) is a natural language processing field that examines the previous works of writers to identify the author of a text based on its features. Studies in authorship analysis include authorship identification, authorship profiling, and authorship verification. Due to its relevance, to many applications in this field attention has been paid. It is widely used in the attribution of historical literature. Other applications include legal linguistics, criminal law, forensic investigations, and computer forensics. This paper aims to provide an overview of the work done and the techniques applied in the authorship analysis domain. The examination of recent developments in this field is the principal focus. Many different criteria can be used to define a writer’s style. This paper investigates stylometric features in different author-related tasks, including lexical, syntactic, semantic, structural, and content-specific ones. A lot of classification methods have been ap...
On the Empirical Evaluation of Hybrid Author Identification Method
2015
In this paper we focus on the identification of the author of a written text. We present a new hybrid method that combines a set of stylistic and statistical features in a machine learning process. We tested the effectiveness of the linguistic and statistical features combined with the inter-textual distance "Delta" on the PAN’@CLEF’2015 English corpus and we obtained 0.59 as c@1 precision.
Authorship Verification based on Syntax Features
Authorship verification is wildly discussed topic at these days. In the authorship verification problem, we are given examples of the writing of an author and are asked to determine if given texts were or were not written by this author. In this paper we present an algorithm using syntactic analysis system SET for verifying authorship of the documents. We propose three variants of two-class machine learning approach to authorship verification. Syntactic features are used as attributes in suggested algorithms and their performance is compared to established word-lenth distribution features. Results indicate that syntactic features provide enough information to improve accuracy of authorship verification algorithms.
Lexical-Syntactic and Graph-Based Features for Authorship Verification
In this paper we present the results obtained by an approach submitted to the author identification task of PAN 2013 which uses lexical, syntactic and graph-based features for constructing a representation model of document authors. In particular, the features extracted from the graph representation were obtained by means of the SubDue mining tool. As a classification model we have employed Support Vector Machines (SVM). The overall results have ranked our approach in the fifth place from around 17 teams.