An essential rephrasing of the Zipf-Mandelbrot law to solve authorship attribution applications by Gaussian statistics (original) (raw)

Using the complexity of the distribution of lexical elements as a feature in authorship attribution

2008

Traditional Authorship Attribution models extract normalized counts of lexical elements such as nouns, common words and punctuation and use these normalized counts or ratios as features for author fingerprinting. The text is viewed as a "bag-of-words" and the order of words and their position relative to other words is largely ignored. We propose a new method of feature extraction which quantifies the distribution of lexical elements within the text using Kolmogorov complexity estimates. Testing carried out on blog corpora indicates that such measures outperform ratios when used as features in an SVM authorship attribution model. Moreover, by adding complexity estimates to a model using ratios, we were able to increase the F-measure by 5.2-11.8%

A Probabilistic Model for the Distribution of Authorships: A Preliminary Report

1988

The purpose of this study was to develop a model for the distribution of authorships-based on the initial hypothesis that the distribution of authorships follows a shifted Waring distribution-and to test the derived model and some other discrete probability models for goodness-of-fit against empirical data. Bibliographic data from 15 abstracting journals covering the literature in six fields-engineering, medical, physical, mathematical and social sciences, and humanities-were used in testing the goodness-of-fit of the shifted Waring distribution and 13 other discrete probability models. The preliminary findings presented here are based on 60 data sets collected from 10 abstracting journals covering the literature in the mathematical and social sciences and humanities. They indicate that the promising models for the distribution of authorships are the shifted Waring, shifted generalized negative binomial, shifted negative binomial, shifted generalized Poisson, and shifted inverse Gaussian-Poisson distributions. Three advantages and possible practical applications of a model for the distribution of authorships include: (1) ability to summarize the entire frequency distribution by a few parameters of the model; (2) estimation of the number of entries in an author index; and (3) usefulness in a simulation study designed to determine, subject to space constraints, the maximum number of authors per paper to be included in an author index. Forms of the discrete probability models are appended. (5 tables, 12 references) (Author/CGD) Reproductions supplied by EDRS are the best that can be made from the original document.

A probabilistic model for the distribution of authorships

Journal of the American Society for Information Science, 1991

The purpose of this study was to develop a model for the distribution of authorships-based on the initial hypothesis that the distribution of authorships follows a shifted Waring distribution-and to test the derived model and some other discrete probability models for goodness-of-fit against empirical data. Bibliographic data from 15 abstracting journals covering the literature in six fields-engineering, medical, physical, mathematical and social sciences, and humanities-were used in testing the goodness-of-fit of the shifted Waring distribution and 13 other discrete probability models. The preliminary findings presented here are based on 60 data sets collected from 10 abstracting journals covering the literature in the mathematical and social sciences and humanities. They indicate that the promising models for the distribution of authorships are the shifted Waring, shifted generalized negative binomial, shifted negative binomial, shifted generalized Poisson, and shifted inverse Gaussian-Poisson distributions. Three advantages and possible practical applications of a model for the distribution of authorships include: (1) ability to summarize the entire frequency distribution by a few parameters of the model; (2) estimation of the number of entries in an author index; and (3) usefulness in a simulation study designed to determine, subject to space constraints, the maximum number of authors per paper to be included in an author index. Forms of the discrete probability models are appended. (5 tables, 12 references) (Author/CGD) Reproductions supplied by EDRS are the best that can be made from the original document.

A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology, 2009

Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed substantially taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. The plethora of available electronic texts (e.g., e-mail messages, online forum messages, blogs, source code, etc.) indicates a wide variety of applications of this technology provided it is able to handle short and noisy text from multiple candidate authors. In this paper, a survey of recent advances of the automated approaches to attributing authorship is presented examining their characteristics for both text representation and text classification. The focus of this survey is on computational requirements and settings rather than linguistic or literary issues. We also discuss evaluation methodologies and criteria for authorship attribution studies and list open questions that will attract future work in this area.

Computational methods in authorship attribution

Journal of the American Society for Information Science and Technology, 2009

Statistical authorship attribution has a long history, culminating in the use of modern machine learning classification methods. Nevertheless, most of this work suffers from the limitation of assuming a small closed set of candidate authors and essentially unlimited training text for each. Real-life authorship attribution problems, however, typically fall short of this ideal. Thus, following detailed discussion of previous work, three scenarios are considered here for which solutions to the basic attribution problem are inadequate. In the first variant, the profiling problem, there is no candidate set at all; in this case, the challenge is to provide as much demographic or psychological information as possible about the author. In the second variant, the needle-in-a-haystack problem, there are many thousands of candidates for each of whom we might have a very limited writing sample.

Computer-Based Authorship Attribution Without Lexical Measures

Language Resources and Evaluation, 2001

The most important approaches to computer-assisted authorship attribution are exclusively based on lexical measures that either represent the vocabulary richness of the author or simply comprise frequencies of occurrence of common words. In this paper we present a fully-automated approach to the identification of the authorship of unrestricted text that excludes any lexical measure. Instead we adapt a set of style markers to the analysis of the text performed by an already existing natural language processing tool using three stylometric levels, i.e., token-level, phrase-level, and analysis-level measures. The latter represent the way in which the text has been analyzed. The presented experiments on a Modern Greek newspaper corpus show that the proposed set of style markers is able to distinguish reliably the authors of a randomly-chosen group and performs better than a lexically-based approach. However, the combination of these two approaches provides the most accurate solution (i.e., 87% accuracy). Moreover, we describe experiments on various sizes of the training data as well as tests dealing with the significance of the proposed set of style markers.