An Empirical Method Exploring a Large Set of Features for Authorship Identification (original) (raw)

Abstract

In this paper, we deal with the author identification issues of the document whose origin is unknown. To overcome these problems, we propose a new hybrid approach combining the statistical and stylistic analysis. Our introduced method is based on determining the lexical and syntactic features of the written text in order to identify the author of the document. These features are explored to build a machine learning process. We obtained promising results by relying on PAN@CLEF2014 English literature corpus. The experimental results are comparable to those obtained by the best state of the art methods.

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (17)

Stamatatos Efstathios, Daelemans Walter, Verhoeven Ben , Potthast Martin, Stein Benno, Juola Patrick, Miguel A. Sanchez-Perez, and Barrn-Cedeo Alberto. 2014. Overview of the Author Identification Task at CLEF. England. Li Jiexun, Zheng Rong and Chen Hsinchun. 2006. From fingerprint to writeprint. Communication ACM 49(4), 7682.
Zheng Rong, Li Jiexun, Chen Hsinchun and Huang Zan. 2006. A framework for authorship identification of online messages: Writing style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378-393.
Vartapetiance Anna and Gillam Lee. 2014. A Trinity of Trials: Surreys 2014 Attempts at Author Verification. Proceedings of PAN@CLEF2014. Argamon Shlomo, Whitelaw Casey, Chase Paul, Hota S. Raj, Garg Navendu and Levitan Shlomo. 2007. Stylistic text classication using functional lexical features Journal of American society of information science and technology 58(6), 802822.
Raghavan Sindhu, Kovashka Adriana and Mooney Raymond. 2010. Authorship attribution using probabilistic context free grammars. Proceedings of ACL10, 3842.
Feng V. Wei and Hirst Graeme. 2013. Authorship verification with entity coherence and other rich linguistic features.Proceedings of CLEF13. Mccarthy M. Philip, Lewis A. Gwyneth, Dufty F. David and Mcnamara S. Danielle. 2006. Analyz- ing writing styles with coh-metrix. Proceedings of FLAIRS06, 764769.
Baayen R. Harald. 2008. Analyzing Linguistic Data. A Practical Introduction to Statistics using R.Cambridge, Cambridge University Press, Cam- bridge.
Mosteller Frederick and Wallace David. 1964. Inference in an Authorship Problem,1964. In Journal of the American Statistical Association, Volume 58, Issue 302, 275-309.
Labb Cyril. 2003. Intertextual Distance and Authorship Attribution. Corneille and Molire, In: Journal of Quantitative Linguistics, , pp. 213-231.
Burrows John. 2002. Delta: a Measure of Stylistic Difference and a Guide to Likely Authorship, In Journal Lit Linguist Computing.
Blei M. David, and Jordan I. Michael. 2004. Variational methods for the Dirichlet process. In Proceedings of the twenty first international conference on Machine learning ACM.
Hershey R. John, Olsen A. Peder and Rennie J. Steven. 2007. Variational Kullback Leibler divergence for Hidden Markov models. IEEE Workshop on Automatic Speech Recognition and Under standing.
Grieve Jack. 2007. Quantitative authorship attribution: An evaluation of techniques. Literary and linguistic computing, 22(3),.251-270.
Savoy Jacques. 2012. Etude comparative de stratgies de slection de prdicteurs pour lattribution dauteur, COnfrence en Recherche dInformation et Applications CORIA. 215-228, France.
Stamatatos Efstathios, Fakotakis Nikos and Kokkinakis George. 2000. Automatic text categorization in terms of genre and author, Computational Linguistics, Volume 26,.471-495.
Lee C. Min, Mani Inderjeet, Verhagen Marc, Wellner Ben, and Pustejovsky James. 2006. Machine learning of temporal relations. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. 753-760.
Zhao Ying, and Zobel Justin. 2007. Searching with style: Authorship attribution in classic literature, In Proceedings of the Thirtieth Aus- tralian Computer Science Conference ACM Press, 59-68,Australia. Moreau Erwan, Jayapal Arun, and Vogel Carl. 2014. Author Verification: Exploring a Large setof Parameters using a Genetic Algorithm Notebook for PAN at CLEF 2014. England. Peas Anselmo and Rodrigo lvaro. 2011. A Simple Measure to Assess Nonresponse. In Proceedings Of the 49th Annual Meeting of the Association for Computational Linguistics, Vol.1, 1415-1424.
Frery Jordan, Largeron Christine, and Juganaru- Mathieu Mihaela. 2014. UJM at CLEF in Author Identification. PAN@CLEF2014. England.