Towards a Computational Model to Thematic Typology of Literary Texts: A Concept Mining Approach (original) (raw)

Addressing Subjectivity and Replicability in Thematic Classification of Literary Texts: Using Cluster Analysis to Derive Taxonomies of Thematic Concepts in the Thomas Hardy's Prose Fiction

Thematic classification of Thomas Hardy's work has traditionally been based partly on textual content and partly on biographical considerations. These analyses and criticisms have been generated by what will henceforth be referred to as 'the philological method'; that is, by individual researchers' reading of printed materials and the intuitive abstraction of generalizations from that reading. A major problem with studies in this tradition is that they are not objective or replicable. In order to address issues of objectivity and replicability, this paper proposes an automated text clustering of the prose fiction works of Thomas Hardy using cluster analysis based on a vector space model (VSM) representation of the lexical content of the selected texts. The results reported here indicate that the proposed clustering structures yield usable results in understanding the thematic structure of Hardy's prose fiction texts and that they do so in an objective and replicable way. The remainder of this discussion is organized as follows: part 1 is the introduction, part 2 is methodology, part 3 covers data preparation, part 4 is hierarchical cluster analysis, part 5 is an interpretation of the results, and part 6 is the conclusion.

Using Cluster Analysis to Derive Taxonomies of Thematic Concepts in the Thomas Hardy’s Prose Fiction

Thematic classification of Thomas Hardy’s work has traditionally been based partly on textual content and partly on biographical considerations. These analyses and criticisms have been generated by what will henceforth be referred to as ‘the philological method’; that is, by individual researchers’ reading of printed materials and the intuitive abstraction of generalizations from that reading. A major problem with studies in this tradition is that they are not objective or replicable. In order to address issues of objectivity and replicability, this paper proposes an automated text clustering of the prose fiction works of Thomas Hardy using cluster analysis based on a vector space model (VSM) representation of the lexical content of the selected texts. The results reported here indicate that the proposed clustering structures yield usable results in understanding the thematic structure of Hardy’s prose fiction texts and that they do so in an objective and replicable way.

Toward discovering potential data mining applications in literary criticism

2006

Over the past decade text mining techniques have been used for knowledge discovery in many domains, such as web documents, news articles, biomedical literature, etc. In the literary study domain, some data mining applications have emerged, among which document categorization may be the most successful example (Meunier 2005). But the overall progress of computer assisted literary study is not significant.

Quantitative Analysis of Literary Texts: Computational Approaches in Digital Humanities Research

2024

The aim of this paper is to show how computing techniques bring new ways of improving text understanding and the text structure. Through using statistical analysis methods, scholars try to dig out the covert patterns, trends, and meanings found in literary texts in order to improve our comprehension of literature in the digital realm. Methodology: From the methodological point of view, the article delves into various quantitative analysis methods, including text mining, natural language processing (NLP), network analysis, and corpus linguistics. This way of working embraces computational devices and programs to explore large volumes of literary texts and to obtain significant information, notice linguistic patterns, and visualize many connections within the texts. Results and Discussion: The computational results and discussion part conveys findings from applied computational techniques to literary texts, which include word frequency analysis, stylometric analysis, and sentiment analysis. Frequency word analysis provides us with the idea of prominent principles and points of emphasis within the texts, whereas the stylometric analysis gives us the author's writing style and linguistic features. Sentiment analysis is to measure the sentiment levels within the texts, that being emotional tones, which in turn reveal affective dimensions as well as thematic content. Conclusion: The incorporation of the quantitative analysis methods into literary studies determines a notable progress of Digital Humanities as a field. Through merging traditional qualitative way with computational tools and methodologies, researchers make it possible to explore intricacies of literary texts, thus bringing about interdisciplinary collaborations and providing more easy access to literary knowledge. Digital humanities continues to be a developing field and quantitative analysis is a proof and a show of powering of technology in revolutionizing how we view modern literature and culture.

Exploring Semantic Domains in the Works of Five Female 19th Century British Novelists: an Assessment of the Effectiveness of WMatrix in Identifying Major Literary Themes

The aim of this paper is to explore some of the new possibilities offered by the field of corpus stylistics for the analysis of fiction. In this paper, I show (1) how Wmatrix 1 (Rayson, 2003, 2007), a new kind of method and tool for the statistical analysis of corpora that integrates part-of-speech tagging and semantic field tagging, provides the main semantic domains of any text or group of texts; and (2) how these semantic domains can be related to the major themes of such text(s). Leech (2008) asserts that the application of WMatrix to literary texts 'is still in its infancy' (2008: 163). In an attempt to contribute to assessing the effectiveness of WMatrix to identify the major literary themes in fiction, this tool is applied to a selection of 19 th century fiction which consists of all the fictional works of five 19 th century female novelists, namely: Jane Austen, Charllotte Brontë, Emily Brontë, George Eliot and Elizabeth Gaskell. Using the BNC Sampler-Written 2 as a reference corpus, the Wmatrix semantic tagger is used to identify the key semantic domains/fields of the compiled works of each female novelist. The results of the corpus analysis confirms the usefulness of corpus stylistics in studying literary texts particularly in relation to their themes.

Proceedings of the Fourth Workshop on Computational Linguistics for Literature

2015

Welcome to the 4 th edition of the Workshop on Computational Linguistics for Literature. After the rounds in Montréal, Atlanta and Göteborg, we are pleased to see both the familiar and the new faces in Denver. We are eager to hear what our invited speakers will tell us. Nick Montfort, a poet and a pioneer of digital arts and poetry, will open the day with a talk on the use of programming to foster exploration and fresh insights in the humanities. He suggests a new paradigm useful for people with little or no programming experience. Matthew Jockers's work on macro-analysis of literature is well known and widely cited. He has published extensively on using digital analysis to view literature diachronically. Matthew will talk about his recent work on modelling the shape of stories via sentiment analysis. This year's workshop will feature six regular talks and eight posters. If our past experience is any indication, we can expect a lively poster session. The topics of the 14 accepted papers are diverse and exciting. Once again, there is a lot of interest in the computational analysis of poetry. Rodolfo Delmonte will present and demo SPARSAR, a system which analyzes and visualizes poems. Borja Navarro-Colorado will talk about his work on analyzing shape and meaning in the 16 th and 17 th century Spanish sonnets. Nina McCurdy, Vivek Srikumar & Miriah Meyer propose a formalism for analyzing sonic devices in poetry and describe an open-source implementation. This year's workshop will witness a lot of work on parallel texts and on machine translation of literary data. Laurent Besacier & Lane Schwartz describe preliminary experiments with MT for the translation of literature. In a similar vein, Antonio Toral & Andy Way explore MT on literary data but between related languages. Fabienne Cap, Ina Rösiger & Jonas Kuhn explore how parallel editions of the same work can be used for literary analysis. Olga Scrivner & Sandra Kübler also look at parallel editionsin dealing with historical texts. Several other papers cover various aspects of literary analysis through computation. Prashant Jayannavar, Apoorv Agarwal, Melody Ju & Owen Rambow consider social network analysis for the validation of literary theories. Andreas van Cranenburgh & Corina Koolen investigate what distinguishes literary novels from less literary ones. Dimitrios Kokkinakis, Ann Ighe & Mats Malm use computational analysis and leverage literature as a historical corpus in order to study typical vocations of women in the 19 th century Sweden. Markus Krug, Frank Puppe, Fotis Jannidis, Luisa Macharowsky, Isabella Reger & Lukas Weimar describe a coreference resolution system designed specifically with fiction in mind. Stefan Evert, Thomas Proisl, Thorsten Vitt, Christof Schöch, Fotis Jannidis & Steffen Pielström explain the success of Burrows's Delta in literary authorship attribution. Last but not least, there are papers which do not fit into any other bucket. Marie Dubremetz & Joakim Nivre will tell us about automatic detection of a rare but elegant rhetorical device called chiasmus. Julian Brooke, Adam Hammond & Graeme Hirst describe a tool much needed in the community: GutenTag, a system for accessing Project Gutenberg as a corpus. iii To be sure, there will be much to listen to, learn from and discuss for everybody with the slightest interest in either NLP or literature. We cannot wait for June 4 (-:). This workshop would not have been possible without the hard work of our program committee. Many people on the PC have been with us from the beginning. Everyone offers in-depth, knowledgeable advice to both the authors and the organizers. Many thanks to you all! We would also like to acknowledge the generous support of the National Science Foundation (grant No. 1523285), which has allowed us to invite such interesting speakers.

Classifying literary genres

Texto Livre: Linguagem e Tecnologia, 2020

Classifying literary genres has always been methodologically confined to philological methods and what is commonly known as Vector Space Clustering (VSC). The problem has been exasperated with the widening gap between computational theory and traditional analysis of literary texts. Towards finding a solution to this problem, the current study utilizes a synergetic approach that brings together two established methods. First, a computational model of genre classification is drawn upon for identifying concept-based, rather than word-bound, topics, where the representation of texts is secured via the ‘bag of concepts’ (BOC) model as well as the sense-restricted knowledge and meaningful links holding between and among concepts; relatedly, the two model strands of explicit semantic analysis (ESA) and ConceptNet have enacted text classification. Second, a contextual lexical semantic approach (CRUSE, 1986, 2000) is employed so that the contextual variability of word meanings and concepts c...

THE POTENTIALITIES OF CORPUS-BASED TECHNIQUES FOR ANALYZING LITERATURE

This paper presents an attempt to explore the analytical potential of five corpus-based techniques: concordances, frequency lists, keyword lists, collocate lists, and dispersion plots. The basic question addressed is related to the contribution that these techniques make to gain more objective and insightful knowledge of the way literary meanings are encoded and of the way the literary language is organized. Three sizable English novels (Joyc's Ulysses, Woolf's The Waves, and Faulkner's As I Lay Dying) are laid to corpus linguistic analysis. It is only by virtue of corpus-based techniques that huge amounts of literary data are analyzable. Otherwise, the data will keep on to be not more than several lines of poetry or short excerpts of narrative. The corpus-based techniques presented throughout this paper contribute more or less to a sort of rigorous interpretation of literary texts far from the intuitive approaches usually utilized in traditional stylistics.