The Gutenberg English Poetry Corpus: Exemplary Quantitative Narrative Analyses (original) (raw)

Abstract

This paper describes a corpus of about 3,000 English literary texts with about 250 million words extracted from the Gutenberg project that span a range of genres from both fiction and non-fiction written by more than 130 authors (e.g., Darwin, Dickens, Shakespeare). Quantitative narrative analysis (QNA) is used to explore a cleaned subcorpus, the Gutenberg English Poetry Corpus (GEPC), which comprises over 100 poetic texts with around two million words from about 50 authors (e.g., Keats, Joyce, Wordsworth). Some exemplary QNA studies show author similarities based on latent semantic analysis, significant topics for each author or various text-analytic metrics for George Eliot’s poem “How Lisa Loved the King” and James Joyce’s “Chamber Music,” concerning, e.g., lexical diversity or sentiment analysis. The GEPC is particularly suited for research in Digital Humanities, Computational Stylistics, or Neurocognitive Poetics, e.g., as training and test corpus for stimulus development and control in empirical studies.

Introduction

In his “The psycho-biology of language,” Zipf () introduced the law of linguistic change claiming that as the frequency of phonemes or of linguistic forms increases, their magnitude decreases. Zipf’s law elegantly expresses a tendency in languages to maintain an equilibrium between unit length and frequency, suggesting an underlying law of economy. Thus, Zipf speculated that humans strive to maintain an emotional equilibrium between variety and repetitiveness of environmental factors and behavior and that a speaker’s discourse must represent a compromise between variety and repetitiveness adapted to the hearer’s tolerable limits of change in maintaining emotional equilibrium. In a way, Zipf not only was a precursor of contemporary natural language processing/NLP (e.g., Natural Language Tool Kit/NLTK; Bird et al., ), quantitative narrative analysis (QNA), Computational Linguistics or Digital Humanities, but also of Psycholinguistics and Empirical Studies of Literature, since he theorized about “the hearers responses” to literature.

About 30 years later, when analyzing Baudelaires poem “Les chats,” Jakobson and Lévi-Strauss () counted text features like the number of nasals, dental fricatives, liquid phonemes or adjectives, and homonymic rhymes in different parts of the sonnet (e.g., the first quatrain) to support their qualitative analyses and interpretation of, e.g., oxymora that link stanzas, of the relation between the images of cats and women, or of the poem as an open system which progresses dynamically from the quatrain to the couplet. While their systematic structuralist pattern analysis of a poem starting with formal metric, phonological, and syntactic features to prepare the final semantic analysis provoked a controversy among literary scholars, it also settled the ground for subsequent linguistic perspectives on the analysis (and reception) of literary texts called cognitive poetics (e.g., Leech, ; Tsur, ; Turner and Poeppel, ; Stockwell, ).

Today, technological progress has produced culturomics, i.e., computational analyses of huge text corpora (5,195,769 digitized books containing ~4% of all books ever published) enabling researchers to observe cultural trends and subject them to quantitative investigation (Michel et al., ). More particularly, Digital Literary Studies now “propose systematic and technologically equipped methodologies in activities where, for centuries, intuition and intelligent handling had played a predominant role” (Moretti, ; Ganascia, ).

One promising application of these techniques is in the emerging field of Neurocognitive Poetics which is characterized by neurocognitive (experimental) and computational research on the reception of more natural and ecologically valid stimuli focusing on literary materials, e.g., excerpts from novels or poems (Schrott and Jacobs, ; Jacobs, ,; Willems and Jacobs, ). These present a number of theoretical and methodological challenges (Jacobs and Willems, ) regarding experimental designs for behavioral and neurocognitive studies which—on the stimulus side—can be tackled by using advanced techniques of NLP, QNA, and machine learning (e.g., Mitchell, ; Pedregosa et al., ; Jacobs et al., ,, ; Jacobs and Kinder, , ). Recent examples for this approach are the prediction of the subjective beauty of single words (Jacobs, ), the literariness of metaphors (Jacobs and Kinder, ), or the most beautiful line of three Shakespeare sonnets (Jacobs, under revision1). Thus, using classifiers of the decision tree family (Geurts et al., ), Jacobs and Kinder identified a set of 11 features that could influence the literariness of metaphors, including their length, surprisal value, and sonority score (see below).

All these studies require training corpora as the basis for their computational predictions, and a particularly interesting challenge consists of finding or creating the optimal training corpus—especially for empirical scientific studies of literature (Jacobs, )—since standard corpora are not based on particularly literary texts. Recently, Bornet and Kaplan () introduced a literary corpus of 35 French novels with over five million word tokens for a named entity recognition study, but in the fields of psycholinguistics and Neurocognitive Poetics, such specific corpora still are practically absent. An exception is the Shakespeare corpus (Shakespeare Online, http://www.shakespeare-online.com/sonnets/sonnetintroduction.html; cf. Jacobs et al., ) we recently used to compute the surprisal2 values of entire sonnets, stanzas, or lines which are reliable and valid predictors of a number of response measures collected in empirical research on reading and literature, e.g., reading time or brain wave amplitudes (Frank, ). Surprisal computation requires a language model, usually based on trigrams (e.g., Jurafsky and Martin, ). However, it makes a big difference when trigram probabilities are computed on the basis of a nonliterary as compared to a literary or poetic training corpus, or when they are based on prose rather than poems (see below). As could be expected, when a contemporary corpus encompassing about six million sentences (SUBTLEX, Brysbaert and New, ) was used, a significantly higher mean surprisal (for all 154 sonnets) resulted than when the Shakespeare corpus was used (Jacobs et al., ). According to the Neurocognitive Poetics Model (Jacobs, , ,), sonnets/lines/words with higher surprisal—and thus foregrounding potential—should more likely produce higher liking ratings, smaller eye movements, and longer fixation durations than sonnets low on surprisal. Data from a recent eye-tracking study using short literary stories support these predictions (van den Hoven et al., ). Regarding potential neuroimaging studies on sonnet reception, Jacobs et al. () predicted a higher activation in several brain. Therefore areas, e.g., the left inferior temporal sulcus, bilateral superior temporal gyrus, right amygdala, bilateral anterior temporal poles, and right inferior frontal sulcus for sonnets with higher surprisal values. The choice of the training corpus and the language model is crucial for such predictions, the selection of the stimulus materials for empirical studies, and the evaluation of the theoretical model’s descriptive accuracy and validity. A major goal of the Neurocognitive Poetics perspective is to develop and test training corpora of differing size, specificity, and representativeness in several languages (cf. Jacobs, under revision).

In this paper, I describe a novel literary corpus assembled from the digitized books part of project Gutenberg (https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html), augmented by a Shakespeare corpus (Shakespeare Online, http://www.shakespeare-online.com/sonnets/sonnetintroduction.html; cf. Jacobs et al., ), henceforth called the Gutenberg Literary English Corpus (GLEC). The GLEC provides a collection of over 3,000 English texts from the Gutenberg project, spanning a wide range of genres, both fiction and non-fiction (novels, biographies, dramas, essays, short stories, novellas, tales, speeches and letters, science books, poetry; e.g., Austen, Bronte, Byron, Coleridge, Darwin, Dickens, Einstein, Eliot, Poe, Twain, Woolf, Wilde, Yeats) with about 12 million sentences and 250 million words.

Materials and Methods: The GLEC and GEPC

The GLEC, i.e., the original Gutenberg texts augmented by the Shakespeare corpus, contains over 900 novels, over 500 short stories, over 300 tales and stories for children, about 200 poem collections, poems and ballads and about 100 plays, as well as over 500 pieces of non-fiction, e.g., articles, essays, lectures, letters, speeches, or (auto-)biographies. Except for the poetry collection subcorpus further explored in this paper and henceforth called the Gutenberg English Poetry Corpus (GEPC), these texts are not (yet) edited, shortened, or cleaned.

For the present analyses, I cleaned (in large part manually) all 116 texts making up the GEPC, e.g., by deleting duplicate poems, prefaces, introductions, content tables, and indices of first lines, postscripts, biographical, and author notes, as well as footnotes3 or line and page numbers, and by separating poems from plays or essays (e.g., in Yeats texts), so that only the poems themselves remain in the texts without any piece of prose. This was important to obtain a valid “poetry-only” subcorpus and a valid poetic language model for comparison with poetic texts or text fragments, such as metaphors (Jacobs and Kinder, , ). Without such cleaning, the computation of any ngram model, for instance, would be distorted by the prose parts. For the same reason, I also deleted poems in other languages than English, e.g., Lord Byrons “Sonetto di Vitorelli,” PB Shelleys “Buona Notte,” or TS Eliots “Dans le Restaurant.”

In a second step, I concatenated all poetic texts written by a specific author which yielded a collection of 47 compound texts by the following authors: Aldous Huxley, Alexander Pope, Ambrose Bierce, Andrew Lang, Bret Harte, Charles Dickens, Charles Kingsley, DH Lawrence, Edgar Allan Poe, Elizabeth Barrett Browning, Ezra Pound, GK Chesterton, George Eliot, Herman Melville, James Joyce, James Russell Lowell, John Dryden, John Keats, John Milton, Jonathan Swift, Leigh Hunt, Lewis Carroll, Lord Byron, Lord Tennyson, Louisa May Alcott, Oscar Wilde, PB Shelley, Ralph Waldo Emerson, Robert Browning, Robert Frost, Robert Louis Stevenson, Robert Southey, Rudyard Kipling, Samuel Taylor Coleridge, Shakespeare, Sir Arthur Conan Doyle, Sir Walter Scott, Sir William Schwenck Gilbert, TS Eliot, Thomas Hardy, Walt Whitman, Walter de la Mare, William Blake, William Butler Yeats, William Dean Howells, William Makepeace Thackeray, and William Wordsworth. These 47 compound texts differ in a variety of surface and deep structure features, some of which are analyzed in the following sections. As can be seen in Table A1 in the Appendix, text length also varies considerably across authors (exponential distribution with a median of 23,000 words): the top three authors are Lord Byron (~210,000 words), PB Shelley (~165,000), and Wordsworth (~115,000); the “flop” three are Alcott (<400), Pound (~1,200), and Joyce (~1,200). The majority of texts have less than 40,000 words. The entire GEPC comprises 1,808,160 words (tokens) and 41,857 types.

Results: Some Exemplary Analyses of the GEPC

Like other fields (e.g., Computational Linguistics or Digital Humanities), Neurocognitive Poetics (Jacobs, ) uses text corpora for many purposes, e.g., the abovementioned computation of surprisal values.

Other purposes are similarity analyses, which can be based on features extracted by latent semantic, topic, or sentiment analyses (e.g., Deerwester et al., ; Turney and Littman, ; Schmidtke et al., ; Jacobs et al., ; Roe et al., ). Such features can then be used to train classifiers for identifying authors, periods of origin or main motifs, as well as for predicting ratings and other response data of poetic texts (e.g., van Halteren et al., ; Stamatatos, ; Jacobs and Kinder, , ; Jacobs et al., ).

Similarity Analyses

As an example for a similarity analysis, Figure 1 shows a multidimensional scaling representation of the 47 texts of the GEPC based on latent semantic analysis (document-term-matrix/DTM analysis4). Not surprisingly, this analysis reveals, e.g., that the “Lake poets” (e.g., Coleridge, Woodsworth) cluster together, or that some poets like Pound or Joyce stand out from the rest. The latter finding is related to the fact that the GEPC is most representative for poetry from the nineteenth century, a limitation discussed below.

Figure 2 shows a heat map comparing the 20 most significant topics (as extracted by Non-Negative Matrix Factorization/NMF; Pedregosa et al., ) for the 47 texts (a list of the 20 most significant words per topic is given in the Appendix). The color code is proportional to the probability of a given topic, i.e., all 20 values per author add up to 1.

The data summarized in Figure 2 and the topic list (Appendix) reveal, e.g., that the (statistically) most important topic for Shakespeare’s sonnets is topic #16 represented by the following 20 keyword stems: “natur spirit hath everi truth right hope think doth back find much faith art free round whole set drop.” By contrast, topic #2 appears to be the most important for Lord Tennyson (keyword stems: “king knight arthur round queen answer mine lancelot saw lord mother arm call name thine hall among child hath speak”).

Moreover, the data of Figure 2 reveal that texts like those of Walt Whitman cover only four of the 20 topics (#1, 3, 4, and 17 have probabilities >0), whereas other authors such as Charles Dickens cover a large range of topics (i.e., 15/20 with p > 0). Such data can be used further in deeper analyses of generic poetic texts such as Shakespeare sonnets that look at topics important for esthetic success (e.g., Simonton, ) or for evoking specific affective and esthetic reader responses (Jacobs et al., ).

It should be noted5 that at least a part of these topics represents authors and works as much as (or instead of) actual abstract concepts, likely because proper names were not filtered out. Thus, topic #18 (with terms like “cuchullain,” “fintain,” and “laigair”) primarily describes Yeats’s work and topic #2 Tennyson’s “Idylls of the King.” Comparative topic analyses using different algorithms and filters for, e.g., higher-frequency function words like "hath" or "without" can help determine the generality of such topics but are beyond the scope of this first paper introducing a new corpus. As argued elsewhere (see text footnote 1), the non-trivial interpretation of such data-driven learned topics can benefit from augmenting it by top-down conceptual tools such as the Cambridge Advanced Learner’s Dictionary (Steyvers et al., ), expert knowledge in an iterative topic modeling process (Andrzejewski et al., ), or qualitative analyses concerning thematic richness or symbolic imagery (cf. Jacobs et al., ). This can help in the creative task of finding a superordinate label for, say, the 20 words describing topic #16. As argued by Roe et al. () who recently applied a topics analysis to the “French Encyclopedie,” “the usefulness of a topic model does not necessarily rest on its ability to provide meaningful topics (a subjective categorization) for the corpus being analyzed, but rather on the multiplicity of perspectives it can generate and, as a result, on the potential for discovery that some of these topics can offer.” In Neurocognitive Poetics, unsupervised topic modeling can also fulfill the role of a naïve “null-model” against which expert interpretations concerning focus and diversity (e.g., Vendler, ) can be gauged.

Comparing Word Uniqueness and Distinctiveness for Two Texts

A third and last example for how to use the present GEPC concerns a more detailed comparative analysis for a subset of the 47 texts including surface and semantic features. This is done for two authors with shorter texts of comparable length: Blake vs. Dickens (4,439 words vs. 3,758). We have recently provided an extensive comparative QNA of all 154 Shakespeare sonnets looking at both surface and deep semantic features. For example, we compared features such as poem or line surprisal, syntactic simplicity, deep cohesion, or emotion and mood potential (Jacobs et al., ). As an example for another interesting feature not considered in our previous study, here I will focus on word distinctiveness or keyness. In computing this feature, I closely followed the procedure proposed in DARIAH—Digital Research Infrastructure for the Arts and Humanities; https://de.dariah.eu/tatom/feature_selection.html. According to DARIAH’s operationalization, one way to consider words as distinctive is when they are found exclusively in texts associated with a single author (or group). For example, if Dickens uses the word “squire” in the present GEPC and Blake never does, one can count “squire” as distinctive or unique (in this comparative context). Vice versa, the word “mother” is distinctive in this GEPC comparison, because Dickens never uses it (see Table 1).

Author/words	SQUIRE	LUCI	MOTHER	FINE	LAMB
Blake	0	0	7.9	0	7.2
Dickens	11.2	8.7	0	7.5	0

Five unique words with usage rates (1/1,000) in Blake’s and Dickens’ poems of the GEPC.

Identifying unique words simply requires to calculate the average rate of word use across all texts for each author and then to look for cases where the average rate is zero for one author. Based on the DTMs for both texts, this yielded the following results.

Another approach to measuring keyness is to compare the average rate at which authors use a word by calculating the difference between the rates. Using this measure, I calculated the top five distinctive words in the Blake−Dickens comparison by dividing the difference in both authors’ average rates by the average rate across all 47 authors.

Thus, appearing only once in the entire text, Dickens’ word stem “outgleam” in the line “Behold outgleaming on the angry main!” appears to be distinctive, much as the other four word stems in Table 2.

Author/words	dol’	outgleam	chalon	toor	vithin
Blake	0	0	0	0	0
Dickens	0.83	0.41	0.41	0.83	0.41

Top five distinctive words (stems) with usage rates (1/1,000) in Blake’s and Dickens’ poems of the GEPC.

A final quantitative comparison inspired by DARIAH’s approach to determining word distinctiveness uses a Bayesian group or an author comparison. It involves estimating the belief about the observed word frequencies to differ significantly by using a probability distribution called the sampling model. This assumes the rates to come from two different normal distributions, and the question to be answered is how confident one is that the means of the two normal distributions are different. The degree of confidence (i.e., a Bayesian probability), that the means are indeed different, then is another probabilistic measure of distinctiveness.

Using a Gibbs sampler to get a distribution of posterior values for δ6 which is the variable estimating the belief about the difference in authors’ word usage (for details, see https://de.dariah.eu/tatom/feature_selection.html., cf. Burrows, ), I computed the probability that using the words “squire” and “fine” (both more characteristic of Dickens’ poems than of Blake’s) are likely to be zero (see Table 3).

SQUIRE	FINE
p (δ < 0)	0.23	0.09
Blake average	0	0
Dickens average	11.2	7.5

Bayesian probability estimates (based on 2,000 samples) for two distinctive words (SQUIRE, FINE) with usage rates (1/1,000) in Blake’s and Dickens’ poems of the GEPC.

According to this Bayesian analysis, “squire” appears more distinctive of Dickens’ poetry than “fine,” but since both words do not produce a high probability of differing from zero, I would not put much belief in them being specifically characteristic of Dickens in the GEPC (although they are most distinctive in comparison to Blake, see Tables 1 and 3). This Bayesian “feature selection” method can be extended to every word occurring in a corpus producing a useful ordering of characteristic words (for details, see https://de.dariah.eu/tatom/feature_selection.html).

Comparing Two Individual Poems

The above analyses dealt with the entire GEPC or two poem collections, respectively. Next, I focus on a more detailed—purely descriptive—comparison of two short individual texts from the GEPC that are far apart from each other (and the rest of the poems) in the similarity graph shown in Figure 1: George Eliot’s poem “How Lisa Loved the King” and James Joyce’s “Chamber Music.” I will give just a few illustrative statistics both for surface and for deeper semantic features that are of potential use in Digital Humanities and Neurocognitive Poetics studies (for review on the latter, see Jacobs, a; Jacobs et al., ).

Two features that are often used as indicators of linguistic complexity, poetic quality, or esthetic success are _lexical diversity_—measured by the _type–token ratio_—and adjective–verb quotient: for example, “better” Shakespeare sonnets are distinguished by a higher type-token ratio, more unique words, and a higher adjective–verb quotient (e.g., Simonton, ). The number of types can also be considered a coestimate of the size of an authors’ (active) mental lexicon and vocabulary profile. As can be seen in columns 2 and 3 of Table 4, both poems descriptively do not differ much on these features.

Author	Nbr. of word tokens/types/hapaxes/type–token ratio (lexical diversity)	Nbr. of nouns, verbs, adjectives/adjective–verb quotient	Most freq. nouns, verbs, adjectives	Most freq. bi- and trigram collocations	Mean sonority score	Mean positive and negative valence, and arousal/most positive, negative, and arousing word
Eliot	2,702, 1,467, 1,014, 0.5	1,111, 686, 642, 0.93	LOVE (19), LIFE (15), SOUL (12), love (7), see (5), live (3)little (13), high (9), good (9)	“King Pedro” (4), “day might ” (2),“death tell” (2),“Six hundred years ” (2), “Hundred years ago” (2), “T gentle Lisa” (1)	5.19	1.01, 0.84, 2.01happiness, shame, happiness
Joyce	1,221, 654, 447, 0.53	507, 313, 270, 0.86	LOVE (23), HEART (18), AIR (9)love (7), come (3), sleep (2)sweet (13), soft (9), fair (5)	“true love” (4), “long hair” (3), “pretty air” (3), “combing long hair” (2), “would sweet bosom” (2), “singing merry air” (2)	5.26	1.02, 0.85, 2.03happiness, sadness, happiness

Some exemplary statistics for two poems.

Looking at the three most frequent nouns, verbs, and adjectives, as well as significant bi- and trigram collocations in columns 4 and 5, the keywords suggest that both poems have much to say about one of three favorite poetry motifs, i.e., love. This is also evident from the two lexical dispersion plots shown in Figure 3, which show, among others, that “love” appears well distributed across the entire poems, never letting the reader forget the poems’ central motif.

Poetic language expertly plays with the sound-meaning nexus, and our group has provided empirical evidence that sublexical phonological features play a role in (written) poetry reception (Schrott and Jacobs, ; Aryani et al., , ; Schmidtke et al., ; Jacobs, ,; Jacobs et al., , ; Ullrich et al., ). A sublexical phonological feature with poetic potential is the sonority score (Jacobs, ; Jacobs and Kinder, ; see Appendix A for details). It is based on the notion of sonority profile (cf. Clements, ; Stenneken et al., ) which rises maximally toward the peak and falls minimally toward the end, proceeding from left to right, for the universally preferred syllable type (Clements, , p. 301). Through a process of more or less unconscious phonological recoding, text sonority may play a role even in silent reading (Ziegler and Jacobs, ; Braun et al., ) and especially in reading poetic texts (Kraxenberger, ). Column six of Table 4 shows that the two poems differ little in their global sonority score. At a finer-grained level of individual lines or stanzas, sonority could still notably differ, however, and implicitly affect readers’ affective-esthetic evaluation (cf. Jacobs and Kinder, ).

An important task for QNA-based Neurocognitive Poetics studies is sentiment analysis, i.e., to estimate the emotional valence or mood potential of verbal materials (e.g., Jacobs et al., ). In principle, this is done with either of two methodological approaches: using word lists that provide values of word valence or arousal based on human rating data (e.g., Jacobs et al., ), or applying a method proposed by Turney and Littman () based on associations of a target word with a set of labels, i.e., keywords assumed to be prototypical for a certain affect or emotion. Following previous research (Westbury et al., ), I computed the lexical features valence and arousal according to a procedure described in Appendix B.

The mean values in the rightmost column of Table 4 indicate that at this global level, both poems practically do not differ on any of these three affective features. This can be visualized for the entire poems by the 3D plots of the principal components extracted from the three variables for all words in the poems: descriptively, they appear very similar (see Figure 4). All other things being equal, this suggests that, e.g., human ratings of the global affective meaning of both poems should not differ significantly (cf. Aryani et al., ). Of course, at the local level, a deeper qualitative analysis of both poems may reveal that they do in fact inhabit completely different esthetic universes which influence such ratings’, as pointed out by one reviewer who also noted that “the formal differences (Joyce’s variety of line lengths, meters, and stanza shapes vs. Eliot’s fairly straight ahead iambic pentameter) have a strong impact on the atmosphere the poems create.” As we have repeatedly argued elsewhere (Jacobs, b; Jacobs et al., ; Abramo et al., under revision7), QNA-based text analyses like these global affect scores should be complemented by qualitative analyses of style figures—done by interdisciplinary experts—at all levels of the 4 × 4 matrix proposed in Jacobs (b), i.e., metric, phonological, morpho-syntactic, semantic, as well as sublexical, lexical, interlexical, and supralexical. The Foregrounding Assessment Matrix recently proposed by Abramo et al. (under revision)7 is such a useful tool that allows to identify density fields of overlapping style figures at several levels, e.g., sublexical–phonological (alliteration) and interlexical–semantic (metaphor). As a promising first result, the combined qualitative–quantitative analysis of Shakespeare’s sonnet 60 allowed these authors to predict the keyword score (i.e., words marked by readers as being “keywords” for understanding the sonnet) with an accuracy of about 90%.

It is thus important to note that QNA-based analyses like those in Table 4 are not meant to replace deep qualitative analyses of texts like Vendler’s (Vendler, ) interpretation of Shakespeare’s sonnets. However, for designing Neurocognitive Poetics studies, which involve selecting and matching complex verbal materials on a variety of feature dimensions, they are a necessity (cf. Jacobs and Willems, ).

Discussion

In this paper, I have briefly described a relatively big corpus of English literary texts, the GLEC, for use in studies of Computational Linguistics, Digital Humanities, or Neurocognitive Poetics. As a whole, the GLEC requires further processing (e.g., cleaning, regrouping according to subgenres, etc.) before it can be used as a training and/or test corpus for future studies. Using a smaller subcorpus already cleaned and consisting of 116 poetry collections, poems, and ballads from 47 authors, i.e., the GEPC, I presented a few exemplary QNA studies in detail. In these explorations of the GEPC, I showed how to use similarity and topic analyses for comparing and grouping texts, several methods for identifying distinctive words, and procedures for quantifying important features that can influence reader responses to literary texts, e.g., lexical diversity, sonority score, valence, or arousal. The GEPC thus can be applied to a variety of research questions such as authorship and period of origin classifications (cf. Stamatatos, ), the prediction of beauty ratings for metaphors (e.g., Jacobs and Kinder, , ), or the design of neuroimaging studies using literary stimuli (e.g., Bohrn et al., ; O’Sullivan et al., ).

The still relatively rare application of corpus-based QNA to poetry is an integral part of the Neurocognitive Poetics Perspective (e.g., Jacobs, ,; O’Sullivan et al., ; Willems and Jacobs, ; Jacobs and Willems, ; Jacobs et al., ; Nicklas and Jacobs, ; Jacobs and Kinder, ), because it offers the possibility of neurocognitive experiments with complex, natural verbal stimuli that can vary on a plethora of features (e.g., >70 in Jacobs & Kinder’s recent metaphor study). While being a first necessary step for state-of-the art statistical data analyses (e.g., in eye tracking or fMRI studies), augmenting QNA with interdisciplinary expert qualitative text analyses (e.g., see text footnote 7) is also necessary, because the rich esthetics of poetry is based on a complex author-(con-)text-reader nexus QNA tools alone cannot cope with. While the detection, dynamic development (across poem parts), and interpretation of metaphors are a case in point, the above data-driven topic analyses also indicate both the potential and limitations of QNA not only for poetry but also for prose or scientific texts, where they can be considered a useful complement to traditional methods of close reading (e.g., Roe et al., ).

The GLEC and GEPC are two of many available training corpora and can be compared or also combined with much larger general corpora, such as ukwac (Baroni et al., ). I have discussed strengths and limits of a dozen training corpora useful for Neurocognitive Poetics and computational stylistics elsewhere (see text footnote 1). The obvious limitation of the present corpus lies in its texts being relatively “old”: due to copyright issues, the GLEC and GEPC contain only texts from 1623 to 1952, the majority of the GEPC stemming from the nineteenth century (Median = 1885). This limitation can at least partly be overcome by merging the GEPC with the contemporary ukwac corpus (>2 billion words), for example. To what extent this ukwac-GEPC merger is appropriate for studies using more modern or contemporary prose and poetry texts is an open theoretical and empirical question to be addressed in future comparative research. The successful application of the GLEC as a reliable language model (with a hit rate of 100%) for the computation of the surprisal values of 464 metaphors which also included contemporary ones (Katz et al., ) is encouraging in this respect (Jacobs and Kinder, ).

The development of appropriate open-access training corpora—which are both sufficiently specific and representative for the research materials and reader population under investigation—is one of four general desiderata of current computational stylistics and Neurocognitive Poetics (see text footnote 1) together with the development of combined qualitative–quantitative narrative analysis (Q2NA) and machine-learning tools for feature extraction, standard ecologically valid literary test materials (Hanauer, ), and open-access reader response data banks. These developments will not replace the art of close reading and interpreting literary texts, but—paraphrasing Hanauer—they may well lead to “stronger and more generalizable hypotheses about literary phenomena in the future” and thus attract and generate more cross-disciplinary research which ideally leads to a cross-fertilization between the humanities and sciences in the domain of literature, much in the spirit of Zipf (), Turner and Poeppel (), or Michel et al. ().

Statements

Author contributions

AJ conceived and wrote the MS and carried out all original work (data collection and analyzing, python programming, etc.) reported herein.

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The reviewer, CC, and handling Editor declared their shared affiliation.

References

1
AndrzejewskiD.ZhuX.CravenM. (2009). Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, 25–32. New York, NY: ACM.
- Google Scholar
2
AryaniA.JacobsA.M.ConradM. (2013). Extracting salient sublexical units from written texts: “Emophon,” a corpus-based approach to phonological iconicity. _Frontiers in Psychology_4:654.10.3389/fpsyg.2013.00654
- CrossRef
- Google Scholar
3
AryaniA.KraxenbergerM.UllrichS.JacobsA.M.ConradM. (2016). Measuring the basic a ective tone of poems via phonological saliency and iconicity. _Psychology of Aesthetics, Creativity, and the Arts_10: 191–204.10.1037/aca0000033
- CrossRef
- Google Scholar
4
BaroniM.BernardiniS.FerraresiA.ZanchettaE. (2009). The WaCky Wide Web: a collection of very large linguistically processed web-crawled corpora. _Language Resources and Evaluation_43: 209–26.10.1007/s10579-009-9081-4
- CrossRef
- Google Scholar
5
BirdS.KleinE.LoperE. (2009). Natural Language Processing with Python. Sebastopol, CA: O’Reilly Media, Inc.
- Google Scholar
6
BohrnI.C.AltmannU.LubrichO.MenninghausW.JacobsA.M. (2013). When we like what we know—a parametric fMRI analysis of beauty and familiarity. _Brain and Language_124: 1–8.10.1016/j.bandl.2012.10.003
- CrossRef
- Google Scholar
7
BornetC.KaplanF. (2017). A simple set of rules for characters and place recognition in French novels. _Frontiers in Digital Humanities_4:6.10.3389/fdigh.2017.00006
- CrossRef
- Google Scholar
8
BraunM.HutzlerF.ZieglerJ.C.DambacherM.JacobsA.M. (2009). Pseudo homophone effects provide evidence of early lexico-phonological processing in visual word recognition. _Human Brain Mapping_30: 1977–89.10.1002/hbm.20643
- CrossRef
- Google Scholar
9
BrysbaertM.NewB. (2009). Moving beyond Kučera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. _Behavior Research Methods_41: 977–90.10.3758/BRM.41.4.977
- CrossRef
- Google Scholar
10
BurrowsJ. (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. _Literary and Linguistic Computing_17: 267–87.10.1093/llc/17.3.267
- CrossRef
- Google Scholar
11
ClementsG.N. (1990). The role of sonority in core syllabification. In Papers in Laboratory Phonology I. Between the Grammar and Physics of Speech, Edited by KingstonJ.BeckmanM.E., 283–333. Cambridge: CUP.
- Google Scholar
12
DeerwesterS.DumaisS.T.FurnasG.W.LandauerT.K.HarshmanR. (1990). Indexing by latent semantic analysis. _Journal of the American Society for Information Science_41: 391–407.10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
- CrossRef
- Google Scholar
13
FrankS.L. (2013). Uncertainty reduction as a measure of cognitive load in sentence comprehension. _Topics in Cognitive Science_5: 475–94.10.1111/tops.12025
- Pubmed Abstract
- CrossRef
- Google Scholar
14
GanasciaJ.-G. (2015). The logic of the big data turn in digital literary studies. _Frontiers in Digital Humanities_2:7.10.3389/fdigh.2015.00007
- CrossRef
- Google Scholar
15
GeurtsP.ErnstD.WehenkelL. (2006). Extremely randomized trees. _Machine Learning_63: 3–42.10.1007/s10994-006-6226-1
- CrossRef
- Google Scholar
16
HanauerD. (2017). Towards a critical mass of accumulated knowledge in the field of scientific literary studies. _Scientific Study of Literature_7: 1–3.10.1075/ssol.7.1.01edi
- CrossRef
- Google Scholar
17
JacobsA.M. (2011). Neurokognitive Poetik: Elemente eines Modells des literarischen Lesens [Neurocognitive poetics: Elements of a model of literary reading]. In Gehirn und Gedicht: Wie wir unsere Wirklichkeiten konstruieren [Brain and Poetry: How We Construct Our Realities], Edited by SchrottR.JacobsA.M., 492–520. Munich: Carl Hanser.
- Google Scholar
18
JacobsA.M. (2015a). Neurocognitive poetics: methods and models for investigating the neuronal and cognitive–affective bases of literature reception. _Frontiers Human Neuroscience_9:186.10.3389/fnhum.2015.00186
- CrossRef
- Google Scholar
19
JacobsA.M. (2015b). Towards a neurocognitive poetics model of literary reading. In Cognitive Neuroscience of Natural Language Use, Edited by WillemsR., 135–159. Cambridge, England: Cambridge University Press.
- Google Scholar
20
JacobsA.M. (2015c). The scientific study of literary experience: sampling the state of the art. _Scientific Study of Literature_5: 139–70.10.1075/ssol.5.2.01jac
- CrossRef
- Google Scholar
21
JacobsA.M. (2017). Quantifying the beauty of words: a neurocognitive poetics perspective. _Frontiers in Human Neuroscience_11:622.10.3389/fnhum.2017.00622
- Pubmed Abstract
- CrossRef
- Google Scholar
22
JacobsA.M.HofmannM.J.KinderA. (2016a). On elementary affective decisions: to like or not to like, that is the question. _Frontiers Psychology_7:1836.10.3389/fpsyg.2016.01836
- CrossRef
- Google Scholar
23
JacobsA.M.LüdtkeJ.AryaniA.Meyer-SickendiekB.ConradM. (2016b). Mood- empathic and aesthetic responses in poetry reception: a model-guided, multilevel, multimethod approach. _Scientific Study of Literature_6: 87–130.10.1075/ssol.6.1.06jac
- CrossRef
- Google Scholar
24
JacobsA.M.KinderA. (2017). The brain is the prisoner of thought: a machine-learning assisted quantitative narrative analysis of literary metaphors for use in Neurocognitive Poetics. _Metaphor and Symbol_32: 139–60.10.1080/10926488.2017.1338015
- CrossRef
- Google Scholar
25
JacobsA.M.KinderA. (2018). What makes a metaphor literary? Answers from two computational studies. Metaphor and Symbol. in press.
- Google Scholar
26
JacobsA.M.SchusterS.XueS.LüdtkeJ. (2017). _What’s in the brain that ink may character ….: a quantitative narrative analysis of Shakespeare’s 154 sonnets for use in neurocognitive poetics. Scientific Study of Literature_7: 4–51.10.1075/ssol.7.1.02jac
- CrossRef
- Google Scholar
27
JacobsA.M.VõM.L.-H.BriesemeisterB.B.ConradM.HofmannM.J.KuchinkeL.et al (2015). 10 years of BAWLing into affective and aesthetic processes in reading: what are the echoes?_Frontiers in Psychology_6: 714.10.3389/fpsyg.2015.00714
- Pubmed Abstract
- CrossRef
- Google Scholar
28
JacobsA.M.WillemsR.M. (2018). The fictive brain: neurocognitive correlates of engagement in literature. Review of General Psychology. in press.10.1037/gpr0000106
- CrossRef
- Google Scholar
29
JakobsonR.Lévi-StraussC. (1962). “Les chats” de Charles Baudelaire. _L’homme_2: 5–21.10.3406/hom.1962.366446
- CrossRef
- Google Scholar
30
JurafskyD.MartinJ.H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Upper Saddle River, NJ: Prentice-Hall.
- Google Scholar
31
KatzA.PaivioA.MarscharkM.ClarkJ. (1988). Norms for 204 literary and 260 non-literary metaphors on psychological dimensions. _Metaphor and Symbolic Activity_3: 191–214.10.1207/s15327868ms0304_1
- CrossRef
- Google Scholar
32
KraxenbergerM. (2017). On Sound-Emotion Associations in Poetry. Ph.D. thesis, Freie University, Berlin.
- Google Scholar
33
LeechG.N. (1969). A Linguistic Guide to English Poetry. London, UK: Longman.
- Google Scholar
34
MichelJ.B.ShenY.K.AidenA.P.VeresA.GrayM.K.Google Books Teamet al. (2011). Quantitative analysis of culture using millions of digitized books. _Science_331: 176–82.
- Google Scholar
35
MitchellT.M. (1997). Machine Learning. New York: McGraw-Hill.
- Google Scholar
36
MorettiF. (2005). Graphs, Maps, Trees: Abstract Models for a Literary History. London: Verso.
- Google Scholar
37
NicklasP.JacobsA.M. (2017). Rhetorics, neurocognitive poetics and the aesthetics of adaptation. _Poetics Today_38: 393–412.10.1215/03335372-3869311
- CrossRef
- Google Scholar
38
O’SullivanN.DavisP.BillingtonJ.Gonzalez-DiazV.CorcoranR. (2015). “Shall I compare thee”: the neural basis of literary awareness, and its benefits to cognition. _Cortex; A Journal Devoted to the Study of the Nervous System and Behavior_73: 144–57.10.1016/j.cortex.2015.08.014
- CrossRef
- Google Scholar
39
PedregosaF.VaroquauxG.GramfortA.MichelV.ThirionB.GriselO.et al (2011). scikit-learn: machine learning in Python. _The Journal of Machine Learning Research_12: 2825–30.
- Google Scholar
40
RoeG.GladstoneC.MorrisseyR. (2016). Discourses and disciplines in the enlightenment: topic modeling the french encyclopeédie. _Frontiers of Digital Humanities_2: 8.10.3389/fdigh.2015.00008
- CrossRef
- Google Scholar
41
SchmidtkeD.S.SchröderT.JacobsA.M.ConradM. (2014a). ANGST: affective norms for German sentiment terms, derived from the affective norms for English words. _Behavior Research Methods_46: 1108–18.10.3758/s13428-013-0426-y
- CrossRef
- Google Scholar
42
SchmidtkeD.S.ConradM.JacobsA.M. (2014b). Phonological iconicity. _Frontiers in Psychology_5:80.10.3389/fpsyg.2014.00080
- CrossRef
- Google Scholar
43
SchrottR.JacobsA.M. (2011). Gehirn und Gedicht: Wie wir unsere Wirklichkeiten konstruieren (Brain and Poetry: How We Construct Our Realities). München, Germany: Hanser.
- Google Scholar
44
SimontonD.K. (1989). Shakespeare’s Sonnets: a case of and for single–case historiometry. _Journal of Personality_57: 695–721.10.1111/j.1467-6494.1989.tb00568.x
- CrossRef
- Google Scholar
45
SimontonD.K. (1990). Lexical choices and aesthetic success: a computer content analysis of 154 Shakespeare sonnets. _Computers and the Humanities_24: 254–64.
- Google Scholar
46
StamatatosE. (2009). A survey of modern authorship attribution methods. _Journal of the American Society for Information Science & Technology_60: 538–56.10.1002/asi.21001
- CrossRef
- Google Scholar
47
StennekenP.BastiaanseR.HuberW.JacobsA.M. (2005). Syllable structure and sonority in language inventory and aphasic neologisms. _Brain & Language_95: 280–92.10.1016/j.bandl.2005.01.013
- Pubmed Abstract
- CrossRef
- Google Scholar
48
SteyversM.SmythP.ChemudugantaC. (2011). Combining background knowledge and learned topics. _Topics in Cognitive Science_3: 18–47.10.1111/j.1756-8765.2010.01097.x
- Pubmed Abstract
- CrossRef
- Google Scholar
49
StockwellP. (2002). Cognitive Poetics: An Introduction. London: Routledge.
- Google Scholar
50
TsurR. (1983). _What is Cognitive Poetics?_Tel aviv: Katz Research Institute for Hebrew Literature.
- Google Scholar
51
TurnerF.PoeppelE. (1983). The neural lyre: poetic meter, the brain and time. _Poetry Magazine_12: 277–309.
- Google Scholar
52
TurneyP.D.LittmanM.L. (2003). Measuring praise and criticism: inference of semantic orientation from association. _ACM Transactions on Information Systems (TOIS)_21: 315–46.10.1145/944012.944013
- CrossRef
- Google Scholar
53
UllrichS.AryaniA.KraxenbergerM.JacobsA.M.ConradM. (2017). On the relation between the general affective meaning and the basic sublexical, lexical, and interlexical features of poetic texts—a case study using 57 poems of H. M. Enzensberger. _Frontiers in Psychology_7:2073.10.3389/fpsyg.2016.02073
- CrossRef
- Google Scholar
54
van den HovenE.HartungF.BurkeM.WillemsR. (2016). Individual differences in sensitivity to style during literary reading: insights from eye-tracking. _Collabra: Psychology_2: 1–16.10.1525/collabra.39
- CrossRef
- Google Scholar
55
van HalterenH.BaayenR.H.TweedieF.HaverkortM.NeijtA. (2005). New machine learning methods demonstrate the existence of a human stylome. _Journal of Quantitative Linguistics_12: 65–77.10.1080/09296170500055350
- CrossRef
- Google Scholar
56
VendlerH. (1997). The Art of Shakespeare’s Sonnets. Cambridge, MA: Harvard University Press.
- Google Scholar
57
WestburyC.KeithJ.BriesemeisterB.B.HofmannM.J.JacobsA.M. (2014). Avoid violence, rioting, and outrage; approach celebration, delight, and strength: using large text corpora to compute valence, arousal, and the basic emotions. _Quarterly Journal of Experimental Psychology_68: 1599–622.10.1080/17470218.2014.970204
- CrossRef
- Google Scholar
58
WillemsR.JacobsA.M. (2016). Caring about Dostoyevsky: the untapped potential of studying literature. _Trends in Cognitive Sciences_20: 243–5.10.1016/j.tics.2015.12.009
- Pubmed Abstract
- CrossRef
- Google Scholar
59
ZieglerJ.C.JacobsA.M. (1995). Phonological information provides early sources of constraint in the processing of letter strings. _Journal of Memory and Language_34: 567–93.10.1006/jmla.1995.1026
- CrossRef
- Google Scholar
60
ZieglerJ.C.StoneG.O.JacobsA.M. (1997). What is the pronunciation for -ough and the spelling for/u/? A database for computing feedforward and feedback consistency in English. _Behavior Research Methods, Instruments, and Computers_29: 600–18.10.3758/BF03210615
- CrossRef
- Google Scholar
61
ZipfG.K. (1932). Selected Studies of the Principle of Relative Frequency in Language. Cambridge, MA: Harvard University Press.
- Google Scholar

Appendix

A. Computing the Sonority Score

Following previous work (Stenneken et al., ; Jacobs and Kinder, ) and considering that here we deal with written instead of spoken words, I used a simplified index based on the sonority hierarchy of English phonemes which yields 10 ranks: [a] > [e o] > [i u j w] > [ɾ] > [l] > [m n ŋ] > [z v] > [f θ s] > [b d g] > [p t k]. Each word was assigned a value according to the number of graphemes belonging to the 10 rank sets. To control for word length, the sum of the values was divided by the number of graphemes per word. Thus, MEMORY would get a value of 9 × 2 [e o] + 2 × 5 [m] + 1 × 7 [r] + 1 × 8 [y = /i/] = 44/6 = 7.33, whereas SKUNK would get a value of 18/5 = 3.6. The final global sonority score of a poem is simply the mean of all word values in the poem. Of course, this simple additive model is only a first approximation, given the lack of any empirical data that would justify more complex models. Moreover, the fact that identical graphemes can have multiple context-dependent pronunciations in English like the/a/in “hAndbAll in the pArk” (Ziegler et al., ) is neglected in this first approximation which considers written, not spoken verbal materials.

Author	Nbr. of texts	Example text in GLEC, year of publication	GEPC text, length (nbr. of words)
1. Abraham Lincoln	16	Lincoln’s First Inaugural Address, 1861	–
2. Agatha Christie	2	The Secret Adversary, 1922	–
3. Albert Einstein	2	Relativity/The Special and General Theory, 1916	–
4. Aldous Huxley	3	Crome Yellow, 1921	The Defeat of Youth and Other Poems, 4,616
5. Alexander Pope	3	The Rape of the Lock and Other Poems, 1875	The Poetical Works, 82,870
6. Alfred Russel Wallace	5	Is Mars Habitable? 1907	–
7. Ambrose Bierce	18	A Cynic Looks at Life, 1912	Black Beetles in Amber, 23,815
8. Andrew Lang	60	Historical Mysteries, 1904	A Collection of Poems, 46,466
9. Anthony Trollope	71	The Eustace Diamonds, 1871	–
10. Arnold J. Toynbee	1	Turkey/A Past and a Future, 1917	–
11. Baronness Orczy	16	The Tangled Skein, 1907	–
12. Beatrix Potter	1	A Collection of Beatrix Potter Stories, 1902	–
13. Benjamin Disraeli	17	Vivian Grey, 1826	–
14. Benjamin Franklin	4	Autobiography of Benjamin Franklin, Version 4, 1791	–
15. Bertrand Russell	8	The Analysis of Mind, 1921	–
16. Bram Stoker	6	Dracula, 1897	–
17. Bret Harte	58	The Queen of the Pirate Isle, 1886	East and West, 6,737
18. Charles Darwin	20	The Expression of Emotion in Man and Animals, 1859	–
19. Charles Dickens	60	Oliver Twist, 1837	The Poems and Verses, 3,758
20. Charles Kingsley	44	True Words for Brave Men, 1884	Poems, 14,391
21. Charlotte Bronte	4	Jane Eyre, 1847	–
22. DH Lawrence	19	Women in Love, 1920	Collected Poems, 19,820
23. Edgar Allen Poe	11	The Masque of the Red Death, 1842	Complete Poetical Works, 8,117
24. Edgar Rice Burroughs	25	Tarzan of the Apes, 1912	–
25. Edmund Burke	15	Burke’s Speech on Conciliation with America, 1775	–
26. Edward P Oppenheim	53	The Zeppelin’s Passenger, 1918	–
27. Elizabeth B Browning	Sonnets From the Portuguese, 1850	The Poetical Works, 59,404
28. Emily Bronte	1	Wuthering Heights, 1847	–
29. Ezra Pound	2	Certain Noble Plays of Japan, 1916	Hugh Selwyn Mauberley, 1,181
30. George A Henty	89	Under Drake’s Flag, 1883	–
31. George Bernard Shaw	42	Pygmalion, 1912	–
32. George Eliot	13	Middlemarch, 1871	How Lisa Loved the King, 2,702
33. George Washington	1	State of the Union Addresses of George Washington, 1790	–
34. GK Chesterton	39	The Wisdom of Father Brown, 1914	Complete Poems, 29,867
35. Hamlin Garland	22	Money Magic, 1907	–
36. Harold Bindloss	43	Delilah of the Snows, 1907	–
37. Harriet EB Stowe	12	Uncle Tom’s Cabin, 1852	–
38. Hector Hugh Munro	7	The Toys of Peace, 1919	–
39. Henry David Thoreau	9	Walden and on the Duty of Civil Disobedience, 1854	–
40. Henry James	72	The Golden Bowl, 1904	–
41. Henry Rider Haggard	52	Love Eternal, 1918	–
42. Herbert George Wells	51	The War of the Worlds, 1897	–
43. Herbert Spencer	4	The Philosophy of Style, 1880	–
44. Herman Melville	16	Moby Dick, 1851	Poems, 19,088
45. Howard Pyle	11	The Merry Adventures of Robin Hood, 1883	–
46. Isaac Asimov	1	Youth, 1952	–
47. Jack London	48	The Sea-Wolf, 1904	–
48. Jacob Abbott	47	William the Conqueror, 1849	–
49. James Bowker	1	Goblin Tales of Lancashire, 1878	–
50. James F Cooper	36	The Last of the Mohicans, 1826	–
51. James Joyce	4	Ulysses, 1922	Chamber Music, 1,221
52. James Matthew Barrie	23	Peter Pan, 1911	–
53. James Otis (Kaler)	27	Dick in the Desert, 1893	–
54. James Russell Lowell	11	Abraham Lincoln, 1890	The Complete Poetical Works, 45,204
55. Jane Austen	8	Emma, 1815	–
56. Jerome K Jerome	30	Three men in a Boat, 1898	–
57. John Bunyan	9	The Holy War, 1682	–
58. John Dryden	13	All for Love, 1678	The Poetical Works, 80,667
59. John Galsworthy	40	The Forsyte Saga, 1906–1921	–
60. John Keats	6	Endymion, 1818	Poems, 36,408
61. John Locke	3	An Essay Concerning Humane Understanding, 1689	–
62. John Maynard Keynes	1	The Economic Consequences of the Peace, 1919	–
63. John Morley	28	On Compromise, 1874	–
64. John Ruskin	42	A Joy For Ever, 1885	–
65. John Stuart Mill	11	Utilitarianism, 1861	–
66. Jonathan Swift	15	Gulliver’s Travels, 1726	The poems, 85,834
67. Joseph Conrad	34	Lord Jim, 1899	–
68. Leigh Hunt	3	Stories From the Italian Poets/With Lives of the Writers, 1835	Captain Sword and Captain Pen, 2,260
69. Lewis Carroll	14	Symbolic Logic, 1896	Poems, 15,505
70. Lord Byron	12	Fugitive Pieces, 1806	Poetical Works, 207,977
71. Lord Tennyson	10	Lady Clara Vere de Vere, 1842	The Poems, 105,650
72. Louisa May Alcott	34	Little Women, 1869	Three Unpublished Poems, 386
73. Lucy M Montgomery	17	Anne of Green Gables, 1908	–
74. Lyman Frank Baum	42	The Wonderful Wizard of Oz, 1900	–
75. Mark Twain	46	The Adventures of Tom Sawyer, 1876	–
76. Mary Shelley	5	Frankenstein, 1818	–
77. Michael Faraday	2	Experimental Researches in Electricity, 1839	–
78. Mary Stewart Daggett	2	Mariposilla, 1895	–
79. Nathaniel Hawthorne	88	The Scarlet Letter, 1850	–
80. O Henry	14	The Gift of the Magi, 1905	–
81. Oscar Wilde	25	The Picture of Dorian Gray, 1890	Poems, 22089
82. PB Shelley	7	Adonais, 1821	The Complete Poetical Works, 165,242
83. PG Wodehouse	35	A Damsel in Distress, 1919	–
84. Percival Lowell	2	The Soul of the Far East, 1896	–
85. Philip Kindred Dick	11	Mr. Spaceship, 1953	–
86. R M Ballantyne	88	The Red Eric, 1863	–
87. Rafael Sabatini	17	Scaramouche, 1921	–
88. Ralph Waldo Emerson	7	Nature, 1836	Poems, 29,446
89. Richard B Sheridan	5	Scarborough and the Critic, 1751	–
90. Robert Browning	7	Men and Women, 1855	Poems, 35,732
91. Robert Frost	A Boy’s will, 1913	Poems, 15,518
92. Robert Hooke	1	Micrographia, 1665	–
93. Robert L Stevenson	79	A Childs Garden of Verses, 1885	Poems, 33,755
94. Robert Southey	3	The Life of Horatio Lord Nelson, 1798	Poems, 23,857
95. Rudyard Kipling	42	The Jungle Book, 1894	Poems, 64,137
96. Samuel T Coleridge	13	The Rime of the Ancient Mariner, 1798	The Complete Poetical Works, 51,983
97. Sinclair Lewis	7	Babbitt, 1922	–
98. Sir Arthur Conan Doyle	57	The Adventures of Sherlock Holmes, 1892	Poems, 14,386
99. Sir Francis Galton	3	Inquiries Into Human Faculty and its Development, 1883	–
100. Sir Humphry Davy	1	Consolations in Travel, 1830	–
101. Sir Isaac Newton	3	Opticks, 1704	–
102. Sir Joseph Dalton Hooker	1	Himalayan Journals, 1854	–
103. Sir Richard Francis Burton	11	The Land of Midian, 1877	–
104. Sir Walter Scott	35	Ivanhoe, 1820	Poems, 46,846
105. Sir Winston Churchill	4	The River War, 1899	–
106. Sir William Schwenck Gilbert	5	Songs of a Savoyard, 1890	Poems, 31,138
107. Stephen Leacock	15	Frenzied Fiction, 1917	–
108. TS Eliot	4	The Waste Land, 1922	Poems, 4,661
109. Thomas Carlyle	32	History of Friedrich II of Prussia, 1895	–
110. Thomas Crofton Croker	1	A Walk From London to Fulham, 1813	–
111. Thomas Hardy	26	Tess of the d’Urbervilles, 1891	Poems, 62,756
112. Thomas Henry Huxley	44	Darwinian Essays, 1893	–
113. Thomas Robert Malthus	4	An Essay on the Principle of Population, 1798	–
114. Thornton Waldo Burgess	31	Mrs. Peter Rabbit, 1902	–
115. Ulysses Grant	3	State of the Union Addresses, 1875	–
116. Virginia Woolf	4	Night and Day, 1919	–
117. Walt Whitman	5	Leaves of Grass, 1855	Poems, 24,787
118. Walter de la Mare	10	The Return, 1910	Collected Poems, 15,765
119. Washington Irving	17	The Legend of Sleepy Hollow, 1820	–
120. Wilkie Collins	32	Hide and Seek, 1854	–
121. William Blake	3	Songs of Innocence, 1789	Poems, 4,439
122. William Butler Yeats	24	In the Seven Woods, 1903	Poems, 23,325
123. William Dean Howells	84	Annie Kilburn, 1888	Poems, 13,554
124. William Ewart Gladstone	1	On Books and the Housing of Them, 1890	–
125. William Henry Hudson	13	The Purple Land, 1885	–
126. William J Long	8	Ways of Wood Folk, 1899	–
127. William M Thackeray	30	Barry Lyndon, 1844	Ballads, 20,521
128. William Penn	2	A Brief Account of the Rise and Progress of the People Called Quakers, 1698	–
129. William Shakespeare	38	Macbeth, 1623	Sonnets, 8,721
130. William Somerset Maugham	13	Of Human Bondage, 1915	–
131. William Wordsworth	7	I Wandered Lonely as a Cloud, 1807	The Poetical Works, 116,683
132. Winston Churchill (novelist)	13	The Inside of the Cup, 1913	–

List of authors with example texts in the Gutenberg Literary English Corpus (GLEC) and total text lengths in the Gutenberg English Poetry Corpus (GEPC).

B. Computing Word Similarity, Valence, and Arousal

Following upon an early unsupervised learning approach proposed by Turney and Littman () and own previous theory-guided research (Westbury et al., ), I computed the lexical features valence and arousal on the basis of (taxonomy-based) semantic associations of a target word with a set of labels, i.e., keywords assumed to be prototypical for a certain affect, e.g., positive valence. The procedure for computing valence and arousal—implemented as a python script—was as follows. The script compared every target word with every word in the NLTK wordnet/WN database and computed the pairwise similarities (WNsim in Eq. A1 below, based on WN’s path-similarity metric), summed and averaged them for each target word and then computed the difference between the mean for the positive and negative lists (for valence, not for arousal where the values were summed and averaged only):where label_1pos/1neg and label_Npos/Nneg are the first and last terms, respectively, in the valence lists given below.

The hit rates (i.e., overlap between words in the WN database and the present target words) were 80% for the Joyce poem and 77% for Eliot’s, which can be considered as reliable (Jacobs and Kinder, ).

Label words for the computation of positive and negative valences, as well as arousal (for details, see Westbury et al.,

, Table

, row 2).

pos = [“contentment’,’ “happiness’,’ “pleasure’,’ “pride’,’ “relief’,’ “satisfaction’,’ “surprise”]
neg = [“disgust’,’ “embarrassment’,’ “fear’,’ “sadness’,’ “shame”]
aro = [“amusement’,’ “anger’,’ “contempt’,’ “contentment’,’ “disgust’,’ “embarrassment’,’ “excitement’,’ “fear’,’ “happiness’,’ “interest’,’ “pleasure’,’ “relief’,’ “sadness’,’ “satisfaction”]

Keywords

quantitative narrative analysis, digital literary studies, neurocognitive poetics, culturomics, language model, neuroaesthetics, affective-aesthetic processes, literary reading

Citation

Jacobs AM (2018) The Gutenberg English Poetry Corpus: Exemplary Quantitative Narrative Analyses. Front. Digit. Humanit. 5:5. doi: 10.3389/fdigh.2018.00005

Edited by

Robert J. Morrissey, University of Chicago, United States

Reviewed by

Charles Cooney, University of Chicago, United States; Jan Rybicki, Jagiellonian University, Poland

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Arthur M. Jacobs, ajacobs@zedat.fu-berlin.de

Specialty section: This article was submitted to Digital Literary Studies, a section of the journal Frontiers in Digital Humanities

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.