Visualizing literature : artistic statistics (original) (raw)

It is a truth universally NOT acknowledged that" (Burrows, Computation 1), in a set of, say, twenty-fi ve novels by fi ve authors, all one needs to know to order the books by their authors (that is, if one refuses to use such childish clues as titles and names on covers) are frequencies of some thirty, fi ft y, hundred, or a thousand at worst, most frequent words of such a corpus. 1 Th e reason why this truth is universally not acknowledged is that these most frequent words rarely go beyond function words, other "non-semantic" words, or those "semantic" words, such as "man" or "time, " which owe their high frequency rank to being part of frequent idioms and set phrases. While cognitive linguists might look with approval on stylometrists who base their study of literature on those "grammatical" words, the traditional literary scholar-were he or she ever persuaded to count words as part of researchwould be much more interested in words that "matter": God, country, brother, or love (or various symbolic obscurations thereof). Also, while Zipf 's Law tells us that thirty or fi ft y most frequent word types usually account for a half of a novel's (or any other text's) number of word tokens, it does nothing to explain why these very frequent words-certainly used in a less deliberate (while, perhaps, highly deterministic) way by writers-should be enough to betray those writers' authorship through style. Th at is, if function-word choice can be called style. But then what else should it be called, if it so well defi nes and/or mirrors how (rather than what) writers write. 1 I have shamelessly stolen my opening sentence from this paraphrase of Jane Austen's most famous opening sentence from the opening sentence of the seminal monograph of stylometry, John Burrows's Computation into Criticism: A Study of Jane Austen's Novels and an Experiment in Method.