corpkit: a tool for investigating text (original) (raw)

Overview

corpkit is a tool for doing corpus linguistics.

It does a lot of the usual things, like parsing, concordancing and keywording, but also extends their potential significantly: you can concordance by searching for combinations of lexical and grammatical features, and can do keywording of lemmas, of subcorpora compared to corpora, or of words in certain positions within clauses.

Corpus interrogations can be quickly edited and visualised in complex ways, or saved and loaded within projects, or exported to formats that can be handled by other tools.

corpkit accomplishes all of this by leveraging a number of sophisticated programming libraries, including pandas, matplotlib, scipy, Tkinter, tkintertable and Stanford CoreNLP.

Screenshots

Interrogating Searching a corpus using constituency parses Editing Making relative frequencies, skipping subcorpora
Visualising Visualising results as a line graph Concordancing Concordancing with constituency queries, manually coding results
Defining wordlists Defining pre-installed and custom wordlists Building, viewing trees Building corpora, viewing parse tree output

Example figures: Risk Semantics project

Interrogating Changing frequencies of risk processes Editing Nominalisation of risk
Risk and power How often do certain social actors do risking? Modals Modal auxiliaries in the NYT
Sayers, increasing Sayers in verbal processes, sorted by increasing frequency Modal ocean An ocean of modals in the NYT
Keyness Using keywording with a list of politicians' names and no external reference corpus To put at risk Using subplots to demonstrate the rise of "to put at risk" in U.S. news

Key features

The main difference from other tools is that corpkit is designed to look at combinations of lexical and grammatical features in structured corpora. You can easily count or concordance the subjects of passive clauses, or the verbal groups that occur when a participant is pronominal. Furthermore, you can do this for every subcorpus in your dataset in turn, in order to understand how language might be similar or different across the different parts of your dataset.

Also unique to corpkit are:

The final key difference between corpkit and most current corpus linguistic software (AntConc, WMatrix, Sketch Engine, UAM Corpus Tool, Wordsmith Tools, etc.), corpkit is free and open-source, hackable, and provides both graphical and command-line interfaces, so that it may be useful for geek and non-geek alike.

Download

To download the most recent OSX version, use the link in the menu bar, or just click here. See the Setup page for (very simple) installation instructions.

Linux users can run the graphical interface by installing corpkit with pip install corpkit and then open the GUI with python -m corpkit.gui.

Windows users will need to get a Python interpreter and pip installed, and then run pip install corpkit and python -m corpkit.gui.

Cite

If you want to cite corpkit, please use:

McDonald, D. (2015). corpkit: a toolkit for corpus linguistics. Retrieved from
https://www.github.com/interrogator/corpkit. DOI: http://doi.org/10.5281/zenodo.28361