Text Mining Early Printed English

Early Print offers a range of tools for the computational exploration and analysis of early English print culture. While EEBO-TCP, a set of transcriptions of English print before 1700, provides access to a massive archive that promises to transform the way scholars approach this period, it is also an archive that presents significant technical and conceptual challenges for analysis. Both the scale and complexity of the EEBO-TCP corpus means that we need to not only make it technically tractable to computational approaches, but that any such analysis must keep in mind the material and ideological conditions that underlie the production of this vast body of print. The site was designed to help scholars make sense of the incomparable textual archive produced by the EEBO Text Creation Partnership.

Early Print was conceived by Anupam Basu, Weil Fellow in Digital Humanities at Washington University; he aimed to render EEBO-TCP more susceptible to quantifiable historical analyses by virtue of some intensive reprocessing of the TCP texts and their metadata. The tools and visualizations here do not provide an interface to the full texts, but they offer an aggregate view of the TCP corpus that enables us to probe English lexical and orthographic history in ways that usefully complement the search capabilities of EEBO-TCP and the Oxford English Dictionary; they also help us to see early modern book culture in a new way, as a structured flow of words.

At the core of Early Print lies a vast database of over a hundred million n-grams extracted by year from EEBO-TCP. N-grams are short sequences of contiguous words extracted from text. They are central to many large scale computational and corpus-linguistic approaches to language and we hope that the initial set of tools provided here will provoke exploration and lead to involved research questions and collaborations. Our first set of visualizations afford a fresh view of the EEBO-TCP as a collection of publications. The first, a graph of TCP books per year, includes as reference the number of publications recorded in the English Short-Title Catalogue (an updated version of the Pollard & Redgrave and Wing Short-title catalogues, on which the EEBO microfilm collection was based), allowing us to get a sense of both changes in English print publication over time and of the relative size and comparability of the EEBO-TCP sample. The second is a scatterplot showing each publication by word count over time, producing, among other patterns, a striking visualization of the sharp increase in pamphlet production in the 1640s. The tool most likely to claim an afternoon’s attention is the n-gram browser which provides an interface for viewing and comparing the changing frequencies of words and word forms over time.

Early Print began as a means to an end. In pursuit of a study of the discursive construction of urban criminality, a study especially focused on the spatial imaginary of crime in early modern London, Dr. Basu wished to bring the resources of named entity recognition to EEBO-TCP, but neither the disambiguation routines built into EEBO searching, nor the current state of the art in NER were up to the challenge of early modern orthography. As part of analytic work on an edition of the Collected Works of Edmund Spenser, Loewenstein hoped to confer greater precision on the long-standing general assessment of the archaism of Spenser’s language, and asked Basu how we might measure the orthographic and lexical temporality, the conservatism and innovation of any given work or corpus against the larger tendencies of the print record. Early Print was devised as a response to these research problems.

We intend to make this site and its underlying databases responsive to others’ research needs. The portal itself can only serve as glimpse into the potential uses of the database. We hope that the kind of exploration it allows will help develop further, more involved research questions that require more sustained engagement and analysis of the database and the texts that lie behind it. In this sense Early Print is not so much a complete project as it is a provocation – a new unfamiliar scale and perspective on a familiar body of texts – one that would encourage scholars to come forward with their own projects and think about ways of collaborating and integrating corpus-scale analysis seamlessly into the ways we already read and think about early modern print culture.

The site is supported by the Humanities Digital Workshop at Washington University. Dr. Basu was assisted by Steve Pentecost, with advice from Douglas Knox and Joseph Loewenstein.

EEBO N-gram Browser

Screenshot of visualization

EEBO-TCP and ESTC Text Counts

Screenshot of visualization

EEBO-TCP Words Per Year

Screenshot of visualization

Recent Posts

Introduction to the N-gram Browser

By Anupam Basu

Introduction to the N-gram Browser

Using the N-gram Browser

By Anupam Basu

How to use the EEBO-TCP N-gram Browser