Text Mining Early Printed English

Early Print offers a range of tools for the computational exploration and analysis of English print culture before 1700. The site was designed to help scholars make sense of the incomparable textual archive produced by the EEBO Text Creation Partnership, consisting of a set of transcriptions of the first two centuries of English print. While EEBO-TCP provides access to a massive collection of texts that promises to transform the way scholars approach this period, it also presents significant technical and conceptual challenges. The relative accuracy (given its scale) of the EEBO-TCP corpus that makes it such a valuable resource for scholars also makes it complex for computational analysis. The corpus faithfully reproduces the evolving and irregular orthographic and syntactic conventions of the early modern period, and retains much of the necessarily incomplete and irregular metadata drawn from the original title pages.Any computational approach to the EEBO-TCP corpus, therefore, needs to not only encounter the digital surrogates of early modern texts, but must take into account the material and ideological conditions that underlie this first information revolution.

Early Print was conceived by Anupam Basu, Weil Fellow in Digital Humanities at Washington University; he aimed to render EEBO-TCP more susceptible to quantitative historical analyses by virtue of some intensive reprocessing of the TCP texts and their metadata. The tools and visualizations collected here do not provide an interface to the full texts, but they offer an aggregate view of the corpus that enables us to probe English lexical and orthographic history in ways that usefully complement the search capabilities of EEBO-TCP and the Oxford English Dictionary; they also help us to see early modern book culture in a new way, as a structured flow of words.

Our first set of visualizations afford a fresh view of the EEBO-TCP as a collection of publications. The first, a graph of TCP books per year, includes as reference the number of publications recorded in the English Short-Title Catalogue (an updated version of the Pollard & Redgrave and Wing short-title catalogues, on which the EEBO microfilm collection was based), allowing us to get a sense of both changes in English print publication over time and of the relative size and comparability of the EEBO-TCP sample. The second is a scatterplot showing each publication by word count over time, producing, among other patterns, a striking visualization of the sharp increase in pamphlet production in the 1640s. We will continue to add more interactive visualizations to this section as we explore more of the EEBO-TCP metadata.

The tool most likely to claim an afternoon’s attention is the n-gram browser which provides an interface for viewing and comparing the changing frequencies of words and word forms over time. At its core lies a vast database of almost a billion n-grams extracted by year from EEBO-TCP. N-grams are short sequences of contiguous words extracted from text. They are the fundamental building blocks of many large scale computational and corpus-linguistic approaches to language and it is our hope that the dataset and initial tools provided here will provoke exploration and lead to involved research questions and collaborations. Although the metadata required for dating these texts is highly irregular and sometimes necessarily ambiguous or incomplete, we extract either exact or approximate dates wherever possible to visualize the development of n-grams over time. The database also allows one to browse regularized spellings, lemmas (the dictionary heading form of a word), and part-of-speech tags generated with Morphadorner, a suite of natural language processing tools developed at Northwestern University. The N-gram browser links to and is complimented by the keyword-in-context tool which allows one to search multiple parallel versions of the EEBO-TCP corpus (original spelling, regularized, or lemmatized) with arbitrarily complex regular expressions and to limit those searches by dates or authors.

Early Print began as a means to an end. As part of analytic work on an edition of the Collected Works of Edmund Spenser, Loewenstein hoped to confer greater precision on the long-standing general assessment of the archaism of Spenser’s language, and asked Basu how we might measure the orthographic and lexical temporality, the conservatism and innovation of any given work or corpus against the larger tendencies of the print record. In pursuit of a study of the discursive construction of urban criminality, Dr. Basu wished to bring the resources of named entity recognition and topic extraction to EEBO-TCP, but the current state of the art in those fields were not up to the challenge of early modern orthography. Early Print was devised as a response to these research problems.

We intend to make this site and its underlying databases responsive to others’ research needs. The portal itself can only serve as glimpse into the potential uses of the database. We hope that the kind of exploration it allows will help develop further, more involved research questions that require more sustained engagement and analysis of the database and the texts that lie behind it. In this sense Early Printis not so much a complete project as it is a provocation – a new unfamiliar scale and perspective on a familiar body of texts – one that would encourage scholars to come forward with their own projects and think about ways of collaborating and integrating corpus-scale analysis seamlessly into the ways we already read and think about early modern print culture.

The site is supported by the Humanities Digital Workshop at Washington University. Dr. Basu was assisted by Steve Pentecost, with advice from Douglas Knox and Joseph Loewenstein.

EEBO N-gram Browser

Screenshot of visualization

EEBO-TCP and ESTC Text Counts

Screenshot of visualization

EEBO-TCP Words Per Year

Screenshot of visualization

EEBO-TCP Key Words in Context

Screenshot of visualization

Recent Posts

Early Modern ‘Culturomics’?

By Anupam Basu

Trends Over Time: Early Modern ‘Culturomics’?

Using the N-gram Browser

By Anupam Basu

How to use the EEBO-TCP N-gram Browser