Text Mining Early Printed English

EEBO-TCP and ESTC Text Counts

Screenshot of visualization

EEBO-TCP Words Per Year

Screenshot of visualization

EEBO and EEBO-TCP: A Brief Introduction

By Joseph Loewenstein

Early English Books Online (EEBO) is one of the great resources for the study of early modern British culture. It aims to provide digital images of one copy each of the surviving books and broadsides printed in the British Isles and British North America between 1473 and 1700 and of the English language books printed in Europe during the same period. EEBO is a composite. Its scans reproduce four microfilm collections produced by University Microfilms: Early English Books I, 1475-1640; Early English Books II, 1641-1700; the Thomason Tracts Collection; and The Early English Books Tract Supplement Collection. While EEBO’s digital collection reproduces most of the microfilms of Early English Books I (93%) and virtually all of the other three collections, the first two microfilm collections are themselves incomplete. The two collections were meant to represent all the volumes catalogued, respectively, in A. W. Pollard and G. R. Redgrave’s Short-title catalogue of books printed in England, Scotland and Ireland, and of English books printed abroad, 1475-1640 (STC I, 26,500 titles) and Donald G. Wing’s Short-title catalogue of books printed in England, Scotland, Ireland, Wales, and British America, and of English books printed in other countries, 1641-1700 (STC II, 90,000 titles), but the microfilming is ongoing, with conclusion projected to require a few more years.

For a more detailed account of Early English Books Online, see the official site.

EEBO is a commercial product, available for institutional purchase or license. A number of subscribing institutions funded the transcription project by a further subscription to the EEBO-Text Creation Partnership. The first phase of TCP transcription began in 1999 and it completed its target of 25,000 transcriptions by 2009; the second phase of transcription began in 2009 and was roughly half complete by March of 2014. On January 1, 2015, the transcriptions of TCP Phase I will be made freely available; five years after the completion of Phase II, the Phase II transcriptions will be made similarly available. The Phase I EEBO transcriptions were the first of the TCP undertakings. In 2005, satisfied that they had proved the viability of the model, the board and staff of the Text Creation Partnership decided to approach the publishers of two other major scholarly databases, Eighteenth Century Collections Online (ECCO) and Evans Early American Imprints. Of the 150,000 titles in the ECCO, over 2200 have been transcribed, and these transcriptions are freely available. 6000 titles of the Evans database which comprises 40,000 titles, roughly two-thirds of the books, pamphlets, and broadsides printed between 1640 and 1800 in the territory that eventually became the United States. The Evans TCP will be made freely available by the end of June 2014.

For researchers with access to it, EEBO-TCP substantially extends the utility of EEBO, for the TCP texts may be easily quoted and variously searched. The TCP texts have been tagged, so that many of the component parts of the transcribed texts are distinguishable by form (verse, prose, drama) and function (dedication, table of contents). Boolean and proximity searches are possible and the TCP search tools have been equipped to mitigate the notorious irregularities of early modern spelling. The results of keyword searches are displayed in context and, although the keyword-in-context displays could do with some refinement, it is reasonably easy to move between the KWIC results of a search to the relevant page image. The EEBO site makes available plenty of useful coaching on how to search by means of the “Help?” link on their menu bar and the TCP site maintains an equally instructive and somewhat friendlier search interface. The works transcribed were selected from more than 125,000 books, broadsides, and manuscripts reproduced (from microfilm) as scans and available by subscription as Early English Books Online (EEBO).

Caveat Explorator:

There are limitations on the data in EEBO-TCP. Since they derive from microfilms of printed books (which often derive from scribal copies of autograph MS), the transcriptions in EEBO-TCP are quite vulnerable to transmissional error.

EEBO aims to reproduce the works represented in the Pollard & Redgrave and Wing catalogues, STC I and II. Since they list only surviving works, these catalogues may distort the profile of actual printed books.

The TCP transcriptions represent only a portion – to date, roughly 40% – of the (still incomplete) microfilm collections represented, as scans, in EEBO. The original selection of material for transcription focused on works now felt to be of literary or historical importance. As the transcription program advanced, this focus relaxed towards greater representativeness, but the TCP is not a random sample of early print.

The microfilm collections from which the transcriptions have been taken include images from single copies only. When a copy is somehow deficient, because of missing pages or poor inking, the protocols for TCP transcription require that the deficiencies go unremedied. And single copies are almost inevitably deficient in another sense: since early modern printing practices allow for the correction of apparent transmissional errors without the destruction of “misprinted” sheets, single copies cannot represent all the states that make up the highly variable flow of printed output.

For a much deeper account of the pitfalls of too-credulous scholarly reliance on EEBO, an account pertinent for the user of EEBO-TCP, see Kichuk, Diana,“Metamorphosis: Remediation in Early English Books Online (EEBO),” Literary and Linguistic Computing 22, no. 3 (September 1, 2007): 291–303 and Ian Gadd, “The Use and Misuse of Early English Books Online,” Literature Compass 6, no. 3 (May 1, 2009): 680–692. Kichuk and Gadd make it clear why the TCP should be regarded as a large sample of the output of the early modern press.