Text Mining Early Printed English

Trends Over Time: Early Modern ‘Culturomics’?

Perhaps the most familiar recent use of n-grams is Google's N-gram Viewer, which lets users explore trends in word and phrase usages over time in the Google Books corpus. Interesting usage patterns are easy to discern and often provide insights into shifting cultural paradigms. Google’s N-gram browser and its underlying data have found a variety of uses in corpus linguistics including Mark Davies’ work at Brigham Young University and the much cited article by Michel et al in Science that is partly credited with popularizing the so-called “culturomics” – the quantitative study of cultural trends. 1

Can the ability to explore the language of EEBO-TCP over time from the beginnings of English print to 1700 enable a similar “culturomics” approach for scholars of early modern England? Certainly, there are interesting trends that can be glimpsed and that will hopefully provoke engaging research questions. But, as is the case with so much computational analysis of early English print, our expectations must be tempered with an awareness of the material and cultural conditions within which these texts were produced. The relatively small number of texts surviving from the early decades of print, and the highly irregular and non-standardized spelling across much of the EEBO-TCP corpus constitute significant barriers to building a database that can in any way be compared to the richness, flexibility, and ease of use of the Google n-gram corpus.

A few central points should especially be kept in mind. EEBO-TCP is an ongoing project and still adding texts – and we will update the database as new texts are released. However, even though we make n-gram-data from the entirety of the current version of EEBO-TCP available here, the survival rates for at least the first century of print are too small and too irregular for us to draw anything more than tentative conjectures. The n-gram browser plots relative frequencies for each year – that is, the frequency of a certain n-gram as a proportion of all n-grams from that year. This normalization is necessary so that years with more available texts don’t simply overwhelm other years in terms of absolute counts of n-grams. But another consequence of this is that in years that have a relatively small set of texts, single texts or a handful of texts with high frequencies of certain words can result in a very high relative frequency. For example, if a year has only a few relatively small texts of which one has several instances of a word, it might cause an anomalous spike that would have been drowned out in a year with a much larger sample of texts. So we should be hesitant with generalizations about the early years of print without looking closely as the underlying texts and issues that might show up as trend patterns. The section on exploring the EEBO-TCP and ESTC catalogs on this website discusses issues surrounding the number of texts and rates of survival in more detail.

Early English orthography is highly irregular and EEBO-TCP’s great value to researchers as a highly accurate transcribed corpus also proves to be the main barrier to computational analysis here as it faithfully captures all the idiosyncrasies of early modern spelling. Of course, there are tools and algorithms available that attempt to automatically standardize spelling as much as possible, but the sheer variety of spelling in EEBO-TCP makes this an especially difficult computational problem to solve. We present unigrams in original spelling, but use Morphadorner, a suite of tools developed at Northwestern University to standardize spellings for a parallel n-gram corpus that is also part-of-speech tagged. While Morphadorner has continued to mature as a tool, there are always interesting instances that it cannot recognize or correct for. Our database allows users to use regular expressions, a judicious use of which can allow one to do broad searches and capture variants that Morphadorner might have failed to standardize or POS tag correctly, but ultimately, one should take every care possible to go back to the texts and verify interesting or anomalous trends as carefully as possible rather than using the n-gram browser as a black box tool that just reveals trends in culture through simple queries.

However, in spite of these significant caveats, it is our hope that these databases will allow scholars to explore early english print from a unique perspective and raise interesting research questions about emerging orthographic conventions as well as larger cultural trends.

1 Jean-Baptiste Michel et al., “Quantitative Analysis of Culture Using Millions of Digitized Books,” Science 331, no. 6014 (January 14, 2011): 176–182.

Anupam Basu