OCR, Voyant, nGram, and Bookworm are all really useful tools for digital textual analysis. I found, however, that the older the text, the more difficult it is, and the more things you need to work around, to get a useful result. For this blog post I played around with a couple of different texts, the oldest of which was Gervase Markham’s Country Contentment, or, The English Huswife, published in London in 1623. The Google Books copy is from the British Museum. The first thing I noticed, before I even got into any textual analysis, was how different it was to see this book online than it would be to see it in person. I only got the very basic idea of how large it is, what the cover is made of, and whether or not it had been re-bound. While those things are less relevant to the sort of textual analysis we’re looking at here, they do have an impact on analysis at a micro-level, if I were interested in the book as an object and not just as a text.
That aside, I started with looking at the OCR. 17th century print is distinct and very different even from print in the 19th century, so I was expecting the OCR to be pretty dirty. Unsurprisingly, it was pretty messy.
Once you get used to reading the long s and the generally cramped nature of 17th century print, this edition isn’t terribly difficult to read. There’s some bleed through the paper, and the way the page is laid out, with headers in the margins, is a little strange, but the human eye can learn to see past or around those things. The OCR software, however, struggles, especially with the spacing of the text and the italicized headings in the margins. The software returns a text with strings of words smushed together, and italicized margin headings interspersed through the text. The OCR does recognize the long s as an f and replaces u with v wherever that happens in the text, although that is less detrimental to analysis than the words strung together, as anyone reading a 17th century text expects those things. Making the OCR useful is less about correcting spellings to fit a modern standard, and more about training the software to recognize the different page elements (such as the italicized margin notes and the catchwords at the bottoms of each page).
Since this book didn’t have chapters, I took the first fifteen pages or so and plugged them into Voyant. These pages were the introduction and dedication, a cataloging of the virtues of a good English housewife, and the beginning of the section on physic, constituting mostly cures for various types of fever.
Logically, then, “fever” (“feuer”) and “wife” were two of the most-used words in this section. The other word that stands out is “good.” This was a bit of a surprise; in approaching this, I was looking for data that would tell me about the frequency of ingredients or ailments. That “good” shows up as many times as it does tells me that this author is doing his best not only to provide receipts for fever cures, but also to convince his reader that they ought to use those receipts, particularly since “good” shows up in conjunction with another frequently used word, “drinke,” on a fairly regular basis. It could merely be convention; this is where a more in-depth assessment of the book as a whole and of this book in comparison to other 17th century cookbooks would be helpful. This frequency is something I might not have noticed by simply reading the text; to be sure, the analysis of the word’s appearance in the whole text is certainly something that couldn’t be done by reading alone.
Turning to Ngram and Bookworm, I decided to look at three slightly broader (though still topical) terms: sugar, cinnamon, and dessert. I wondered if these tools could help me pinpoint when each term came into use, and how that usage expanded or changed over time. I thought this might be able to indicate usage trends, although just because the word shows up doesn’t mean that it showed up in cookbooks or that people actually used the product (or, in the case of dessert, the term).
Sugar was the far more popular word, which, again, makes sense. NGram shows a spike in the use of the word sugar around 1620, a spike which is mirrored by the Bookworm graph. Why this is, I’m not entirely sure, especially since the use of the word drops off again significantly until the middle of the 18th century. The term “dessert” doesn’t become popular until the late 18th century, which makes sense, but when I looked into the books Google had analyzed (on the NGram) which dated to between 1600 and 1800, there were bills from the 1977 California assembly and Jacques Olivier’s 1623 Alphabet de l’imperfection et Malice de Femmes (the earliest book in Google’s set). This speaks to me of one of the biggest pitfalls of the NGram and Bookworm: it depends on OCR, which isn’t entirely reliable for older texts.
As a public historian, I could see how these tools would be really cool to use in an online exhibit or something on those lines. For the numerically inclined visitor to a site or exhibit, playing around with data about how something has changed over time can be another way to engage with the content. As a historian/digital humanist, particularly one interested in the 17th and 18th centuries, I wish that Google’s OCR was better trained to recognize and deal with older fonts, spacing and page layouts. Another frustration as a researcher that I encountered was that with NGram in particular is that it’s designed to search the entire corpus of digital works. What if I wanted to search only cookbooks, and only those published between 1600 and 1800, but still want the data sets and visualization that NGram provides? I can search cookbooks on Bookworms, but for the date range I’m interested in (1600-1800) the data set is small, maybe too small to provide any useful information. These are certainly cool tools, and helpful, but I would need to refine my research interests and think hard about whether or not this data is useful to me before I commit to them wholeheartedly.