Joshua Beatty - Historical Documents in a Digital Library: OCR, Metadata, and Crowdsourcing

This was originally written as a paper for Chris Tomer’s graduate class on Digital Libraries at the University of Pittsburgh this past spring. It’s my attempt to articulate some ideas about what makes online historical documents usable — or not usable — for researchers. Comments and criticism are welcome!

Over the past decade, a vast number of historical materials from the past three centuries have been digitized and placed on the Internet. The majority of these have been printed sources — newspapers and books. Some have been digitized as part of a proprietary system (for example, Readex’s Early American Imprints.) Others have been made publicly accessible (Google Books, or projects from the Library of Congress under American Memory.) The grand hope of all of these was to provide searchable full text online. This would be done through the magic of optical character recognition software. Surely, librarians might have thought ten or fifteen years ago, software quality and processing power would improve rapidly, soon permitting quick and accurate reproduction of any text.

The promise of OCR has gone largely unfulfilled. While modern printed sources are easily read, older ones are not. This should lead us to reconsider how we think about these documents — how we categorize them. In a pre-digital world, there is not much difference between the modern newspaper and the eighteenth-century one. Both are opened and easily skimmed, column by column. Contrast that to a manuscript — a letter or diary — which is much harder to read.

But in the digital age, if images and computer-generated text are available over the web, the older newspapers have more in common with manuscripts than they do with newer printed materials. The latter are searchable; the former are not.

For a researcher, to profitably use a big digital collection of historic materials, he or she needs to be able to search the contents, to winnow down centuries of text. In other words he or she needs either quality OCR or quality metadata. For a large corpus, if you have a collection that is well-OCR’d, then you can get by without robust metadata. But if you have a collection that is poorly OCR’d, text search will not work — you need to have robust metadata for the library to be useful at all.

An example of the latter is the old microprint edition of Early American Imprints. The documents were in physical form and thus, not searchable at all. But the makers had created robust metadata — and this meant that, when libraries began using digital catalogs, the metadata could be ported into that catalog. Early American Imprints would be searchable along with the rest of the library’s holdings.

Printed materials and manuscript materials should be seen as parts of a larger continuum. Indeed, the major difference is not between manuscript and print, but between modern (post-1950) printed materials on the one hand and pre-modern printed works and manuscripts together on the other. This primary difference is based on the ability to create an accurate OCR text from a high-quality scan of the paper source.

Modern printed materials can be easily transformed into accurate, searchable text. Twentieth-century printing methods produced a clear, precise, and, importantly, regular type. A computer can convert those shapes into text with little trouble.

An example is the work done on the George Washington papers by the Library of Congress. Many of Washington’s letters and papers had been transcribed and printed. These included works printed in the years 1898, 1931–1944, and 1976–1979. Even in 1998–2000, when OCR technology was significantly less powerful than it is today, librarians were able to achieve high accuracy rates — they claim 99.95%, or one error in every 2000 characters. It is a measure of the advancement of the technology that today a measure of 99.98% — sixty percent fewer errors than the George Washington Papers project — is considered a bar for high accuracy.

On the other hand, handwriting recognition is still fraught with problems and errors, even when done with software designed to learn a particular person’s script. As a result, it isn’t currently possible for a computer to transcribe a historical manuscript into searchable text.

The category in the middle — pre-modern printed materials — is deceptive. To a human eye, it is very similar to modern print. But for the OCR program, it is much more like handwriting. The deficiency of current OCR capabilities when applied to older print sources are illustrated by a series of rigorous tests performed on the Nineteenth-Century Newspaper Project. This recent study found that character accuracy was 83.6 percent. Already this is alarming. But the corpus is searched not by character, but by word. And the word accuracy was significantly worse — 78% percent. Further, proper nouns — the names and places beginning with capital letters, and those words that a researcher would be most likely to search for — were recognized only 63.4% of the time.

Other researchers have found similarly appalling accuracy rates. A 2007 study by the Dutch National Library of the results produced by several OCR contractors found a significant amount of variance “…the rates respondents gave for newspaper digitisation projects vary from 99.8% for 700,000 newspaper pages (word accuracy, manually corrected) to 68% (character accuracy, no correction) for 350,000 pages of early 20th century newspapers.” Another study, this one by the Australian Newspaper Digitisation Program, found similar variance: “In a sample of 45 pages to be representative of the libraries digitised newspaper collection 1803-1954, we found that raw OCR accuracy varied from 71% to 98.02% (character confidence, no correction).”

Clearly a raw character accuracy of 68 to 71% renders the resulting text useless for searching. Higher rates are more useful — yet even the 98% character accuracy at the high end of the Australian study will result in many missed words. If the British finding that proper nouns are recognized significantly less than their common-noun counterparts holds true here as well, then the power of full-text search is hampered even more. In sum, the researchers found that in their corpus of nineteenth-century newspapers, recall was high, precision was relatively low, and fallout was high.

This can be contrasted to a pre-digital form of searching: the index. An example is the comprehensive index to the Virginia Gazette from 1737 to 1790, prepared by historians at Colonial Williamsburg in 1950. In this index are contained references to proper names (people and places) and subject terms. (Colonial newspapers generally were populated by anonymous or pseudonymous pieces, so no authors.) An index like this, rigorously compiled and checked, provides a very different profile: very high precision, moderate to good recall, depending on the rigor, and low fallout.

What the Virginia Gazette index provides, in essence, is metadata. In a pre-digital world, this was the only way of “searching” the corpus. But in a world of digital libraries, such an index would seem unnecessary. And perhaps it would be, were the online text of newspapers acceptably accurate.

When digitizing the eighteenth-century run of the Virginia Gazette, the digital humanities specialists did not even seriously consider putting searchable text online. OCR was quickly found not to work well on the microfilm versions of the newspaper, and costs to have the text inputted manually were far beyond their budget. Instead they went through a laborious process of scanning and OCRing the index (which, typeset in Courier in the mid-twentieth century, could be done with high accuracy). They then placed the index online in HTML format, with links leading to the scanned images of newspaper pages. In this they were helped by another feature of the print index: it listed not just the issue date, but the page and column of the entry.

The creators envisioned a workflow that took advantage of the diligent labor of the mid-century index compilers and married it to the speed and convenience of the digital library. When working with the digital Virginia Gazette, a researcher would first search the index web page for a relevant term. Then he or she could tab back and forth between the index and a set of open images, quickly running through a list of results. All in all, the technique was successful; the disadvantage, of course, is that it is not so easy to read a run of consecutive issues, or even consecutive pages.

Yet not every old newspaper was printed in a town that had been bought by the scions of a Gilded Age dynasty. The money that John D. Rockefeller pumped into Colonial Williamsburg for restoration of the colonial city, and the research to make that possible, was not present everywhere. Thus, most eighteenth and nineteenth century newspapers do not have detailed, proofread indexes waiting as a gateway to a digital edition. Another source of metadata must be found.

OCR might be one option: it can read article titles with a moderate degree of accuracy, and, if it could pick out proper nouns with any consistency whatsoever, could index those. But, given the poor quality of the microfilm that is used to make scans of newspaper pages, OCR simply can’t cope with the demands. The amount of cleanup required would mean that librarians might as well just read the articles and index the text themselves. At least in this way they could index concepts and make a true subject index — not something that literal OCR software can do.

A workaround, tried by the Australian Newspaper Digitisation Program, is to correct the OCR version of only the headline and the first several lines of text, in hopes that this would catch the most significant aspects of the article. But in practice this still takes a great deal of time. And, from the historical point of view, another problem emerges. The standard journalist’s model of writing in the present day — in which a story begins with a lead paragraph containing a summary and essential details — was simply not part of the eighteenth or nineteenth-century repertoire. Articles from those periods are just as likely to unfold slowly — like an oration rather than a summary of facts.

To recap: for most old newspapers no preexisting metadata exists, the software to correctly create that metadata is similarly absent, and the costs in time and money for librarians to create the metadata on their own are unworkable. Yet the quality of OCR means that, without metadata, the digital text itself is an imperfect and unreliable reflection of the actual content of the newspaper.

What is to be done? The Australian Newspaper Digitisation Program came up with an innovative solution: crowdsourcing. They made it possible for users to “view, edit, correct and save OCR text on whatever article they happened to be looking at.” Knowing that particular documents had unusually bad OCR, they highlighted those images to encourage patrons to improve them. The crowdsourcing was an instant success. Within three months of the project’s launch 1200 individuals had edited “700,000 lines of text within 50,000 articles.” Further, the volunteer correctors were — based on information from that two-thirds who had registered for accounts rather than working anonymously — largely experts in the places and time period covered in the newspapers. This meant they were better able to use context to puzzle out difficult words.

As the project progressed, more and more users began to edit the newspaper text. They also developed elements of a community. Since the interface was very basic and there was no forum area, they took to using the comment mechanism as a way to interact with their fellow correctors. To their infinite credit, the Australian Newspaper Digital Project has not tried to exert particular control over user activities. Realizing that they’ve got a good thing going — valuable work being performed by a vibrant community — they have instead stood back and watched that community develop. They found that “having no moderation and being open and free like the internet has raised many questions but has so far resulted in bringing more advantages than issues to the program.”

This is the sort of project over which librarians in the United States seem to drag their feet, unwilling to give up control. The best example of crowdsourced editing of historical newspapers I know of in this country is that put together by University of Virginia history professor Ed Ayres (now president of the University of Richmond). Ayers had his lecture classes — often several hundred students — go through nineteenth-century Virginia newspapers and cull local news to be put into a database by county. This is, of course, a different kind of project — the intent is to produce a refined database rather than improve the primary sources. But, once again, hundreds or thousands of people working on small bits of a project produced usable data far superior to what modern software would have come up with.

So in at least two cases crowdsourcing has worked as a way to produce usable, index-ready text from image files and low-quality OCR. Old newspapers are but one source for which this technique has potential. Other printed materials could be made accessible, and beyond print is manuscript. Historical archives in the United States and elsewhere are notoriously conservative institutions. But it would take relatively little effort and not much more in the way of resources for them to provide the materials that could generate their own online community of researchers. It would be enough to provide a digital library of reasonably decent image files of manuscripts, and a web interface that allowed researchers to transcribe the material for their own use while also saving the transcription for other patron’s benefit. Allow users to create tags for the material — as the Australian project does — and you also have the beginnings of a robust set of metadata.