Joshua Beatty - Writing

Maple sugar and Marx's <em>Grundrisse</em>

During this spring of social distancing I sat at home and read Karl Marx’s Grundrisse. The Grundrisse is a working-out of the ideas that would eventually make up the three published volumes of Capital. But Marx died before he could put much of the material in the Grundrisse into a more finished form, so it remains maybe the best overview of Marx’s total vision of how capitalism functions.

Also this spring, the geographer David Harvey taught a course on the Grundrisse at CUNY and posted his lectures to YouTube. At one point he asked his students for feedback — what were they getting from the text and lectures? In appreciation of the work he’d put in, I also sent him an email.

The email’s as good a summary of where my research has been taking me as anything I’ve written, so I’m posting it to my own blog as a way to mark my progress.

I’m a librarian by trade, but my training and research interests are in history, and I’m currently researching the history of maple sugar production in the early American republic. Specifically, I’m looking at a set of three attempts, between 1790 and 1795, to produce maple sugar on a large enough scale that it could replace cane sugar imported from the Caribbean.

One of the key insights that I’ve picked up from the Grundrisse is that of viewing the circuits of capital as a totality. I’m finding that thinking about both how sugar was produced and how it was distributed and consumed is, well, productive.

Cane sugar was produced on large plantations with an almost industrial character — a great deal of investment in the form of fixed capital, but still reliant on enslaved laborers and a focus on a single staple product for export. After harvest and an initial refining process in the Caribbean, cane sugar was then shipped to the United States (or Europe), where it had to be refined again both to fix damage from travel, and to make it the purer, whiter grade desired by genteel consumers.

In contrast, maple sugar was produced in the northeast US and in Canada on small family farms, with little investment needed, free labor (whether family members only or occasional wage labor), and as an adjunct to the larger farm production, with most of the sugar kept for personal use and some being sold at local markets. Maple sugar was refined to a level similar to a Caribbean brownish muscovado but not further.

What the investors in large-scale maple sugar wanted to do was usurp cane’s place within the totality of production, distribution, and consumption, by mass-producing maple sugar in a quality similar to what was shipped from the Caribbean, transporting it to urban seaports, then having existing refineries transform it into a high-grade white sugar.

There were three major attempts to harvest maple sugar on a large scale between 1790 and 1794. The proprietors were William Cooper, in Cooperstown, NY; Henry Drinker, at Union Farm on the Susquehanna in southeast PA; and Gerrit Boon, near Utica, NY. These were independent but not unrelated attempts; Drinker was Cooper’s biggest investor, while Boon visited both Cooper and Drinker’s operations.

All three failed. What’s interesting to me, and what the Grundrisse gives me a lens for understanding, are the ways in which they failed, and how the initial vision for the projects gradually retreated.

Initially, as conceived by Cooper, Drinker, Benjamin Rush, and a circle of Philadelphia Quakers who invested in both Cooper and Drinker’s attempts, the maple sugar industry was intended to strike a deathblow to Caribbean sugar plantation slavery. Farmers and their families, settled in the forests of Pennsylvania and New York, would produce raw sugar equal in quality and quantity to that of the sugar islands. They would do so using simple tools rather than industrial facilities, and laboring for only a few weeks of the agricultural year.

But Cooper’s first attempt, in 1790, produced only relatively small amounts of a poor-quality sugar that the Philadelphia refiner Edward Penington had difficulty refining and selling. Cooper and Drinker had hoped to gain the endorsement of the country’s elite. But Thomas Jefferson and James Madison, having been sent samples of maple sugar, expressed support publicly but in private denigrated the quality and refused to serve it at their tables.

The next year Cooper tried to increase the quality of his maple sugar by building a refinery in the woods, based on descriptions of refineries at Caribbean plantations. We don’t know exactly what went wrong, only that he shipped only a very small amount of sugar out of Cooperstown in 1791 and closed up his operation soon after.

Drinker’s operation at Union Farm attempted to produce maple sugar for several years but, like Cooper’s, never seems to have produced enough to create a return for its investors.

Boon’s also failed, but in a more interesting way. Boon seems to have had no interest in antislavery; he was a Dutchman employed by a sugar refining and mercantile house in Rotterdam, had experience on Caribbean sugar plantations, and owned slaves at his central New York home. He was simply looking to use his expertise to make a killing for his firm. His attempt was based around a significant investment in fixed capital: creating a system of troughs and gutters to collect the sap and draw it by gravity to the refinery. Modern maple syrup farmers can pull that off with a system of plastic tubing and a vacuum pump, but Boon’s handcrafted wooden troughs warped and leaked. The experiment was deemed a failure and Boon’s firm instead invested in the burgeoning land speculation market.

I’ve been thinking primarily about production, and a little about consumption, but before watching your lectures and reading the Grundrisse, I hadn’t considered the other circuits that flow within the totality of cane sugar/maple sugar production. Credit is one: from where do Cooper, Drinker, and Boon get their initial investment, and how does that affect the progress of their attempts? And I don’t know as much as I’d like about the farmers or laborers who did the actual work of harvesting and refining maple sugar as I’d like.

→ 1:26 PM, Jun 8

Coronavirus, day 12: the weekend, running, media consumption

It’s surprising how little there is to write about from this weekend. I ran both mornings, a long run on Saturday and an easier one today, and saw almost nobody else either time. I’d expect that going around the old Air Force base as I did Saturday, but today, when I went out a little later and around some local trails, it was more surprising.

Inside, it’s just a round of reading (William Gibson, Agency), video games (Mario Kart, Animal Crossing) and YouTube (Bon Appétit, David Harvey’s lecture series on the Grundrisse). I’m reading along with the latter but have fallen about two weeks behind; I’d like to catch up before the end of the course.

Tomorrow spring break ends and classes return, and I have no idea what will happen.

→ 7:38 PM, Mar 22

Coronavirus, day 10: Animal Crossing

It’s become clear that we will all be socially distancing for some time to come. Since my immune system isn’t yet back to full strength I’ve been staying in the house for the past week. As of Sunday New York is instituting a ban on all public gatherings, no matter how many people, and closing all non-essential businesses.

So…I bought a Nintendo Switch yesterday. I’d been thinking about it for the past couple of years. With Animal Crossing: New Horizons being released today, the time seemed right. I’m a slow adopter on video games. I’ve never had an Animal Crossing title the same year it was released. So I’m excited to discover how the game unfolds at the same time as everyone else.

The library finally shut its doors entirely as of Thursday afternoon. The college has asked students not to return to the dorms — previously we’d expected about 700 to come back. Virtual classes start up on Monday. The library can’t support the students and faculty because the bureaucracy has blocked the implementation of the chat/reference software we purchased until they review the contract. Our IT team has been very good about setting things up so we can roll it out on short notice. But we don’t know when that will be.

→ 5:49 PM, Mar 20

Coronavirus day 8: 18 March 2020

Of course the coronavirus crisis is more than eight days old, but the day it began to change my life was last Wednesday, when the college announced the rest of the semester would be taught online. This week we — the library and IT — are continuing to set up support for the faculty and students, on very short notice. I’m awfully impressed by the amount the IT side has been able to get done, like setting up a virtual computer lab with 200 seats. And they, and our student workers, are all working in the building under what can’t be pleasant conditions. Tomorrow the VPs will decide if we can close the library; our leadership is pleading with them to do so.

→ 7:16 PM, Mar 18

Coronavirus: small routines in the midst of chaos

An odd day, even by the current standard. A flurry of emails received from different levels of college administration all morning until just before 10 AM, then — nothing. The clerks and secretaries’ union got them permission to go home, which is good. But as long as the library stays open that means someone will have to supervise the student workers, and I’ve seen no evidence the library will close.

Meanwhile, I’m working from home, setting up our systems to accommodate reference chat and virtual appointments. Morning runs, afternoon walks with Peanut, and a beer precisely at 5PM to mark the end of the day.

→ 6:04 PM, Mar 17

The coronavirus, online classes, and physical infrastructure

Back to work today, from the comfy chair in my home office. We began putting together the systems we’ll need to give students online access to reference services when they return from break. I like the system we’re using (LibAnswers, from Springshare). There are collaborative tools embedded that will allow the librarians to work together on reference questions in a way we rarely do. The real question is will students have reason to ask questions of us, and even if so will they actually ask those questions? The first depends on how the teaching faculty teach over the next two months, and as for the second the best we can do is figure out how to make it easy to engage with us, and to follow up quickly on questions.

But to take or teach a virtual class requires the right hardware and software, and many people need that set up by our IT side. Faculty have been instructed to move courses online and don’t have the equipment or knowledge to do so. They’re actually coming to the building more now than at any time during the semester because they need in-person help. Schools in the area have shut down and other colleges have sent their students back home. So there are more people than ever looking for a public space to work in, and we are the only one left. Plus we are going to have an estimated 700 students returning after spring break, some presumably from places with community transmission of Covid-19. So, when we are supposed to be spreading out and away from each other, the college and library will become more densely populated. I do not think that the system or the college will look good when this is over.

→ 6:42 PM, Mar 16

Preparing for social distancing

This morning I went to the library to grab the things I’ll need at home over the next few weeks. The library was closed — it always is on Sunday mornings, and with spring break this week it won’t open until Monday. It was quiet and dark and I took the chance to crank Deafheaven while I was packing up my things.

The county has closed all the schools, the local public library is shutting its doors, and the city-owned gym is going offline. That leaves our academic library as the only large public space that will still be open for at least twenty miles in any direction. I’m worried about my colleagues, and especially the staff and students who will be working at the service desks. The state needs to end this pretense of keeping the SUNY colleges open through the pandemic.

→ 8:28 PM, Mar 15

Holding our breath until classes resume

I worked from home today, so if there was any excitement on campus I missed it. But I doubt there was. Almost no email came across the wire. I suspect everyone is just holding their breath, waiting for a week and a half from now when spring break ends and the shift to online instruction begins.

It looks like I’ll have more days working from home. I talked to the nurse today, who thought that since my blood counts are still bouncing back post-chemotherapy I should stay away from the library and from crowds. And the dean had told me I could work from home when the crisis was starting. I think it will be good for my mental health as well. When I was at the library this week I mostly just felt helpless.

→ 7:52 PM, Mar 13

The library's role under coronavirus

The college is open today, and classes are in session, but it seems strange: why can’t we cancel classes for two days before spring break, before we move them all online? Keep offices open to support students, sure, but I can’t imagine holding classes did any good.

Our library is staying open through spring break, and will stay open after break ends, though likely on reduced hours. At most colleges that are moving instruction online, the library is staying open. There are two schools of thought on this. One is that the library provides essential services and has a duty to the community to maintain those services. The other is that by staying open the library becomes another potential site of community transmission. Nor should librarians and other employees have to place themselves in danger.

The former is the traditional view. But more and more, librarians are shifting to the latter position, in defense of both public and personal health.

Update: Timothy Burke has an excellent Twitter thread detailing the logic behind “close the libraries,” but from the teaching faculty perspective.

→ 7:28 PM, Mar 12

Coronavirus and the college

Today the governor announced that all SUNYs would be moving to online instruction for the rest of the semester. This wasn’t unexpected, but we’re underfunded and shortstaffed and there’s no way we could be prepared.

The college isn’t closing. Students with nowhere else to go will stay in the dorms. Some lab classes and internships and experiential learning setups will keep going. And the library and computer labs are going to stay open. Faculty and commuter students are happy about the last, because we’re in a digital desert. Broadband is scarce, and cell networks erratic, once you drive a couple of miles out of the city. Ellen’s museum, fifteen miles north, gets 5 Mbps download speeds — and that’s on a good day.

I hope our faculty recognize this, and don’t try to replace in-class meetings with synchronous video sessions students won’t have the bandwidth to watch even if their hardware is good. But we’ve signed a contract with Zoom, and IT is pushing that service pretty hard. From the library side, most of our services have an online component anyway — an email account for reference service, online appointment booking, subject and citation guides, and of course our discovery system and databases. We’re adding a chat service, something we’d ditched a few years back due to lack of use. And we can use Zoom or Google Hangouts for meetings with students. Most of our classroom instruction is finished by this point of the semester so that’s good.

Ellen and I had planned a long weekend vacation in Québec City starting this Friday. That’s off now — even if we felt comfortable traveling, I need to stay here to help get the new library systems up and running quickly, and to make sure everyone is trained to use them.

→ 8:08 PM, Mar 11

Teaching history from the local

On April 6, 2019, I delivered a version of the following as the keynote speech for the SUNY Plattsburgh Department of History’s annual awards luncheon and honor society induction ceremony. I am the history librarian and occasionally teach as an adjunct for the department. With this talk I wanted to draw on my own experiences to inspire especially the students becoming teachers to embrace local resources as an excellent way to get their students engaged with history.

First, thanks to the History department faculty for inviting me to speak. And second, congratulations to all of the students being inducted into Phi Alpha Theta today. I work with a number of different departments on campus and I know how challenging your program is, and the level of work you have all done to excel.

Many of you, I know, are going to become teachers, in either primary or secondary school. Some of you are going to go into museum, library, or archival work. And perhaps some of you are going to follow your professors’ path and attend graduate school in history. In all these fields you will be educators. Even if you’re going into a different career, you’re still citizens, and you’ve been historically trained — you may find yourselves needing or wanting to educate your community about history.

This talk is about teaching history in all these aspects. I’ve titled it “Teaching History from the Local” and not “Teaching Local History” — you’ll see the difference.

I want to begin with my own, somewhat unusual, path into history teaching, because I think it will illuminate some of the other things I have to say.

My undergraduate degree was in anthropology, not history. I worked for a number of years as an archaeologist, mostly for Colonial Williamsburg. If you haven’t been there, Colonial Williamsburg is a large open-air living history museum in Tidewater Virginia. They have a research arm that does a great deal of historical, archaeological, and architectural research. So with CW I excavated historic sites — notably a seventeenth-century plantation house, a colonial theater, and a colonial coffeehouse.

Eventually I decided to pursue a doctorate in history and chose William and Mary because of its close links to Colonial Williamsburg and their history department’s willingness to accept students from material culture and archaeological backgrounds.

For eight summers in graduate school I taught in a program my advisor ran, for high school students while I was in graduate school. It was called the National Institute for American History and Democracy — NIAHD for short. The course was for college credit and ran for three weeks. Every day we would take the students to a different historic site. Often, because of Prof. Whittenburg’s connections, we would get a behind the scenes tour. At the same time, the students would read a serious scholarly article dealing directly with the history of the site we were visiting.

When we visited Bacon’s Castle, a seventeenth-century brick house in Surry County, Virginia, they read an article by T. H. Breen about the origins of Bacon’s Rebellion in labor practices and race and class divisions. When we visited Colonial Williamsburg, where every day there were living-history performances of aspects of the American Revolution, they might read an article by Woody Holton on enslaved persons’ contribution to triggering the war in Virginia.

I found that the site visits and the readings reinforced each other, not because they said the same things, but because they said different things about the same topics.

Sites and readings reinforce each other, not because they say the same thing, but because they say different things about the same thing. As a broad generalization, historic sites tell a sort of Whig history, emphasizing the march of progress; we chose readings that cut across that with a social-history perspective. The combination really clicked with the students; by giving them two very different interpretations of the same people, place, or event it allowed them to begin contrasting those and, eventually, judging between them.

I’m not posting a dichotomy here between material culture as non-scholarly and historical reading as scholarly. As my advisor told me when I was first moving from archaeology to history, “Archaeology is too important to be left to the archaeologists.”

This program worked tremendously well at helping students engage with history. As I began teaching here at Plattsburgh, I thought back to that experience to try to understand why NIAHD worked, and how I could incorporate that into the American Revolution course I taught last year.

First, there are an amazing number of historic sites close by Williamsburg. Monticello is two hours away; Mount Vernon two and a half. There are a number of Civil War sites within ninety minutes: Fredericksburg, Richmond, Petersburg, Fort Monroe. And within twenty minutes there’s Colonial Williamsburg, Jamestown, and Yorktown.

Second, there is a massive amount of excellent scholarship published on Virginia in the seventeenth through nineteenth centuries. And this scholarship has been written over decades or centuries, incorporating many different historiographical perspectives examining the same people, places, and events.

In trying to replicate aspects of NIAHD with my Revolution class I found the latter to be the biggest problem. There is simply not enough scholarly history on Plattsburgh and the North Country, or for that matter on most places, to replicate what can be done with Tidewater Virginia.

There is history, but it’s not scholarly. There’s scholarship relating to the region, but we only get a passing mention, if that. And even when there are single works, they’re not part of a larger historiography. I’m thinking here of Philip G. White’s Beekmantown, an excellent quantitative social history study of Beekmantown and Plattsburgh. There are no other works in conversation with it.

We do, however, have a surprising number of historic sites and museums. Ticonderoga is only ninety minutes away, for example. And we have quite a few smaller museums here in Clinton County. But they aren’t as well funded, and not as deeply studies or interpreted, as those in Tidewater Virginia.

What I concluded as I was teaching my Revolution course was that students needed to read, in depth, local primary sources, and then interpret those in the light of secondary sources that dealt with the same topics in areas regions similar to ours.

In that course, students read, transcribed, and shared among themselves a series of letters from the 1810s between General Benjamin Mooers of Plattsburgh, commander of the New York militia in the War of 1812, and his wife Elizabeth. These are all held in the Kent-Delord collection in SUNY Plattsburgh’s Special Collections. We discovered that these letters have rarely been cited. Those who have used them were primarily interested in military history, and therefore in Benjamin’s letters rather than Elizabeth’s.

But, we found, there’s an absolute wealth of social and cultural history that can be mined from these letters. Some students wrote about the development of Gen. Mooers’s militia. Others focused on Benjamin and Elizabeth’s extensive discussions of travel and communication between Plattsburgh and Albany, where Benjamin frequently traveled after the war as a state legislator. Of course, this is the era when the Erie Canal, and the attendant transformation of travel and commerce in New York, was become more than just a gleam in a merchant’s eye.

Two other topics in particular revealed aspects of local history that are themselves worthy of further study. First, some students noticed that Elizabeth wrote a great deal about religious matters. These included an acrimonious split in her own congregation. This brought to mind the tensions of the Second Great Awakening, of which so much has been written, especially dealing with Western New York’s Burned-Over District of a decade later. Were the struggles here in Plattsburgh an early example of the same tensions there?

Second, a couple of students noted the letters in early 1816 commented on the weather. It was unusually cold and dark. There was a haze in the air, and plants bloomed late if at all. The Mooers family couldn’t have known, but they were experiencing the effects of the explosion of Mount Tambora in modern-day Indonesia, resulting in a volcanic winter and what became know as the “Year Without a Summer.” There are plenty of primary-source accounts of this unusual weather in the Northern United States, but Elizabeth noted a specific historical effect of the agricultural failures it caused: more Canadians were coming southwards across the border, looking for work, than ever before.

To return to your roles as future teachers: The point of all this isn’t to say that you and your students should all study Plattsburgh — though we would be delighted if you did — but that these neglected histories are out there, not just in Plattsburgh, but everywhere. And the resources to study them exist everywhere those histories do. But you’ll need to seek out those resources.

For secondary sources, you will unfortunately lose access to the college’s databases after you graduate. But your local public libraries often have subscriptions to these databases or can get materials through interlibrary loan. And the New York Public Library offers free library cards to state residents, and those cards allow you online access to many historical database, including a number we don’t have access to via the college.

As for primary sources, there are understudied collections and underutilized resources everywhere. Every county in the state has a county historian. Many towns and cities have the same. These are great people to talk to — they often combine knowledge of the area with an academic history degree. There are local academic libraries like ours with archives. Public libraries often have their own archival collections. Local historical societies and museums often have resources of their own. In many cases, the collections haven’t been properly indexed, because that requires a great deal of staff time and funding to do well. So you and your students will need to spend time exploring. But the people involved with these institutions will be eager to work with you.

I’ll leave you with this: remember that that the gap in scholarship on regions like Plattsburgh and the North Country provides an opportunity to begin doing original research. Yes, getting your facts right is essential, but you can do more, by asking not only “what happened here?” and also tying those local events, people and places to the larger historical trends that you’re uniquely placed to identify by way of your rigorous training as historians.

→ 9:47 AM, Apr 9

Reading notes: Matthew Thomas, Maple King: The Making of a Maple Sugar Empire 📚

Thomas, Matthew M. Maple King: The Making of a Maple Sugar Empire. Published by author, 2018.

The eponymous “Maple King” of Matthew Thomas’s book is George C. Cary (1864-1931), founder of the Cary Maple Sugar Company of St. Johnsbury, Vermont. In Thomas’s telling Cary was not only a producer of maple products, but a key figure in the modernization of the industry. Before Cary, maple sugar was produced a few hundred pounds at a time by small farmers and sold to consumers for household use. But after Cary, maple sugar had become a commodity. Small producers sold to large “packers,” who mixed their purchases together and sold a standardized product measured in tons rather than pounds. The primary purchaser was now large commercial interests rather than individuals.

Maple King is organized chronologically, and serves both as a biography of Cary himself and a history of the rise, fall, and rebirth of the Cary Maple Sugar Company and its associated brands. It makes sense to write the two together, for Cary and his business were, Thomas argues, inextricably intertwined. Cary began the company with the money he had made purchasing maple sugar to sell to tobacco manufacturers. It ran on credit through the 1910s and 1920s, much of that credit backed by Cary’s personal wealth. When Cary was no longer able to finance operations the company fell into bankruptcy.

The book begins with the limited scale of maple sugaring in the early nineteenth century and the industrial progress that began after the Civil War that transformed the maple products as it did the rest of American agriculture. It follows how Cary entered the maple sugar business and how he steadily moved from a role as middleman to become a producer, refiner, and marketer of maple products. Succeeding chapters detail Cary’s expansion beyond his own company, via the purchase of failing competitors, partnership with successful ones, and innovative marketing of his products. The culmination of this period was Cary’s dominance of bulk maple sugar production in the United States by the 1920s. But by the end of the decade business had slumped, hit hard by the Great Depression, as credit tightened and the tobacco companies that were Cary’s primary customers pushed for lower prices. The result was bankruptcy, after which pieces of the Cary empire were split off, sold, and sold again, up to the present day. Yet one brand in particular – Maple Grove Farms – has stayed strong.

One of my favorite things about Maple King is Thomas’s use of both documentary sources and material culture. Company documents, newspaper accounts, and archival photographs are key to the narrative. But Thomas is especially interested in the built landscape of the Cary maple empire, including the company plant and Cary’s residence in St. Johnsbury, the stores and restaurants that marketed Cary products to tourists, and the farms and sugarbushes in outlying areas where Cary sourced its raw materials.

Cary’s story is that of a self-made man, skilled in the world of business, not just rising to the top of an industry but creating one where there was none before. And what Cary created really was an industry: bulk purchasing, bulk refining, and the sale of a commoditized product. Today the iconic images associated with maple syrup include log cabins, trees hung with buckets, and a strapping man in a red and black checked shirt. These are very much the opposite of the landscape of mass production that Cary created. And yet, it was Cary’s own marketing machine that sold us the images of rustic sugaring.

When I first read Maple King I wondered how Cary’s business compared with another large maple sugar operation of the early twentieth century, Abbot Augustus Low’s Horse Shoe in the Adirondacks. I was happy to read on Thomas’s blog Maple Sugar History that his next book will be on Horse Shoe.

In the meantime, Maple King is an excellent narrative of the transformation of the maple sugar industry in the late nineteenth and early twentieth century, well worth reading for anyone interested in the histories of maple sugar and maple syrup, Vermont, or industrialized agriculture and forestry.

→ 3:35 PM, Feb 11

Historical Documents in a Digital Library: OCR, Metadata, and Crowdsourcing

This was originally written as a paper for Chris Tomer’s graduate class on Digital Libraries at the University of Pittsburgh this past spring. It’s my attempt to articulate some ideas about what makes online historical documents usable — or not usable — for researchers. Comments and criticism are welcome!

Over the past decade, a vast number of historical materials from the past three centuries have been digitized and placed on the Internet. The majority of these have been printed sources — newspapers and books. Some have been digitized as part of a proprietary system (for example, Readex’s Early American Imprints.) Others have been made publicly accessible (Google Books, or projects from the Library of Congress under American Memory.) The grand hope of all of these was to provide searchable full text online. This would be done through the magic of optical character recognition software. Surely, librarians might have thought ten or fifteen years ago, software quality and processing power would improve rapidly, soon permitting quick and accurate reproduction of any text.

The promise of OCR has gone largely unfulfilled. While modern printed sources are easily read, older ones are not. This should lead us to reconsider how we think about these documents — how we categorize them. In a pre-digital world, there is not much difference between the modern newspaper and the eighteenth-century one. Both are opened and easily skimmed, column by column. Contrast that to a manuscript — a letter or diary — which is much harder to read.

But in the digital age, if images and computer-generated text are available over the web, the older newspapers have more in common with manuscripts than they do with newer printed materials. The latter are searchable; the former are not.

For a researcher, to profitably use a big digital collection of historic materials, he or she needs to be able to search the contents, to winnow down centuries of text. In other words he or she needs either quality OCR or quality metadata. For a large corpus, if you have a collection that is well-OCR’d, then you can get by without robust metadata. But if you have a collection that is poorly OCR’d, text search will not work — you need to have robust metadata for the library to be useful at all.

An example of the latter is the old microprint edition of Early American Imprints. The documents were in physical form and thus, not searchable at all. But the makers had created robust metadata — and this meant that, when libraries began using digital catalogs, the metadata could be ported into that catalog. Early American Imprints would be searchable along with the rest of the library’s holdings.

Printed materials and manuscript materials should be seen as parts of a larger continuum. Indeed, the major difference is not between manuscript and print, but between modern (post-1950) printed materials on the one hand and pre-modern printed works and manuscripts together on the other. This primary difference is based on the ability to create an accurate OCR text from a high-quality scan of the paper source.

Modern printed materials can be easily transformed into accurate, searchable text. Twentieth-century printing methods produced a clear, precise, and, importantly, regular type. A computer can convert those shapes into text with little trouble.

An example is the work done on the George Washington papers by the Library of Congress. Many of Washington’s letters and papers had been transcribed and printed. These included works printed in the years 1898, 1931–1944, and 1976–1979. Even in 1998–2000, when OCR technology was significantly less powerful than it is today, librarians were able to achieve high accuracy rates — they claim 99.95%, or one error in every 2000 characters. It is a measure of the advancement of the technology that today a measure of 99.98% — sixty percent fewer errors than the George Washington Papers project — is considered a bar for high accuracy.

On the other hand, handwriting recognition is still fraught with problems and errors, even when done with software designed to learn a particular person’s script. As a result, it isn’t currently possible for a computer to transcribe a historical manuscript into searchable text.

The category in the middle — pre-modern printed materials — is deceptive. To a human eye, it is very similar to modern print. But for the OCR program, it is much more like handwriting. The deficiency of current OCR capabilities when applied to older print sources are illustrated by a series of rigorous tests performed on the Nineteenth-Century Newspaper Project. This recent study found that character accuracy was 83.6 percent. Already this is alarming. But the corpus is searched not by character, but by word. And the word accuracy was significantly worse — 78% percent. Further, proper nouns — the names and places beginning with capital letters, and those words that a researcher would be most likely to search for — were recognized only 63.4% of the time.

Other researchers have found similarly appalling accuracy rates. A 2007 study by the Dutch National Library of the results produced by several OCR contractors found a significant amount of variance “…the rates respondents gave for newspaper digitisation projects vary from 99.8% for 700,000 newspaper pages (word accuracy, manually corrected) to 68% (character accuracy, no correction) for 350,000 pages of early 20th century newspapers.” Another study, this one by the Australian Newspaper Digitisation Program, found similar variance: “In a sample of 45 pages to be representative of the libraries digitised newspaper collection 1803-1954, we found that raw OCR accuracy varied from 71% to 98.02% (character confidence, no correction).”

Clearly a raw character accuracy of 68 to 71% renders the resulting text useless for searching. Higher rates are more useful — yet even the 98% character accuracy at the high end of the Australian study will result in many missed words. If the British finding that proper nouns are recognized significantly less than their common-noun counterparts holds true here as well, then the power of full-text search is hampered even more. In sum, the researchers found that in their corpus of nineteenth-century newspapers, recall was high, precision was relatively low, and fallout was high.

This can be contrasted to a pre-digital form of searching: the index. An example is the comprehensive index to the Virginia Gazette from 1737 to 1790, prepared by historians at Colonial Williamsburg in 1950. In this index are contained references to proper names (people and places) and subject terms. (Colonial newspapers generally were populated by anonymous or pseudonymous pieces, so no authors.) An index like this, rigorously compiled and checked, provides a very different profile: very high precision, moderate to good recall, depending on the rigor, and low fallout.

#

What the Virginia Gazette index provides, in essence, is metadata. In a pre-digital world, this was the only way of “searching” the corpus. But in a world of digital libraries, such an index would seem unnecessary. And perhaps it would be, were the online text of newspapers acceptably accurate.

When digitizing the eighteenth-century run of the Virginia Gazette, the digital humanities specialists did not even seriously consider putting searchable text online. OCR was quickly found not to work well on the microfilm versions of the newspaper, and costs to have the text inputted manually were far beyond their budget. Instead they went through a laborious process of scanning and OCRing the index (which, typeset in Courier in the mid-twentieth century, could be done with high accuracy). They then placed the index online in HTML format, with links leading to the scanned images of newspaper pages. In this they were helped by another feature of the print index: it listed not just the issue date, but the page and column of the entry.

The creators envisioned a workflow that took advantage of the diligent labor of the mid-century index compilers and married it to the speed and convenience of the digital library. When working with the digital Virginia Gazette, a researcher would first search the index web page for a relevant term. Then he or she could tab back and forth between the index and a set of open images, quickly running through a list of results. All in all, the technique was successful; the disadvantage, of course, is that it is not so easy to read a run of consecutive issues, or even consecutive pages.

Yet not every old newspaper was printed in a town that had been bought by the scions of a Gilded Age dynasty. The money that John D. Rockefeller pumped into Colonial Williamsburg for restoration of the colonial city, and the research to make that possible, was not present everywhere. Thus, most eighteenth and nineteenth century newspapers do not have detailed, proofread indexes waiting as a gateway to a digital edition. Another source of metadata must be found.

OCR might be one option: it can read article titles with a moderate degree of accuracy, and, if it could pick out proper nouns with any consistency whatsoever, could index those. But, given the poor quality of the microfilm that is used to make scans of newspaper pages, OCR simply can’t cope with the demands. The amount of cleanup required would mean that librarians might as well just read the articles and index the text themselves. At least in this way they could index concepts and make a true subject index — not something that literal OCR software can do.

A workaround, tried by the Australian Newspaper Digitisation Program, is to correct the OCR version of only the headline and the first several lines of text, in hopes that this would catch the most significant aspects of the article. But in practice this still takes a great deal of time. And, from the historical point of view, another problem emerges. The standard journalist’s model of writing in the present day — in which a story begins with a lead paragraph containing a summary and essential details — was simply not part of the eighteenth or nineteenth-century repertoire. Articles from those periods are just as likely to unfold slowly — like an oration rather than a summary of facts.

To recap: for most old newspapers no preexisting metadata exists, the software to correctly create that metadata is similarly absent, and the costs in time and money for librarians to create the metadata on their own are unworkable. Yet the quality of OCR means that, without metadata, the digital text itself is an imperfect and unreliable reflection of the actual content of the newspaper.

What is to be done? The Australian Newspaper Digitisation Program came up with an innovative solution: crowdsourcing. They made it possible for users to “view, edit, correct and save OCR text on whatever article they happened to be looking at.” Knowing that particular documents had unusually bad OCR, they highlighted those images to encourage patrons to improve them. The crowdsourcing was an instant success. Within three months of the project’s launch 1200 individuals had edited “700,000 lines of text within 50,000 articles.” Further, the volunteer correctors were — based on information from that two-thirds who had registered for accounts rather than working anonymously — largely experts in the places and time period covered in the newspapers. This meant they were better able to use context to puzzle out difficult words.

As the project progressed, more and more users began to edit the newspaper text. They also developed elements of a community. Since the interface was very basic and there was no forum area, they took to using the comment mechanism as a way to interact with their fellow correctors. To their infinite credit, the Australian Newspaper Digital Project has not tried to exert particular control over user activities. Realizing that they’ve got a good thing going — valuable work being performed by a vibrant community — they have instead stood back and watched that community develop. They found that “having no moderation and being open and free like the internet has raised many questions but has so far resulted in bringing more advantages than issues to the program.”

This is the sort of project over which librarians in the United States seem to drag their feet, unwilling to give up control. The best example of crowdsourced editing of historical newspapers I know of in this country is that put together by University of Virginia history professor Ed Ayres (now president of the University of Richmond). Ayers had his lecture classes — often several hundred students — go through nineteenth-century Virginia newspapers and cull local news to be put into a database by county. This is, of course, a different kind of project — the intent is to produce a refined database rather than improve the primary sources. But, once again, hundreds or thousands of people working on small bits of a project produced usable data far superior to what modern software would have come up with.

#

So in at least two cases crowdsourcing has worked as a way to produce usable, index-ready text from image files and low-quality OCR. Old newspapers are but one source for which this technique has potential. Other printed materials could be made accessible, and beyond print is manuscript. Historical archives in the United States and elsewhere are notoriously conservative institutions. But it would take relatively little effort and not much more in the way of resources for them to provide the materials that could generate their own online community of researchers. It would be enough to provide a digital library of reasonably decent image files of manuscripts, and a web interface that allowed researchers to transcribe the material for their own use while also saving the transcription for other patron’s benefit. Allow users to create tags for the material — as the Australian project does — and you also have the beginnings of a robust set of metadata.

→ 8:35 PM, May 28