Data Mining, Collaboration, and Institutional Infrastructure for Transforming Research and Teaching in the Human Sciences and Beyond
Cathy N. Davidson, Duke University
CTWatch Quarterly
May 2007

The first generation of the digital humanities was all about data. The excitement and impetus of digital humanities throughout much of the 1990s and continuing to the present was that massive data bases could be digitized, searched, and combined with other data bases for interoperable searches that yielded more complex and complete results in a shorter amount of time than the human mind had ever imagined possible.1 In this way, revolutions in digital humanities were similar to those in other fields. In biological science, sequencing the genome could never have happened without dramatic increases in computational power. In natural science, we know more than ever about global warming due to such projects as the Millennium Ecosystem Assessment (2005), which evaluates the global changes to 24 separate life support systems (including biodiversity, ecosystems, and the atmosphere).2 In social sciences, human complex systems theory combines results based on social network theory, demography, migratory patterns, social regulation and laws to analyze movements of persons and goods globally. And in the human sciences or humanities, myriad projects digitize the texts and artifacts of world culture, from the beginning to the present, in order to create new understandings of the history of ideas.

Second-generation digital humanities are the scholarly equivalent of what Tim O’Reilly has dubbed “Web 2.0.” If Web 1.0 was the World Wide Web’s collection of websites and data bases (what human scientists would call “archives”), Web 2.0 is a fully developed platform that serves a variety of applications to its end users.3 However, there is also an important difference between the business and humanistic history of cyberinfrastructure. O’Reilly’s term “Web 2.0” was coined to differentiate what was coming next from what didn’t work in the burst dot.com economy. By creating a “before” and “after,” the concept of Web 2.0 was designed to encourage a new generation of investors in internet technologies. But there is no equivalent “bad before” in digital humanities. Rather, the current generation of digital humanities extends and builds upon the foundation of Humanities 1.0.

The transformation of archives into interoperable and professionally-constructed digital databases has changed the research and pedagogical questions of our age, by providing the individual researcher almost instantaneous access to far more data than any one person could gather in a lifetime and by allowing more people access to these materials than ever before. Let me give an example of how transformative this has been for teaching and education in the human sciences. Back in the 1980s and 1990s, when I taught courses on mass education, reading, and writing during the highly contentious political period following the American Revolution, I used to have graduate students do archival research in early American newspapers and magazines, some of which were available on microfilm or microfiche, unindexed.4 A student might have spent a hundred hours rolling the films in the dizzying light of those unwieldy machines (I had one in my office and used to call it, without affection, The Green Monster). If the student found one good example, it was a successful project. Two examples constituted a triumph. In many cases, the search was so frustrating that the student might well have applied for a scholarship to travel to an archive in New England, such as the American Antiquarian Society, where the resources were far richer.

If I teach that course now, my students can go to searchable data bases of early American imprints, of eighteenth-century European imprints, of South American and (growing) African archives, and of archives in Asia as well. A contemporary student could, in far less time, not only use digitized and indexed archives to search U.S. data bases but could make comparisons across and among popular political movements world-wide, and possibly make arguments about the spread of dissent along with commodities such as tea, sugar, or rice. The barbarism and ubiquity of the slave trade as part of the spread of global systems of capital also meant for an exchange of ideas about personhood, statehood, individual rights, and human rights.

What is salient about that example is that, in the humanities, as in the sciences and social sciences, cyberinfrastructure does not simply change the quantity of information. It allows for the conceptualization of more complex, intertwined, and interconnected problems that are as vast as the data bases themselves. However, the immense intellectual ambition of projects enabled by new access to massive data sets is precisely what has spurred the evolution to what I’m calling second-generation digital humanities. As with O’Reilly’s Web 2.0, in the human sciences we are seeing far more user-generated content, customization, collaborative archiving, writing, and research, distributed among large numbers of scholars, students, and sometimes amateur intellectuals who, together, are arriving at new and often challenging concepts and not simply at ever-increasing amounts of data.

A project such as the International Dunhuang Project combines both first and second-generation digital humanities. It is both a professionally-archived digitization project and one that is collaborative across multiple sites. The city of Dunhuang was a crossroads on the trading route that would later become known as the “Silk Road.” When archeologists excavated Dunhuang in the nineteenth century, they divvied up the spoils to museums in Beijing, Berlin, London, Tokyo, St. Petersburg, and elsewhere. Now, shards or fragments of text in one physical location are being put together virtually with those in another to create legible artifacts that are changing our view of what is “West” about so-called Western culture. Dunhuang flourished from 100 BC to 1200 AD.5 Over 20 languages have been found in materials there, underscoring that cultural fusion and exchange was happening from East to West, North to South, from Africa to Japan, and from at least the time of Julius Caesar.

A second project is even more exemplary of how second-generation digital humanities work. The Law in Slavery and Abolition Project shows how laws in one country reverberate around the world, with consequences for humans, institutions, and states.6 In this project, much of the content—the archive itself—is located and digitized by students who are learning collaboratively even as they are making interoperable databases for others to learn from. Classes are coordinated across universities in the US, France, Germany, Brazil, Canada, and Cuba. New archives remake history, remake causalities. Like the Dunhuang project, this one is paradigm-changing in its content but also in its collaborative teaching/learning/research/archiving methods.

We live in an exciting time for the human sciences, yet the amount of material to be digitized is so vast that, in real terms, we are only at the tip of the data iceberg. In non-textual fields (such as art, music, performance studies, media studies) we are at the tip of that tip. As is well-rehearsed by now, the data needs of the humanities are incalculable. The Sloan Digital Sky Survey—the most ambitious astronomical study ever undertaken—uses 40 terabytes of data.7 By contrast, the Survivors of the Shoah’s Visual History project requires 200 terabytes of compressed data.8 These enormous data needs (exacerbated by the under-funding of the human sciences) result in impoverished resources in many areas, especially in data-intensive areas such as media studies. For example, the Museum of Television and Radio has an archive of 120,000 English-language programs, beginning with the 1918 speech of Labor Leader Samuel Gompers. Only 1500 of these have been digitized.9 None are searchable. And, as historian Timothy Lenoir reminds us, the situation is worst of all for New Media. He calls ours not the “Information Age” but the “Digital Dark Ages,”10 because we have preserved almost none of the archive of the virtual materials (early code, software, hardware, websites, the first on-line games) of Web 1.0. Even digitized financial records of major corporations and universities turn out to be inaccessible now because of rapidly-changing hardware and software that left brontobytes of data behind.

Yet, even acknowledging that we are only beginning to digitize the record of the world’s knowledge, how are we going to make sense of all that data? No one person can. Projects such as Dunhuang or Law in Slavery and Freedom require many scholars, working from different intellectual traditions, with different assumptions and different languages, pooling not only local archives but interpretations of those archives. And we need interpretations that are not conceptually rooted in Western ideas that create the intellectual binaries that pervade code and carry over into what currently constitutes AI (Artificial Intelligence).

Thus, in terms of next-generation cyberinfrastructure, we need to start at the most foundational level and envision and implement a globalized semantic web. The linguistic choices embedded in semantics-based searches must incorporate a humanistic and culturally-motivated understanding that terms themselves embody cultural ideologies and that concepts formulated in slightly different ways in different languages encode different epistemologies, ontologies, taxonomies, and histories. Moving from indexical to semantic searches has to be undertaken with a cultural awareness of what is or is not included in “semantics.”

That brings me to another point, which may appear tangential but which is at the heart of the matter. New ways of thinking need support. If, at present, academic rewards go to the author of a monograph, especially one that posits a different analytical or interpretive hypothesis, for Human Sciences 2.0 we need to think of ways to reward teams of scholars working cross-culturally on collaborative projects. Collaborative work should count, and here humanists can use models that scientists have developed for determining credit in co-authored projects with multiple investigators.

Bibliographic work, translation, and indexical scholarship should also have a place in the reward system of the humanities, as they did in the nineteenth century. The split between “interpretation” or “theoretical” or “analytical” work on the one hand and, on the other, “archival work” or “editing” falls apart when we consider the theoretical, interpretive choices that go into decisions about what will be digitized and how. Do we go with taxonomy (formal categorizing systems as evolved by trained archivists)? Or folksonomy (categories arrived at by users, many of which offer less precise organization than professional indexes but often more interesting ones that point out ambiguities and variabilities of usage and application)?

We also need to rethink paper as the gold standard of the humanities. If scholarship is better presented in an interactive 3-D data base, why does the scholar need to translate that work to a printed page in order for it to “count” towards tenure and promotion? It makes no sense at all if our academic infrastructures are so rigid that they require a “dumbing down” of our research in order for it to be visible enough for tenure and promotion committees.

As colleagues in the sciences and engineering will acknowledge, these are not simply humanistic issues by any means. Which brings me to a final point. Once we have changed what we value as scholarship, we need to think through the departmental and disciplinary systems within our universities. Unless we find ways to “link” the different kinds of knowledge and analysis offered by different disciplines, we will be generating data but not really understanding the implications and import of that data. This is exactly why HASTAC (“haystack”) was created. A voluntary network of crossdisciplinary scholars realized that we had to form a “virtual university” across disciplines where scholars could think together, without institutional boundaries, about what cyberinfrastructure is needed. We needed to conceive better collaborative models of participation, implementation, and interpretation.11

We are in an oddly contradictory age where revelations in the computational, natural, and biological sciences evoke the deepest issues about what it means to be human. And yet the present-day academy seems determined to undervalue exactly those disciplines—the humanities, arts, and interpretive social sciences—that offer the most sustained and rigorous methods and insights into the category of the “human.” In different areas across the human sciences, we have addressed the deeply contested definitions and applications of the “human” in ways that can challenge (and thus make better) and also support new scientific work. More and more of our nationally funded grants are requiring a social and ethical component in studies, precisely because so much work in science is moving into areas with implications that are profound (in hopeful or disturbing ways) for the future of humanity. Yet, within our universities, humanists are often not at the table when major scientific projects with humanistic implications are proposed. And when they are, the work they do in tandem with scientists often does not count towards tenure and promotion within their humanistic departments. This, too, is an academic infrastructure issue that can only impede the development of cyberinfrastructure. We must attend to these social, institutional, and infrastructural arrangements and make them as flexible—as interoperable—as other aspects of cyberinfrastructure.

1 “Our Cultural Commonwealth: The Final Report of the American Council of Learned Societies Commission on Cyberinfrastructure for the Humanities and Social Sciences,” December 13, 2006. http://www.acls.org/cyberinfrastructure/
2 Millennium Ecosystem Assessment - http://www.maweb.org/en/index.aspx
3 O’Reilly, T. “What is Web 2.0? Design Patterns and Business Models for the Next Generation of Software,” September 30, 2005. http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html
4 Davidson, C. N. Revolution and the Word: The Rise of the Novel in America (1986; New York and Oxford: Oxford University Press; Expanded Edition, 2004); and Davidson, ed., Reading in America: Literature and Social History (Baltimore: Johns Hopkins University Press, 1989; Second Edition, 1992).
5 International Dunhuang Project - http://idp.bl.uk/
6 The Law in Slavery and Freedom - http://sitemaker.umich.edu/law.slavery.freedom/
7 Sloan Digital Sky Survey - http://www.sdss.org/
8 USC Shoah Foundation Institute for Visual History and education http://www.usc.edu/schools/college/vhi/
9 The Museum of Television and Radio - http://www.mtr.org/
10 Lenoir, T. “Emerging from the Digital Dark Ages: Challenges and Opportunities for the History of Science and Technology in the Information Age,” in Roland Ris, ed., Technikforschung: Zwischen Reflexion und Dokumentation, Bern: Swiss Academy of Humanities and Social Sciences, 2004: 11-26; and “Making Studies in New Media Critical,” in Oliver Grau, ed.,MediaArtHistories, Cambridge, Mass.; MIT Press, 2007, pp. 355-380.
11 HASTAC (Humanities, Arts, Science, and Technology Advanced Collaboratory) - http://www.hastac.org/

URL to article: http://www.ctwatch.org/quarterly/articles/2007/05/data-mining-collaboration-and-institutional-infrastructure/