CTWatch Quarterly » The Shape of the Scientific Article in The Developing Cyberinfrastructure

The Shape of the Scientific Article in The Developing Cyberinfrastructure

Clifford Lynch, Coalition for Networked Information (CNI)

Scientific Literature that is Computed Upon, Not Merely Read by Humans

In the previous section, we explored a few of the ways in which human readers may expand and extend their interactions with the scientific literature through the mediation of a new generation of software. But the use of the corpus of scientific literature is already changing in other ways as well: not only do human beings read (and interact with) articles from the scientific literature one article at a time, but we are also seeing the deployment of software that computes upon the entire corpus of scientific literature (perhaps restricted by discipline, and perhaps also federated with proprietary and/or unpublished literature or auxiliary databases). Such computation includes not only the now familiar and commonplace indexing by various search engines, but also computational analysis, abstraction, correlation, anomaly identification and hypothesis generation that is often termed “data mining” or “text mining.”

The implications of this shift are extensive and complex, but I want to sketch a few implications here. First, there will be greater demand for the availability of scientific literature corpora as part of the cyberinfrastructure, and for these corpora to be available in such a way — both technically and in terms of licensing, legal and economic arrangements — so as to facilitate ongoing computation on the literature by groups of collaborating researchers, including groups (“virtual organizations”) assembled often fairly casually from across multiple institutions. The barriers here are formidable: most commonly, access arrangements for publisher-controlled literature are established on an institutional basis; these licenses often specifically prohibit large-scale duplication of the text corpora for this kind of computational use; and today most publishers do not provide technical provisions for arbitrary computation of the type envisioned here.

More important to the changing nature of the individual article as opposed to the literature as a whole, the computational techniques that are applied to the current literature base make extensive use of heuristics (as well as various auxiliary databases, dictionaries, ontologies and other resources). Basically, they use algorithms to guess (with increasingly good accuracy) whether “Washington” in a given bit of text refers to a person, a state, or a city (and if so which one), whether something is the name of a gene, a chemical compound, a species, or other entity of interest. As we create new literature going forward, it makes sense to specifically tag some of these terms of interest to allow more accurate computation on the articles. Clearly, also, there are interesting possibilities of retrospectively tagging older literature, or even running the current best heuristics to provisionally tag the older literature, and then selectively (and perhaps collaboratively) applying human analysis to review provisional tags that are most suspect (Greg Crane and his colleagues at the Perseus Project have run some fascinating pilots of this type in doing markup of classical texts). The questions here are what entities merit tagging, how standards are set for tagging such entities, and what combination of author and publisher should take responsibility for actually doing the markup? There are delicate issues about how to manage the evolution of the tagging over time and also how to manage it across disciplines in such a way to facilitate interdisciplinary work. There’s a difference between viewing the presence of tags as conclusive positive information and being able to count on the absence of a tag as conclusive negative information, for example.

Pages: 1 2 3 4 5 6 7

CTWatch is a collaborative effort				Sponsored By