CTWatch Quarterly » Reinventing Scholarly Communication for the Electronic Age

Reinventing Scholarly Communication for the Electronic Age

J. Lynn Fink, University of California, San Diego
Philip E. Bourne, University of California, San Diego

BioLit: Blurring the Boundary Between Publications and Data

In order to initiate a community effort, we are developing a set of open source tools that will facilitate the integration of open literature and biological data, a project we call BioLit. Initially, these tools will be implemented using the entire corpus of the Public Library of Science (PLoS) and the Protein Data Bank (PDB) as testing platforms. The tools are being designed, however, to be generally applicable to all open access literature and other biological data.

The Public Library of Science (PLoS) is an ideal partner since it is leading the open access movement - a fundamental change in scientific publication, which represents a significant improvement in access to literature by the scientific community. Articles are published under a Creative Commons Attribution License ⁶ whereby the community may use the article in any way they see fit, provided they attribute the original authors. Furthermore, the copyright to the material remains with the author. Once published, the article is available free in its entirety to anyone. This means that all PLoS articles, which are very high quality, are freely available, freely usable, and consist of a large body of text covering biology, medicine, computational biology, genetics, pathology and a variety of other fields.

The Protein Data Bank (PDB) is one of the oldest databases in biology and contains all publicly accessible three dimensional structures of biological macromolecules – currently over 44,000.⁷ The PDB is used by over 10,000 scientists every day and one structure is downloaded, on average, every second. Over the last decade or so, PDB structures appear in roughly 2% of all open access life science journal articles making the PDB an obvious target for an effort to integrate data with open literature.

Specifically, the BioLit tools will capture meta-data from an article or manuscript by identifying relevant terms and identifiers and adding mark-up to the original NLM DTD-based XML document containing the article. Terms relating to the life sciences are identified using ontologies and controlled vocabularies specific to this field such as the Gene Ontology⁸⁹ and Medical Subject Headings (MeSH).¹⁰

This meta-data is captured in different ways depending on the status of the article. A tool we are developing with Microsoft, which will be implemented as a plug-in for Word, will allow this information to be captured while the manuscript is being written. This strategy gives the author full and fine control over the exact meta-data that are captured. The plug-in will prompt the author with choices or will allow the author to customize the meta-data if no appropriate matches are found in the resources that the plug-in has knowledge of. Cross-references to biological databases will also be detected and added to the meta-data, allowing the manuscript content to be more easily integrated with the database.

Articles that have already been published can be post-processed through a related tool that identifies the same types of meta-data and generates similar XML mark-up. The meta-data may not be as rich using this approach since the author has not had direct input, but the capture of any information is a significant advance. Processing all PLoS articles, and later all open access articles, with this tool will generate a considerable amount of meta-data, which will help establish the integration effort in the community.

The BioLit tools will effectuate a change in the authoring process that is nearly transparent to the author, but will capture significant new meta-data and establish an informative connection between the data and the article describing the data. Effective use of these tools will provide new views on the traditional literature and on biological databases. The literature will simply become another interface to biological data in a database, and the database can recall appropriate literature – not in abstract or complete paper size chunks, but knowledge objects that annotate the data being examined (see Figure 1 for an example).

Figure 1. A PDB entry enhanced with additional literature. Historically, PDB entries reference the original article in which the macromolecular structure was published (Primary Citation). The BioLit tools will add text and figures from other articles that reference this structure in order to provide data that describe other aspects of the protein (Additional Literature).

Pages: 1 2 3 4

CTWatch is a collaborative effort				Sponsored By