CTWatch Quarterly » Reinventing Scholarly Communication for the Electronic Age

Introduction

Cyberinfrastructure is integral to all aspects of conducting experimental research and distributing those results. However, it has yet to make a similar impact on the way we communicate that information. Peer-reviewed publications have long been the currency of scientific research as they are the fundamental unit through which scientists communicate with and evaluate each other. However, in striking contrast to the data, publications have yet to benefit from the opportunities offered by cyberinfrastructure. While the means of distributing publications have vastly improved, publishers have done little else to capitalize on the electronic medium. In particular, semantic information describing the content of these publications is sorely lacking, as is the integration of this information with data in public repositories. This is confounding considering that many basic tools for marking-up and integrating publication content in this manner already exist, such as a centralized literature database, relevant ontologies, and machine-readable document standards.

We believe that the research community is ripe for a revolution in scientific communication and that the current generation of scientists will be the one to push it forward. These scientists, generally graduate students and new post-docs, have grown up with cyberinfrastructure as a part of their daily lives, not just a specialized aspect of their profession. They have a natural ability to do science in an electronic environment without the need for printed publications or static documents and, in fact, can feel quite limited by the traditional format of a publication. Perhaps most importantly, they appreciate that the sheer amount of data and the number of publications is prohibitive to the traditional methods of keeping current with the literature.

To do our part to get the revolution turning, we are working with the Public Library of Science ¹ and a major biological database, the RCSB Protein Data Bank,² to destroy the traditional concept of a publication and a separate data repository and reinvent it as an integration of the two information sources. Here, we describe new authoring tools that are being developed to consummate the integration of literature and database content, tools being developed to facilitate the consumption of this integrated information, and the anticipated impact of these tools on the research community.

The Legacy of Scientific Publishing

Publications are the currency of research. They are the mechanism through which scientists communicate their results to their peers and the means through which we evaluate each other. This model is unlikely to change completely. However, the electronic age – the introduction of cyberinfrastructure – is introducing some differences in this paradigm, a phenomenon that has been observed previously.³⁴⁵

One significant difference is the requirement of many publishers that the author deposit the data described in a publication in an appropriate public repository concomitant with publication of the manuscript. For example, macromolecular structure data must be deposited in the Protein Data Bank (PDB) when a manuscript describing the macromolecule’s structure is published. As part of the deposition process, a reference to the publication is included. However, this is generally the only link between paper and data repository, even though the paper many contain a wealth of information that would be relevant to someone viewing the deposition record.

A similar scenario exists with publications that reference a record in one of these repositories. For example, if someone has used the structural data describing a protein from the PDB, this structure will be referenced by ID in the publication but this reference most times is not included in the PDB itself. Someone viewing this structure in the database will see only the citation of the publication and abstract describing the generation of the initial data. Much of the research performed subsequent to structure deposition concerns functional information about the protein, information that would surely be useful to anyone interested in that molecule but information that is not trivial to obtain.

One possible reason why a link between this secondary publication and the database is not made is that there is no value associated with database annotation.⁴ Scientists are valued on their peer-reviewed publications, not on their database annotations (which are not independently peer-reviewed). Furthermore, proper database annotation takes time and effort so there is little to gain in this endeavor.

Another barrier that contributes to disconnection between publications and data deposition is the delay in utilization of cyberinfrastructure by scientific publishers. Most publishers have at least an online presence and generally make the articles they publish available online for download, viewing, or printing but do little else to make use of the information they are communicating. Cyberinfrastructure is little more than a new means of distribution.

A significant issue that complicates extensive use of publication content is intellectual property rights, an issue that is currently quite controversial. Some publishers have risen to the challenge and adopted the open access philosophy, publishing their articles under a Creative Commons AL license. This means that the content is free to use and distribute as long as the original attribution is maintained. The Public Library of Science (PLoS) and BioMedCentral (BMC) are good examples of publishers in the life sciences who have embraced the open access model. The articles published by these publishers and others are archived in a central repository, PubMed Central, which archives all open access articles in the life sciences. To date, there are over 54,000 articles in over 300 journals – hardly a major representation of the field, but still a solid beginning.

The centralization of open access articles is a significant step forward but, even more significant, is the storage of these articles in a standardized and machine-readable format: the NLM DTD. This document format allows all open access articles to be archived as XML files, which includes some semantic mark-up of the content and unique identifiers for the article itself and the objects (figures and tables) within. This format also allows these articles to be parsed for relevant information. Unfortunately, there is little value added to article content itself. To recall a previous example, referencing a protein structure in a manuscript, most authors referencing a protein structure do not include a link to the structural data on the PDB. In order to find a mention of a PDB ID, one would have to perform a full text search on the article content (including figure captions). Even this does not ensure that a successful search result actually references a PDB ID – the ID could belong to a different database or have an entirely different meaning, since there is no semantic context for that string of text. (This is not always true. Some papers do include direct references to the PDB using the xlink tag.)

Even if a link is included in the PDB to an article that mentions a PDB ID, it is not clear what the value of that reference is to the reader. Does the article describe the biochemical function of the protein or was the structure used in training a computational prediction algorithm? Rather than direct the reader to an article that may not be of interest, it would be useful to include some indication of the type of content of the article. Semantic mark-up of the article content is necessary. Using ontologies or controlled vocabularies within the framework of the NLM DTD would increase the usefulness of the article content dramatically.

All of these tools exist – the standardized document format, the ability to create hyperlinks in electronic documents, field-specific ontologies – but they have yet to be utilized to their full advantage. This may be due to the legacy of static manuscripts, which is largely perpetuated by scientists who did not have access to cyberinfrastructure during their formative years. However, today’s scientists do and it is time to make this happen.

BioLit: Blurring the Boundary Between Publications and Data

In order to initiate a community effort, we are developing a set of open source tools that will facilitate the integration of open literature and biological data, a project we call BioLit. Initially, these tools will be implemented using the entire corpus of the Public Library of Science (PLoS) and the Protein Data Bank (PDB) as testing platforms. The tools are being designed, however, to be generally applicable to all open access literature and other biological data.

The Public Library of Science (PLoS) is an ideal partner since it is leading the open access movement - a fundamental change in scientific publication, which represents a significant improvement in access to literature by the scientific community. Articles are published under a Creative Commons Attribution License ⁶ whereby the community may use the article in any way they see fit, provided they attribute the original authors. Furthermore, the copyright to the material remains with the author. Once published, the article is available free in its entirety to anyone. This means that all PLoS articles, which are very high quality, are freely available, freely usable, and consist of a large body of text covering biology, medicine, computational biology, genetics, pathology and a variety of other fields.

The Protein Data Bank (PDB) is one of the oldest databases in biology and contains all publicly accessible three dimensional structures of biological macromolecules – currently over 44,000.⁷ The PDB is used by over 10,000 scientists every day and one structure is downloaded, on average, every second. Over the last decade or so, PDB structures appear in roughly 2% of all open access life science journal articles making the PDB an obvious target for an effort to integrate data with open literature.

Specifically, the BioLit tools will capture meta-data from an article or manuscript by identifying relevant terms and identifiers and adding mark-up to the original NLM DTD-based XML document containing the article. Terms relating to the life sciences are identified using ontologies and controlled vocabularies specific to this field such as the Gene Ontology⁸⁹ and Medical Subject Headings (MeSH).¹⁰

This meta-data is captured in different ways depending on the status of the article. A tool we are developing with Microsoft, which will be implemented as a plug-in for Word, will allow this information to be captured while the manuscript is being written. This strategy gives the author full and fine control over the exact meta-data that are captured. The plug-in will prompt the author with choices or will allow the author to customize the meta-data if no appropriate matches are found in the resources that the plug-in has knowledge of. Cross-references to biological databases will also be detected and added to the meta-data, allowing the manuscript content to be more easily integrated with the database.

Articles that have already been published can be post-processed through a related tool that identifies the same types of meta-data and generates similar XML mark-up. The meta-data may not be as rich using this approach since the author has not had direct input, but the capture of any information is a significant advance. Processing all PLoS articles, and later all open access articles, with this tool will generate a considerable amount of meta-data, which will help establish the integration effort in the community.

The BioLit tools will effectuate a change in the authoring process that is nearly transparent to the author, but will capture significant new meta-data and establish an informative connection between the data and the article describing the data. Effective use of these tools will provide new views on the traditional literature and on biological databases. The literature will simply become another interface to biological data in a database, and the database can recall appropriate literature – not in abstract or complete paper size chunks, but knowledge objects that annotate the data being examined (see Figure 1 for an example).

Figure 1. A PDB entry enhanced with additional literature. Historically, PDB entries reference the original article in which the macromolecular structure was published (Primary Citation). The BioLit tools will add text and figures from other articles that reference this structure in order to provide data that describe other aspects of the protein (Additional Literature).

SciVee: Pioneering New Methods of Scientific Communication

In addition to the BioLit tools for authoring and data integration, we want to use cyberinfrastructure to the fullest advantage. Due to the increasing availability of high bandwidth and consumer-level video recording equipment, internet video is now wildly popular. We want to take advantage of this trend and use this medium to communicate science more effectively. It is important, however, to bear in mind the need for quality content. To this end, we have developed SciVee, ¹¹ which allows authors to upload an article they have already published (open access, naturally) with a video or podcast presentation (about 10 minutes long) that they have made that describes the highlights of the paper. The author can then synchronize the video with the content of the article (text, figures, etc.) such that the relevant parts of the article appear as the author discusses them during the video presentation. We call the result a pubcast. Figure 2 shows a typical view of a SciVee pubcast.

Figure 2. A SciVee pubcast. This figure shows how a video presentation is integrated with a published article. While the speaker is discussing a point from the article, the relevant figure or text is highlighted. The viewer can also download the original paper as a companion to the pubcast.

Anyone can visit SciVee and view the pubcast. It is similar to attending a conference to hear a particular speaker, except that the pubcast is available on demand, can be viewed any number of times, and explicitly refers to the content of the original article. Another important feature of SciVee is the ability of any user to add or read comments on pubcasts. This allows a community to be established around an article and encourages discussion about the results and their impact on the field. We believe this activity will transform what has traditionally been a static document into a dynamic exchange.

SciVee makes it easier and faster to keep up with current literature by delivering the key points of articles in a portable and enjoyable medium. A reader can interact with several articles using this website in the time it would take to read a single full article in the traditional way.

Conclusion

We believe revitalizing journal articles will have a significant impact on the scientific community. The traditional article format no longer effectively supports the research in the electronic age. The number of articles researchers need to read in order to keep up with their field has increased significantly in the last decades. In addition, there is an increasing number of articles that report data generated in a high-throughput manner and the primary method of exploring these data is through a database, not through the article itself. In part, these phenomena are due to the increasing reliance on cyberinfrastructure to perform research. It is thus a natural response to use cyberinfrastructure to address this situation. Indeed, our initial tests have proven quite successful. A large group of students in the UCSD School of Pharmacy and Pharmaceutical Sciences were shown an eight minute pubcast of a recently published paper and were then quizzed on their comprehension. Their results were compared to students who were given the paper and eight minutes in which to read it. The pubcast group largely outperformed the paper group and, perhaps more importantly, greatly enjoyed the experience.

SciVee and the BioLit tools will complement similar efforts such as the Stuctured Digital Abstract,¹² MutationFinder,¹³ and BioCreAtIvE.¹⁴ We hope that the scientific community will embrace these efforts and use cyberinfrastructure to its fullest capacity to make scientific communication more enjoyable and effective.

Acknowledgement
This work is supported by grants 0544575 and 0732706 from the National Science Foundation

¹Public Library of Science – http://www.plos.org
²RCSB Protein Data Bank – http://www.pdb.org
³Bourne, P. "Will a biological database be different from a biological journal?" PLoS computational biology 2005, Vol. 1, no. 3, pp. 179-181.
⁴Seringhaus, M. R., Gerstein, M. B. "Publishing perishing? Towards tomorrow's information architecture," BMC bioinformatics 2007, Vol. 8, pp. 17.
⁵Berners-Lee, T., Hall, W., Hendler. J., Shadbolt, N., Weitzner, D. J. "Computer science. Creating a science of the Web," Science (New York, NY 2006, Vol. 313, no. 5788, pp. 769-771.
⁶Creative Commons Attribution License - http://www.plos.org/journals/license.html
⁷Kouranov, A., Xie, L., de la Cruz, J., Chen, L., Westbrook, J., Bourne, P. E., Berman, H. M. "The RCSB PDB information portal for structural genomics," Nucleic acids research 2006, Vol. 34 (Database issue), pp. D302-305.
⁸Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T. et al. "Gene ontology: tool for the unification of biology. The Gene Ontology Consortium" Nature genetics 2000, Vol. 25, no. 1, pp. 25-29.
⁹Harris, M. A., Clark, J., Ireland. A., Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C. et al. "The Gene Ontology (GO) database and informatics resource," Nucleic acids research 2004, Vol. 32(Database issue), pp. D258-261.
¹⁰Medical Subject Headings - http://www.nlm.nih.gov/mesh
¹¹SciVee - http://www.scivee.com
¹²Gerstein, M., Seringhaus, M., Fields, S. "Structured digital abstract makes text mining easy," Nature 2007, Vol. 447, no. 7141, pp.142.
¹³Caporaso, J. G., Baumgartner, W. A., Jr., Randolph, D. A., Cohen, K. B., Hunter, L. "MutationFinder: A high-performance system for extracting point mutation mentions from text," Bioinformatics (Oxford, England) 2007.
¹⁴Hirschman, L., Yeh, A., Blaschke, C., Valencia, A. "Overview of BioCreAtIvE: critical assessment of information extraction for biology," BMC bioinformatics 2005, Vol. 6, suppl. 1, pp. S1.