CTWatch
August 2006
Trends and Tools in Bioinformatics and Computational Biology
Introduction
Rick Stevens, Associate Laboratory Director, Computing and Life Sciences – Argonne National Laboratory, Professor, Computer Science Department – The University of Chicago

1

In this issue you will find a number of articles outlining the current trends and tool development strategies in the bioinformatics and computational biology community. It is by necessity an incomplete survey of today’s thinking about the directions of our field. In addition to the four submitted articles, I’ve enclosed my thoughts on a few of the questions likely to be of interest to CTWatch readers.

What is the most important trend today in biology research?

Probably the most important trend in modern biology is the increasing availability of high-throughput (HT) data. The earliest forms of HT were genome sequences, and to a lesser degree, protein sequences, however now many forms of biological data are available via automated or semi-automated experimental systems. This data includes gene expression data, protein expression, metabolomics, mass spec data, imaging of all sorts, protein structures and the results of mutagenesis and screening experiments conducted in parallel. So an increasing quantity and diversity of data are major trends. To gain biological meaning from this data it is required that this data be integrated (finding and constructing correspondences between elements) and that it be curated (checked for errors, linked to the literature and previous results and organized). The challenges in producing high-quality, integrated datasets are immense and long term.

The second trend is the general acceleration of the pace of asking those questions that can be answered by computation and by HT experiments. Using the computer, a researcher can be 10 or 100 times more efficient than by using wet lab experiments alone. Bioinformatics can identify the critical experiments necessary to address a specific question of interest. Thus the biologist that is able to leverage bioinformatics is in a fundamentally different performance regime that those that can’t.

The third trend is the beginnings of simulation and modeling technologies that will eventually lead to predictive biological theory. Today, simulation and modeling applied at the whole cell level is suggestive of what is to come, the ability to predict an organisms phenotype computationally from just a genome and environmental conditions. That capability is probably five years away for microbial organisms and 10 to 20 years away for complex eukaryotes (such as the mouse and human).

What is the role of cyberinfrastructure in biological research?

As I noted above, modern biology will become increasingly coupled to modern computing environments. This means that rates of progress of some (but not all) biological investigations will become rate limited by the pace of cyberinfrastructure development. Certainly, it will make it much easier for the biologist to gain access to both data and computing resources (perhaps without them knowing it) once cyberinfrastructure is more developed and in place. Today, we have early signs of how some groups will use access to large-scale computing to support communities by developing gateways or portals that provide access to integrated databases and computing capabilities behind a web-based user interface. But, that is just the beginning. It is possible to imagine that, in the future, laboratories will be directly linked to data archives and to each other, so that experimental results will flow from HT instruments directly to databases which will be coupled to computational tools for automatically integrating the new data and performing quality control checks in real-time (not that dissimilar from how high-energy physics and astronomy work today). In field research, cyberinfrastructure can not only connect researchers to their databases and tools while they are in the field, but it will enable the development of automated instruments that will continue working in the field after the scientists and graduate students have returned home.

Pages: 1 2 3 4 5

Wilfred W. Li, University of California, San Diego (UCSD), San Diego Supercomputer Center (SDSC)
Nathan Baker, Washington University in Saint Louis
Kim Baldridge, UCSD, SDSC
J. Andrew McCammon, UCSD
Mark H. Ellisman, UCSD, Center for Research In Biological Systems (CRBS)
Amarnath Gupta, UCSD, SDSC
Michael Holst, UCSD
Andrew D. McCulloch, UCSD
Anushka Michailova, UCSD
Phil Papadopoulos, UCSD, SDSC
Art Olson, The Scripps Research Institute (TSRI)
Michel Sanner, TSRI
Peter W. Arzberger, California Institute for Telecommunications and Information Technology (Calit2), CRBS, UCSD

1

Abstract — Begun in 1994, the mission of the National Biomedical Computation Resource (NBCR) is to conduct, catalyze and enable multiscale biomedical research by harnessing advanced computation and data cyberinfrastructure through multidiscipline and multi-institutional integrative research and development activities. Here we report the more recent research and technology advances in building cyberinfrastructure for multiscale modeling activities.

The development of the cyberinfrastructure is driven by multiscale modeling applications, which focus on scientific research ranging in biological scale from the subatomic, to molecular, cellular, tissue to organ level. Application examples include quantum mechanics modeling with GAMESS, calculation of protein electrostatic potentials with APBS and the finite element toolkit FEtk; protein-ligand docking studies with AutoDock; cardiac systems biology and physiology modeling with Continuity; and molecular visualizations using PMV and visual workflow programming in Vision. Real use cases are used to demonstrate how these multiscale applications may be made available transparently on the grid to researchers in the biomedicine and translational research arena, through integrative projects ranging from the understanding of the detailed mechanism of HIV protease and integrase action, to neuromuscular junction research in myopathy, to heart arrhythmia and failure, and to emerging public health threats, as well as through collaborative projects with other research teams across the world.

The adoption of service oriented architecture enables the development of highly reusable software components and efficiently leverages the international grid development activities. We describe an end to end prototype environment, exemplified by the adoption of key components of the Telescience project, that allows existing applications to run transparently on the grid, taking advantage of open source software that provides the following:

  • a portal interface using GridSphere,
  • transparent GSI authentication using GAMA,
  • a web service wrapper using Opal,
  • a metascheduler using CSF4,
  • a virtual filesystem using Gfarm, and
  • a grid-enabled cluster environment using Rocks.

Solutions to complex problems may be developed using workflow tools that coordinate different interoperable services. In addition, we also describe the development of ontology and semantic mediation tools such as Pathsys and OntoQuest for data integration and interoperability, which may be efficiently coupled with the application services provided to the biomedical community.

Pages: 1 2 3 4 5 6 7 8 9 10 11

Eric Jakobsson, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

1

All leading edge research in biology now utilizes computation, as a result of the development of useful tools for data gathering, data management, analysis, and simulation of biological systems. While there is still much to be done to improve these tools, there is also a completely new frontier to be attacked. The new initiatives to be undertaken will require much more interaction between applications scientists and cyberinfrastructure architects than has previously been the case. The single word that provides a common thread for the new initiatives needed in the next few years is Integration, specifically

  • Integration of time and length scales of description.
  • Integration of informatics, dynamics, and physical-based approaches.
  • Integration of heterogenous data forms.
  • Integration of basic science with engineering design.
  • Integration of algorithmic development with computing architecture design.
Integration of time and length scales of description

Biological systems display important dynamics on time scales ranging from femtoseconds and faster (eg., interactions with electromagnetic radiation) to billions of years (evolution), and distance scales ranging from single atoms to the entire biosphere. Events at all time and length scales are linked to each other. For the most extreme example, the emergence of the photosynthetic reaction center (a protein that couples absorption of photons with synthesis of other biological molecules) over a billion years ago produced as a by-product a major change in the composition of the atmosphere (an increase in oxygen) that profoundly altered the course of biological evolution from that time on. Yet the vast majority of the computational tools that we use to understand biology are specialized to a particular narrow range of size and distance scales. We badly need computing environments that will facilitate analysis and simulation across time and length scales, so we may achieve a quantitative understanding of how these scales link to each other.

Integration of informatics, dynamics, and physics-based approaches

There are three core foundations of computational biology: a) Information-based approaches, exemplified by sequence-based informatics and correlational analysis of systems biology data, b) Physics-based approaches, based on biological data analysis and simulation founded in physical and chemical theory, and c) Approaches based on dynamical analysis and simulation, notably exemplified by successful dynamics models in neuroscience, ecology, and viral-immune system interactions. Typically these approaches are developed by different communities of computational biologists and pursued largely independently of each other. There is great synergy, however, in the three approaches when they are integrated in pursuing solutions to major biological problems. This can be seen notably in molecular and cellular neuroscience. Understanding of the entire field is largely organized around the dynamical systems model first put forth by Hodgkin and Huxley, which also had an underpinning of continuum physical chemistry and electrical engineering theory. Extension of the systems and continuum understanding to the molecular level depended on using informatics means to identify crystallizable versions of the membrane proteins underlying excitability. Physics-based computing has been essential to interpreting the structural data and to understand the relationship between the structures and the function of the excitability proteins. All areas of biology need a comparable synergy between the different types of computing. As a corollary, we need to train computational biologists who can use, and participate in developing, all three types of approach.

Integration of Heterogenous Data Forms

The types of data that are relevant to any particular biological problem are quite varied, including literature reports, sequence data, microarray data, proteomics data, a wide array of spectroscopies, diffraction data, time series of dynamical systems, simulation results, and many more. There is a major need for an integrated infrastructure that can enable the researcher to search, visualize, analyze, and make models based on all of the relevant data to any particular biological problem. The Biology Workbench1 is a notable example of such integration in the specific domain of sequence data. This approach needs to be extended to much more varied and complex data forms.

Pages: 1 2

Folker Meyer, Argonne National Laboratory

1

As genomes grow faster than Moore’s law, biology will provide numerous cyber challenges in the next 10-15 years.

Biology is an area in science that is using more and more computational resources as it is turning into a data driven discipline. Most notably, the emergence of genome and post-genome technology has made vast amounts of data available, demanding analysis. Hundreds of bacterial (more precisely prokaryotic) genomes are available today and have already proven to be a very valuable tool for many applications. A prominent example is the reconstruction of the metabolic pathways1 of several bacterial organisms. The analysis of the rising number of genomes is already an application of cyber technologies2 and to some extent limited by the available cyber resources. As more data is becoming available, this trend is likely to continue.

An important factor in this equation is the fact that the number of available complete genomic sequences is doubling almost every 12-months3 at the current state of technology. Whereas according to Moore’s law, available compute cycles double only every 18 month. The analysis of genomic sequences requires serious computational effort: most analysis techniques require binary comparison of genomes or the genes within genomes. Since the number of binary comparisons grows as the square of the number of sequences involved, the computational overhead of the sequence comparisons alone will become staggering. Whether we are trying to reconstruct the evolutionary history of a set of proteins, trying to characterize the shape they fold into, or attempting to determine correspondences between genes in distinct genomes, we are often using these binary operations, and the cost is rapidly climbing.

Today, traditional research teams in bioinformatics either totally rely on resources provided by institutions like the National Center for Biotechology Information (NCBI)4 for sequence analysis purposes5 or build up their own local resources. The NCBI provides services including comprehensive sequence databases and online sequence comparison via a browser interface. Researchers possessing private compute resources have the advantage of running algorithms of their choosing on the machines however, to keep up with the data flood, they either have to accept long waiting times or continue to invest in cluster resources to fulfill their growing sequence analysis needs.

However as the number of sequences available grows, the number of algorithms available for their analysis also increases. So today, numerous bioinformatics techniques exist or are being developed that use considerably more computational power and are yielding different results for sequence comparison than the traditionally used BLAST algorithm. Over the last five years, most notably the influx of machine learning techniques has led to an increased consumption of compute cycles in computational biology.

When researchers began to use Markov Models to search for sequence similarities not visible with BLAST and also began building databases of common sequence motifs represented as Hidden-Markov-Models (e.g. HMMer or InterPro), the CPU requirements were increased dramatically. While a BLAST search against the NCBIs comprehensive, non-redundant collection of known proteins can be run in a matter of minutes either locally or on NCBI’s BLAST-server for several hundred query sequences (remember a single genome contains thousands of genes), no resource exists that allows querying several hundred (let alone thousand) proteins for protein motifs using the European Bioinformatics Institutes’ (EBI) InterPro tool.

Today, few resources exist outside TeraGrid6 that could provide the computational power needed to run a comprehensive protein motif search for more than a few complete bacterial genomes. Only a massive, high-performance computing resource like TeraGrid can provide the CPU-hours that will be required for this and other future challenges stemming from the increasing amount of sequence data.

Pages: 1 2

Natalia Maltsev, Argonne National Laboratory

The past decade has completely changed the face of biology. The image of a biologist passionately chasing butterflies in the wilderness of an Amazon rainforest or losing sight spending hours staring in a microscope has been substituted by pictures of factory-like sequencing facilities and high-throughput automated experimental complexes. The technology has changed the entire fabric of biology from a science of lonely enthusiasts to a data-intensive science of large projects involving teams of specialists in various branches of life sciences spread between multiple institutions.The new generation of biology is tightly interlinked with the progress in computer science. Indeed, in order to exploit the enormous scientific value of biological data for understanding living systems, the information must be integrated, analyzed, graphically displayed, and modeled computationally in a timely fashion. The development of computational models of an organism’s functionality is essential for progress in medicine, biotechnology, and bioremediation. Such models allow predicting functions of the genes in newly sequenced genomes and existence of particular metabolic pathways and physiological features. Conjectures developed during computational analysis of genomes provide invaluable aid to researchers in planning experiments and save an enormous amount of time and resources required for elucidation of an organism’s biochemical and physiological characteristics. Essential for fulfilling this task is the development of high-throughput computational environments that integrate (i) large amounts of genomic and experimental data, (ii) comprehensive tools and algorithms for knowledge discovery and data mining, and (iii) comprehensive user interfaces that provide tools for easy access, navigation, visualization, and annotation of biological information. To achieve such an integrated environment, computational biologists face four primary challenges:

1. Exponential growth of the biological data in the databases requires scalable public computational resources. In the past 10 years the amount of data in genomic databases has doubled or tripled each year. For example, the current 151.0 release of the largest genomic database, GenBank, contains 56 billion bases, from 52 million sequences; and the rate of growth of GenBank is expected to increase dramatically as the cost of sequencing of new genomes drops. To date, 394 genomes have been completely sequenced and 1644 are on various levels of completion. However, the development of highly integrated and scalable bioinformatics systems for interpreting newly sequenced genomes is a time and resource-consuming task. While large sequencing or bioinformatics centers have the resources needed, a significant number of institutions are working on a small number of genome projects. Indeed, out of 1174 on-going sequencing projects, 66% of institutions have only one sequencing project, and over 87% have four or fewer.1 The development of large, public computational systems that provide computing resources, integration of data, automated analyses, and the capability for expert-driven refinement of these analyses may significantly benefit the field of genomics. Such systems will provide state-of-the art computational support to the smaller genomics groups and will allow them to concentrate on biological scientific questions rather than on the development of bioinformatics environments.

2. Complexity of biological data. Increasingly, biological models are utilizing information from different branches of life sciences; genomics, physiology, biochemistry, biophysics, proteomics, and many more. The development of such models is a task of unprecedented computational, semantic, and algorithmic complexity. It requires integration of various classes of biological information as well as similar classes of data from different resources. Such large-scale integration presents a number of computer science challenges because of the large volume and complexity of data, distributed character of this information residing in different databases, shortfalls of current biological ontologies, and generally poor naming conventions for biological objects (e.g., a large number of synonyms describing the same object or notion, or nonunique names for describing different objects). More than 100 groups (e.g., GO, BioPAX, W3C; see the Open Biomedical Ontologies website for a partial list of such efforts) are developing ontologies for various branches of biological sciences. For example, in order to satisfy diverse scientific communities, the ontological description of glycolysis has evolved into an enormously complex data structure integrating various classes of data, historical conventions of the communities, and links to the ontologies of other sciences (e.g., chemistry, biophysics, taxonomy). However, the development of most of these ontologies still depends on social consensus between the scientific communities—a task that seems to be of insurmountable social and scientific complexity. The development of new tools and algorithms for mining and clustering of existing scientific notions and terms may provide significant assistance to this process.

3. Algorithm development. The most popular bioinformatics tools (e.g., BLAST, FASTA) perform pairwise comparisons of the query sequence with all sequences in a specified database. Such a computationally intensive approach at a time of exponentially growing amounts of sequence data will inevitably lead to the emergence of an N-squared problem. Bioinformatics will significantly benefit from the development of a new generation of algorithms that will allow efficient data mining and identification of complex multidimensional patterns involving various classes of data. Visualization of multifarious information is another essential need of high-throughput biology: it allows for reducing the complexity of biological knowledge and developing much-needed overviews.

4. Development of collaborative environments. A typical biological project involves data sources and users distributed among various institutions. Such projects require a mature infrastructure that allows seamless integration, analysis, storage, and delivery of information to a distributed community of users. Warehousing of the data and its analysis by the researchers residing in one location will not be sufficient for the needs of biology in the future. Essential for the success of large biological projects is further development of collaborative environments that will allow the scientists residing in different locations and sometimes even on different continents to analyze, discuss, annotate, and view the data. Access Grid conferencing, shared interfaces, Web services, and other collaborative tools will allow groups to identify, discuss, and solve scientific problems efficiently.

The 21st century is considered to be the “age of biology.” Advances in genomic research will establish cures or therapies for numerous diseases that were considered to be incurable, and future genetically engineered bioproducts will contribute significantly to solving the global hunger problem. Such progress will, to a large extent, be driven by the formulation of new computational approaches for analysis of biological data and the timely transfer of technologies developed in other disciplines (e.g., physics, linguistics). Computing and the biological sciences will become intimately intertwined, opening new possibilities and cause unprecedented changes to life as we know it.

References
1 Statistics from the Genome Project Sequencing Center at the National Center of Biological Information for July 2, 2006
Reference this article
Maltsev, N. "Computing and the "Age of Biology"," CTWatch Quarterly, Volume 2, Number 3, August 2006. http://www.ctwatch.org/quarterly/articles/2006/08/computing-and-the-age-of-biology/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.