CTWatch Quarterly » Specifications for the Next-Generation Computational Biology Infrastructure

Specifications for the Next-Generation Computational Biology Infrastructure

Eric Jakobsson, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

All leading edge research in biology now utilizes computation, as a result of the development of useful tools for data gathering, data management, analysis, and simulation of biological systems. While there is still much to be done to improve these tools, there is also a completely new frontier to be attacked. The new initiatives to be undertaken will require much more interaction between applications scientists and cyberinfrastructure architects than has previously been the case. The single word that provides a common thread for the new initiatives needed in the next few years is Integration, specifically

Integration of time and length scales of description.
Integration of informatics, dynamics, and physical-based approaches.
Integration of heterogenous data forms.
Integration of basic science with engineering design.
Integration of algorithmic development with computing architecture design.

Integration of time and length scales of description

Biological systems display important dynamics on time scales ranging from femtoseconds and faster (eg., interactions with electromagnetic radiation) to billions of years (evolution), and distance scales ranging from single atoms to the entire biosphere. Events at all time and length scales are linked to each other. For the most extreme example, the emergence of the photosynthetic reaction center (a protein that couples absorption of photons with synthesis of other biological molecules) over a billion years ago produced as a by-product a major change in the composition of the atmosphere (an increase in oxygen) that profoundly altered the course of biological evolution from that time on. Yet the vast majority of the computational tools that we use to understand biology are specialized to a particular narrow range of size and distance scales. We badly need computing environments that will facilitate analysis and simulation across time and length scales, so we may achieve a quantitative understanding of how these scales link to each other.

Integration of informatics, dynamics, and physics-based approaches

There are three core foundations of computational biology: a) Information-based approaches, exemplified by sequence-based informatics and correlational analysis of systems biology data, b) Physics-based approaches, based on biological data analysis and simulation founded in physical and chemical theory, and c) Approaches based on dynamical analysis and simulation, notably exemplified by successful dynamics models in neuroscience, ecology, and viral-immune system interactions. Typically these approaches are developed by different communities of computational biologists and pursued largely independently of each other. There is great synergy, however, in the three approaches when they are integrated in pursuing solutions to major biological problems. This can be seen notably in molecular and cellular neuroscience. Understanding of the entire field is largely organized around the dynamical systems model first put forth by Hodgkin and Huxley, which also had an underpinning of continuum physical chemistry and electrical engineering theory. Extension of the systems and continuum understanding to the molecular level depended on using informatics means to identify crystallizable versions of the membrane proteins underlying excitability. Physics-based computing has been essential to interpreting the structural data and to understand the relationship between the structures and the function of the excitability proteins. All areas of biology need a comparable synergy between the different types of computing. As a corollary, we need to train computational biologists who can use, and participate in developing, all three types of approach.

Integration of Heterogenous Data Forms

The types of data that are relevant to any particular biological problem are quite varied, including literature reports, sequence data, microarray data, proteomics data, a wide array of spectroscopies, diffraction data, time series of dynamical systems, simulation results, and many more. There is a major need for an integrated infrastructure that can enable the researcher to search, visualize, analyze, and make models based on all of the relevant data to any particular biological problem. The Biology Workbench¹ is a notable example of such integration in the specific domain of sequence data. This approach needs to be extended to much more varied and complex data forms.

Pages: 1 2

CTWatch is a collaborative effort				Sponsored By