CTWatch Quarterly » Trends in Cyberinfrastructure for Bioinformatics and Computational Biology

In this issue you will find a number of articles outlining the current trends and tool development strategies in the bioinformatics and computational biology community. It is by necessity an incomplete survey of today’s thinking about the directions of our field. In addition to the four submitted articles, I’ve enclosed my thoughts on a few of the questions likely to be of interest to CTWatch readers.

What is the most important trend today in biology research?

Probably the most important trend in modern biology is the increasing availability of high-throughput (HT) data. The earliest forms of HT were genome sequences, and to a lesser degree, protein sequences, however now many forms of biological data are available via automated or semi-automated experimental systems. This data includes gene expression data, protein expression, metabolomics, mass spec data, imaging of all sorts, protein structures and the results of mutagenesis and screening experiments conducted in parallel. So an increasing quantity and diversity of data are major trends. To gain biological meaning from this data it is required that this data be integrated (finding and constructing correspondences between elements) and that it be curated (checked for errors, linked to the literature and previous results and organized). The challenges in producing high-quality, integrated datasets are immense and long term.

The second trend is the general acceleration of the pace of asking those questions that can be answered by computation and by HT experiments. Using the computer, a researcher can be 10 or 100 times more efficient than by using wet lab experiments alone. Bioinformatics can identify the critical experiments necessary to address a specific question of interest. Thus the biologist that is able to leverage bioinformatics is in a fundamentally different performance regime that those that can’t.

The third trend is the beginnings of simulation and modeling technologies that will eventually lead to predictive biological theory. Today, simulation and modeling applied at the whole cell level is suggestive of what is to come, the ability to predict an organisms phenotype computationally from just a genome and environmental conditions. That capability is probably five years away for microbial organisms and 10 to 20 years away for complex eukaryotes (such as the mouse and human).

What is the role of cyberinfrastructure in biological research?

As I noted above, modern biology will become increasingly coupled to modern computing environments. This means that rates of progress of some (but not all) biological investigations will become rate limited by the pace of cyberinfrastructure development. Certainly, it will make it much easier for the biologist to gain access to both data and computing resources (perhaps without them knowing it) once cyberinfrastructure is more developed and in place. Today, we have early signs of how some groups will use access to large-scale computing to support communities by developing gateways or portals that provide access to integrated databases and computing capabilities behind a web-based user interface. But, that is just the beginning. It is possible to imagine that, in the future, laboratories will be directly linked to data archives and to each other, so that experimental results will flow from HT instruments directly to databases which will be coupled to computational tools for automatically integrating the new data and performing quality control checks in real-time (not that dissimilar from how high-energy physics and astronomy work today). In field research, cyberinfrastructure can not only connect researchers to their databases and tools while they are in the field, but it will enable the development of automated instruments that will continue working in the field after the scientists and graduate students have returned home.

What are some notable accomplishments in applying CI to biology research?

There are a handful of systems that have fundamentally changed how biologists work. The most important has been the system developed by the National Center for Biotechnology Information¹ including Entrez, which is a search engine (google like) that supports searching across many types of biological data. There are similar systems like this in Europe² and Japan.³ These systems and systems like them have provided the global community access to sequence data (starting out as outgrowths from genome and protein sequence databases) and more recently to publications, annotations, linkage maps, expression data, phylogeny data, metabolic pathways, regulatory and signally data, compounds and molecular structures. Search techniques have expanded from keywords to computed properties (sequence similarity, and more generally “associations”) that enable one to find connections between biological or chemical entities. While these systems have enormous user bases and require considerable computing capabilities for indexing and integration, they are essentially client/server in nature, and the computing that an end user can request is closely controlled.

Approximately a decade ago a number of groups began to produce more flexible tools that support a more unstructured workflow, enabling the user to construct their own mini-environment to pursue computational approaches to problems. One of the first such systems was the Biology Workbench developed at the University of Illinois and now hosted at the University of California, San Diego.⁴ Other systems were developed to provide access to a specific type of data (e.g. microbial genomes) in well engineering data integrations. These systems are often associated with teams of curators. Three are particularly important: the Institute for Genomic Research’s Comprehensive Microbial Resource;⁵ the SEED, an annotation system developed by the Fellowship for the Interpretation of Genomes at the University of Chicago;⁶ and the DOE’s Joint Genome Institute’s Integrated Microbial Genomes resource.⁷ These systems provide the user with an integrated view of hundreds of genomes and provide a rich environment for discovery.

Are there some good road mapping documents available?

In the past couple of years there have been several worthwhile road-mapping documents written by the community. These reports in general attempt to identify the trends in the field and provide some structure for understanding directions. The first is a report from the NSF committee for building a cyberinfrastructure for the biological sciences;⁸ the second is the National Academy of Sciences Report on computing and biology.⁹ The third report is more oriented towards systems biology and is a program roadmap document developed by the DOE for their Genomes to Life program,¹⁰ which contains a section on computing and infrastructure to support the building out of systems biology, focused on microbial organisms, energy, and the environment. All three documents are worth reading to gain an understanding of where the field is going.

Are grids really being used that much for real biology?

There are several national and international projects developing grid infrastructures for biological research. Many of these projects are loosely affiliated by sharing services and technology, and all are working towards a vision of a BioGrid. Several are worth looking at.
The TeraGrid is sponsoring two Science Gateways for Biology; one developed by RENCI, Biology and Biomedical Science Gateway Renaissance Computing Institute, UNC;¹¹ and one developed by the University of Chicago.¹² Both of the TeraGrid gateways are aimed at enabling communities to leverage the TeraGrid computing and data resources without a need for obtaining a dedicated allocation of resources. They are examples of an emerging concept of "community allocations," which are aimed at lowering the adoption barrier to cyberinfrastructure. The Open Science Grid also hosts biological applications such as the GADU virtual organization.¹³ In Europe, one of the most well developed Life Sciences grid projects is MyGrid.¹⁴ MyGrid is developing a comprehensive set of web services based tools and services.

There is a lot of talk about web services as a future direction for the Internet. How will web services impact biology?

Web services are the key to providing the ability for groups around the world to collaborate on building new tools that leverage each other’s data and computational services without prior coordination. Early web services deployments in life science suffered from poor implementations, poor performance and lack of high-quality data. More recent efforts are dramatically improving. The KEGG group in Japan recently published a comprehensive set of web services¹⁵ for accessing their data, which have proven to be robust and of moderate performance. I’ve used these services routinely for the last year and find them simple, yet useful. As web services interfaces become more common, it will be possible for many groups to build applications that leverage the major data sources. This is one of the most important trends, but it is still far from being generally demonstrated.

Are there systems in use today that leverage web services?

One system worth exploring is Taverna.¹⁶ Taverna is a collaboration between the European Bioinformatics Institute (EBI), IT Innovation, the School of Computer Science at the University of Newcastle, Newcastle Centre for Life, School of Computer Science at the University of Manchester, and the Nottingham University Mixed Reality Lab. Additional development effort has come from the Biomoby project, Seqhound, Biomart and various individuals across the planet. Development is coordinated through the facilities provided by SourceForge.net and predominantly driven by the requirements of biologists in the UK life science community. Taverna enables end users to compose bioinformatics web services in a graphical environment for computing novel workflows.
Is industry developing any new cyberinfrastructure tools that might alter the landscape in biology?

It is likely that several of the commercial search engine companies (e.g. Google and Microsoft) will explore the issue of coupling biological searches of open literature and databases with computational services with access to commercial tools and databases. These tools will most likely be emerging examples of coupling commercial tools (web services infrastructure, indexing and search technologies) with the best of the open science literature.

What about petascale computing?

Large-scale computational methods can address fundamental biological problems:

The origins, function, structure, and evolutionary history of genes and genomes ⇒ large-scale sequence analysis, sequence-based phylogenic analysis

By studying the details of individual gene history and protein families, we can begin to understand the factors that influence molecular evolution, refine our strategies for building large-scale databases of protein structures, and lay the foundation for understanding the role of horizontal gene transfer in evolution.

The structure, function, dynamics, and evolution (SFDE) of proteins and protein complexes ⇒ large-scale molecular dynamics

Proteins are the building blocks for biological processes. Using modeling and simulation, we can begin to understand how proteins work, how they evolve to optimize their functions, how complexes are formed and function, and how we can modify proteins to alter their functions.

Predictive protein engineering ⇒ large-scale molecular dynamics and electronic structure

Many processes of interest to the biological community are mediated by proteins, ranging from biocatalysis of potential fuel stocks to the production of rare and unique compounds to the detoxification of organic waste products. Large-scale modeling and simulation can be used to attack the problem of rational protein design, whose solution may have long-term impact on our ability to address, in an environmentally sound manner, a wide variety of energy and environmental problems.

The SFDE of metabolic, regulatory, and signaling networks ⇒ graph-theoretic and network analysis methods and stochastic modeling and analysis techniques

Understanding the function of gene regulation is one of the major challenges of 21st century biology. By employing a variety of mathematical techniques coupled with large-scale computing resources, researchers are beginning to understand how to reconstruct regulatory networks, map these networks from one organism to another, and ultimately develop predictive models that will shed light on development and disease.

The SFDE of DNA, RNA, and translation and transcription machinery in the cell ⇒ large-scale molecular dynamics and stochastic modeling

The standard dogma of molecular biology relates the transcription of DNA to messenger RNA, which is then translated to produce proteins. This is the foundation of the information-processing operation in all living organisms. The molecular complexes that mediate these processes are some of the most complex nanomachines in existence. Via large-scale modeling and simulation of protein/RNA complexes such as the ribosome and the splisosome, we will improve our understanding of these fundamental processes of life.

The SFDE of membranes, protein and ion channels, cell walls, and internal and external cellular structures ⇒ large-scale molecular dynamics and mesoscale structural modeling

Membranes are the means that nature uses for partitioning biological functions and supporting complexes of proteins that are responsible for supporting the cell’s ability to interact with its neighbors and the environment. Large-scale modeling is the means by which we can understand the formation, function, and dynamics of these complex molecular structures.

Whole-genome scale metabolic modeling ⇒ linear-programming and optimization

With the number of completed genome sequences reaching 1,000 in the next few years, we are on the verge of a new class of biological problem; reconstructing the function of entire genomes and building models that enable the prediction of phenotypes from the genotype. With petascale modeling it will become feasible to quickly produce a whole genome scale model for a new sequenced organism and begin to understand the organism’s lifestyle prior to culturing the organism.

Population, community and ecosystem modeling ⇒ numerical solution of PDEs, ODEs, and SODEs

Large-scale computing is making it feasible to model ecosystems by aggregating models of individuals. With petascale computing capabilities, this technique can begin to be applied to natural environments such as soils and to artificial environments such as bioreactors, in order to understand the interactions between different types of organisms and their ability to cooperatively metabolize compounds important for carbon cycling.

The following table gives examples of high-impact problems that could be addressed in the next two to three years on an open access petascale platform and that leverage the methods have already been ported to the IBM BG/L platform.

Biology Problem Area	@ 360 TF/s	@1000 TF/s	@ 5000 TF/s
Determining the detailed evolutionary history of each protein family ⇒ This will enable rational planning for structural biology initiatives and will provide a foundation for assessing protein function and diversity	3,000 hours to build reference database	300 hours to build reference database	60 hours to build reference database
Determining the frequency and detailed nature of horizontal gene transfers in prokaryotes ⇒ This will shed light on the molecular and genetic mechanisms of evolution by means other than direct “Darwinian” descent and will contribute to our understanding of the acquisition of virulence and drug resistance in pathogens and the means by which prokaryotes adapt to the environment	1,000 hours to study 200 gene families	1,000 hours to study 2000 gene families	1,000 hours to study 10,000 gene families
Automated construction of core metabolic models for all the sequenced DOE genomes ⇒ This will enable dramatic acceleration of the promise of the GTL program and the use of microbial systems to address DOE mission needs in energy, environment, and science	One hour per organism, 100 hours per metagenome	10 organisms per hour, 10 hours per metagenome	50 organisms per hour, two hours per metagenome
Predict essential genes for all known sequenced micro-organisms ⇒ This will enable a broader class of genes and gene products to be targeted for potential drugs and to predict culturability conditions for environmental microbes	300 hours for 1,000 organisms 10 hours to predict culturability per organism	30 hours for 1,000 organisms, one hour to predict culturability per organism	30 hours for 5,000 organisms
Computational screening all known microbial drug targets against the public and private databases of chemical compounds to identify potential new inhibitors and potential drugs ⇒ The resulting database would be a major national biological research resource that would have a dramatic impact on worldwide health research and fundamental science of microbiology	2 M ligands per day per target (1 year to screen all microbial targets)	20 M ligands per day per target (~1 month to screen all microbial targets)	1 machine year to screen all known human drug targets
Model and simulate the precise cellulose degradation and ethanol and butanol biosynthesis pathways at the protein/ligand level to identify opportunities for molecular optimization ⇒ This would result in a set of model systems to be further developed for optimization of the production of biofuels	Simulate in detail the directed evolution of individual enzymes	Simulate the co-evolution and optimzation of a degradation or biosynthesis pathway of up five enzymes	Simulate the optimization of a complete cellulose to ethanol or butanol production system of over a dozen enzymatic steps
Model and simulate the replication of DNA to understand the origin of and the repair mechanisms of genetic mutations ⇒ This would result in dramatic progress in the fundamental understanding of how nature manages mutations and understanding which molecular factors determine the broad range of organism susceptibility to radiation and other mutagens	30 ns simulation of DNA polymerase	10 ensembles of different DNA repair enzymes	Complete polymerase mediated base pair addition step
Model and simulate the process of DNA transcription and protein translation and assembly ⇒ This would enable us to move forward on understanding post-transcription and post-translation modification and epi-genetic regulation of protein synthesis	Validate current understanding of ribosomal function	Explore splisosome function and the evolution of intron/exon functions	Model the complete coupled processes of DNA transcription to protein translation including regulatory processes
Model and simulate the interlinked metabolisms of microbial communities ⇒ This project is relevant to understanding the biogeochemical cycles of extreme, natural and disturbed environments and will lead to the development of strategies for the production of bio-fuels and the development of new bio-engineered processes based on exploiting communities rather than individual organisms	20 organisms in a linked metabolic network	100 organisms in a linked metabolic network	200 organisms in a linked metabolic network
In silico prediction of mutations and activity, conformational changes, active site alterations	One enzyme	Five-enzyme pathway	Eight enzyme pathway optimization

¹ http://www.ncbi.nlm.nih.gov/
² http://www.ebi.ac.uk/
³ http://www.genome.jp/
⁴ http://workbench.sdsc.edu/
⁵ http://cmr.tigr.org/tigr-scripts/CMR/CmrHomePage.cgi
⁶ http://theseed.uchicago.edu/FIG/index.cgi
⁷ http://img.jgi.doe.gov/cgi-bin/pub/main.cgi
⁸ http://research.calit2.net/cibio/archived/CIBIO_Overview_Report.pdf
⁹ http://darwin.nap.edu/books/030909612X/html/R1.html
¹⁰ http://doegenomestolife.org/roadmap/index.shtml
¹¹ http://www.tgbioportal.org/
¹² http://lsgw.mcs.anl.gov/about
¹³ http://compbio.mcs.anl.gov/gaduvo/gaduvo.cgi
¹⁴ http://www.mygrid.org.uk/
¹⁵ http://www.genome.jp/kegg/soap/
¹⁶ http://taverna.sourceforge.net/