CTWatch Quarterly » Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade

As genomes grow faster than Moore’s law, biology will provide numerous cyber challenges in the next 10-15 years.

Biology is an area in science that is using more and more computational resources as it is turning into a data driven discipline. Most notably, the emergence of genome and post-genome technology has made vast amounts of data available, demanding analysis. Hundreds of bacterial (more precisely prokaryotic) genomes are available today and have already proven to be a very valuable tool for many applications. A prominent example is the reconstruction of the metabolic pathways¹ of several bacterial organisms. The analysis of the rising number of genomes is already an application of cyber technologies² and to some extent limited by the available cyber resources. As more data is becoming available, this trend is likely to continue.

An important factor in this equation is the fact that the number of available complete genomic sequences is doubling almost every 12-months³ at the current state of technology. Whereas according to Moore’s law, available compute cycles double only every 18 month. The analysis of genomic sequences requires serious computational effort: most analysis techniques require binary comparison of genomes or the genes within genomes. Since the number of binary comparisons grows as the square of the number of sequences involved, the computational overhead of the sequence comparisons alone will become staggering. Whether we are trying to reconstruct the evolutionary history of a set of proteins, trying to characterize the shape they fold into, or attempting to determine correspondences between genes in distinct genomes, we are often using these binary operations, and the cost is rapidly climbing.

Today, traditional research teams in bioinformatics either totally rely on resources provided by institutions like the National Center for Biotechology Information (NCBI)⁴ for sequence analysis purposes⁵ or build up their own local resources. The NCBI provides services including comprehensive sequence databases and online sequence comparison via a browser interface. Researchers possessing private compute resources have the advantage of running algorithms of their choosing on the machines however, to keep up with the data flood, they either have to accept long waiting times or continue to invest in cluster resources to fulfill their growing sequence analysis needs.

However as the number of sequences available grows, the number of algorithms available for their analysis also increases. So today, numerous bioinformatics techniques exist or are being developed that use considerably more computational power and are yielding different results for sequence comparison than the traditionally used BLAST algorithm. Over the last five years, most notably the influx of machine learning techniques has led to an increased consumption of compute cycles in computational biology.

When researchers began to use Markov Models to search for sequence similarities not visible with BLAST and also began building databases of common sequence motifs represented as Hidden-Markov-Models (e.g. HMMer or InterPro), the CPU requirements were increased dramatically. While a BLAST search against the NCBIs comprehensive, non-redundant collection of known proteins can be run in a matter of minutes either locally or on NCBI’s BLAST-server for several hundred query sequences (remember a single genome contains thousands of genes), no resource exists that allows querying several hundred (let alone thousand) proteins for protein motifs using the European Bioinformatics Institutes’ (EBI) InterPro tool.

Today, few resources exist outside TeraGrid⁶ that could provide the computational power needed to run a comprehensive protein motif search for more than a few complete bacterial genomes. Only a massive, high-performance computing resource like TeraGrid can provide the CPU-hours that will be required for this and other future challenges stemming from the increasing amount of sequence data.

In Figure 1, the reason for developing new algorithms and looking for more computation power becomes apparent: we cannot generate annotations fast enough as the speed of sequencing accelerates. So applying new bioinformatics techniques as well as high throughput computing provides a much-needed means of reducing the growing gap between the number of sequences and annotations. Today we are clearly limited in our ability to generate annotations fast enough.

This limitation is currently of interest to people working in basic science, with the advent of more and more complete genomic sequences for crop-plants, pathogens and ultimately individual human beings. The demand for precise and fast bioinformatics analysis of genomes, not only from bacteria but also plant and human, is going to grow fast.

Figure 1. Using a logarithmic scale, the growth of sequence databases and annotations. Numbers are taken from the respective database release notes of the databases Genbank(NCBI) and Swissprot(EBI).

As daunting as our limited ability to generate annotations seems, we have so far only discussed a fraction of the challenges posed by biology. Annotations cover only the static components of the genome. They are a description of the gene load.

Ever since we have learned that the human genome contains relatively few genes (estimates are changing but all are below 50,000) it has become clear that the dynamics of gene expression and regulation thereof hold the key to understanding the organisms in question.

As long as we are unable to fully enumerate, let alone describe, the functional elements in the respective genomes, we are a long way from understanding the full complexity hidden in the static and the dynamic components of the genome. Cyber Technologies will play a key role in furthering our understanding in understanding the data that we are currently amassing. While currently important insights into the respective organism’s lifestyles can be obtained from studying the dynamic components of life (gene expression and regulation), we are at the beginning of another data deluge. The NCBI presents, as part of their training material, a comparison of the growth of sequence and gene expression data,⁷ highlighting the fact that both are growing dramatically.

The analysis of the growing volume of gene expression data becoming available from the various post-genomics technologies will present an even greater challenge than the annotation problem we are faced with right now. A single gene expression experiment can generate data for thousands of genes at a time, while gene expression studies have the potential of helping us understand annotation much better. Initially, we are faced with more data that not only needs integration with the annotations but also exceeds the annotations in volume and complexity.

While we are currently faced with the problem of generating annotations for the sequences we are producing, the next steps are already well defined, and it is clear that there is serious need for computational support of biology that in turn will require large-scale computation. Biology is in the middle of a paradigm shift towards becoming a fully data driven science.

¹ http://en.wikipedia.org/wiki/Metabolic_network_reconstruction_and_simulation
² Catlett, C., Beckman, P., Skow, D., Foster, I. Creating and Operating National-Scale Cyberinfrastructure Services. CTWatch Quarterly, 2(2), May 2006. http://www.ctwatch.org/quarterly/articles/2006/05/creating-and-operating-national-scale-cyberinfrastructure-services/
³ GOLD – Genomes Online Database - http://www.genomesonline.org/
⁴ National Center for Biotechnology Information - http://www.ncbi.nlm.nih.gov/
⁵ http://www.ncbi.nlm.nih.gov/BLAST/
⁶ TeraGrid - http://www.teragrid.org/
⁷ http://www.ncbi.nlm.nih.gov/Class/NAWBIS/Modules/Expression/exp45.html