CTWatch Quarterly » Computing and the "Age of Biology”

The past decade has completely changed the face of biology. The image of a biologist passionately chasing butterflies in the wilderness of an Amazon rainforest or losing sight spending hours staring in a microscope has been substituted by pictures of factory-like sequencing facilities and high-throughput automated experimental complexes. The technology has changed the entire fabric of biology from a science of lonely enthusiasts to a data-intensive science of large projects involving teams of specialists in various branches of life sciences spread between multiple institutions.The new generation of biology is tightly interlinked with the progress in computer science. Indeed, in order to exploit the enormous scientific value of biological data for understanding living systems, the information must be integrated, analyzed, graphically displayed, and modeled computationally in a timely fashion. The development of computational models of an organism’s functionality is essential for progress in medicine, biotechnology, and bioremediation. Such models allow predicting functions of the genes in newly sequenced genomes and existence of particular metabolic pathways and physiological features. Conjectures developed during computational analysis of genomes provide invaluable aid to researchers in planning experiments and save an enormous amount of time and resources required for elucidation of an organism’s biochemical and physiological characteristics. Essential for fulfilling this task is the development of high-throughput computational environments that integrate (i) large amounts of genomic and experimental data, (ii) comprehensive tools and algorithms for knowledge discovery and data mining, and (iii) comprehensive user interfaces that provide tools for easy access, navigation, visualization, and annotation of biological information. To achieve such an integrated environment, computational biologists face four primary challenges:

1. Exponential growth of the biological data in the databases requires scalable public computational resources. In the past 10 years the amount of data in genomic databases has doubled or tripled each year. For example, the current 151.0 release of the largest genomic database, GenBank, contains 56 billion bases, from 52 million sequences; and the rate of growth of GenBank is expected to increase dramatically as the cost of sequencing of new genomes drops. To date, 394 genomes have been completely sequenced and 1644 are on various levels of completion. However, the development of highly integrated and scalable bioinformatics systems for interpreting newly sequenced genomes is a time and resource-consuming task. While large sequencing or bioinformatics centers have the resources needed, a significant number of institutions are working on a small number of genome projects. Indeed, out of 1174 on-going sequencing projects, 66% of institutions have only one sequencing project, and over 87% have four or fewer.¹ The development of large, public computational systems that provide computing resources, integration of data, automated analyses, and the capability for expert-driven refinement of these analyses may significantly benefit the field of genomics. Such systems will provide state-of-the art computational support to the smaller genomics groups and will allow them to concentrate on biological scientific questions rather than on the development of bioinformatics environments.

2. Complexity of biological data. Increasingly, biological models are utilizing information from different branches of life sciences; genomics, physiology, biochemistry, biophysics, proteomics, and many more. The development of such models is a task of unprecedented computational, semantic, and algorithmic complexity. It requires integration of various classes of biological information as well as similar classes of data from different resources. Such large-scale integration presents a number of computer science challenges because of the large volume and complexity of data, distributed character of this information residing in different databases, shortfalls of current biological ontologies, and generally poor naming conventions for biological objects (e.g., a large number of synonyms describing the same object or notion, or nonunique names for describing different objects). More than 100 groups (e.g., GO, BioPAX, W3C; see the Open Biomedical Ontologies website for a partial list of such efforts) are developing ontologies for various branches of biological sciences. For example, in order to satisfy diverse scientific communities, the ontological description of glycolysis has evolved into an enormously complex data structure integrating various classes of data, historical conventions of the communities, and links to the ontologies of other sciences (e.g., chemistry, biophysics, taxonomy). However, the development of most of these ontologies still depends on social consensus between the scientific communities—a task that seems to be of insurmountable social and scientific complexity. The development of new tools and algorithms for mining and clustering of existing scientific notions and terms may provide significant assistance to this process.

3. Algorithm development. The most popular bioinformatics tools (e.g., BLAST, FASTA) perform pairwise comparisons of the query sequence with all sequences in a specified database. Such a computationally intensive approach at a time of exponentially growing amounts of sequence data will inevitably lead to the emergence of an N-squared problem. Bioinformatics will significantly benefit from the development of a new generation of algorithms that will allow efficient data mining and identification of complex multidimensional patterns involving various classes of data. Visualization of multifarious information is another essential need of high-throughput biology: it allows for reducing the complexity of biological knowledge and developing much-needed overviews.

4. Development of collaborative environments. A typical biological project involves data sources and users distributed among various institutions. Such projects require a mature infrastructure that allows seamless integration, analysis, storage, and delivery of information to a distributed community of users. Warehousing of the data and its analysis by the researchers residing in one location will not be sufficient for the needs of biology in the future. Essential for the success of large biological projects is further development of collaborative environments that will allow the scientists residing in different locations and sometimes even on different continents to analyze, discuss, annotate, and view the data. Access Grid conferencing, shared interfaces, Web services, and other collaborative tools will allow groups to identify, discuss, and solve scientific problems efficiently.

The 21st century is considered to be the “age of biology.” Advances in genomic research will establish cures or therapies for numerous diseases that were considered to be incurable, and future genetically engineered bioproducts will contribute significantly to solving the global hunger problem. Such progress will, to a large extent, be driven by the formulation of new computational approaches for analysis of biological data and the timely transfer of technologies developed in other disciplines (e.g., physics, linguistics). Computing and the biological sciences will become intimately intertwined, opening new possibilities and cause unprecedented changes to life as we know it.

¹ Statistics from the Genome Project Sequencing Center at the National Center of Biological Information for July 2, 2006