CTWatch
November 2006 B
High Productivity Computing Systems and the Path Towards Usable Petascale Computing
David A. Bader, Georgia Institute of Technology
Kamesh Madduri, Georgia Institute of Technology
John R. Gilbert, UC Santa Barbara
Viral Shah, UC Santa Barbara
Jeremy Kepner, MIT Lincoln Laboratory
Theresa Meuse, MIT Lincoln Laboratory
Ashok Krishnamurthy, Ohio State University

2
2. SSCA #1: Bioinformatics Optimal Pattern Matching

Figure 1

Figure 1. Sequence alignment algorithms (SSCA#1) are used for protein structure prediction.

The intent of this SSCA is to develop a set of compact application kernels that perform a variety of analysis techniques used for optimal pattern matching. The chosen application area is from an important optimization problem in bioinformatics and computational biology, namely, aligning genomic sequences. These references provide an introduction to the extensive literature on this problem space, some publicly available programs that address these problems, and the algorithms used in those programs: 3 4 5 6 7 8 9 10 11 12 13 14

2.1 Bioinformatics

A genome consists of a linear sequence composed of the four deoxyribonucleic acid (DNA) nucleotides (bases), which forms the famous double helix. The DNA sequence contains the information needed to code the proteins that form the basis for life. Proteins are linear sequences of amino acids, typically 200-400 amino acids in length. Each different protein twists naturally into a specific, complex 3-dimensional shape. This shape is what primarily determines the protein’s function.

Three adjacent DNA bases form each of 64 different codons, 61 of which code for the 20 different amino acids, while the three remaining codons indicate a stop to the coding region for the current protein. A particular amino acid may have from one to six different encodings.

Different organisms typically use similar proteins for similar purposes. A slight change in the amino acid sequence can cause anything from a slight to a profound change in the shape of the resulting protein. A slight change in the DNA sequence can cause anything from no change to a profound change in the amino acid sequence. Profound changes are almost always bad for the organism, but smaller changes may be good, bad, or neutral. Such changes (mutations) are continually occurring as a result of radiation, chemical agents, copying errors, etc.

Mutations can change individual bases, or can add or delete sections of DNA. Adding or deleting individual bases almost always produces a profound change, but adding or deleting a sequence of 3n bases may have only a slight effect since the subsequent amino acids remain unchanged.

Automated techniques have produced enormous libraries of DNA sequences identified by organism. Laboratory research has produced enormous libraries of protein sequences identified by organism and function. Today, much biological research depends on finding and evaluating approximate matches between sequences from these libraries.

Pages: 1 2 3 4 5 6 7 8 9 10

Reference this article
"Designing Scalable Synthetic Compact Applications for Benchmarking High Productivity Computing Systems ," CTWatch Quarterly, Volume 2, Number 4B, November 2006 B. http://www.ctwatch.org/quarterly/articles/2006/11/designing-scalable-synthetic-compact-applications-for-benchmarking-high-productivity-computing-systems/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.