CTWatch Quarterly » Designing Scalable Synthetic Compact Applications for Benchmarking High Productivity Computing Systems

Designing Scalable Synthetic Compact Applications for Benchmarking High Productivity Computing Systems

David A. Bader, Georgia Institute of Technology
Kamesh Madduri, Georgia Institute of Technology
John R. Gilbert, UC Santa Barbara
Viral Shah, UC Santa Barbara
Jeremy Kepner, MIT Lincoln Laboratory
Theresa Meuse, MIT Lincoln Laboratory
Ashok Krishnamurthy, Ohio State University

2.2 Sequence Alignment

In this SSCA, we consider the polynomial time problem of pairwise sequence alignment and the potentially NP-hard problem of multiple sequence alignment. Algorithms that solve these problems are often integer-based and use a variety of algorithmic techniques and heuristics. Optimal algorithms exist and are practical for pairwise alignment. However, approximate or heuristic algorithms are required for multiple sequence alignment; there is no single, obviously best approach to the problem, and the simple algorithms are either NP-hard or approximations.

In biology, multiple sequence alignments are needed to:

Organize data to reflect sequence homology
Identify conserved sites/regions
Identify variable sites/regions
Uncover changes in gene structure
Identify probes for similar sequences in other organisms
Develop PCR primers
Perform phylogenetic analysis

No one alignment algorithm is suitable for all these applications, but most commonly used alignment programs use variants of a single basic approach, the dynamic programming algorithm for optimal pairwise sequence alignment. The kernels below explore several of these variations.

2.3 Data Generation and Kernels

The first kernel performs a local pairwise alignment of two long codon sequences, finding the end-points of subsequences that are good matches according to the specified criteria. The second kernel identifies actual codon sequences located by the first kernel, working backward from the given end-points. The third kernel uses the interesting subsequences found in the second sequence by the second kernel to search the first sequence for a set of the best complete matches to each interesting subsequence. The fourth kernel goes through each set of matches found by the third kernel and performs a global pairwise alignment at the nucleotide level for each pair. The fifth kernel performs multiple sequence alignment on each of the sets of alignments generated by the third kernel, using the simple, approximate “center star” algorithm.

2.3.1 Scalable Data Generator

This SSCA requires a scalable data generator to create genomic sequences of lengths from hundreds to potentially billions of nucleotide bases. At least four types of data are commonly used for sequence matching: DNA, RNA, codons, and amino acids.

Usually the division of DNA/RNA into codons is known, and matching at the codon level is the most informative and fastest. Matching at the nucleotide level is interesting for some applications, but is much slower. Matching at the amino acid level is important only if the corresponding DNA/RNA is unknown, and uses almost exactly the same algorithms as codon-level matching. For this SSCA we have chosen to specify DNA codon-level pairwise alignment and DNA nucleotide-level multiple sequence alignment.

2.3.2 Kernel 1: Pairwise Local Alignment of Sequences

Waterman¹⁴ states: “Surprising relationships have been discovered between sequences that overall have little similarity. These are dramatic cases when unexpectedly long matching segments have been located between viral and host DNA. [Smith-Waterman] is a dynamic programming algorithm to find these segments. This is probably the most useful dynamic programming algorithm for current problems in molecular biology. These alignments are called local alignments.” For this first kernel, we are given two sequences and wish to find the subsequences from these two that are most similar to each other as defined by Waterman. There may be several ‘equally similar’ subsequence pairs.

Pages: 1 2 3 4 5 6 7 8 9 10

CTWatch is a collaborative effort				Sponsored By