CTWatch Quarterly » Designing Scalable Synthetic Compact Applications for Benchmarking High Productivity Computing Systems

Designing Scalable Synthetic Compact Applications for Benchmarking High Productivity Computing Systems

David A. Bader, Georgia Institute of Technology
Kamesh Madduri, Georgia Institute of Technology
John R. Gilbert, UC Santa Barbara
Viral Shah, UC Santa Barbara
Jeremy Kepner, MIT Lincoln Laboratory
Theresa Meuse, MIT Lincoln Laboratory
Ashok Krishnamurthy, Ohio State University

2.3.3 Kernel 2: Sequence extraction

Using the end-point pairs and similarity scores produced by kernel 1, kernel 2 locates actual subsequence pairs with those scores. If there is more than one match for a particular end-point pair, only the best one should be reported. This kernel is specified separately from kernel 1, since keeping track of the actual sequences in kernel 1 would require some extra computation and perhaps a great deal of extra space.

2.3.4 Kernel 3: Locating similar sequences

For the second of each pair of subsequences produced by kernel 2, remove any gaps and search the first full sequence for the 100 best matches to the entire compacted subsequence.

2.3.5 Kernel 4: Aligning pairs of similar sequences

The result of kernel 4 is 100 sets of 100 highly similar subsequences, taken from the first of the two original full sequences.

For each of these sets, this kernel prepares the way for the kernel 5 multiple-sequence alignment algorithm by aligning each pair of subsequences and reporting their alignment score. The scoring algorithm does global matching using a scoring function, which operates at the nucleotide level and does not include any gap-start penalty. Instead of measuring similarity directly, it measures differences between the strings, so computing optimal alignments requires minimizing the difference-score rather than maximizing a similarity score.

2.3.6 Kernel 5: Multiple Sequence Alignment

The result of kernel 4 is 100 sets of 100 subsequences, pairwise aligned and scored for similarity at the nucleotide level. Kernel 5 then arranges each set of subsequences into a multiple alignment that approximates an alignment, which might be of interest to someone studying relationships between the subsequences within each set.

A Multiple Sequence Alignment (MSA) is defined as follows. A multiple alignment of strings S[1], . . . , S[k] is a series of strings S'[1], . . . , S'[k] with spaces (internal gaps), such that the new sequences S'[] are all of the same length n, and S'[j] is an extension of S[j] by insertion of spaces for 1 ≤ j ≤ k find an optimal multiple alignment.

For biological purposes an optimal multiple alignment is one which most clearly shows the relationships of the various sequences. These relationships may be evolutionary and/or structural, and may depend on additional data such as known kinship relationships or 3-dimensional protein shape correlations. Many different approaches are possible and this remains an active research area.

Sum-of-Pairs (SP) is one simple theoretical measure of multiple alignment. Given a specific multiple alignment (including spaces), the global similarity of each pair of sequences is calculated separately. The sum of these pairwise similarities is the SP measure. The minimum of SP over all possible multiple alignments is the theoretically optimal SP alignment.

Finding the optimal SP alignment is NP-hard; using dynamic programming it runs in O(k²2^kn^k) time, that is, exponential in the number of sequences. For n = 200 and k = 100, this is prohibitive. A number of different approximate methods are known, of varying performance, quality, and complexity, such as the Star, Tree, and various Progressive alignments. Star alignment runs in polynomial time, but the result can be far from optimal. Tree alignment maps the k sequences to a tree with k leaves, is NP-complete, and requires a tree selection heuristic.

Progressive alignments (such as ClustalW, PileUp, or T-Coffee) are perhaps the most practical and widely used methods, and are a hierarchical extension of pairwise alignments. They compare all sequences pairwise, perform cluster analysis on the pairwise data to generate a hierarchy for alignment (guide tree), and then build the alignment step by step according to the guide tree. The multiple alignment is built by first aligning the most similar pair of sequences, then adding another sequence or another pairwise alignment.

Progressive alignments often work well for similar sequences but are problematic for distantly-related sequences. Probabilistic methods are emerging, such as HMMER,⁵ that perform Profile Hidden Markov Models introduced by Gribskov.⁶ A profile HMM can be trained from unaligned sequences, if a trusted alignment is not yet known. HMMs have a consistent theory behind gap and insertion scores, and are useful in determining protein families. Many new MSA heuristics have been published in the last few years, including techniques such as MACAW, Clustal W, DIAlign, T-Coffee, and POA.

Multiple sequence alignment is a difficult problem; as described the best solutions are either very slow or very complex. For the purposes of this SSCA we choose the "center star" approximation method, as discussed in[⁸], a simple example of a progressive alignment. When used with an alphabet-weighted scoring scheme that satisfies the triangle inequality, this method produces a sum-of-pairs solution that is within a factor of two of optimal (and is usually much better).

Pages: 1 2 3 4 5 6 7 8 9 10

CTWatch is a collaborative effort				Sponsored By