Overview of the BLAST algorithm:
Fast approximation of optimal local alignment


Martin Tompa
CSE 427: Computational Biology
January 18, 2010

Outline of the BLAST algorithm

In what follows, w, T, S, E, Xu, A, and Xg are parameters that can be set by the user.

Inputs: query sequence Q, database D of sequences

  1. Find exact (or nearly exact, for protein sequences) "word matches" or "hits" between substrings of Q and D, each of length w (and alignment score ≥ T, for protein sequences).
  2. For each word match, do ungapped extension to the left and right to construct a "high-scoring segment pair" (HSP) of alignment score ≥ S.
  3. For each HSP, do gapped extension to the left and right, and output those final alignments with alignment score x and E-value ≤ E, where the "E-value" is the expected number of sequences in a randomly constructed database that have alignment score ≥ x.

Details of word matches

  1. For nucleotide sequences, these are exact matches of w consecutive residues. The default for NCBI BLAST is w=28.
  2. For protein sequences, construct a list of all possible words of length w and alignment score ≥ T when aligned with some length w substring of Q. The default for NCBI BLAST is w=3. The word list can be constructed in time "essentially proportional" to its length by branch and bound: Explore the complete tree with 20w leaves, but terminate any partial branch for which completing the word by perfect matches to the query does not yield score ≥ T. To search for all occurrences of the word list in D, employ the finite automaton for string matching with multiple patterns.
  3. There is a speed/sensitivity tradeoff in the choice of w (and T): increasing w decreases the number of word matches dramatically, but increases the chance of missing good alignments.

Details of ungapped extension

  1. Heuristic: continue to extend only if the alignment score is at most Xu (the "ungapped dropoff") below the best score seen in the extension up until this point.
  2. For protein sequences, only extend if there are two word matches on the same dynamic programming matrix diagonal within distance A.

Details of gapped extension

References

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool". Journal of Molecular Biology 215:403-410.
  2. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs". Nucleic Acids Research, 25:3389-3402.