CSE 590TV: Computational Biology
Homework 4
February 5, 2003

Reading

  1. Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994) "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice" . Nucleic Acids Research 22:4673-4680.
  2. Morgenstern B., Frech K., Dress A., and Werner T. (1998) "DIALIGN: finding local similarities by multiple sequence alignment" Bioinformatics 14:290-294.
  3. Kamvysselis, M., Patterson, N., Birren, B., Berger, B., and Lander, E. (2003) "Whole-genome comparative annotation and regulatory motif discovery in multiple yeast species". To appear in The Seventh Annual International Conference on Research in Computational Molecular Biology, Berlin, April 2003.

Due February 12

  1. Initially set n=500. Generate 5 strings, each of length n, over the alphabet {A,C,G,T}, where each character is chosen randomly, independently, and uniformly (i.e., each character takes each of the four possible values with probability 1/4). Now insert the string ACATATTACT once into each of these 5 strings, at varying positions somewhere between 100 and 300 characters from each string's right end. The resulting strings are intended to represent typical noncoding upstream sequences, with a short planted pattern ACATATTACT present in each one near its 3' end.

    Run CLUSTALW and DIALIGN on these 5 strings and observe how many of the 5 planted patterns each program succeeds at aligning. What is the smallest value of n (say, to within the nearest multiple of 100) for which CLUSTALW fails to align any pair of these 5 planted patterns? Same question for DIALIGN. If you cannot get one of these tools to fail in this way simply by increasing n, try also making small changes in the 5 planted instances of the pattern, for instance making each one differ from ACATATTACT in one position.

    Report the results of your experiments providing, for each of the two tools, (1) the minimum value of n, (2) the alignment output (with the planted pattern highlighted somehow so that it's easy to see), and (3) how much you had to alter the pattern instances before failure (if at all).

  2. Use your 2-3 data sets of noncoding upstream sequences produced in Homework 3 for this part. (Don't throw them away, because we will use them again in a later homework.) If any upstream sequence in your sets is longer than 300 bp, truncate it to just the 300 bp at the 3' end, that is, retain the end closest to the start codon of its gene. Run CLUSTALW and DIALIGN on each data set, resulting in 4-6 alignments that you will turn in. In each alignment, highlight every aligned pattern (say 6 bp or longer) that looks to you to be surprisingly well conserved across the upstream sequences. (Some of your alignments may not contain any such patterns because of the problem exposed in part 1 above.) Any pattern found is a plausible candidate as a regulatory element for the downstream gene, that is, the binding site of some protein that regulates the gene's expression. With each alignment include the names of the prokaryotes and gene identification numbers, as well as which tool produced the alignment.

Extra credit

Send mail to the instructor if you did not get the handout. Here are the links: