CSE 590TV, Homework 3

CSE 590TV: Computational Biology
Homework 3
January 22, 2003

(revised January 28, 2003)

Reading

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." Journal of Molecular Biology 215:403-410.
The first half (on gapped BLAST) of Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs". Nucleic Acids Research, 25:3389-3402.

Due February 4

In this assignment you will find 2 or 3 proteins from your prokaryote that each have very similar proteins in a number of other prokaryotes. The gene for each of the proteins you find (in your and the other prokaryotes) should have at least 50 bp of noncoding DNA upstream from its start codon. (DNA is "noncoding" if and only if neither strand is part of the coding region of any gene.) We will use the protein sequences and upstream noncoding sequences in a later assignment, so don't lose them. Here are some hints of how you might proceed (though you may discover better methods yourself):

Start on the http://www.ncbi.nlm.nih.gov/COG/ web page and look for your prokaryote. (A few of you will not find your prokaryote on this page. If so, pick any other prokaryote to start this process.) Remember the code letter for your prokaryote shown. Click on "Phylogenetic patterns search". Next to your code letter click the "Yes" button, and click "Search". What you are now looking at is a list of COGs (Clusters of Orthologous Genes) that each include a gene from your prokaryote. A COG is a set of similar genes from various prokaryotes.
Click on a COG i.d. that has a specific description (avoid the words "predicted", "putative", "uncharacterized") and involves a large number of species (which you can see by the code letters at the left). Find your code letter on the page that comes up and click on the link. This will bring you to a blastp output page for your prokaryote's protein P in this COG. What you would like to find here are matches to P from several (at least 5, preferably more) other prokaryotes, each with extremely low E-value and moderately high percent identity. If not, go back and pick another COG.
From the Protein table link for each prokaryotic genome you can immediately determine how much noncoding upstream DNA there is until the next gene upstream. Be careful, though: if in the protein table the gene is marked as on strand "-", then the gene is on the complementary DNA strand. In this case you have to look at the higher index of the gene's 2 location indices and measure the distance to the next higher gene (on either strand) to get the upstream sequence length. If you don't have at least 5 related proteins each with at least 50 bp of upstream noncoding DNA, go back and pick another COG. If you do, use either the "GenBank format" or "FASTA format" links from the Microbial Genomes page to extract the corresponding noncoding upstream DNA sequences. Again, if the gene was on the "-" strand you have to go to its higher index, extract the intergenic DNA from here to the next higher gene, and then compute its reverse complement. Store these noncoding sequences for a future assignment. In that assignment it is going to be very important that you've consistently extracted upstream sequence that is on the same strand as the coding sequence!
Get a global multiple alignment from CLUSTALW (http://www.cmbi.kun.nl/bioinf/tools/clustalw.shtml) for the set of highly related proteins you have found. ("Protein" means you should be aligning amino acid sequences, not DNA coding sequences.)
Here's what to turn in, for each of your 2-3 sets of highly related proteins:
- For each protein in the set, list the name of the organism, the protein ID, the given description of the protein, and the length of the noncoding upstream sequence.
- Give the multiple alignment produced by CLUSTALW.
- Describe the procedure by which you found the sets of related proteins, if different from my hints above. In particular, describe how you dealt with the frustration of having to repeatedly choose a new COG because something was wrong with earlier choices.

Extra credit

If you implement a method that can automatically produce lots of high quality data sets rather than just the 2-3 data sets required, please let the instructor know. By "high quality" I mean that each set of proteins has a really clean multiple alignment (lots of perfectly conserved columns and few gaps), so that we are convinced they are the "corresponding" proteins from different organisms. Each of the genes must have at least 50 bp of noncoding DNA upstream from its start codon, which your program should extract correctly. In this extra credit portion, do not bias the search toward your own prokaryote: your program should be capable of simply producing (say) 100 high quality data sets, each involving an arbitrary subset of the prokaryotes. My aim is to produce lots of good data sets for the sort of analysis on the upstream DNA that we are going to do later.

tompa@cs.washington.edu (Last Update: )

CSE 590TV: Computational Biology Homework 3 January 22, 2003

Reading

Due February 4

Extra credit

CSE 590TV: Computational Biology
Homework 3
January 22, 2003