CSE 590TV: Computational Biology
Homework 3
January 22, 2003
(revised January 28, 2003)
Reading
-
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman,
D.J. (1990)
"Basic local alignment search tool." Journal of Molecular Biology
215:403-410.
- The first half (on gapped BLAST) of
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W,
Lipman DJ. (1997)
"Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs". Nucleic Acids Research, 25:3389-3402.
Due February 4
In this assignment you will find 2 or 3 proteins from your prokaryote
that each have very similar proteins in a number of other prokaryotes.
The gene for each of the proteins you find (in your and the other
prokaryotes) should have at least 50 bp of noncoding DNA upstream from
its start codon. (DNA is "noncoding" if and only if neither strand is
part of the coding region of any gene.) We will use the protein
sequences and upstream noncoding sequences in a later assignment, so
don't lose them. Here are some hints of how you might proceed (though
you may discover better methods yourself):
-
Start on the http://www.ncbi.nlm.nih.gov/COG/
web page and look for your prokaryote. (A few of you will not find
your prokaryote on this page. If so, pick any other prokaryote to
start this process.) Remember the code letter for your prokaryote
shown. Click on "Phylogenetic patterns search". Next to your code
letter click the "Yes" button, and click "Search". What you are now
looking at is a list of COGs (Clusters of Orthologous Genes) that each
include a gene from your prokaryote. A COG is a set of similar genes
from various prokaryotes.
- Click on a COG i.d. that has a
specific description (avoid the words "predicted", "putative",
"uncharacterized") and involves a large number of species (which you
can see by the code letters at the left). Find your code letter on
the page that comes up and click on the link. This will bring you to
a blastp output page for your prokaryote's protein P in this COG.
What you would like to find here are matches to P from several (at
least 5, preferably more) other prokaryotes, each with extremely low
E-value and moderately high percent identity. If not, go back and
pick another COG.
-
From the Protein table link for each prokaryotic genome you can
immediately determine how much noncoding upstream DNA there is until
the next gene upstream. Be careful, though: if in the protein table
the gene is marked as on strand "-", then the gene is on the
complementary DNA strand. In this case you have to look at the
higher index of the gene's 2 location indices and measure the
distance to the next higher gene (on either strand) to get the
upstream sequence length. If you don't have at least 5 related
proteins each with at least 50 bp of upstream noncoding DNA, go back
and pick another COG. If you do, use either the "GenBank format" or
"FASTA format" links from the Microbial Genomes page to extract the
corresponding noncoding upstream DNA sequences. Again, if the gene was on
the "-" strand you have to go to its higher index, extract the
intergenic DNA from here to the next higher gene, and then
compute its reverse complement. Store these noncoding sequences for a
future assignment. In that assignment it is going to be very
important that you've consistently extracted upstream sequence that is
on the same strand as the coding sequence!
-
Get a global multiple alignment from CLUSTALW (http://www.cmbi.kun.nl/bioinf/tools/clustalw.shtml)
for the set of highly related proteins you have found. ("Protein"
means you should be aligning amino acid sequences, not DNA coding
sequences.)
-
Here's what to turn in, for each of your 2-3 sets of highly related proteins:
-
For each protein in the set,
list the name of the organism, the protein ID,
the given description of the protein, and
the length of the noncoding upstream sequence.
-
Give the multiple alignment produced by CLUSTALW.
-
Describe the procedure by which you found the sets of related
proteins, if different from my hints above. In particular, describe
how you dealt with the frustration of having to repeatedly choose a
new COG because something was wrong with earlier choices.
Extra credit
If you implement a method that can automatically produce lots of high
quality data sets rather than just the 2-3 data sets required, please
let the instructor know. By "high quality" I mean that each set of
proteins has a really clean multiple alignment (lots of perfectly
conserved columns and few gaps), so that we are convinced they are the
"corresponding" proteins from different organisms. Each of the genes
must have at least 50 bp of noncoding DNA upstream from its start
codon, which your program should extract correctly. In this extra
credit portion, do not bias the search toward your own prokaryote:
your program should be capable of simply producing (say) 100 high
quality data sets, each involving an arbitrary subset of the prokaryotes.
My aim is to produce lots of good data sets for the sort of analysis
on the upstream DNA that we are going to do later.
tompa@cs.washington.edu
(Last Update:
)