CSE 590TV: Computational Biology
Homework 1
January 8, 2003

Reading

  1. Read Molecular Biology for Computer Scientists, a primer by Lawrence Hunter.
  2. Spend as much time as you can exploring the NCBI web at http://www.ncbi.nlm.nih.gov/ to get a good idea of some of the databases and tools available.

Due January 15

In this assignment you will collect some simple information and statistics on your very own prokaryotic genome, which was assigned to you in class. Most of your analyses will be done by following your prokaryote's ProtTable link from the NCBI Microbial Genomes web http://www.ncbi.nlm.nih.gov:80/PMGifs/Genomes/micr.html, and from there the three RefSeq links. Turn in the requested material by attaching it to an electronic message to the TA with subject line "CSE 590TV, HW1".
  1. Give the full name of your organism and write a background paragraph on its biology. (For instance: How does it differ from most other prokaryotes? Where is it found? Does it cause disease? If so, what are its hosts?)
  2. Construct a histogram of the protein length distribution, measured in number of amino acids, of all proteins in the genome. Choose an appropriate bucket size for the histogram, say in the range 20-50 amino acids. Give the mean and standard deviation of this length distribution. Compute the percent of the genome that codes for proteins by multiplying the mean protein length by 3, multiplying by the number of genes, and dividing by the total length of the genome.
  3. Construct a histogram of the intergenic distance distribution (the number of base pairs between the end of one gene and the beginning of the adjacent gene). Choose an appropriate bucket size, say 20-50 bp. Give the mean and standard deviation of this distribution.
  4. Construct a table showing, for each of the 20 amino acids, the number of times it occurs over all genes and its frequency (i.e., its number of occurrences divided by the total number over all 20 amino acids; these frequencies will add up to 1).
  5. Construct a table showing, for each of the 64 codons, the name of the encoded amino acid, the number of times the codon occurs (in the correct reading frame) over all genes, and the frequency of this codon as a fraction of all the synonymous codons (i.e., codons that encode the same amino acid). For instance, if you found 30,000 occurrences of UUU (in the correct reading frame) and 20,000 of UUC, the corresponding frequencies for the Phe codons would be 0.6 and 0.4. Give your table in the same format as Table 1.1 of the course notes, e.g., with codons beginning with UU in the upper left corner, and with rows and columns labeled by nucleotides as in that table. Do you see any explanation for the codon bias? For instance, is there some correlation between codon bias and nucleotide distribution over the whole genome? (You can find the nucleotide distribution in the "BASE COUNT" entry of the GenBank format for your genome.)


tompa@cs.washington.edu (Last Update: )