Homework #1: Prokaryotic Gene Prediction

CSE 427: Computational Biology
January 5, 2012
due: Tuesday, January 17, 2012, 10:00 a.m.
Last update: 1/12, 4:30 p.m.

For your very own personal prokaryote (from Homework #0), collect the answers to the following questions. (You can easily find the answers to all but the last 2 of these questions in two places: for some of the answers, check the table entries at the NCBI Microbial Genome page and for other answers, click on the name of your prokaryote in that table. If any of the answers happen to be missing for your prokaryote, you may just omit them, unless you feel like trying to search for the answer using Google or something similar. While on your individual prokaryote's NCBI page, read the other bits of information that describe your prokaryote: this will give you a sense of intimacy and pride in your new adoption.)
1. Is it an archaeon (A) or a bacterium (B) ?
2. Is it pathogenic? If so, what disease does it cause and in what host?
3. What is its habitat?
4. What is the size of its genome, measured in Mbp? (If there is more than one chromosome or there are plasmids, add their sizes together.)
5. What is its genome's GC content as a percent? (This is the percent of its genome that is either G or C. If there is more than one chromosome or there are plasmids, do the obvious weighted sum to find the overall GC content. Round to the nearest percent.)
6. How many proteins does it have?
7. What is the average length of its proteins, measured in number of amino acids? Round to the nearest integer.
8. What percent of its genome is protein-coding? Round to the nearest percent.
For the average length of its proteins, find your prokaryote on the NCBI Microbial FTP page and click on its link. Download all the .ptt files you find there. These are the protein tables for your prokaryote. (If there is more than one, that means your prokaryote's genome consists of more than just a single chromosome; use all these files. Each file corresponds to what is called a replicon, which is either a chromosome or a plasmid.) Each line gives important information about a different protein. The third column is the length of that protein, measured in number of amino acids. Once you have the average of all these lengths, it is a simple calculation to answer the last question: add 1 to the average (for the stop codon), multiply this by 3 to get the average number of nucleotides occurring in codons, multiply by the number of proteins, and take the ratio of this result to the total genome size. (This calculation ignores the fact that some neighboring genes will have a small amount of overlap, but that's all right for this assignment.)
Write a program to find all the long open reading frames in your prokaryote's genome. An open reading frame is a contiguous sequence of nucleotide triplets that (a) starts with the "start codon" ATG, (b) ends with a stop codon, (c) does not contain any other stop codon that "respects the triplet boundaries", and (d) is as long as possible subject to the conditions (a)-(c). For example, ATGCTAACCTAA qualifies as an open reading frame: its codons would be ATG,CTA,ACC,TAA; the TAA starting at position 5, is not in the correct "reading frame" to be a stop codon, that is, it does not respect the codon boundaries. An open reading frame will be called "long" if it contains at least 125 codons, that is, its length is at least 375 bp. (You should include the stop codon in this count of 375 bp.) These long open reading frames are simple, but not too inaccurate, predictions of the organism's protein-coding genes.
Notice that there are 6 possible "reading frames": the start codon can start at a position that is either 0, 1, or 2 modulo 3, and it can be on either of the two DNA strands. (You will be given the DNA sequence of only one of the strands, called the + strand. You have to infer the DNA sequence of the other strand, called the - strand: it is simply the reverse complement of the + strand.) It is even possible that two long open reading frames overlap each other, either on the same strand or on opposite strands.
Given a DNA sequence, your program should output the start and end index in that DNA sequence of each long open reading frame. The first character of the input DNA sequence has index 1, not index 0. Since every open reading frame has length a multiple of 3, the difference between the end and start indices will always be 2 modulo 3. (I'm trying to save you from "off by 1" errors.) For open reading frames on the - strand, give the start and end indices on the complementary + strand. That is, the start index will actually be the index of the nucleotide that is base-paired to the last nucleotide of the stop codon, and the end index is the index of the nucleotide that is base-paired to the first nucleotide of the start codon. (I know this is confusing, but it will get your output in the right format for the comparison to the .ptt files, described below.)
Go back to the NCBI Microbial FTP page and download all the .fna files for your prokaryote. These are the DNA sequences corresponding to the protein tables that you downloaded earlier. Together, these .fna files represent the whole genome of your prokaryote. Run your program on all these .fna files to produce a list of all the long open reading frames. These are your gene predictions.
You are now going to check your program's gene predictions against the expert annotations. In the .ptt files you downloaded earlier, the first column shows the start and end positions of each gene and the second column shows the strand. Tabulate the following numbers:
- TP = the number of genes for which your program predicted both the start and end positions correctly (True Positives).
- sTP = the number of genes for which your program predicted the stop codon correctly but got the start codon wrong (semi-True Positives). It turns out to be harder to get the start codon right than the stop codon. When calculating sTP, remember that for genes on the - strand this means your program got the start index right and the end index wrong.
- FN = the number of genes in the .ptt files for which your program made no prediction having the same stop codon (False Negatives). Most of these will probably be shorter than 125 amino acids long.
- FP = the number of long open reading frames your program found that do not have the same stop codon as any gene in the .ptt files (False Positives). Most of these will probably be close to 125 amino acids long or will have substantial overlap with some other gene.
To normalize these numbers so that they are comparable across different genomes, let A be the total number of proteins listed in all your .ptt files, and let B be the total number of long open reading frames your program predicted in all .fna files. Calculate the following normalized statistics, rounded to 4 decimal places:
- Sn = TP / A (Sensitivity)
- sSn = (TP + sTP) / A (semi-Sensitivity)
- FOR = FN / A (False Omission Rate)
- PPV = TP / B (Positive Predictive Value)
- sPPV = (TP + sTP) / B (semi-Positive Predictive Value)
- FDR = FP / B (False Discovery Rate)
Extra credit: Investigate what went wrong in your predictions and how you could improve the accuracy of your program.
- For the semi-true positives, how could you refine the definition of open reading frame in order to predict the start codon more accurately? For instance, in E. coli approximately 88% of the start codons are ATG, 10% GTG, and 2% TTG. Can you use these biases to improve the start codon predictions?
- Are there any false negatives whose lengths are at least 125 amino acids? If so, why are they not open reading frames?
- Do some of the false positives and false negatives overlap, so that some algorithmic change would have turned a false positive into a true positive?
- Do many of your false positives have a significant overlap with a true positive, so that you could possibly eliminate them by disallowing long overlaps? But you need some way of predicting which is the true and which the false positive.
- Do many of your semi-true positives have a significant overlap with a true positive or another semi-true positive, so that you could possibly correct the start codon prediction by disallowing long overlaps? But you need some way of predicting which are the correct start codons.
- If you decrease the length threshold below 125, you can probably decrease FN, but investigate the corresponding increase in FP.
- Your ideas here.

Turn-in Instructions

If you experiment with any extra credit enhancements to your program, do not include those enhancements in the basic version you run for the turn-in. Instead, describe your extra credit ideas and any results in a separate file.

You will actually run your program as described in part (2) above on two separate genomes: once on your personal prokaryote and once on the "community prokaryote" Mesorhizobium loti MAFF303099. For each of these individually, you will produce the list of long open reading frames, and the 10 statistics TP, sTP, FN, FP, Sn, sSn, FOR, PPV, sPPV, and FDR.

Your turn-in will consist of the following files, named as shown. The first line of each of the first 3 files should contain your name and the name of the prokaryote.

community.txt: the results of running on the community prokaryote. After the name of the prokaryote, the next 10 lines should be the statistics TP, sTP, FN, FP, Sn, sSn, FOR, PPV, sPPV, and FDR in that order, one number per line with no labels. The remaining lines provide the list of long open reading frames, one per line, sorted in increasing order of replicon ID, and within one replicon ID sorted in increasing order of start index. (The replicon ID is the portion of the .ptt file name that precedes the .ptt extension. For instance, for the community prokaryote, the file NC_002678.ptt corresponds to replicon ID NC_002678.) Each of these lines consists of 4 items separated by tabs: replicon ID, start index, end index, and strand (denoted by the single character + or -). For example, a line in the community prokaryote's output file might look like this:
```
NC_002678    13076    13744    +
```
personalORF.txt: the results of running on your personal prokaryote. This should just be the list of long open reading frames, in the same format described above.
personalStats.txt: This file will have 19 lines for your personal prokaryote. After the name of the prokaryote, the next 10 lines will be the statistics TP, sTP, FN, FP, Sn, sSn, FOR, PPV, sPPV, and FDR in that order, one number per line with no labels. The remaining 8 lines will be the answers to the questions in part (1) for your personal prokaryote, in the form illustrated in the example below. If you could not find the answer to a particular question, write "N/A" on that line.
Source files for your program, with appropriate filenames.
README: a short text file explaining how to compile and run your program.
any files describing extra credit work, whose filenames should begin with the prefix "extra".

Submit these files to the homework drop box at https://catalyst.uw.edu/collectit/dropbox/tompa/19168.

Example of format for part (1) answers.

B
Opportunistic pathogen in humans
Multiple habitats
5.16
46
4473
317
83