GENOME 559, Winter 2010, Homework 6

Introduction to Statistical and Computational Genomics
GENOME 559
Department of Genome Sciences
University of Washington School of Medicine

Homework 6

Due: Thursday, 2/25/10, at start of class.

Problems:
(50 points) Lecture notes show how to build weight matrix models (AKA Position Specific Scoring Matrices) based on nucleotide counts (or freqencies), even when background nucleotide frequencies are nonuniform.
Write a Python program ps6wmm.py to do this. Your input is a file name specified on the command line. For a motif of length n, this file will contain 4 lines, each with n+1 integers. The 4 lines correspond to counts of A, C, G, T, resp., in the i-th position of observed motif instances (1 <= i <= n) for the first n columns, and in the background (column n+1). For simplicity, no counts are ≤ 0 and each column total is 100. (Use 10 x log₂, rounded, as in the notes. See ">>>help(math.log)" for base 2 logarithms; ">>>help(round)" for rounding.) The function for reading a matrix from the last assignment may be useful. Write the code for converting frequencies to log-ratio scores as a function. Additionally, write a separate function to compute total- and per-column relative entropies, as shown in the notes (also log₂). Print the input matrix, the resulting weight matrix scores and the relative entropies, per column and in total.
Apply your code to the "TATA Box" frequencies given in the notes (tweaked to avoid the zero in column 6), assuming uniform background frequencies; your scores should reproduce the weight matrix given in the notes. Use the file ps6-tata-unif.txt as your input. E.g., the output from my version is:
    Weight matrix scores:
      -36    19     1    12    10   -46 
      -15   -36    -8    -9    -3   -31 
      -13   -46    -6    -7    -9   -46 
       17   -31     8    -9    -6    19 
Rel Entropies: 
     0.97  1.59  0.15  0.38  0.22  1.65, Total =  4.95 
(NB: these data are old, incomplete; fine as an example, but don't use for a real study.)
Which column (1..6) of the WMM is least informative, i.e., expected to contribute least to discriminating between TATA boxes and background sequences? Which is most informative? If you scored 1000 randomly selected E. coli TATA boxes according to this WMM, what would you expect their average score to be (approximately)? Explain all 3 answers.

The uniform background assumption is very inaccurate for some organisms. E.g., the macronuclear genome of Tetrahymena thermophila is about 74% AT. Repeat (b) and (c), assuming this background frequency. Use the file ps6-tata-Tt.txt as your input. E.coli TATA box frequencies are not likely to be correct for T. thermophila, but if they were, scoring should change accordingly. Based on your intuition and your two score tables, do you think the WMM would be more or less accurate at finding these motifs in an AT-rich genome? Explain why.

Challenge Problem(s): Generalize your code to drop the assumption that column sums are 100. Instead, divide counts by column totals to get frequencies. Add a "pseudocount" of 1 to each entry before converting counts to frequencies. Implement the "scanning" algorithm, i.e., given a WMM and a nucleotide sequence, compute the score for each position (except the few at the end) of the sequence. Try it on E. coli. Can you find annotated transcription start sites against which to compare; how well does it work?
(10 points) "Scores" in many bioinformatic applications are quite arbitrary, and can be difficult to interpret, other than a general sense of "bigger is better." In contrast, the scores produced from a weight matrix model have a very precise probabilistic interpretation, under the assumption that "sites" versus "nonsites" arise by independent selection of nucleotides from successive columns of the foreground versus background models.

Under these assumptions, what does a score of "+90" mean? (Again assume scores are 10 x log₂, rounded. Hint: "Likelihood Ratio Test.")

Usual rules for "preponderance of scientific evidence" suggest that you should have at least 95% confidence before rejecting the null hypothesis that "this is just random DNA, not a TATA box" (or similar motif). What score does that correspond to? (Again, assume 10 x log₂.)

Challenge Problem: Many researchers would demand a higher score than that before being confident that a functional motif instance has been identified. Give one or two reasons why such a higher standard is prudent.

(10 points) Write a Python program ps6tab.py to read and write a tab-delimited file, where the output equals the input, except that columns 2 and 4 have been interchanged. Use the RegExp "substitute" function to accomplish the column swap. It is ok to assume that each line has 4 columns of data (but may or may not have more). Print your result on this test file ps6-colswap.txt.

(30 points) Go to the NCBI Genome page, click "Genome Projects" under "Microbial" in the left column, scroll to find "Methanocaldococcus jannaschii DSM 2661" and click its name. Read the brief description of it. Follow the Refseq FTP link and download two files: NC_000909.gbk and NC_000909.rnt. The .rnt file ("RNA Table") is a succinct summary of non-coding RNAs in this genome (in this case, limited to tRNAs and rRNAs). Your task is to reproduce part of the .rnt data by directly extracting it from the .gbk ("Genbank") file. Specifically, write a Python program ps6gbk.py that reads the .gbk file and prints one line for each tRNA, giving

its genomic coordinates,

length (in nucleotides), and

strand.

You must use the Python re module to identify the relevant lines of the .gbk file. I recommend using the re module's "parenthesized groups" facilities to extract coodinates, but this is not required; use Python's string functions if you prefer. Be sure to find tRNAs on both + and - strands. Check your results against the .rnt file.
Planning and Incremental Development: No one writes big programs from top to bottom correctly in one pass. Instead, build incrementally. Here's a suggestion. (Think about this for your other programs as well.)

Look at the .gbk and .rnt files to see what you're up against.

Write a program that takes a file name as its argument and prints the first (say) 10 or 20 lines of that file.

Modify it to print the first 10 or 20 lines of the file that match a specified RegExp pattern.

Play with the pattern. Absent a detailed, lawyerly specification of your input, one of the main issues with "parsing" a text input file (with or without RegExps) is to define patterns that are tight enough to exclude extraneous junk, yet loose enough to give you all matches of interest, including plausible variants of examples you've looked at closely. (Spaces or tabs? Always 42 of them? Does case matter? Are leading zeros allowed on numbers? Etc., etc.) For this problem, r'tRNA' is an obvious starting point, but you'll find it too loose. Tighten it until you've homed in on the lines defining tRNA genes. Compare to the .rnt file to make sure.

Modify your pattern to capture the genomic coordinates of each hit as "parenthesized groups".

Print coordinates, length and strand rather than the actual line.

Declare Victory. Celebrate.

Challenge Problem(s): Find rRNAs as well. For each tRNA, identify its associated amino acid. (Perhaps look in the Python docs for information about iterators and the "next" function.) Reproduce the .rnt file from information in the .gbk file (which is probably how NCBI generates .rnt files...). Extract the nucleotide sequence of each tRNA. What is the average G-C content of tRNAs versus the genome as a whole? Is this a surprise? Care to speculate about why?