Homework #4: Protein Homology Search Using a Profile HMM

CSE 427: Computational Biology
February 16, 2012
due: Friday, March 9, 2012, 10:00 p.m.
Last update: 3/3

Write a program that trains a profile hidden Markov model (HMM) M from a multiple protein alignment, and then finds the protein in a given prokaryote that has the highest scoring Viterbi score with respect to M.

You will use the multiple alignment A of 20 proteins from Homework #2 results to train the profile HMM M. (See Figure 5.2 of Durbin et al. for the topology of a profile HMM.) Rather than running the Viterbi algorithm on your own personal prokaryote and the community prokaryote (for which most of you already found a great match in Homework #2), you will all run it on the two personal prokaryotes for which the match scores were low in homework 2, Haloquadratum walsbyi DSM 16790 and Nitrosopumilus maritimus SCM1, to see if these prokaryotes have any better matches to this protein family. It will be exciting to see whether this HMM approach can find anything better than the Homework #2 approach.

Every column of this alignment A that has 10 or fewer gap characters will correspond to a Match state in the HMM, and those sequences with a gap in this column will take the path through the corresponding Delete state. Every column that has 11 or more gap characters will not correspond to any Match state, and those sequences with an amino acid in this column will take the path through the corresponding Insert state.

The Viterbi score you will be using to score each of your prokaryote's proteins S is the log likelihood ratio of the Viterbi probability of S and the background probability of S. See Equations (5.1) of Durbin et al. for the recurrences to use. You will need to modify this set of recurrences slightly: you need to add a recurrence with V_End(i) on the left hand side (hint: the End state has no emissions, so this recurrence looks very much like the recurrence for the Delete states) and you need to modify the recurrences for the states I₀, M₁, and D₁ (hint: these each have only two inputs, one of which is the Begin state). The basis for the recurrence is V_Begin(0) = 0, V_k(0) = -∞ if k is any Match, Insert, or End state, and V_Begin(i) = -∞ for i > 0. All your logarithms should be computed base 2.

As in Homework #2, you can find a list of all the prokaryote's protein sequences in the .faa file(s) on the prokaryote's NCBI Microbial FTP page. Compute the background frequency q_b of each of the 20 amino acids b from these files. The word "frequency" here means the fraction of amino acids that are b in the file(s); that is, the 20 values of q_b should sum to 1. (Throughout this homework, treat any letter in the .faa files that does not represent one of the 20 amino acids as though it were an A.) You will need these frequencies for the recurrences and for the adjustments described in the next paragraph.

Because there are only 20 characters in each column of the alignment, when training the HMM some emission probabilities would be set to 0, but this is actually an artifact of the small sample size of 20. Instead of using the actual counts E_k(b) of the number of times b was emitted in state k, use the adjusted counts E_k(b) + q_b (in both numerator and denominator) when computing the emission probability e_k(b). This way, e_k(b) will be a small positive number even if E_k(b) = 0. Similarly, use A_jk + Δ_jk rather than A_jk (in both numerator and denominator) when computing the transition probability a_jk, where Δ_jk = 0.9 if j and k are both the same type of state (that is, both Match states, or both Insert states, or both Delete states) and Δ_jk = 0.05 otherwise. If j is the Begin state, use Δ_jk = 0.1 if k is the Match state and Δ_jk = 0.45 otherwise. If j is the last Delete state, use Δ_jk = 0.9 if k is the End state and Δ_jk = 0.1 otherwise. If j is the last Match or last Insert state, use Δ_jk = 0.1 if k is the End state and Δ_jk = 0.9 otherwise.

Finally, for the protein S* that has the best log likelihood ratio score, determine its most probable path through the HMM M by doing the traceback of the Viterbi algorithm. From this path, align S* to the 20 other proteins in the input alignment.

Here is an outline of your entire program, to help you get started:

Training:

Read the multiple sequence alignment A into an array.
Mark the columns of A (the columns with 10 or fewer gap characters) that correspond to match states of M.
For each sequence R in A, follow the path of R through M determined by step 2, incrementing appropriate counts of emissions and transitions.
Compute background frequencies from the .faa file(s).
Compute all emission and transition probabilities.

Testing:

For each sequence S in the .faa file(s), use the recurrences of Equations (5.1) from Durbin et al. to compute score(S) = V_End(|S|).
Identify the S* that has the greatest value of score(S).
Trace the Viterbi maximizations backwards to find the most probable path for S* and the best alignment of S* with A. (Read the first extra credit problem below for more important details.)

Extra credit:

Let P be the maximum probability path that your best protein S* takes through the HMM, obtained from the Viterbi traceback. Suppose that P goes through some Insert state multiple times and that a few other sequences in the original multiple alignment go through the same Insert state multiple times. Then the traceback gives you no information about how to align this insertion in S* with the insertions in the other sequences. Use a separate multiple sequence alignment algorithm just on the corresponding insertions to find good gap placements. You may use some multiple alignment program available on the internet for this purpose if you like. (For the generic HW4 solution it is fine to make any arbitrary decision about gap placement within corresponding inserted sequences.)
Use the Forward algorithm instead of the Viterbi algorithm and compare the results. In the LLR-transformed Forward algorithm you are faced with the logarithm of a summation, which is not simple to handle.
Your ideas here.

Turn-in Instructions

If you experiment with any extra credit enhancements to your program, do not include those enhancements in the basic version you run for the turn-in. Instead, describe your extra credit ideas and any results in separate files.

You will actually run your program as described above on two separate genomes, Haloquadratum walsbyi DSM 16790 and Nitrosopumilus maritimus SCM1. (Be sure to use the background frequencies from the appropriate prokaryote to train and test the HMM.) For each of these two prokaryotes, you will report the protein that has the best log likelihood ratio score. Turn in the following items:

The protein identifier and name for the best-scoring protein S*, exactly as given in the .faa file, formatted on a single line as in this example:
```
>gi|15826866|ref|NP_301129.1| chromosomal replication initiation protein [Mycobacterium leprae TN]
```
A very brief description of the function of S*, formatted on a single line. A good place to start is the PROSITE database, entering the protein name in PROSITE's Search box.
The log likelihood ratio score for S*, just the number, unrounded and with no additional text.
The most probable (Viterbi) path for S*, given as the sequence of state names on that path. Name your states M_j, I_j, and D_j, where j is the column number in the profile HMM, and use commas to separate the states. For example, a path might look like this:
```
M_0,M_1,I_1,I_1,I_1,M_2,D_3,D_4,M_5,...,M_601
```
M_0 is the Begin state and M_601 is the End state if there are (in this example) 600 columns with 10 or fewer gap characters in the alignment. If the path goes through an Insert state I_j x times, then I_j should be repeated x times in your path. The path should be on a single line of your file.
Your explanation for why S* is the same protein or a different protein than the best one found in Homework #2 (YP_657160.1 for Haloquadratum walsbyi DSM 16790 and YP_001583126.1 for Nitrosopumilus maritimus SCM1). Note that Kris verified that these are indeed the best matches using the Homework #2 method, so your explanation cannot be "the Homework #2 result was wrong". This explanation should all be on one line of your file.
The multiple alignment of the original 20 proteins plus S* as determined by the Viterbi algorithm, formatted as in the example, but without the first 4 lines and the last annotation line.

Your turn-in will consist of the following files, named as shown.

H.walsbyi.txt, containing the results for Haloquadratum walsbyi DSM 16790:
[your name]
[one blank line]
[protein ID and protein name for S*]
[brief description of protein S*]
[log likelihood ratio score of S*]
[Viterbi path for S*]
[comparison to Homework #2 and explanation]
[original 20-protein multiple alignment aligned with S*]
N.maritimus.txt, containing the results for Nitrosopumilus maritimus SCM1, in the same format as H.walsbyi.txt.
Source files for your program, with appropriate filenames.
README: a short text file explaining how to compile and run your program.
any files describing extra credit work, whose filenames should begin with the prefix "extra".

Submit these files to the homework drop box at https://catalyst.uw.edu/collectit/dropbox/tompa/19168.