Homework #2: Protein Similarity Search

CSE 427: Computational Biology
January 17, 2012
due: Sunday, January 29, 2012, 10:00 p.m.
Last update: 1/23

Write a program that takes a protein sequence Q as input and finds the protein S in your very own personal prokaryote (from Homework #0) that has the highest scoring local alignment with Q. Use the algorithm from Section 5.1 of the CSE 527 lecture notes for each local alignment.

Use the BLOSUM62 scoring matrix to score any pair of amino acids. Use the * row and column to score any * character that appears in any of your protein sequences, and the X row and column to score any other character that is not one of the 20 amino acids. Use a score of -4 for each insertion or deletion (i.e., the alignment of the gap character with an amino acid).

You can find a list of all your prokaryote's protein sequences in the .faa file(s) on the prokaryote's NCBI Microbial FTP page. For the query protein Q, use the 573 amino acid sequence

MLRLPTVFRQMRPVSRVLAPHLTRAYAKDVKFGADARALMLQGVDLLADAVAVTMGPKGR
TVIIEQSWGSPKVTKDGVTVAKSIDLKDKYKNIGAKLVQDVANNTNEEAGDGTTTATVLA
RSIAKEGFEKISKGANPVEIRRGVMLAVDAVIAELKKQSKPVTTPEEIAQVATISANGDK
EIGNIISDAMKKVGRKGVITVKDGKTLNDELEIIEGMKFDRGYISPYFINTSKGQKCEFQ
DAYVLLSEKKISSIQSIVPALEIANAHRKPLVIIAEDVDGEALSTLVLNRLKVGLQVVAV
KAPGFGDNRKNQLKDMAIATGGAVFGEEGLTLNLEDVQPHDLGKVGEVIVTKDDAMLLKG
KGDKAQIEKRIQEIIEQLDVTTSEYEKEKLNERLAKLSDGVAVLKVGGTSDVEVNEKKDR
VTDALNATRAAVEEGIVLGGGCALLRCIPALDSLTPANEDQKIGIEIIKRTLKIPAMTIA
KNAGVEGSLIVEKIMQSSSEVGYDAMAGDFVNMVEKGIIDPTKVVRTALLDAAGVASLLT
TAEVVVTEIPKEEKDPGMGAMGGMGGGMGGGMF

This is actually a human protein sequence; one of the goals of this homework is to see how close a match there is to certain human proteins in your prokaryote, despite how distant humans are evolutionarily from prokaryotes.

It will be somewhat time-consuming to run the pairwise alignment algorithm for each of your prokaryote's protein sequences. Since your prokaryote has thousands of genes, I estimate that the total number of dynamic programming matrix entries you will need to compute is in the hundreds of millions, so be careful about efficiency. One thing you should do to save time is only compute the score of the optimal local alignment, not reconstruct the alignment itself except once at the very end for the single best-scoring protein S that you found. If there is more than one optimal alignment, you should pick one of them arbitrarily to output.

Extra credit:

Devise a way to preprocess all the protein sequences of your prokaryote so that it is then much faster, given a new query protein Q, to find the protein S most similar to Q among all your prokaryote's protein sequences.
Investigate the effect of using an affine gap penalty function on the alignment produced. How did you choose appropriate parameters for your gap penalty function?
Your ideas here.

Turn-in Instructions

If you experiment with any extra credit enhancements to your program, do not include those enhancements in the basic version you run for the turn-in. Instead, describe your extra credit ideas and any results in a separate file.

You will actually run your program as described above on two separate genomes: once on your personal prokaryote and once on the "community prokaryote" Mesorhizobium loti MAFF303099. For each of these individually, you will report the protein S that has the best local alignment with Q. Turn in the following items:

The approximate elapsed time (in seconds) it took to run your program on all the prokaryote's protein sequences.

The protein identifier and name for the protein S given in the .faa file, e.g.,

>gi|15826866|ref|NP_301129.1| chromosomal replication initiation protein [Mycobacterium leprae TN]

A very brief description of the function of this protein. A good place to start is the PROSITE database, entering the protein name in PROSITE's Search box.
The score of the optimal local alignment of Q and S.
An optimal local alignment of Q and S. Present your alignment in the format of the following example:
```
Q  359 NSIPPDEVIPIGAAIEAGILIGKENLLVEDSLMIECSARDILVKGVDESGASRFTVLFPS  418
        S+ PDEV+ IGAAI+ G+L G     V D L+++ +    L  G++  G    T L  +
S  356 KSVNPDEVVAIGAAIQGGVLKGD----VTDVLLLDVTP---LSLGIETLGGV-MTKLIDA  407

Q  419 GTPLPARRQHTLQAPGSISSVCLELYESDGKNSAKEETK-FAQVVLQDLDKKENGLRDIL  477
        T +P R+Q      G  +   +E++   G+     + K   +  L D+     G+  I 
S  408 NTTIPTRKQEVFSTAGD-NQTSVEVHVLQGERPMATDNKTLGRFHLGDIPPSPRGIPQIE  466
```
If the two aligned amino acids are identical, the middle line repeats that amino acid. If the BLOSUM62 scoring matrix gives a positive score to two nonidentical aligned amino acids, the middle line shows a "+". The indices at the beginning and end of each sequence show the start and end indices of the (locally) aligned sequences within their full protein sequences. If the alignment is too long to fit on one line, break it into multiple lines.

Your turn-in will consist of the following files, named as shown.

match.txt:

[your name]

Mesorhizobium loti
[running time on Mesorhizobium loti]
[protein ID and protein name for S]
[brief description of protein S]
[score of optimal local alignment of Q and S]
[optimal local alignment of Q and S]

[name of your personal prokaryote]
[running time on your personal prokaryote]
[protein ID and protein name for S]
[brief description of protein S]
[score of optimal local alignment of Q and S]
[optimal local alignment of Q and S]

Source files for your program, with appropriate filenames.
README: a short text file explaining how to compile and run your program.
any files describing extra credit work, whose filenames should begin with the prefix "extra".

Submit these files to the homework drop box at https://catalyst.uw.edu/collectit/dropbox/tompa/19168.