GENOME 559, Winter 2010, Problem Set 7

Introduction to Statistical and Computational Genomics
GENOME 559
Department of Genome Sciences
University of Washington School of Medicine

Problem Set 7

Due: At the start of class, Tuesday, 3/9.

Electronic Turnin: is fine, but please use this Catalyst Drop Box in preference to email. (Email sometimes mangles Python indentation.) For electronic turnin, please name your solution ps7.py. If you do any of the challenge problems, please mark those parts of the code clearly, or put them in separate file(s) with "-challenge" added to the file name, e.g., ps7-challenge42.py. Also include any necessary Python module files you might have chosen to create.

Problems:
In this problem, you'll calculate the "Sum of Pairs" score for a multiple sequence alignment. As shown in the lecture notes, the SP score for an alignment is the sum of the SP scores for its columns, and the SP score for a column is the sum of the scores for all pairs of rows in the column. E.g., the SP score, based on BLOSUM62 scores for pairs, for a column containing (A,A,-,R) is +2: +4 for the single A-A pair, -1 for each of the two A-R pairs (and nothing for the 3 pairs involving '-'; yes, I'm ignoring gaps). The problem is broken into several parts. These may be done independently, and if stumped by one, moving to the next is fine, but I suspect the order below will be easiest for most people.
(20 points) Write a function readPhylip(filename) that, given a filename as its one parameter, reads a multiple sequence alignment from that file, assumed to be presented in (a restricted case of) Phylip format. In this format, the first line of the file has 2 white-space-separated integers m and n. Each of the subsequent m lines contains a sequence identifier of exactly 10 characters, immediately followed by exactly n sequence characters or '-'s (meaning alignment gaps). ["Restricted case" means that you need not worry about whitespace in the sequences, and sequences fit entirely on the same line as the sequence identifier.] A careful program would check that m and n on the header line match the rest of the file, but you need not do this; just ignore the header and grab the rest. (This is similar, but not identical, to the format shown in PS#3, challenge problem 9; if you did that problem, you may be able to reuse some of your code.) The value returned by your function should be a pair (2-tuple) of lists, one containing the m sequence identifiers, each a string of length 10, and the other contining a list of m sequences, each of length n. The two lists should be in the same order as the lines in the input file. Use ps7_hem6.txt as a test input.
Challenge Problem: Generalize this to handle multi-line Phylip sequences with embedded whitespace, in either "sequential format" or "interleaved format" (see link above; sequential is probably easier).

(15 points) Write another function printPhylip(msa) to print a multiple sequence alignment. Its input parameter should be a pair in the format returned by readPhylip(), and it should format its output as shown in the example below.
Challenge problem: add a maximum line length parameter, and print multiline output in the analog of the interleaved format, but with columns spaced as in the example below.
(15 points) Download the BLOSUM62 amino acid substitution matrix from ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62 and save it as a file named BLOSUM62.txt. Write a function readSubMatrix(filename) to read such a matrix from the file named by its parameter. Note that as a white-space-separated file, the first row has one less column, since the first column has no item in the header row.
Your function should return a dictionary mapping amino acid pairs to integer scores. E.g., if b62 is the dictionary returned, then b62['AG'] gives the score for an alanine/glycine substitution.
Challenge problem: We expect the row and column labels to be in the same order, and that the scores are symmetric, i.e., score[i][j] = score[j][i]. Write code to verify this, printing a warning if not.

(5 points) Write a function extractColumn(i, msa) that extracts the i^th column from the multiple sequence alignment msa (as returned by readPhylip). Returned value should be a list of characters (including dashes, if present).
(10 points) Write a function SP1(column, score_map) that returns the integer-valued sum-of-pairs score for one "column" (a list of single characters, as produced by extractColumn, scored according to score_map, which is a dictionary like that returned by readSubMatrix.
Tip: A nested loop like
    for i in range(99):
        for j in range(99):
            do something with i & j
will, for example, visit the pair {1,2} twice; once with i=1, j=2 and again with i=2, j=1. It also includes the "pair" i=1, j=1. Sometimes that's just what you want, but not in this case; you'd be double-counting. To get each pair only once, use for j in range(i+1,99): for the inner loop.
(5 points) Write a function SPn(msa, score_map) to calculate the SP score for each column of an alignment. Returns a list of scores.
(15 points) Print the alignment, the score of each column, and the total score. Run it on ps7_hem6.txt, linked above. E.g., a portion of my answer is:
                 1   2   3   4   5   6   7   8   9  10  ...  35  36  37  38  39  40  
    P68871homo   -   V   H   L   T   P   E   E   K   S  ...   L   -   V   Y   P   W 
    P68873chmp   -   V   H   L   T   P   E   E   K   S  ...   L   -   I   Y   P   W 
    P02088mous   -   V   H   L   T   D   A   E   K   A  ...   L   I   V   Y   P   W 
    P02112chic   -   V   H   W   T   A   E   E   K   Q  ...   L   I   V   Y   P   W 
    Q802A3fugu   M   V   E   W   T   D   Q   E   R   T  ...   L   I   V   Y   P   W 
    Q540F0cowp   M   V   A   F   S   D   K   Q   E   G  ...   L   I   V   A   P   A 
    Score:       5  60  39  13  55  11  20  60  42   2  ...  60  24  55  60 105  95 
    Total: 1340
Challenge Problem: Create a Python class to hold a substitution matrix. Its constructor (__init__ method) should build the object by reading a file, as above. Your class definition should include two other methods, one returning the score for a single pair of amino acids, the other returning the SP score for a column of residues in a multiple alignment. Use your class to score the given alignment, as in a-g. For extra glory, make a class for alignments, too, perhaps with index/slice-like operations to enable you to extract columns or subalignments, print alignments, etc.
(15 points) The alignment in ps7_hem6.txt is from a ClustalW alignment of 5 vertebrate hemoglobins and a plant leghaemoglobin, except that I mucked around with the sequences and alignment in the last 5 columns. (SwissProt accession numbers are part of the "species" identifiers in the file, if you're curious about the real deal.) The alignment would get a better SP score if the "-V/-I" in columns 36-37 of the human/chimp alignments were changed to "-V/I-", hence aligning the chimp I to I's in the other 4 species. Explain why a progressive alignment method is unlikely to construct the later alignment.