Here is an outline of the process you will
automate for phase 1 of the project:
- Pick gene g and bacterium b containing g.
- Get b's accession number.
- Look in b's protein table to find g's protein product pid.
- Run perl script to download blast result and gather first x hits.
This will produce x pairs of pid p_i and species name s_i, one for
each BLAST hit.
Repeat steps 5 - 10 for each i:
- Use accession_species table to look up accession num a_i
corresponding to s_i.
- Look in [a_i]_protein_table for pid = p_i.
- Find genomic indices of upstream region.
- Look up genome.
- Extract upstream region and set aside.
- Look up amino acid sequence (in [a_i]_amino_acid_table) and set aside.
- Pass set of amino acid sequences to ClustalW to produce phylogeny T.
- Pass T, set of upstream regions, and appropriate parameters to
- Present results.
Eventually you will want to
iterate over different choices of
parameters to FootPrinter, choosing the "most interesting" results, and
iterate over different starting genes, perhaps reporting only those
that terminated in the "most interesting" FootPrinter results.