So I didn't do any implementation on any of this since, well, I'm
super busy with non school things (as is typical.)  But as I spent way
way to long Sunday finding that data by hand I did have a few thoughts
about not doing it by hand.  So below are several ways that this could
be done in a non lame way. :)

1) Write a little html parser and run the suggested method via a set
   of scripts.  Hopefully do something vaguely smart and cache the
   protein table pages to save on dl time.  Then just churn through
   all the COGs looking for the best data.

Pros: Well, minimal effort in terms of logic or thinking.

Cons: Would probably take a while, parsing html is actually much
harder than one would think, even for perl masters such as myself.
Lots of redundant work is done by the servers, and we still probably
aren't getting the best data we could.

2) If they had an xml feed we could implement some scripts to do the
   suggested method very easily.  Just run through all COGs and output
   the best data.

Pros: Super easy development, no logic to work out (since that is
given in the suggested method) and dealing with XML feeds is really
simple.

Cons: Again it is going to take a long time and we probably aren't
going to get the absolutely 'best' data.

3) Have the back-end implement protein filtering based on user entered
   rules about proteins.  Specificly allow the user to specific
   non-coding upstream sequence length.  Then do either the suggested
   method or a simpler 1 or 2 to get the data.  The COG would return
   only proteins that got past the protein filter.

Pros: Less work needs to be done searching since we don't have to even
look at cogs that have less than five species (and based on me running
into a number of COGs where most of the upstream non-coded sequences
where very small or negative i think it would make it easy to trim
lots of data) Should also be very easy for the back-end to implement

Cons: We don't really have access to the back-end.  I mean I guess we
could download blast, but I don't know if we can download all the cgis
associated with the website we are using or what.  And we still
probably wouldn't get the 'best' data since we are still only looking
at COGS.

4) Run blast on all proteins vs all proteins to form a large DB
   (relational, flatfile, whatever) and turn the problem in to a
   lookup/filtering problem.  If you have 100,000 proteins you could
   imagine a file for each protein where each line represents the
   blast score against the protein represented by the line number.  So
   you would have 100,000 x 100,000 entries.  Which you can cut in
   have because the data is replicated.  I.e., Protein1.dat would have
   1 vs 2, 1 vs 3, etc.  So Protein2.dat can start with 2 vs 3, since
   1 vs 2 has already been done, and Protein3.dat can start with 3 vs
   4 since 1 vs 3 and 2 vs 3 have already been done.  And we don't
   really care that we have to access other files to get this info,
   because the additional work is basically just a larger constant.
   Anyway so we end up with half the entries, which is 5 billion.
   Assuming that each entry needed 4 bytes to represent a blast score
   you would need about 18gigs to hold all the data.  I would argue
   you could probably get away with only 2 bytes per blast score (if
   not less) which brings you down to 9gigs.  And this is before
   compression.  It seems highly likely that depending on how you
   arranged the data you could probably get some decent compression
   (say 5:1) so would be down to 1.8gigs.  Of course that is just for
   100,000 proteins which might be a vaguely reasonable estimate for
   our little prokariates, but what happens when we start doing larger
   numbers? Well, at 1,000,000 proteins you would need 931gigs to
   store all the blast scores with 2bytes per score and no
   compression.  Which isn't entirely unreasonable.

Pros: Similarity can be done with the most accurate algorithm you want
since it is a one time cost.  While I did the above size analysis with
the data stored in one way, you can imagine it organized in best match
order with about the same space requirements with some reorganization.
If the data is in best match order it becomes super simple to get
great data sets. Problems like our homework become trivial to solve,
for each protein in our prokariate scan you could see if it matched
you requirement (> 50 non-coding upstream sequences) Then for each of
your prokariate proteins that passed, go to their protein file and
look at the proteins that make up the top scores and see if they pass
your requirements for proteins as well. So now for each protein in
your prokariate you have the best match scores across all the others
that meet your requirements.  Look at which ones meet the 5 different
species requirement, and take the best score.

Cons: Well disk would probably be the one most people would be
concerned with, but it isn't really that big a deal.  Disk is cheap.
I have almost 300gigs at my house.  At work we have some serious
disk. :) I think a much bigger con is system complexity and design
issues.  Setting up such a massive amount of precomputed data scores
in such a way as to answer other problems as well as our hw3 and keep
the reliability and ease of use at acceptable levels would probably be
the biggest headache, since testing and validation for that much data
is a big PITA.