So I didn't do any implementation on any of this since, well, I'm super busy with non school things (as is typical.) But as I spent way way to long Sunday finding that data by hand I did have a few thoughts about not doing it by hand. So below are several ways that this could be done in a non lame way. :) 1) Write a little html parser and run the suggested method via a set of scripts. Hopefully do something vaguely smart and cache the protein table pages to save on dl time. Then just churn through all the COGs looking for the best data. Pros: Well, minimal effort in terms of logic or thinking. Cons: Would probably take a while, parsing html is actually much harder than one would think, even for perl masters such as myself. Lots of redundant work is done by the servers, and we still probably aren't getting the best data we could. 2) If they had an xml feed we could implement some scripts to do the suggested method very easily. Just run through all COGs and output the best data. Pros: Super easy development, no logic to work out (since that is given in the suggested method) and dealing with XML feeds is really simple. Cons: Again it is going to take a long time and we probably aren't going to get the absolutely 'best' data. 3) Have the back-end implement protein filtering based on user entered rules about proteins. Specificly allow the user to specific non-coding upstream sequence length. Then do either the suggested method or a simpler 1 or 2 to get the data. The COG would return only proteins that got past the protein filter. Pros: Less work needs to be done searching since we don't have to even look at cogs that have less than five species (and based on me running into a number of COGs where most of the upstream non-coded sequences where very small or negative i think it would make it easy to trim lots of data) Should also be very easy for the back-end to implement Cons: We don't really have access to the back-end. I mean I guess we could download blast, but I don't know if we can download all the cgis associated with the website we are using or what. And we still probably wouldn't get the 'best' data since we are still only looking at COGS. 4) Run blast on all proteins vs all proteins to form a large DB (relational, flatfile, whatever) and turn the problem in to a lookup/filtering problem. If you have 100,000 proteins you could imagine a file for each protein where each line represents the blast score against the protein represented by the line number. So you would have 100,000 x 100,000 entries. Which you can cut in have because the data is replicated. I.e., Protein1.dat would have 1 vs 2, 1 vs 3, etc. So Protein2.dat can start with 2 vs 3, since 1 vs 2 has already been done, and Protein3.dat can start with 3 vs 4 since 1 vs 3 and 2 vs 3 have already been done. And we don't really care that we have to access other files to get this info, because the additional work is basically just a larger constant. Anyway so we end up with half the entries, which is 5 billion. Assuming that each entry needed 4 bytes to represent a blast score you would need about 18gigs to hold all the data. I would argue you could probably get away with only 2 bytes per blast score (if not less) which brings you down to 9gigs. And this is before compression. It seems highly likely that depending on how you arranged the data you could probably get some decent compression (say 5:1) so would be down to 1.8gigs. Of course that is just for 100,000 proteins which might be a vaguely reasonable estimate for our little prokariates, but what happens when we start doing larger numbers? Well, at 1,000,000 proteins you would need 931gigs to store all the blast scores with 2bytes per score and no compression. Which isn't entirely unreasonable. Pros: Similarity can be done with the most accurate algorithm you want since it is a one time cost. While I did the above size analysis with the data stored in one way, you can imagine it organized in best match order with about the same space requirements with some reorganization. If the data is in best match order it becomes super simple to get great data sets. Problems like our homework become trivial to solve, for each protein in our prokariate scan you could see if it matched you requirement (> 50 non-coding upstream sequences) Then for each of your prokariate proteins that passed, go to their protein file and look at the proteins that make up the top scores and see if they pass your requirements for proteins as well. So now for each protein in your prokariate you have the best match scores across all the others that meet your requirements. Look at which ones meet the 5 different species requirement, and take the best score. Cons: Well disk would probably be the one most people would be concerned with, but it isn't really that big a deal. Disk is cheap. I have almost 300gigs at my house. At work we have some serious disk. :) I think a much bigger con is system complexity and design issues. Setting up such a massive amount of precomputed data scores in such a way as to answer other problems as well as our hw3 and keep the reliability and ease of use at acceptable levels would probably be the biggest headache, since testing and validation for that much data is a big PITA.