UW Home     CSE Home   Announcements    Message Board    Contact Info 

 
 
 

CSE143 Winter 2005 Project #2 Part B

Searching your Gene Data

Due: Saturday, February 26, at 9:00 pm. No late assignments will be accepted.

(There are a few new terms for this part in a glossary at the end.  There is also an Important Caution about what to turn in.)

Background

Applications for a gene (or protein) search tool

Researchers who work on genetics often have a short sequence of bases that they want to compare with the genes stored in the database (or they have a sequence of amino acids that they want to compare with the proteins that the genes represent).  The short sequence might appear anywhere in the gene -- it might even be in there more than once.  And it might not be exactly the same as what's in the gene.  Often it's sequences that are similar, but not exactly the same, that are the most significant.

Since both the sequence they want to compare, and the sequences in the database, are represented by strings of letters, then comparing them is a problem of comparing strings.  This is usually called the "string matching" problem, though that is a bit misleading, since we want to find more than just exact matches.  Instead, we have a "pattern" that we want to compare to all parts of a sequence, and decide how similar it is to each part.  (If you read about techniques for comparing sequences in computer science texts, they'll use the word "string" to mean any kind of sequence, not just letters.  But all their examples will use letters...)

You've seen that both genes and proteins can be represented by strings of letters.  So the methods you'll develop for pattern matching can be used for both.

Here are just a few examples of what this pattern matching might be used for:
  • Recognizing sequences of amino acids in a protein that indicate a particular sort of shape that the protein will fold into.  The sequence of amino acids in a protein tells us little about what that protein does.  Proteins fold into elaborate three-dimensional shapes.  The biological activity of a protein depends on its shape and what chemical structures are exposed at each place on its surface.  One of the hottest topics in biology today is figuring out how a protein will fold from its amino acid sequence.  One technique involves cataloging short sequences of amino acids that have been found in various kinds of folds, and then matching them against a protein whose shape one wants to predict.

  • Figuring out what species might have shared a common ancestor, and how long ago, based on similarities between genes and proteins.  If we know the rate at which genes change (mutate), and measure how much difference there is between genes in two related species, we can estimate how long ago they diverged.

  • Looking for regions of a the same gene that differ in individual members of a species.  Differences in the gene might cause small (or drastic) changes in the protein it codes for, leading to variants of proteins that may not perform their function as well.  Detecting these differing regions (or "markers") can help predict who might be at risk of a disease that's due to the variant protein.  (The large data files you had for part A are from BRCA1 and BRCA2 genes -- women who have these gene variants are at higher risk of getting breast cancer.)

  • Detecting insertion of genes from viruses into bacteria, plant, or animal genomes.  Detecting swapping of genes between viruses.  A virus that infects a cell causes the cell to make copies of the virus's genome.  Along the way, it may also leave gene fragments that get copied into the host cell's chromosomes.  One hope for turning this "bug" into a feature is to use a virus to deliver working copies of a defective gene into cells.

Pattern matching and measures of "distance" between strings

In all of the background examples above, the comparison between two sequences isn't just deciding whether they match exactly or not, it's looking for how different they are.  We want to give a numerical value to this difference -- something that can let specify how close a match we want to our patterns.  That is, we want to provide something like a "distance" between two sequences.  ("Distance" is math jargon -- see the glossary.)

We've already noted that our sequences are represented just as strings of letters.  We'll use two popular distance measures for strings:
  • The Hamming distance compares the letters in the same positions in both strings, and counts up how many are different -- it's appropriate if the changes between one string and the other are due to replacing one letter by another.

  • Edit distance allows replacing, adding, or deleting letters, and counts the smallest number of changes needed to turn one string into the other.  We'll use a simple version of edit distance that only allows adding one letter, or replacing letters.
As you might imagine, a mutation that replaces one amino acid in a protein by another might not make too much difference, but a mutation that removes or adds a base in an exon will cause all the codons beyond that to change.  (Recall that an exon is a part of the gene that gets translated into protein, as opposed to an intron, which is a part that doesn't get translated.)  A gene with a single base addition or deletion in the wrong place may cause the unfortunate organism that possesses that gene not to be viable...

Administrivia

You should work with your assigned partner on this project using the pair programming style discussed in class. You and your partner will turn in a single set of files under the same name you used for part A.  After this final part of the project is done, each of you will individually produce a written report. (Details about that will be supplied separately.)

Grading:  This is the final code turnin for the project -- it is not a checkpoint. You and your partner will receive the same scores on the programming parts of the project. Programming projects will be evaluated both for how well the code works and how well it is written, each on a scale of 0 to 4 (see the project assessment guidelines on the main projects page for details). Be sure to include appropriate JavaDoc and other comments in your code, use meaningful names, indent sensibly, and so forth.

Overview

This part of the project is mainly about working with sequences of data -- strings, lists -- and more on the use of collections.  (Along the way, it should answer the question about how to return more than one piece of information from a method, in case you're still wondering.)  You'll also get some experience coding to a pre-defined API specification -- in industry, this is what you'd typically be doing (even if you were the one who wrote the spec...).

Imagine we're designing software to be used for the sorts of search tasks described above.  Here's a top-down view of what a user of our software will want to accomplish.  The tasks the user wants to perform will suggest what lower-level functions our software has to provide.

Finding sequences in the sequence collection

At the highest level, the user wants to find all sequences in the collection that contain some part that's similar to a pattern.  They want to supply a pattern of DNA bases and search among the gene sequences, but they also want to be able to supply an amino acid pattern, and search through the protein sequences that the gene sequences translate to.

The user will want to get results for all sequences that had matches to the pattern.  Besides any information about the matches within a sequence (e.g. where they are within the sequence), the user will need to know which sequence had those matches, so we'll need to give them identifying information too.

If we want to look for pattern matches in all sequences, we can build that on something that looks for matches in one sequence.

Searching for pattern matches in one sequence

We'd like to find regions in a long sequence that are similar to some pattern.  Because there might be more than one such region, and we shouldn't guess which one the user wants, we'll return the locations of all the regions in the sequence that were matches.

We've been offering to locate matches to a pattern within a sequence.  That implies we'll want something that can compare the pattern to each substring in the sequence.  We've also been saying things like "is similar to" and "matches" -- it's in the comparison methods that we pin down what this means.

Comparing strings

At the lowest level, we'll need to compare a pattern to parts of a sequence, so we'll want methods that compare two strings.  Because the user wants to specify how similar the sequence region should be to the pattern in order to regard it as a match, these will be "distance" measures.

Details

In order to make it more clear what you need to do, and also to make it possible for us to test your projects, we're going to give you two interfaces.  We'll ask you to provide concrete classes that implement these interfaces.  (You should not modify these interfaces at all!)
  • SequenceSearch is the "search engine" interface -- it lists the various searching and pattern matching tasks.

    Please use the name GeneDataSearch for your class that implements SequenceSearch.  (Your GeneDataSearch class will build on your part A work.  If you don't want to rename your existing part A class, you could just create an instance of it inside your GeneDataSearch.)

    Provide a constructor with no parameters.  The constructor should take care of all the setup needed for your part A code to work -- getting file names, reading in files, setting up your internal data structures.  (We will need to create instances of your class in our test procedures, so we'll need to know what it's called and how to make an instance of it.)

  • SearchResult describes a little wrapper class for returning search results.  (This is the "gene record" of part A -- if you have a class for this already, it should be simple to make it implement this interface.  It does not need to have a specific name.)
We're giving you a little text-based user interface, SimpleSearchTool, that shows how the methods described in the interfaces will be called, and how the return values are interpreted.  SimpleSearchTool creates an instance of GeneDataSearch.  Since we don't have yours yet we're providing a dummy version of GeneDataSearch just so SimpleSearchTool will compile -- replace this with your real version.

The two interfaces are linked from the project page -- it may be helpful to have copies of them available while reading the following descriptions of the various search and pattern matching tasks we're asking you to implement.

Here are more specific descriptions of what each task requires.  (We're starting at the lowest level this time, as the tasks build on each other.)  See the interfaces for specifications of parameters and return types.

Computing the distance between two strings

At the bottom of the hierarchy of tasks are the distance computations.  Here are more complete descriptions of the two distance measures we want to provide:

Hamming distance:  Compare the letters in the two strings position by position, and count up the number of mismatches.  (Note this is the same as the number of letters that would have to be changed to make one string into the other.)  Stop at the end of whichever string is shorter (i.e. don't count extra letters as a mismatch).

Here's an example:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
ABBDXF
^ ^
There are two differences, so the Hamming distance is two.

Edit distance with a single insertion allowed:  Edit distance is the minimum number of changes that need to be made to two strings to make them the same.  We're implementing a very restricted version of this that only allows one letter to be inserted in either string, as well as changing any letters.

We'll be trying out distances when we've inserted letters in all possible places in the strings.  Once we've chosen a place for the insertion, we can count up the mismatches to tell how many letters we'd need to replace.  (When we insert a letter, it only makes sense to choose the same letter as is in the other string at that position, as otherwise we'd be making the distance unnecessarily greater.)  The insertion itself is a change, so we add one to the distance to account for that.

And we should also check the distance if we don't make any insertions.

Whether we're inserting or not, we stop the comparison at the end of the shorter string.

Among all of these, we return the smallest.

This example shows all the cases that need to be checked when comparing the strings BARD and BEARD.  An insertion is indicated by an underscore.  (We don't bother showing the inserted letter because we would always add the matching letter from the other string.)

Cases in the comparison:   
Their distances:   
BARD
BEARD
3
_BARD
BEARD
2
B_ARD
BEARD
1
BA_RD
BEARD
2
BAR_D
BEARD
3
BARD
_BEARD
4
BARD
B_EARD
3
BARD
BE_ARD
3
BARD
BEA_RD
3

Here, the smallest distance is found when a letter is inserted before the A in BARD, and that's 1, so the edit distance is 1.

Finding locations of all the close matches in a long sequence

Now that we've got some distance methods that can compare a short pattern to part of a long string, we want to do that comparison starting at each location in the long string.  We could return a list of all the distances along the string, but most of these are not going to be close enough to be interesting.  So instead, we let the user say how close they want the distance to be, and return a list with just the locations that were at least that close.  A location value should be the index within the long string of the matching piece.

You can pick what kind of list you want to use, but it should be something that implements the List interface.  As you know, you can't put just bare ints in a List, so you'll need to put each one in an Integer.  If there are no good-enough matches, return a null to indicate that.

What the matches are will depend both on which distance measure the user wants, and what maximum distance they want to allow.

Here's an example:  Say the long string is ABRACADABRA, the pattern is ABA, we're using edit distance, and we want to hear about matches with distance no more than 1.

Here are the edit distances with the pattern compared at each location in the long string.  (This is just FYI -- you don't return this!  Also note we're letting the comparisons continue right to the end of the long string.  We did this for simplicity in the assignment, but if this were a real application, maybe the user wouldn't want those matches on the tail end.)
ABRACADABRA
12212121210
So the positions you'd return are 0, 3, 5, 7, 9, 10, wrapped in Integers and put in a List.

Searching for matches in the whole collection

Finally, we're going to do something with your collection of names and sequences.  We'll let the user provide a pattern and maximum distance, and do the above search for close matches on every sequence in the collection, and say whether they want to search gene sequences or their protein translations.

For each sequence that had any close-enough matches, we'll give the user the name of the sequence and the sequence itself along with a list of match locations like that returned by the search-one-sequence methods.  These should be packaged in objects that implement the SearchResult interface.

Put all the SearchResult objects into some Collection class (i.e. a class that implements the Collection interface).  It's your choice -- see the Java API docs for Collection and pick your favorite.  Return that.  (Our main reason for asking for a Collection is that it can provide an iterator, but doesn't pin down much of anything else.)

JUnit tests

Include JUnit tests for your Hamming, edit distance, and search-one-sequence methods (and we won't complain if you include them for anything else).

Hints and notes

Computing the edit distance -- There is a lot of room for "optimization" and code reuse here.  While we do expect you to get rid of obvious code redundancy, we don't expect you to get fancy (except for extra credit).  And you should get this working however it's simplest first, before trying any optimizations.  But if you have time, here's a hint for one not-too-labor-intensive optimization:

You've seen that you don't need to bother counting the inserted letter, because you'd always choose the "right" letter to insert anyway.  So...do you really need to bother actually inserting it?  After all, an insert in a string (which is essentially an array) is a rather expensive operation -- everything past the insert has to be shifted down one.  Could you split up the distance comparison so you could avoid doing the insert?

Java Strings and the cost of String operations that extract substrings  --  Some of you may be familiar with how Unix stores strings -- to tell where the end is, a null character is put after the last character in the string, rather than having a length specified.  When a string is stored that way, then in order to refer to a substring in the middle, the substring has to be copied and a null put on its end.  This makes extracting a substring an expensive operation.

Java doesn't do that.  Instead, Java represents Strings internally by an index into an array of chars and a length.  Strings can't be modified, so Java can let any number String objects point to various parts of one char array.  In particular, all its substring method has to do is make a String with the right initial index and length, pointing into the same char array as the original String.  So go ahead and use substring and other methods that return parts of Strings without hesitation.

That SearchResult interface...  --  So, why do we want to return the sequence as well as the name and match locations?  One reason is that you might have, in your collection, more than one sequence with the same name, so returning just names and locations won't tell which actual gene got a particular set of matches.  But another is that your collection that holds the names and sequences is surely a private instance variable (right??), so the user couldn't go find the sequence just given the name -- we haven't asked you to provide that sort of accessor.

If you are providing a GUI for extra credit  --  Remember the MVC architecture!  Please divide your project into model (the part that handles the files, data structures, and search and translation) and view (the GUI) classes.  You can keep the JFileChoosers for opening files with the model, since we've asked you to read in files in the GeneDataSearch constructor (just make their parent component null instead of your window).  Otherwise when we try to run out tests on your model, your GUI would pop up...

Why did we keep calling this a database?  --  We stopped calling it that for part B because the term was bothering some folks.  But it shouldn't -- here's why.  You've probably heard the word database mainly applied to products like Oracle Database, Microsoft SQL Server (and its much cheaper development kit), postgreSQL, and MySQL.  Those are "relational database management systems" (RDBMSs) -- they use a particular idea of how to handle references among the data, and how to safely process changes to to data even if lots of users are making changes at once.

But more generally, a database is any repository for information that lets someone perform "queries" -- ask questions that are answered based on the stored information.  The feature that makes a query different from just looking up an item is that a query can specify "constraints" on what information they're interested in, to select some subset of all the information -- they do not need to ask for all the items in the subset by name.  When we let users select genes that have regions closely matching a pattern, they're doing a query.  We could let them add other constraints, like searching only among genes from a particular species -- that would make this even more database-like.

Ok, so what's a relational database?  --  Some of you have noticed the issue that when your program exits, all your objects vanish.  And even if you could save the values in all your objects' instance variables, what if they were reference variables?  When you read your saved objects back in, how would you figure out what used to point to what?  After all, the value of the reference depends on where the object gets put in the computer's memory -- that's highly unlikely to be the same when your program is restarted again.  So even if you got your objects and their instance variable values back, all their references would be garbage. 

"Relational" databases solve this problem by not using references at all.  Instead, each "record" in the database is identified by a key.  When one record needs to refer to another, it includes the other's key in its data.  The keys remain the same if the database program is shut down and started again, so this way of referring to another record doesn't go stale like memory-location references do.  But a key doesn't "point" directly to the other record the way a memory-location reference does -- the program has to go search for the key.  So you can imagine how much slower this could be.  But good RDBMSs provide ways of quickly looking up keys (for instance, they might put the keys in a hash table, where the value is some information that tells the RDBMS how to locate the record in its data structures).

Java has a solution to this problem called "serialization".  There's a long post on this topic in the message board -- look for the subject "our database", dated 2/15.

Extra credit

NOTE!  If your extra credit features need gene data with particular information in it in order to show off your work (e.g. if you want those [key=value] pairs on more gene entries, or want duplicate names), you can add fake data easily by copying and modifying the records in the existing files.  Turn in your demo files with your code, and tell us how to try your features in your reports.

Explore ways to further optimize the edit distance calculation besides what's given in the hint above.  There may be some relatively simple improvements possible here.  (For a harder problem, consider how to optimize the comparison with one pattern all along a long sequence.  Even if all you want to do is check for an exact match, there are optimizations that range from somewhat scary to amazingly arcane.  See, for instance, Introduction to Algorithms, by Cormen, Leiserson, and Rivest -- look for the section on string matching.)

Give your application a GUI that lets the user enter a search pattern.  Display the results in the GUI.  Look at JTextArea for a widget that lets the user enter text, or lets you display it.  Its doc page has a link to the appropriate part of the Java tutorial.  (But see the important warning in the "Hints and notes" above.)

Show the areas in a gene that are good matches by displaying the whole gene sequence with the matches highlighted (e.g. in a different color, or bold, or with ^^^^ under it on the next line, etc., etc.).

Let the user restrict which genes to include a the search by specifying some gene names, etc.  If you interpreted and stored other info from the name line (e.g. the [key=value] pairs), you could let the user restrict their search by specifying a key and value.

Etc. etc.

Glossary

distance

A "distance" is a special kind of measure of how different things are, that obeys the "triangle inequality".  This says that the distance from A to B plus the distance from B to C is always at least as much as the distance from A to C.  We won't need to use this property here, but wanted to let you know why measures of difference are sometimes called distances...

genome

The complete set of genes that a species has.

What to Turn In

Use this online turnin form to turn in the Java source files that make up your project, including any JUnit tests. If your project uses any other files, such as images, turn those in also. If you have many files, including things like images, you can bundle them into an archive file (zip, jar, or tar) and turn that in. Multiple turnins are fine and highly recommended if you are planning to add extra credit features. Once you've got something that meets the basic requirements, turn that in. Then if you add to your project, turn in the extended version(s) later - we'll grade the last one you turn in.

Caution!  Several otherwise very nice projects on the project 1 failed to compile because they defined packages, but didn't have the directories to go with them.  This is causing your TAs grief.  Please consider whether you need to put your classes in packages at all -- that is usually for very large projects with zillions of classes.  If you really want to use packages, please contact your TA to find out how to make an archive of your classes and directories that will compile successfully.

Eclipse users:  If you're using eclipse projects, please don't turn those in.  Instead, make sure your code compiles without eclipse, and turn in your files either individually or in one of the types of archive files that the turnin server supports.