|
|
|
|
CSE143 Winter 2005 Project #2 Part B
Searching your Gene Data
Due: Saturday, February 26, at 9:00 pm. No late assignments
will
be accepted.
(There are a few new terms for
this part
in a glossary at the end.
There is also an Important Caution
about what to turn in.)
Background
Applications for a gene (or protein) search tool
Researchers who work on
genetics often have a short sequence
of bases that they want to compare with the genes stored in the
database (or they have a sequence of amino acids that they want to
compare with the proteins that the genes represent). The short
sequence might appear anywhere in the gene -- it might even be in there
more than once. And it might not be exactly the same as what's in
the gene. Often it's sequences that are similar, but not exactly
the same, that are the most significant.
Since both the sequence they want to compare, and the
sequences in the database, are represented by strings of letters, then
comparing them is a problem of comparing strings. This is usually
called the "string matching" problem, though that is a bit misleading,
since we want to find more than just exact matches. Instead, we
have a "pattern" that we want to compare to all parts of a sequence,
and decide how similar it is to each part. (If you read about
techniques for comparing sequences in computer
science texts, they'll use the word "string" to mean any kind of
sequence, not just letters. But all their examples will use
letters...)
You've seen that both genes and proteins can be represented by
strings of letters. So the methods you'll develop for pattern
matching can be used for both.
Here are just a few examples of what this pattern matching might be
used for:
- Recognizing
sequences of amino acids in a protein that indicate a particular sort
of shape that the protein will fold into. The sequence of
amino acids in a
protein tells us little about what that protein does. Proteins
fold
into elaborate three-dimensional
shapes. The biological activity of a protein depends on its shape
and what chemical
structures are exposed at each place on its surface. One of the
hottest topics
in biology today is figuring out how a protein will fold from its amino
acid sequence. One technique involves cataloging short sequences
of amino acids that have been found in various kinds of folds, and then
matching them against a protein whose shape one wants to predict.
- Figuring
out
what species might have shared a common ancestor, and how long ago,
based on similarities between genes and proteins. If we
know the rate at which genes
change (mutate), and measure how much difference there is between genes
in two related species, we can estimate how long ago they diverged.
- Looking
for regions of a the same gene that differ in individual members of a
species. Differences in the gene might cause small (or
drastic) changes in the protein it codes for, leading to variants of
proteins that may not perform their function as well. Detecting
these differing regions (or "markers") can help predict who might be at
risk of a disease that's due to the variant protein. (The large
data files you had for part A are from BRCA1 and BRCA2 genes -- women
who have these gene variants are at higher risk of getting breast
cancer.)
- Detecting
insertion of genes from viruses into bacteria, plant, or animal
genomes. Detecting swapping of genes between viruses.
A virus that infects a cell causes the cell to make copies of the
virus's genome. Along the way, it may also leave gene fragments
that
get copied into the host cell's chromosomes. One hope for turning
this
"bug" into a feature is to use a virus to deliver working copies of a
defective gene into cells.
Pattern matching and measures of "distance" between strings
In all of the background examples above, the comparison between two
sequences
isn't just
deciding whether they match exactly or not, it's looking for how different they are. We
want to give a numerical value to this difference -- something that can
let specify how close a match we want to our patterns. That is,
we want to provide something like a
"distance"
between two sequences. ("Distance" is math jargon -- see the
glossary.)
We've already noted that our sequences are represented just as strings
of letters. We'll use two popular distance measures for strings:
- The Hamming
distance compares the letters in the same positions in both
strings, and counts up how many are different -- it's appropriate if
the changes between one string and the other are due to replacing one
letter by another.
- Edit
distance allows replacing, adding, or deleting letters, and
counts the smallest number of changes needed to turn one string into
the other. We'll use a simple version of edit distance that only
allows adding one letter, or replacing letters.
As you might imagine, a mutation that replaces one amino acid in a
protein by
another might not make too much difference, but a mutation that removes
or adds a base in an exon will cause all the codons beyond that to
change. (Recall that an exon is a part of the gene that gets
translated into protein, as opposed to an intron, which is a part that
doesn't get translated.) A gene with a single base addition or
deletion in the wrong place may cause the unfortunate organism that
possesses that gene not to be viable...
Administrivia
You should work with your assigned partner on this
project using the pair programming style discussed in class. You and
your partner will turn in a single set of files under the same name you
used for part A. After this final part
of the project is done,
each of you will individually produce a written report. (Details about
that will be supplied separately.)
Grading: This
is
the final code turnin for
the project -- it is not a checkpoint. You and your partner will
receive the same scores on the programming parts of the project.
Programming projects will be evaluated both for how well the code works
and how well it is written, each on a scale of 0 to 4 (see the project
assessment guidelines on the main projects page for details). Be sure
to
include appropriate JavaDoc and other comments in your code, use
meaningful names, indent sensibly, and so forth.
Overview
This part of the project is mainly about working with sequences of data
-- strings, lists -- and more on the use of collections. (Along
the way, it should answer the question about how to return more than
one piece of information from a method, in case you're still
wondering.) You'll also get some experience coding to a
pre-defined API specification -- in industry, this is what you'd
typically be doing (even if you were the one who wrote the spec...).
Imagine we're designing software to be used for the sorts of
search tasks described above. Here's a top-down view of what a
user of our software will want to accomplish. The
tasks the user wants to perform will suggest what lower-level functions
our software has to provide.
Finding sequences in the sequence collection
At the highest level, the user wants to find all sequences in the
collection that contain some part that's similar to a pattern.
They want to supply a pattern of DNA bases and search among the gene
sequences, but they also want to be able to supply an amino acid
pattern, and search through the protein sequences that the gene
sequences translate to.
The user will want to get results for all sequences that had matches to
the pattern. Besides any information about the matches within a
sequence (e.g. where they are within the sequence), the user will need
to know which sequence had those matches, so we'll need to give them
identifying information too.
If we want to look for pattern matches in all sequences, we can build
that on something that looks for matches in one sequence.
Searching for pattern matches in one sequence
We'd like to find regions in a long sequence that are similar to some
pattern. Because there might be more than one such region, and we
shouldn't guess which one the user wants, we'll return the locations of
all
the regions in the sequence that were matches.
We've been offering to locate matches to a pattern within a
sequence. That implies we'll want something that can compare the
pattern to each substring in the sequence. We've also been saying
things like "is similar to" and "matches" -- it's in the comparison
methods that we pin down what this means.
Comparing strings
At the lowest level, we'll need to compare a pattern to parts of a
sequence, so we'll want methods that compare two strings. Because
the user wants to specify how similar the sequence region should be to
the pattern in order to regard it as a match, these will be "distance"
measures.
Details
In order to make it more clear what you need to do, and also to make it
possible for us to test your projects, we're going to give you two
interfaces. We'll
ask you to provide concrete classes that implement these
interfaces.
(You should not modify these interfaces at all!)
SequenceSearch is the "search engine"
interface -- it lists the various searching and pattern matching
tasks.
Please use the name GeneDataSearch for your class that
implements SequenceSearch . (Your GeneDataSearch
class will
build on your part A work.
If you don't want to rename your existing part A
class, you could just create
an instance of it inside your GeneDataSearch .)
Provide a constructor with no parameters. The constructor should
take care of all the setup needed for your part A code to work --
getting file names, reading in files, setting up your internal data
structures. (We will need to create instances of your class in
our test procedures, so we'll need to know what it's called and how to
make an instance of it.)
SearchResult
describes a little wrapper class for returning search results.
(This is the "gene record" of part A -- if you have a class for this
already, it should be simple to make it implement this interface.
It does not need to have a specific name.)
We're giving you a little text-based user interface, SimpleSearchTool ,
that shows how the methods described in the interfaces will be called,
and how the return values are interpreted. SimpleSearchTool
creates an instance of GeneDataSearch . Since we
don't have yours yet we're providing a dummy version of GeneDataSearch
just so SimpleSearchTool will compile -- replace this
with your real version.
The two interfaces are linked from the project page -- it may be
helpful to have copies of them available while reading the following
descriptions of the various search and pattern matching tasks we're
asking you to implement.
Here are more specific descriptions of what each task requires.
(We're starting at the lowest level this time, as the tasks build on
each other.) See the interfaces for specifications of parameters
and return types.
Computing the distance between two strings
At the bottom of the hierarchy of tasks are the distance
computations. Here are more complete descriptions of the two
distance measures we want to provide:
Hamming distance:
Compare the letters in the two strings position by position, and count
up the number of mismatches. (Note this is the same as the number
of letters that would have to be changed to make one string into the
other.) Stop at the end of whichever string is shorter (i.e.
don't count extra letters as a mismatch).
Here's an example:
ABCDEFGHIJKLMNOPQRSTUVWXYZ ABBDXF ^ ^
There are two differences, so the Hamming distance is two.
Edit
distance with a single insertion allowed: Edit distance is
the minimum number of changes that need to be made to two strings to
make them the same. We're implementing a very restricted version
of this that only allows one letter to be inserted in either string, as
well as changing any letters.
We'll be trying out distances when we've inserted letters in all
possible places in the strings. Once we've chosen a place for the
insertion, we can count up the mismatches to tell how many letters we'd
need to replace. (When we insert a
letter, it only makes sense to choose the same letter as is in the
other string at that position, as otherwise we'd be making the distance
unnecessarily greater.) The insertion itself is a change, so we
add one to the distance to account for that.
And we should also check the distance if we don't make any insertions.
Whether we're inserting or not, we stop the comparison at the end of
the shorter string.
Among all of these, we return the smallest.
This example shows all the cases that need to be checked when comparing
the strings BARD
and BEARD . An insertion is indicated by an
underscore. (We don't bother showing the inserted letter because
we
would always add the matching letter from the other string.)
Cases in the
comparison:
|
Their
distances:
|
BARD BEARD
|
3
|
_BARD BEARD
|
2
|
B_ARD BEARD
|
1
|
BA_RD BEARD
|
2
|
BAR_D BEARD
|
3
|
BARD _BEARD
|
4
|
BARD B_EARD
|
3
|
BARD BE_ARD
|
3
|
BARD BEA_RD
|
3
|
Here, the smallest distance is found when a letter is inserted before
the A in BARD , and that's 1, so the edit
distance is 1.
Finding locations of all the close matches in a long sequence
Now that we've got some distance methods that can compare a short
pattern to part of a long string, we want to do that comparison
starting at each location in the long string. We could return a
list of all the distances along the string, but most of these are not
going to be close enough to be interesting. So instead, we let
the user say how close they want the distance to be, and return a list with just the locations that were
at least that close. A location value should be the index
within the long string of the matching piece.
You can pick what kind of list you
want to use, but it should be something that implements the List
interface. As you know, you can't
put just bare ints in a List, so you'll need to put each one in an
Integer. If there are no good-enough matches, return a null to
indicate that.
What the matches are will depend both on which distance measure the
user wants, and what maximum distance they want to allow.
Here's an example: Say the long string is ABRACADABRA ,
the
pattern is ABA , we're using edit distance, and we want to
hear about
matches with distance no more than 1.
Here are the edit distances with
the pattern compared at each location in the long string. (This
is just FYI -- you don't return this! Also note we're letting the
comparisons continue right to the end of the long string. We did
this for simplicity in the assignment, but if this were a real
application, maybe the user wouldn't want those matches on the tail
end.)
ABRACADABRA 12212121210
So the positions you'd return are
0, 3, 5, 7, 9, 10, wrapped in Integers and put in a List.
Searching for matches in the whole collection
Finally, we're going to do something with your collection of names and
sequences. We'll let the user provide a pattern and maximum
distance, and do the above search for close matches on every sequence
in the collection, and say whether they want to search gene sequences
or their protein translations.
For each sequence that had any close-enough matches, we'll give the
user the name of the sequence and the sequence itself along with a list
of match locations like that returned by the search-one-sequence
methods. These should be packaged in objects that implement the SearchResult
interface.
Put all the SearchResult objects into some Collection
class (i.e. a class that implements the Collection
interface). It's your choice -- see the Java API docs for Collection
and pick your favorite. Return that. (Our main reason for
asking for a Collection is that
it can provide an iterator, but doesn't pin down much of anything else.)
JUnit tests
Include JUnit tests for your Hamming, edit distance, and
search-one-sequence methods (and we
won't complain if
you include them for anything else).
Hints and notes
Computing
the edit distance -- There is a lot of room for "optimization"
and code reuse here. While we do expect you to get rid of obvious
code redundancy, we don't expect you to get fancy (except for
extra credit). And you should get this working however it's
simplest first, before trying any optimizations. But if you have
time, here's a hint for one not-too-labor-intensive optimization:
You've seen that you don't need to bother counting the inserted letter,
because you'd always choose the "right" letter to insert anyway.
So...do you really need to bother actually inserting it? After
all, an insert in a string (which is essentially an array) is a rather
expensive operation --
everything past the insert has to be shifted down one. Could you
split up the distance comparison so you could avoid doing the insert?
Java Strings
and the cost of String operations that extract substrings
-- Some of you may be familiar with how Unix stores strings -- to
tell where the end is, a null character is put after the last character
in the string, rather than having a length specified. When a
string is stored that way, then in order to refer to a substring in the
middle, the substring has to be copied and a null put on its end.
This makes extracting a substring an expensive operation.
Java doesn't do that. Instead, Java represents String s
internally by an index into an array of chars and a length. String s
can't be modified, so Java can let any number String
objects point to various parts of one char array. In particular,
all its substring method has to do is make a String
with the right initial index and length, pointing into the same char
array as the original String . So go ahead and use substring
and other methods that return parts of String s without
hesitation.
That
SearchResult interface...
-- So, why do we want to return the sequence as well as the name
and match locations? One reason is that you might have, in your
collection, more
than one sequence with the same name, so returning just names and
locations won't tell which actual gene got a particular set of
matches. But another is that your collection that holds the names
and sequences is surely a private instance variable (right??), so the
user couldn't go find the sequence just given the name -- we haven't
asked you to provide that sort of accessor.
If you are
providing a GUI for extra credit -- Remember the MVC
architecture! Please divide your project into model (the part
that handles the files, data structures, and search and translation)
and view (the GUI) classes. You can keep the JFileChooser s
for opening files with the model, since we've asked you to read in
files in the GeneDataSearch constructor (just make their
parent component null instead of your window).
Otherwise when we try to run out tests on your model, your GUI would
pop up...
Why did we
keep calling this a database? -- We stopped calling
it that for part B because the term was bothering some folks. But
it shouldn't -- here's why. You've probably
heard the word database mainly applied to products like Oracle Database, Microsoft
SQL Server (and its much
cheaper development kit), postgreSQL,
and MySQL. Those are
"relational database management systems" (RDBMSs) -- they use a
particular idea of how to handle references among the data, and how to
safely process changes to to data even if lots of users are making
changes at once.
But more generally, a database is any repository for information that
lets someone perform "queries" -- ask questions that are answered based
on the stored information. The feature that makes a query
different from just looking up an item is that a query can specify
"constraints" on what information they're interested in, to select some
subset of all the information -- they do not need to ask for all the
items in the subset by name. When we let users select genes that
have regions closely matching a pattern, they're doing a query.
We could let them add other constraints, like searching only among
genes from a particular species -- that would make this even more
database-like.
Ok, so
what's a relational database? -- Some of you
have noticed the issue that when your program exits, all your
objects vanish. And even if you could save the values in all your
objects' instance variables, what if they were reference
variables? When you read your saved objects back in, how would
you figure out what used to point to what? After all, the value
of the reference depends on where the object gets put in the computer's
memory -- that's highly unlikely to be the same when your program is
restarted again. So even if you got your objects and their
instance variable values back, all their references would be
garbage.
"Relational" databases solve this problem by not using references at
all. Instead, each "record" in the database is identified by a
key. When one record needs to refer to another, it includes the
other's key in its data. The keys remain the same if the database
program is shut down and started again, so this way of referring to
another record doesn't go stale like memory-location references
do. But a key doesn't "point" directly to the other record the
way a memory-location reference does -- the program has to go search
for the key. So you can imagine how much slower this could
be. But good RDBMSs provide ways of quickly looking up keys (for
instance, they might put the keys in a hash table, where the value is
some information that tells the RDBMS how to locate the record in its
data structures).
Java has a solution to this problem called "serialization".
There's a long post on
this topic in the message board -- look for the subject "our database",
dated 2/15.
Extra credit
NOTE! If your extra
credit features need gene data with particular information in it in
order to show off your work (e.g. if you want those [key=value] pairs
on more gene entries, or want duplicate names), you can add fake
data easily by copying and modifying the records in the existing
files. Turn in your demo files with your code, and tell us how to
try your features in your reports.
Explore ways
to further optimize the edit distance calculation besides
what's given in the hint above. There may be some relatively
simple improvements possible here. (For a harder problem,
consider how to optimize the comparison with one pattern all along a
long sequence. Even if all you want to do is check for an exact
match, there are optimizations that range from somewhat scary to
amazingly arcane. See, for instance, Introduction to Algorithms, by
Cormen, Leiserson, and Rivest -- look for the section on string
matching.)
Give your
application a GUI that lets the user enter a search
pattern. Display the results in the GUI. Look at
JTextArea for
a
widget that lets the user enter text, or lets you display it. Its
doc page has a link to
the appropriate part of the Java tutorial. (But see the important
warning in the "Hints and notes" above.)
Show the
areas in a gene that are good matches by displaying the whole
gene sequence with the matches highlighted (e.g. in a different color,
or bold, or with ^^^^ under it on the next line, etc., etc.).
Let the user
restrict which genes to include a the search by
specifying some gene names, etc. If you interpreted and stored
other info from the name line (e.g. the [key=value] pairs), you could
let
the user restrict their search by specifying a key and value.
Etc. etc.
Glossary
distance
A
"distance" is a special kind of measure
of how different things are, that obeys the "triangle
inequality". This says that the distance from A to B plus the
distance from B to C is always at least as much as the distance from A
to C. We won't need to use this property here, but wanted to let
you know why measures of difference are sometimes called distances...
genome
The complete set of genes that a
species has.
What to Turn In
Use this
online turnin form to
turn in the Java source files that make up your project, including any
JUnit tests. If your project uses any other files, such as images, turn
those in also. If you have many files, including things like images,
you can bundle them into an archive file (zip, jar, or tar) and turn
that in. Multiple turnins are fine and highly recommended if you are
planning to add extra credit features. Once you've got something that
meets the
basic requirements, turn that in. Then if you add to your project, turn
in the extended version(s) later - we'll grade the last one you turn in.
Caution! Several otherwise very
nice projects on the project 1 failed to compile because they
defined packages, but didn't have the directories to go with
them. This is causing your TAs grief. Please consider
whether you need to put your classes in packages at all -- that is
usually for very large projects with zillions of classes. If you
really want to use packages, please contact your TA to find out how to
make an archive of your classes and directories that will compile
successfully.
Eclipse users: If you're using eclipse projects, please
don't turn those in. Instead, make sure your code compiles
without eclipse, and turn in your files either individually or in one
of the types of archive files that the turnin server supports.
|