|
|
|
|
CSE143 Winter 2005 Project #2 Part A
This
version has an alternative, more realistic, translation procedure,
which you may implement instead of the original one.
A Gene Database
Due: Wednesday, February 16, at 9:00 pm. No late assignments
will
be accepted.
Background
(We're using a few terms from
genetics. Terms that may
be new to you are defined where they're first used, but there is also a
glossary at the end. If we make up our own names for these
things, it will only be confusing if you try to read up on it
elsewhere... Also at the end are a few hints, which you are free
to ignore -- you may find something you like better.)
In this project, you'll be working with data representing genes -- very long sequences of
molecules called DNA (deoxyribonucleic acid) bases. These sequences of
bases serve as a code that can tell a cell's protein-making apparatus
what sequence of amino acids it should put in when it builds a protein
molecule.
There are only four different bases: adenine, guanine,
cytosine, and thymine. But proteins are made up of 20 different
amino acids, so it will take more than one base to tell which amino
acid to use. In fact, a sequence of three bases is needed.
These triplets of bases are called codons.
Several different codons may represent the same amino acid. Some
codons serve as instructions to the protein-making apparatus.
Sometimes the meaning of a codon depends on where it is
in the gene...
In recent years, there has been a major effort to record the
sequences of as many genes as possible, and make them available to
genetics researchers. A lab that discovers the sequence of a
particular gene will submit it to a central database.
Bases are traditionally represented by the letters A (for
adenine), G (guanine),
C (cytosine), and T (thymine). So a DNA sequence can be
represented by a string of letters, e.g. CTTGGGTTTTATAAC...
Likewise amino acids can each be represented by a letter, so a protein
can also be represented by a string.
Researchers who work on genetics often have a short sequence
of bases that they want to compare with the genes stored in the
database (or they have a sequence of amino acids that they want to
compare with the proteins the genes represent). The short
sequence might appear anywhere in the gene -- it might even be in there
more than once. And it might not be exactly the same as what's in
the gene. Often it's sequences that are similar, but not exactly
the same, that are the most interesting.
Since both the sequence they want to compare, and the
sequences in the database, are represented by strings of letters, then
comparing them is a problem of comparing strings. This is usually
called the "string matching" problem, though that is a bit misleading,
since we want to find more than just exact matches. Instead, we
have a "pattern" that we want to compare to all parts of a sequence,
and decide how similar it is to each part.
This project will have two parts. In the first part,
you'll read in files containing gene sequences, and store them in some
way that's convenient for searching -- this will be your own gene
database. You'll also write a translator that will take a gene
sequence and return the amino acid sequence it represents.
In the second part, you'll be writing code to allow searching
your database, and do the string matching.
Administrivia
You should work with your assigned partner on this
project using the pair programming style discussed in class. While you may want to think about the project,
sketch ideas, and
try them out on the computer yourself (when your partner is not
around), it works best if the two of you write the actual project
together at a single computer, trading off the keyboard at least every
5 or 20 minutes. You and
your partner will turn in a single set of files. After this final part
of the project is done,
each of you will individually produce a written report. (Details about
that will be supplied separately.)
Grading: This is
the first part of a two-part project. When the project is
complete,
your project will be evaluated both for how well the code works and how
well it is written and tested. For this intermediate part of the
project, we will try to give you quick feedback on the scale of 0-3:
3=no major problems, 2=something needs to be fixed, 1=serious trouble,
and 0=no credible effort. Be sure to include appropriate JavaDoc and
other comments in your code, use meaningful names, indent sensibly,
provide toString() methods where appropriate, and so forth.
Overview
This part of the project is mainly concerned with file reading using
the Java library classes for input. You'll be reading in two
sorts of files. One will be a collection of gene sequences, in a
format that's very much like the one used to submit gene data to "real"
gene databases. The other will be a table telling which codons
represent which amino acids, which you'll need for your
translator. You'll only need to read one table of codons.
But you might get any number of files to read into your database -- you
should be prepared to let a user enter multiple files.
You'll also need to come up with a way to represent the data you read
from the gene sequence files, and from the codon-to-amino-acid table.
Your gene-to-protein translator will take a sequence string as input,
and return a string with the equivalent amino acid string. (This
is not as simple a problem as it sounds -- some parts of the gene
sequence get skipped over. There are codons that tell when to
start and stop translating -- there can be any number of starts and
stops in a gene. We'll describe this in detail below.)
You should also provide some simple methods that will print out items
in
your database, and the results of doing a gene sequence to protein
sequence translation.
Details
Getting file names from the user
Your program should have a way to let a user enter file names.
You should ask them first for the file containing the table of
conversions from codons to amino acids. After you have that, you
should ask for the name of the gene sequence file.
Your code that gets file names should be prepared to handle errors,
e.g. if no file is found.
Use a JFileChooser to let the user select files.
Reading the table of codon to amino acid conversions.
(This section will make more
sense after you've read the section on translating gene sequences to
proteins, below.)
This table has one or more lines for each amino acid, and for codons
that mean start and stop. Each line for an amino acid has one
letter at the beginning, which is the letter for the amino acid.
After that it has a space, then the three letters of the codon.
Since some amino acids are represented by more than one codon, there
can be several lines for each amino acid. Start and stop codons
will have the whole word "start" or "stop" at the beginning of the
line. You should ignore blank lines. Here are examples of a
few
lines:
A GCT
A GCC
N AAT
M ATG
Start ATG
Stop TAA
Stop TAG
There is a link on the assignment page to this file.
(In the glossary, there is a list of names of amino acids, with their
letters.)
Storing codon to amino acid conversions
Your translator will use this information when it converts a string of
bases to a string representing a protein. You should look at what
the translator has to do (described below) and decide how to store this
information so it's convenient for the translator.
Reading in gene sequence files
Each file will have multiple gene records. In each record, there
are at least two lines, and probably more. The first line in each
record starts with a > which indicates the start of a new
record. There might be some spaces after that, which you should
ignore. After that is a string of letters, numbers, and some
punctuation characters (like dash, underscore, period) -- this is the
name of
the gene. After the name, there might be some spaces, or the end
of
the line, or there may be other descriptive information -- you can
ignore that. (But see the extra credit suggestion
below.)
After this line are one or more lines with strings of letters
representing bases. They end when the next record starts, or at
the end of the file. The sequence might take up more than one
line. In that case you should append each line after the first to
the right-hand end of the line before it.
Here's an example of what might be in these files:
>eIF4E [organism=Drosophila melanogaster] [strain=Oregon R] CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCA
> ABC-1 [organism=Saccharomyces cerevisiae][strain=ABC][clone=1] ATTGCGTTATGGAAATTCGAAACTGCCAAATACTATGTCACCATCATTGA TGCACCTGGACACAGAGATTTCATCAAGAACATGATCACTGGTACTT
Here, eIF4E and ABC-1 are gene names.
The single line following the eIF4E line is its
sequence. The two lines after ABC-1 should be
joined.
Your code should be prepared to deal with errors while reading the
file. For instance, what if the
file you're reading is on a flash disk, and someone removes it while
you're reading the file? In general you should report errors by
throwing an exception.
We'll give you several files to practice on for now. When you get
to part B, we'll provide larger files, to make searching more
interesting. Look for a link on the assignment page. You
may also want to make up some simple files with fake data as test
cases.
Storing gene data
For each gene record, you'll want to keep the name and the
sequence. Later, you might decide you want to keep other
information for each gene. You may want to have an object to hold
each gene's information.
Note that the names of genes may not be unique! There might be
several entries in the files for the same type of gene, each with
slightly different sequences. (This will be an issue if you want
to use a HashMap, which needs a unique key for each item.)
Once you have a gene record's information packaged up however you
choose, you'll want to put it in some structure that will be your
database. In part B, you'll be doing searches in your database,
in which you'll need to go through all the gene records, or maybe find
individual records by name. So you
should choose a way to store your gene records that you think will be
appropriate.
Translating gene sequences to proteins
Several
people objected to the original translation procedure on the grounds
that it was not biologically accurate. So we're including a
somewhat more realistic procedure that you can implement instead if you
prefer. Note that the more realistic procedure is still not what
really happens, but it's closer. Note that the intent of this
project is for you to learn something about Java input, string
processing, and data structures. We should have warned you that
there were no guarantees as to the accuracy of the "biology".
There are still no such guarantees.
More realistic procedure:
Each gene sequence starts out with some bases that do not translate to
amino acids, followed by several alternating regions that do get
translated (called exons) and
regions that don't get translated (called introns).
If you read from the beginning of the gene string,
you'll eventually find a "start" codon. (There happens to be only
one codon that represents "start" -- it's the sequence ATG. Note
that there can be any number of bases
before the start codon -- those don't have to come in threes.
The amino acid indicated by the start codon should be included in the
translation.
After the start codon, each following codon (each set of three bases)
should be translated to the single amino acid letter it
represents, until the gene ends (see below) or a short sequence is
found that indicates the beginning of an intron. (This sequence
will be labeled "Donor" in the alternate version of the translation
file. There is only one of these included in the file -- it is
the two-base sequence GT.)
After a donor sequence is found, ignore anything until you find a
sequence that indicates the end of the intron. (This sequence
will be labeled Acceptor in the alternate version of the translation
file. There is only one of these -- AG.) Just like at
the beginning of the sequence, there can be any number of bases between
the donor and the next acceptor -- it doesn't have to be a multiple of
three.
After that start, you can continue translating -- this continuation is
part of the same protein.
That is, the region to leave out starts with GT after a codon that gets
translated, and ends with AG:
. . .
xxx xxx
GTx . . . xAG xxx xxx
. . .
<---translated--> <-not translated->
<--translated--->
You're done translating either if you get to the end of the gene
string, or if you're in an exon (i.e. in a region where you're
translating) and you find a "stop" codon. There may be a few
extra bases at the end, after the
last codon that gets translated -- these should be ignored.
You should not hard-code the start, stop, acceptor, or donor sequences
in your program, nor assume that there's
only one start, acceptor, or donor.
Example -- this sequence:
TTATGGAAATTCGAGTAAACTGCCAAATACTAAGTCACCATCATTGTGAATTTA
codes for:
MEIRSPSL
(What's still wrong with this
is that although there is always a GT at the beginning of an intron,
and AG always appears at the end, they may also appear when the region
is not an intron. There are other conditions besides just the
appearance of these short sequences that are required. But these
other conditions are apparently not completely known. Assuming
that a region is always an intron if it starts with GT after a codon,
and ends with AG, will clip out more that would normally get
omitted. For one thing, if GT always stops translation, then
valine will never appear in any translation -- all of its codons start
with GT. But we can't expect you to implement something that
isn't actually known, nor can you expect us to come up with a precise
description of an unknown procedure.)
Original procedure:
Each gene sequence starts out with some bases that do not translate to
amino acids. If you read from the beginning of the gene string,
you'll eventually find a "start" codon. (There happens to be only
one codon that represents "start" -- it's the sequence ATG. But
you should not hard-code this in your program, nor assume that there's
only one "start".) Note that there can be any number of bases
before the start codon -- those don't have to come in threes.
After the start codon, each following codon (each set of three bases)
should be translated to the single amino acid letter it
represents. Translation should stop again when one of the "stop"
codons appears.
Then ignore anything until you find another start. Just like at
the beginning of the sequence, there can be any number of bases between
the stop and the next start -- it doesn't have to be a multiple of
three.
After that start, you can continue translating -- this continuation is
part of the same protein.
You're done translating when you get to the end of the gene
string. There may be one or two extra bases at the end, after the
last codon -- these should be ignored.
Your translator method should take the gene sequence as a String
parameter, and produce a String with the amino acid letters.
Here's an example. This sequence:
TTATGGAAATTCGATAAAACTGCCAAATACTATGTCACCATCATTGTGAATTTA
codes for:
EIRSPSL
We'll give you some translations that you can use to check your work --
look for a link on the assignment page.
Getting and displaying gene information
Give your gene record class a toString() method that
makes a nicely formatted String out the name and sequence. You'll
probably also want accessor methods that return the name or sequence.
JUnit tests
Include JUnit tests for your translation code (and we won't complain if
you include them for anything else).
You might find you have a test file where the tests all need a
database. This would be a good opportunity to try out using a
setUp() method. In order to make testing your code more
automatic, you might want to have a way to add files to your database
using a method call to pass in the file name, rather than reading it
from the user.
Hints
File reading
-- You may want to use BufferedReader's readLine method to avoid having
to clean off return and linefeed characters at the ends of lines.
Detecting
input errors -- In the section on reading in gene sequence
files,
you're told to deal with errors. If there is a problem with the
device the
file is on, some code deep down in the Java runtime
system will throw an exception. If you want to handle the sort of
error that causes the runtime system to throw an exception, you'll need
to catch it. The Java API documentation lists most of the
relevant exceptions that API methods can throw.
HashMap
-- The collection classes you've used so far use a numerical index to
identify items, and that index can change if you add items before the
end of the collection. If you need to have a way to refer to an
item that won't change as you add things, or if you have names or other
labels for your items, and don't want some arbitrary number instead,
then you might want to use a HashMap. It lets you store each item
with a "key" -- you then use the key instead of some number to get the
item back. Depending on how you design your classes for this
program, you might find a HashMap useful. The one caution is that
a HashMap needs a unique key for eaach item. If your data doesn't
have anything that is guaranteed to be unique, you might need to check
for duplicates before you add new items, and maybe add something to the
key to make it unique.
Extra credit
The gene sequence files may have additional information on the same
line as the name of the gene. This can include things like what
organism the gene was found in. Each item of information has the
form:
[key=value]
That is, it starts with a [, then anything between the [ and = is a
string that tells what kind of information it is. The part
between the = and the ] is the information itself. For instance:
> ras [organism=Mus
musculus]
tells you that this ras gene
was found in a mouse. There can be several of these key / value
pairs on the line. You might read in this extra information, and
store it in your gene record.
(Some of this extra information might be useful to someone searching
your database. For instance, they might want to only search in
genes from a particular organism. So you might think about
whether this affects your decision about what sort of structure to use
for your database, in case you want to let users use this information
in searches, as extra credit in part B.)
We've asked you to print out sequences as strings. But sometimes
sequences are displayed graphically, as rows of colored bars, with a
different color for each base or amino acid. In a gene sequence,
it might be interesting to have the regions that code for amino acids
highlighted.
Besides files, you might let the user download a file from the Web --
you could let them enter a URL instead of just a file name. (Note
that a URL can represent a local file too!)
Give your application a GUI that lets the user tell you when they want
to enter a file name, rather than you having to prompt for it.
Then they can enter new gene sequence files whenever they want.
Maybe let them enter the name of a gene to translate, and display the
results. (A JLabel is a simple Swing component that can display
text.) If you also let them load the codon to amino acid table
whenever they want, then it's an error if you need to translate
something before they've given you the file.
Glossary
amino acid
These are the molecules that are joined to form proteins.
(Although a protein is a single series of amino acids, it is not just a
long straight molecule -- it folds up into a complicated shape.
It's this shape that determines the protein's function. Figuring
out what shape a protein will fold into, given its amino acid sequence,
is just about the hottest topic in molecular biology today.) The
amino acids, with there single-letter abbreviations, are:
A alanine
R arginine
N asparagine
D aspartic acid
C cysteine
E glumatic acid
Q glutamine
G glycine
H histidine
I isoleucine
L leucine
K lysine
M methionine
F phenylalanine
P proline
S serine
T threonine
W tryptophan
Y tyrosine
V valine
base
A base is one of the molecules
that join to form a DNA (deoxyribonucleic acid) sequence, or similar
molecules like RNA (ribonucleic acid). The set of bases for RNA
and DNA are slightly different -- here, we're only looking at
DNA. The bases for DNA, and their single-letter abbreviations,
are:
A adenine
C cytosine
G guanine
T thymine
codon
A codon is a sequence of three
bases. Each codon that is between a start and a stop codon in a
gene sequence represents an amino acid in the protein that the gene
encodes.
deoxyribonucleic
acid (DNA)
See base.
exon
A region of a gene that gets
translated into amino acids.
gene
A sequence of DNA bases that
represents a protein.
intron
A region of a gene that does not
get translated into amino acids.
For more information
In case you're interested, here are a few Web sites with information on
genetics and protein coding.
Primer
on Molecular Genetics from the U.S. Department of Energy
Our gene sequence files will be very much like those used to submit
sequences to the National Center for Biotechnology Information (NCBI),
to be included in GenBank. Here is the
NCBI's description of gene data files.
What to Turn In
Use this
online turnin form to
turn in the Java source files that make up your project, including any
JUnit tests. If your project uses any other files, such as images, turn
those in also. If you have many files, including things like images,
you can bundle them into an archive file (zip, jar, or tar) and turn
that in. Multiple turnins are fine and highly recommended if you are
planning to add extra credit features. Once you've got something that
meets the
basic requirements, turn that in. Then if you add to your project, turn
in the extended version(s) later - we'll grade the last one you turn in.
|