UW Home     CSE Home   Announcements    Message Board    Contact Info 

 
 
 

CSE143 Winter 2005 Project #2 Part A

This version has an alternative, more realistic, translation procedure, which you may implement instead of the original one.

A Gene Database

Due: Wednesday, February 16, at 9:00 pm. No late assignments will be accepted.

Background

(We're using a few terms from genetics.  Terms that may be new to you are defined where they're first used, but there is also a glossary at the end.  If we make up our own names for these things, it will only be confusing if you try to read up on it elsewhere...  Also at the end are a few hints, which you are free to ignore -- you may find something you like better.)

In this project, you'll be working with data representing genes -- very long sequences of molecules called DNA (deoxyribonucleic acid) bases.  These sequences of bases serve as a code that can tell a cell's protein-making apparatus what sequence of amino acids it should put in when it builds a protein molecule.

There are only four different bases: adenine, guanine, cytosine, and thymine.  But proteins are made up of 20 different amino acids, so it will take more than one base to tell which amino acid to use.  In fact, a sequence of three bases is needed.  These triplets of bases are called codons.

Several different codons may represent the same amino acid.  Some codons serve as instructions to the protein-making apparatus.  Sometimes the meaning of a codon depends on where it is in the gene...

In recent years, there has been a major effort to record the sequences of as many genes as possible, and make them available to genetics researchers.  A lab that discovers the sequence of a particular gene will submit it to a central database.

Bases are traditionally represented by the letters A (for adenine), G (guanine), C (cytosine), and T (thymine).  So a DNA sequence can be represented by a string of letters, e.g. CTTGGGTTTTATAAC...  Likewise amino acids can each be represented by a letter, so a protein can also be represented by a string.

Researchers who work on genetics often have a short sequence of bases that they want to compare with the genes stored in the database (or they have a sequence of amino acids that they want to compare with the proteins the genes represent).  The short sequence might appear anywhere in the gene -- it might even be in there more than once.  And it might not be exactly the same as what's in the gene.  Often it's sequences that are similar, but not exactly the same, that are the most interesting.

Since both the sequence they want to compare, and the sequences in the database, are represented by strings of letters, then comparing them is a problem of comparing strings.  This is usually called the "string matching" problem, though that is a bit misleading, since we want to find more than just exact matches.  Instead, we have a "pattern" that we want to compare to all parts of a sequence, and decide how similar it is to each part.

This project will have two parts.  In the first part, you'll read in files containing gene sequences, and store them in some way that's convenient for searching -- this will be your own gene database.  You'll also write a translator that will take a gene sequence and return the amino acid sequence it represents.

In the second part, you'll be writing code to allow searching your database, and do the string matching.

Administrivia

You should work with your assigned partner on this project using the pair programming style discussed in class. While you may want to think about the project, sketch ideas, and try them out on the computer yourself (when your partner is not around), it works best if the two of you write the actual project together at a single computer, trading off the keyboard at least every 5 or 20 minutes. You and your partner will turn in a single set of files. After this final part of the project is done, each of you will individually produce a written report. (Details about that will be supplied separately.)

Grading: This is the first part of a two-part project.  When the project is complete, your project will be evaluated both for how well the code works and how well it is written and tested. For this intermediate part of the project, we will try to give you quick feedback on the scale of 0-3: 3=no major problems, 2=something needs to be fixed, 1=serious trouble, and 0=no credible effort. Be sure to include appropriate JavaDoc and other comments in your code, use meaningful names, indent sensibly, provide toString() methods where appropriate, and so forth.

Overview

This part of the project is mainly concerned with file reading using the Java library classes for input.  You'll be reading in two sorts of files.  One will be a collection of gene sequences, in a format that's very much like the one used to submit gene data to "real" gene databases.  The other will be a table telling which codons represent which amino acids, which you'll need for your translator.  You'll only need to read one table of codons.  But you might get any number of files to read into your database -- you should be prepared to let a user enter multiple files.

You'll also need to come up with a way to represent the data you read from the gene sequence files, and from the codon-to-amino-acid table.

Your gene-to-protein translator will take a sequence string as input, and return a string with the equivalent amino acid string.  (This is not as simple a problem as it sounds -- some parts of the gene sequence get skipped over.  There are codons that tell when to start and stop translating -- there can be any number of starts and stops in a gene.  We'll describe this in detail below.)

You should also provide some simple methods that will print out items in your database, and the results of doing a gene sequence to protein sequence translation.

Details

Getting file names from the user

Your program should have a way to let a user enter file names.  You should ask them first for the file containing the table of conversions from codons to amino acids.  After you have that, you should ask for the name of the gene sequence file.

Your code that gets file names should be prepared to handle errors, e.g. if no file is found.

Use a JFileChooser to let the user select files.

Reading the table of codon to amino acid conversions.

(This section will make more sense after you've read the section on translating gene sequences to proteins, below.)

This table has one or more lines for each amino acid, and for codons that mean start and stop.  Each line for an amino acid has one letter at the beginning, which is the letter for the amino acid.  After that it has a space, then the three letters of the codon.  Since some amino acids are represented by more than one codon, there can be several lines for each amino acid.  Start and stop codons will have the whole word "start" or "stop" at the beginning of the line.  You should ignore blank lines.  Here are examples of a few lines:

A GCT
A GCC
N AAT
M ATG

Start ATG
Stop TAA
Stop TAG


There is a link on the assignment page to this file.

(In the glossary, there is a list of names of amino acids, with their letters.)

Storing codon to amino acid conversions

Your translator will use this information when it converts a string of bases to a string representing a protein.  You should look at what the translator has to do (described below) and decide how to store this information so it's convenient for the translator.

Reading in gene sequence files

Each file will have multiple gene records.  In each record, there are at least two lines, and probably more.  The first line in each record starts with a > which indicates the start of a new record.  There might be some spaces after that, which you should ignore.  After that is a string of letters, numbers, and some punctuation characters (like dash, underscore, period) -- this is the name of the gene.  After the name, there might be some spaces, or the end of the line, or there may be other descriptive information -- you can ignore that.  (But see the extra credit suggestion below.)

After this line are one or more lines with strings of letters representing bases.  They end when the next record starts, or at the end of the file.  The sequence might take up more than one line.  In that case you should append each line after the first to the right-hand end of the line before it.

Here's an example of what might be in these files:
>eIF4E [organism=Drosophila melanogaster] [strain=Oregon R]
CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCA
> ABC-1 [organism=Saccharomyces cerevisiae][strain=ABC][clone=1]
ATTGCGTTATGGAAATTCGAAACTGCCAAATACTATGTCACCATCATTGA
TGCACCTGGACACAGAGATTTCATCAAGAACATGATCACTGGTACTT
Here, eIF4E and ABC-1 are gene names.  The single line following the eIF4E line is its sequence.  The two lines after ABC-1 should be joined.

Your code should be prepared to deal with errors while reading the file.  For instance, what if the file you're reading is on a flash disk, and someone removes it while you're reading the file?  In general you should report errors by throwing an exception.

We'll give you several files to practice on for now.  When you get to part B, we'll provide larger files, to make searching more interesting.  Look for a link on the assignment page.  You may also want to make up some simple files with fake data as test cases.

Storing gene data

For each gene record, you'll want to keep the name and the sequence.  Later, you might decide you want to keep other information for each gene.  You may want to have an object to hold each gene's information.

Note that the names of genes may not be unique!  There might be several entries in the files for the same type of gene, each with slightly different sequences.  (This will be an issue if you want to use a HashMap, which needs a unique key for each item.)

Once you have a gene record's information packaged up however you choose, you'll want to put it in some structure that will be your database.  In part B, you'll be doing searches in your database, in which you'll need to go through all the gene records, or maybe find individual records by name.  So you should choose a way to store your gene records that you think will be appropriate.

Translating gene sequences to proteins

Several people objected to the original translation procedure on the grounds that it was not biologically accurate.  So we're including a somewhat more realistic procedure that you can implement instead if you prefer.  Note that the more realistic procedure is still not what really happens, but it's closer.  Note that the intent of this project is for you to learn something about Java input, string processing, and data structures.  We should have warned you that there were no guarantees as to the accuracy of the "biology".  There are still no such guarantees.

More realistic procedure:

Each gene sequence starts out with some bases that do not translate to amino acids, followed by several alternating regions that do get translated (called exons) and regions that don't get translated (called introns).

If you read from the beginning of the gene string, you'll eventually find a "start" codon.  (There happens to be only one codon that represents "start" -- it's the sequence ATG.  Note that there can be any number of bases before the start codon -- those don't have to come in threes.

The amino acid indicated by the start codon should be included in the translation.

After the start codon, each following codon (each set of three bases) should be translated to the single amino acid letter it represents, until the gene ends (see below) or a short sequence is found that indicates the beginning of an intron.  (This sequence will be labeled "Donor" in the alternate version of the translation file.  There is only one of these included in the file -- it is the two-base sequence GT.)

After a donor sequence is found, ignore anything until you find a sequence that indicates the end of the intron.  (This sequence will be labeled Acceptor in the alternate version of the translation file.  There is only one of these -- AG.)  Just like at the beginning of the sequence, there can be any number of bases between the donor and the next acceptor -- it doesn't have to be a multiple of three.

After that start, you can continue translating -- this continuation is part of the same protein.

That is, the region to leave out starts with GT after a codon that gets translated, and ends with AG:

.  .  .  xxx  xxx  GTx   .  .  .  xAG  xxx  xxx  .  .  .
<---translated-->  <-not translated->  <--translated--->

You're done translating either if you get to the end of the gene string, or if you're in an exon (i.e. in a region where you're translating) and you find a "stop" codon.  There may be a few extra bases at the end, after the last codon that gets translated -- these should be ignored.

You should not hard-code the start, stop, acceptor, or donor sequences in your program, nor assume that there's only one start, acceptor, or donor.

Example -- this sequence:
TTATGGAAATTCGAGTAAACTGCCAAATACTAAGTCACCATCATTGTGAATTTA
codes for:

MEIRSPSL

(What's still wrong with this is that although there is always a GT at the beginning of an intron, and AG always appears at the end, they may also appear when the region is not an intron.  There are other conditions besides just the appearance of these short sequences that are required.  But these other conditions are apparently not completely known.  Assuming that a region is always an intron if it starts with GT after a codon, and ends with AG, will clip out more that would normally get omitted.  For one thing, if GT always stops translation, then valine will never appear in any translation -- all of its codons start with GT.  But we can't expect you to implement something that isn't actually known, nor can you expect us to come up with a precise description of an unknown procedure.)

Original procedure:

Each gene sequence starts out with some bases that do not translate to amino acids.  If you read from the beginning of the gene string, you'll eventually find a "start" codon.  (There happens to be only one codon that represents "start" -- it's the sequence ATG.  But you should not hard-code this in your program, nor assume that there's only one "start".)  Note that there can be any number of bases before the start codon -- those don't have to come in threes.

After the start codon, each following codon (each set of three bases) should be translated to the single amino acid letter it represents.  Translation should stop again when one of the "stop" codons appears.

Then ignore anything until you find another start.  Just like at the beginning of the sequence, there can be any number of bases between the stop and the next start -- it doesn't have to be a multiple of three.

After that start, you can continue translating -- this continuation is part of the same protein.

You're done translating when you get to the end of the gene string.  There may be one or two extra bases at the end, after the last codon -- these should be ignored.

Your translator method should take the gene sequence as a String parameter, and produce a String with the amino acid letters.

Here's an example.  This sequence:
TTATGGAAATTCGATAAAACTGCCAAATACTATGTCACCATCATTGTGAATTTA
codes for:

EIRSPSL

We'll give you some translations that you can use to check your work -- look for a link on the assignment page.

Getting and displaying gene information

Give your gene record class a toString() method that makes a nicely formatted String out the name and sequence.  You'll probably also want accessor methods that return the name or sequence.

JUnit tests

Include JUnit tests for your translation code (and we won't complain if you include them for anything else).

You might find you have a test file where the tests all need a database.  This would be a good opportunity to try out using a setUp() method.  In order to make testing your code more automatic, you might want to have a way to add files to your database using a method call to pass in the file name, rather than reading it from the user.

Hints

File reading -- You may want to use BufferedReader's readLine method to avoid having to clean off return and linefeed characters at the ends of lines.

Detecting input errors -- In the section on reading in gene sequence files, you're told to deal with errors.  If there is a problem with the device the file is on, some code deep down in the Java runtime system will throw an exception.  If you want to handle the sort of error that causes the runtime system to throw an exception, you'll need to catch it.  The Java API documentation lists most of the relevant exceptions that API methods can throw.

HashMap -- The collection classes you've used so far use a numerical index to identify items, and that index can change if you add items before the end of the collection.  If you need to have a way to refer to an item that won't change as you add things, or if you have names or other labels for your items, and don't want some arbitrary number instead, then you might want to use a HashMap.  It lets you store each item with a "key" -- you then use the key instead of some number to get the item back.  Depending on how you design your classes for this program, you might find a HashMap useful.  The one caution is that a HashMap needs a unique key for eaach item.  If your data doesn't have anything that is guaranteed to be unique, you might need to check for duplicates before you add new items, and maybe add something to the key to make it unique.

Extra credit

The gene sequence files may have additional information on the same line as the name of the gene.  This can include things like what organism the gene was found in.  Each item of information has the form:
[key=value]
That is, it starts with a [, then anything between the [ and = is a string that tells what kind of information it is.  The part between the = and the ] is the information itself.  For instance:
> ras [organism=Mus musculus]
tells you that this ras gene was found in a mouse.  There can be several of these key / value pairs on the line.  You might read in this extra information, and store it in your gene record.

(Some of this extra information might be useful to someone searching your database.  For instance, they might want to only search in genes from a particular organism.  So you might think about whether this affects your decision about what sort of structure to use for your database, in case you want to let users use this information in searches, as extra credit in part B.)

We've asked you to print out sequences as strings.  But sometimes sequences are displayed graphically, as rows of colored bars, with a different color for each base or amino acid.  In a gene sequence, it might be interesting to have the regions that code for amino acids highlighted.

Besides files, you might let the user download a file from the Web -- you could let them enter a URL instead of just a file name.  (Note that a URL can represent a local file too!)

Give your application a GUI that lets the user tell you when they want to enter a file name, rather than you having to prompt for it.  Then they can enter new gene sequence files whenever they want.  Maybe let them enter the name of a gene to translate, and display the results.  (A JLabel is a simple Swing component that can display text.)  If you also let them load the codon to amino acid table whenever they want, then it's an error if you need to translate something before they've given you the file.

Glossary

amino acid

These are the molecules that are joined to form proteins.  (Although a protein is a single series of amino acids, it is not just a long straight molecule -- it folds up into a complicated shape.  It's this shape that determines the protein's function.  Figuring out what shape a protein will fold into, given its amino acid sequence, is just about the hottest topic in molecular biology today.)  The amino acids, with there single-letter abbreviations, are:

A alanine
R arginine
N asparagine
D aspartic acid
C cysteine
E glumatic acid
Q glutamine
G glycine
H histidine
I isoleucine
L leucine
K lysine
M methionine
F phenylalanine
P proline
S serine
T threonine
W tryptophan
Y tyrosine
V valine

base

A base is one of the molecules that join to form a DNA (deoxyribonucleic acid) sequence, or similar molecules like RNA (ribonucleic acid).  The set of bases for RNA and DNA are slightly different -- here, we're only looking at DNA.  The bases for DNA, and their single-letter abbreviations, are:

A adenine
C cytosine
G guanine
T thymine


codon

A codon is a sequence of three bases.  Each codon that is between a start and a stop codon in a gene sequence represents an amino acid in the protein that the gene encodes.

deoxyribonucleic acid (DNA)

See base.

exon

A region of a gene that gets translated into amino acids.

gene


A sequence of DNA bases that represents a protein.

intron

A region of a gene that does not get translated into amino acids.

For more information

In case you're interested, here are a few Web sites with information on genetics and protein coding.

Primer on Molecular Genetics from the U.S. Department of Energy
Our gene sequence files will be very much like those used to submit sequences to the National Center for Biotechnology Information (NCBI), to be included in GenBank.  Here is the NCBI's description of gene data files.

What to Turn In

Use this online turnin form to turn in the Java source files that make up your project, including any JUnit tests. If your project uses any other files, such as images, turn those in also. If you have many files, including things like images, you can bundle them into an archive file (zip, jar, or tar) and turn that in. Multiple turnins are fine and highly recommended if you are planning to add extra credit features. Once you've got something that meets the basic requirements, turn that in. Then if you add to your project, turn in the extended version(s) later - we'll grade the last one you turn in.