Lecture n

Tuesday, December 4

Lecturer: Larry Ruzzo

Notes: Tobias Mann

Gene Prediction

Motivation:

There is lots of sequence data available, and there is a need for automated sequence annotation. Gene prediction algorithms seek determine which parts of a dna sequence will be expressed as a protein.

Biological Background:

DNA is transcribed into mRNA, and the mRNA is then translated into proteins. Each amino acid in a protein is coded as a triplet of DNA bases (b_i b_i+1 b_i+2),where each b_j denotes an ‘A’,’T’,’G’, or ‘C’. These triplets are called codons. There are also codons that indicate that translation should start or stop. There are 64 distinct codons, which map to 20 different amino acids.

Reading Frame:

A dna sequence S = [b₁ b₂ … b_n], has three possible translations into a protein sequence, each of which is called a reading frame. The first translation starts at b₁, and is the amino acid sequence determined by: [ (b₁ b₂ b₃) (b₄ b₅ b₆) …]. The second translation starts at b₂, and is the amino acid sequence determined by: [ (b₂ b₃b₄) (b₅ b₆b₇) … ]. The third sequence starts at b₃ and is the amino acid sequence determined by: [(b₃ b₄b₅) (b₆ b₇b₈) … ]. An open reading frame (ORF) is a reading frame with no stop codons.

Statistics of Sequences:

Random DNA should have a stop codon every 21 bases on average, but the average protein is about 10K bases. This suggests that finding a long sequence of DNA with no stop codons might yield DNA sequences that are likely to be translated into proteins.

Also, in random DNA, the ratio of Leucine:Alanine:Tryptophan should be about 6:4:1, whereas proteins made by organisms have different ratios which can differ significantly. Also, some species are biased in their use of codons, and use some codon synonyms for certain proteins more than others.

An Algorithm For Predicting Reading Frames

Assume that codons are i.i.d, and that codon (b_i b_j b_k) has frequency f(b_i b_j b_k). Then, a scheme to predict reading frames is to find

P_i = f(b_i+1 b_i+2 b_i+3)* f(b_i+4 b_i+5 b_i+6)* f(b_i+7 b_i+8 b_i+9)* f(b_i+10 b_i+11 b_i+12)…i=(0,1,2)

Where P_ican be interpreted as the probability of observing a sequence of codons.

The most probable reading frame is the one whose codons are closest to the distribution of codons in the organism whose DNA is being analyzed.

This works pretty well in prokaryotes, where most DNA is coding for proteins, and the ORFs are relatively long. However, note that not every ORF is expressed, and the situation is considerably more complicated for eukaryotes.

Promoters:

Another way to identify genes is to find promoters, which are sequences that are upstream of genes and can cause the genes to be expressed. A classic example is the concensus sequence ‘TATAAT’ which is about 10 bp upstream of transcription sites in e coli. The ‘TATAAT’ sequence is not perfectly conserved, and there are other promoter sequences as well.

Weight matrices, which express the conditional probability of a base occurring as a function of position in the sequence, can be used to identify ‘TATAAT’ sequences, and the predictions of similarity to the ‘TATAAT’ sequence yielded by weight matrices have some correlation with the binding energy of RNA polymerase to that sequence. One reason for variation in the ‘TATAAT’ sequence may be that through variation of binding affinity to RNA polymerase, the level of expression of a gene can be varied through several orders of magnitude, where sequences closer to ‘TATAAT’ bind more readily to RNA polymerase and are thus expressed more often.