image University of Washington Computer Science & Engineering
 CSE 590 CB, Sp '05: Reading & Research in Comp. Bio.
  CSE Home   About Us    Search    Contact Info 

 Course Info    CSE 590CB is a weekly seminar on Readings and Research in Computational Biology, open to all graduate students in computational, biological, and mathematical sciences.
When/Where:  Mondays, 3:30 - 4:50, EE1 025
Organizers:  Joe Felsenstein, Michal Linial, Bill Noble, Emily Rocke, Larry Ruzzo, Martin Tompa
Credit: 1-3 Variable
Grading: Credit/No Credit. Talk to the organizers if you are unsure of our expectations.
 Email
cse590cb at cs.washington.edu Course-related announcements and discussions
  Manage Your Subscription List Archives
compbio-seminars at cs.washington.edu Biology seminar announcements from all around campus
  Manage Your Subscription List Archives
compbio-group at cs.washington.edu Discussions about computational biology
  Manage Your Subscription List Archives
 Schedule
 Date  Presenters/Participants Topic Papers
03/28   ---- Organizational Meeting ----  
04/04 Michal Linial, UW & Department of Biological Chemistry, Institute of Life Sciences, Hebrew University, Jerusalem, Israel Moving from Sequence to Structure and Function: Structural and Functional information from a million protein sequences Abstract
04/11 Max Robinson, UW GS ifinder: high-quality intron identification by mapping ESTs and cDNAs to their genomic source locus. Abstract
04/18 Elon Portugaly, Hebrew U Automatic Identification and Classification of Protein Domains (or aiming for an automatic Pfam) Abstract
04/25 Emily Rocke, UW GS Properties of short sequence signals on a genomewide level Abstract
05/02 Mike Hawrylycz and Sayan Pathak, Allen Institute for Brain Science Informatics of the Allen Brain Atlas Abstract
05/09 Goeff Macintyre, Latrobe U An Integrated Framework for Motif Discovery Abstract
05/16 Raphael Gottardo, UW Stat Probabilistic Segmentation and Intensity Estimation for Microarray Images Abstract
05/23 Amol Prakash, UW CSE Statistics of local multiple alignments Abstract
05/30   --- Holiday ---  

 Papers, etc.

  Note on Electronic Access to Journals

Links to full papers below are often to journals that require a paid subscription. The UW Library is generally a paid subscriber, and you can freely access these articles if you do so from an on-campus computer. For off-campus access, look at the library "proxy server" instructions.

During Autumn 2004, the Library deployed a new proxy server; if you are accessing this page from an offcampus computer, the "[offcampus]" links below may be more convenient. You will be prompted for your UW net ID and password once per session.  


04/04: Moving from Sequence to Structure and Function: Structural and Functional information from a million protein sequences -- Michal Linial, UW & Department of Biological Chemistry, Institute of Life Sciences, Hebrew University, Jerusalem, Israel

   Abstract:   I will introduce to you some of our current research that deals with a global organization of the protein sequence space. The main outcome of this classification effort is the ProtoNet system. Currently, ProtoNet includes over one million sequences that is organized automatically into thousands of protein families. We will present some tests to access the quality of the system. Different applications will be introduced in brief: (i) A method that allows reducing the number families to only few thousands of high quality functional groups. (ii) Target Selection for Structural Genomics Initiatives. Structural genomics aims at narrowing the gap between the known protein sequences and their 3D structure. We will present a mode for ranking proteins according to their probability to belong to a new superfamily (iii) Illustrating some cases in which functional relationships of very remote families may be proposed. We will end by presenting one concept according to which, the level of consistency between protein annotations is used to trace annotation mistakes.

Suggested reading:

  • Kaplan, N. Sasson, O. Inbar, U. Friedlich, M. Fromer, M. Fleischer, H. Portugaly, E. Linial, N. Linial, M. (2005) ProtoNet 4.0: A hierarchical classification of one million protein sequences. Nucleic Acids Research, 33, D216-D218.
  • Kaplan, N. Friedlich, M. Fromer, Linial, M. (2004) A functional hierarchical organization of the protein sequence space. BMC Bioinformatics. 5, 196.
  • Kifer, I. Sasson, O. Linial, M. (2004) Predicting fold novelty based on ProtoNet hierarchical classification. Bioinformatics 21, 1020-1027
  • Shachar, O. Linial, M. (2004) A robust method to detect structural and functional remote homologues. Proteins: Structure, Function, and Bioinformatics 57, 532-538.

04/11: ifinder: high-quality intron identification by mapping ESTs and cDNAs to their genomic source locus. -- Max Robinson, UW GS

   Abstract:   With the availability of a large and rapidly increasing set of sequenced eukaryotic genomes, automating the annotation of genes has become an important task. Experimental evidence of the exon-intron structure of a gene is available in the form of sequenced mRNA molecules, either as single sequence reads (expressed sequence tags, or ESTs) or as assembled complementary DNA (cDNA) clone sequences. Two programs, sim4 and BLAT, have been the primary tools used to perform this task. These tools perform fast, heuristic spliced alignment of the query EST or cDNA to the genome. Very recently, a new tool (GMAP; see reference) was reported to provide more accuracy using a "sandwich dynamic programming" algorithm to determine optimal exon-intron boundaries.

I will present a new tool, ifinder, for mapping ESTs and cDNAs to their source genome. Unlike previous approaches, which are focused on detecting and delimiting exons, ifinder was developed to detect and delimit introns unambiguously supported by the expressed sequences. ifinder separates detection of exonic sequence matches, performed by cross_match, from evaluation of exon-intron structure, performed by ifinder. In addition to using dynamic programming to optimize exon-intron boundaries, ifinder uses a splicing graph to evaluate potential gene structures. ifinder determines how many optimal exon-intron structures exist for each EST or cDNA query, allowing discrimination of introns with unambiguous versus ambiguous supporting evidence. I will present an evaluation of ifinder's performance on the C. elegans genome, comparing its output to the genomic annotation. I will also describe the current effort to improve ifinder, adding support for the minor class of U12-dependent introns, for detailed models of splicing signals to improve exon-intron boundary determination, and for identification of "micro-exons", exons too short to be identified in the initial genome-wide scan for matches to the query sequence.

Reference: T.D. Wu and C.K. Watanabe, "GMAP: a genomic mapping and alignment program for mRNA and EST sequences", Bioinformatics (2005 Feb 22; epub ahead of print). PMID: 15728110.

04/18: Automatic Identification and Classification of Protein Domains (or aiming for an automatic Pfam) -- Elon Portugaly, Hebrew U

   Abstract:   A very large number of protein sequences are already known, however, our knowledge about higher properties of proteins, such as their structure and function is scarce. A large scale classification of all proteins into families can help bridging this gap by facilitating homology modelling - e.g. the inference of a proteins function from the functions of other proteins of the same family, and by identifying unknown families of proteins as targets for future research. Proteins are typically composed of several domains - (semi) autonomous functional subunits, that are shuffled in a mix and match evolutionary process generating new proteins. I will present EVEREST, a process we have developed for the identification and classification of protein domains in a comprehensive database of protein sequences. EVEREST combines methodologies of sequence similarity identification, graph based clustering, machine learning, statistical modeling and iterative refinement. We achieve state of the art results, recovering 63% of the known domain families and suggesting new families with about 40% fidelity. This is joint work with Michal Linial and Nati Linial.

04/25: Properties of short sequence signals on a genomewide level -- Emily Rocke, UW GS

   Abstract:   In this talk, I will present some previously uncharacterized patterns in the position of different short (~7 nucleotide) exact sequences relative to nearby genes, using whole genomes of invertebrate animals as a model. I will discuss both the computational techniques we are using to explore these patterns, and the potential biological implications for whole-genome regulatory motif discovery, gene categorization, and computational prediction of chromatin arrangement.

05/02: Informatics of the Allen Brain Atlas -- Mike Hawrylycz and Sayan Pathak, Allen Institute for Brain Science

   Abstract:   The Allen Brain Atlas (ABA) is an in-situ gene hybridization (ISH) map of the approximately 20,000 gene transcripts of the adult mouse brain. The main goal of our bioinformatics and image processing groups is the design and implementation of quantitative and analytic components of the atlas data pipeline including:

  • Construction of informatics tools to manipulate an anatomically correct mouse brain reference atlas and its annotation,
  • Registration of image data to the reference atlas and validation of the registration process, and
  • Quantification and measurement of expression patterns on gene expression data, and strategies for how to approach mining this database.

In this talk we will survey the methods and challenges of constructing the ABA and some preliminary results on quantitative measurement of expression signal by anatomic region. Two of the main distinguishing features of this project are its scope and high-throughput nature. These pose both limitations and interesting challenges on the types of methods that can be applied.

References

  • The main portal http://www.brain-map.org/ to the database contains several documents reviewing the ISH process and its informatics
  • Hantke, Sabine, The DIG system - Non-radioactive automated high-throughput in situ hybridization: a powerful tool for functional genomics research, Biochemica, No.1, 2004
  • Bankman, Isaac, Handbook of Medical Imaging, Academic Press, San Diego, 2000

05/09: An Integrated Framework for Motif Discovery -- Goeff Macintyre, Latrobe U

   Abstract:   The continuous increase in obtainable biological data types has enabled improvement of the biological relevance of existing methods in motif detection. Component based software design provides a platform for the extension and integration of existing regulatory motif detection methods.

I will present a software and probabilistic framework that is based on the work of Segal et al. The program is designed to take multiple data types as input to enhance the process of regulatory motif detection. In its current form, it uses an EM probabilistic framework to integrate micro array expression data and promoter sequences from multiple yeast species to determine regulatory modules. These modules associate specific predicted motifs with certain co-regulated genes. The component-based design allows integration of existing implemented clustering and motif detection programs.

  • Segal E, Yelensky R, Koller D., Genome-wide discovery of transcriptional modules from DNA sequence and gene expression, Bioinformatics, 19(Suppl 1),i273-82 (2003)

05/16: Probabilistic Segmentation and Intensity Estimation for Microarray Images -- Raphael Gottardo, UW Stat

   Abstract:   Algorithms for image segmentation and intensity estimation are crucial to the successful analysis of cDNA microarray experiments but most procedures allow little or no flexibility. Our methods described here are probabilistic and use Bayesian hierarchical modelling. Segmentation of images into foreground objects (spots) and background is more realistic in allowing a much wider variation in spots, including those with doughnut shapes, which are observed quite frequently in practice. We are also able to penalise the occurrence of small artefacts. Our intensity estimation is robust and avoids the common logical error that estimates of foreground may be less than those of corresponding background. Markov chain Monte Carlo provides an integrated approach to segmentation and estimation in sampling from the full posterior distribution of all the parameters and this enables us to focus on any quantity of interest. We illustrate the potential of this modelling approach on two publicly available datasets and show that they lead to improved segmentation in comparison to other approaches. In addition to point estimates, we compute measures of uncertainty, and these can be useful in suggesting low quality spots.

  • Raphael Gottardo, Julian Besag, Alejandro Murua and Matthew Stephens, Probabilistic Segmentation and Intensity Estimation for Microarray Images, UW STAT TR 471 [offcampus]

05/23: Statistics of local multiple alignments -- Amol Prakash, UW CSE

   Abstract:   BLAST [Karlin and Altschul, 1990] statistics have been shown to be extremely useful for searching for significant similarity hits, for amino acid and nucleotide sequences. While these statistics are well understood for pairwise comparisons, there has been little success developing statistical scores for multiple alignments. In particular, there is no score for multiple alignment that is well founded and treated as a standard. We extend the BLAST theory to multiple alignments. Following some simple assumptions, we present and justify a significance score for multiple segments of a local multiple alignment. We demonstrate its usefulness in distinguishing high and moderate quality multiple alignments from low quality ones, with supporting experiments on orthologous vertebrate promoter sequences.


 Other  Seminars Past quarters of CSE 590CB
COMBI & Genome Sciences Seminars
Applied Math Department Mathematical Biology Journal Club
Biostatistics Seminars
Microbiology Department Seminars
Zoology 525, Mathematical Biology Seminar Series

 Resources Molecular Biology for Computer Scientists, a primer by Lawrence Hunter (46 pages)
A Quick Introduction to Elements of Biology, a primer by Alvis Brazma et al.
S-Star Bioinformatics Online Course Schedule, a collection of video primers
A very comprehensive FAQ at bioinformatics.org, including annotated references to online tutorials and lectures.
CSE 527: Computational Biology
CSE 590TV: Computational Biology (Professional Masters Program)
Genome 540/541: Introduction to Computational Molecular Biology: Genome and Protein Sequence Analysis

CSE's Computational Molecular Biology research group
Interdisciplinary Ph.D. program in Computational Molecular Biology


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to cse590cb-webmaster]