|
CSE Home |
About Us |
Search |
Contact Info |
| Course Info |
CSE 590CB is a weekly seminar on Readings and Research in
Computational Biology, open to all graduate students in computational,
biological, and mathematical sciences.
| |||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||
| Schedule |
|
|||||||||||||||||||||||||||||||||||||||||||||||||
| Papers, etc. |
Links to full papers below are often to journals that require a
paid subscription. The UW Library is generally a paid
subscriber, and you can freely access these articles if you do
so from an on-campus computer. For off-campus access,
look at the
library "proxy server" instructions.
|
|||||||||||||||||||||||||||||||||||||||||||||||||
| Abstract: I will introduce to you some of our current research that deals
with a global organization of the protein sequence space. The
main outcome of this classification effort is the ProtoNet
system. Currently, ProtoNet includes over one million sequences
that is organized automatically into thousands of protein
families. We will present some tests to access the quality of
the system. Different applications will be introduced in brief:
(i) A method that allows reducing the number families to only
few thousands of high quality functional groups. (ii) Target
Selection for Structural Genomics Initiatives. Structural
genomics aims at narrowing the gap between the known protein
sequences and their 3D structure. We will present a mode for
ranking proteins according to their probability to belong to a
new superfamily (iii) Illustrating some cases in which
functional relationships of very remote families may be
proposed. We will end by presenting one concept according to
which, the level of consistency between protein annotations is
used to trace annotation mistakes.
Suggested reading:
|
04/11: ifinder: high-quality intron identification
by mapping ESTs and cDNAs to their genomic source locus. -- Max Robinson, UW GS
| Abstract: With the availability of a large and rapidly increasing set of
sequenced eukaryotic genomes, automating the annotation of genes has
become an important task. Experimental evidence of the exon-intron
structure of a gene is available in the form of sequenced mRNA
molecules, either as single sequence reads (expressed sequence tags,
or ESTs) or as assembled complementary DNA (cDNA) clone sequences.
Two programs, sim4 and BLAT, have been the primary tools used to
perform this task. These tools perform fast, heuristic spliced
alignment of the query EST or cDNA to the genome. Very recently, a
new tool (GMAP; see reference) was reported to provide more accuracy
using a "sandwich dynamic programming" algorithm to determine optimal
exon-intron boundaries.
I will present a new tool, ifinder, for mapping ESTs and cDNAs to their source genome. Unlike previous approaches, which are focused on detecting and delimiting exons, ifinder was developed to detect and delimit introns unambiguously supported by the expressed sequences. ifinder separates detection of exonic sequence matches, performed by cross_match, from evaluation of exon-intron structure, performed by ifinder. In addition to using dynamic programming to optimize exon-intron boundaries, ifinder uses a splicing graph to evaluate potential gene structures. ifinder determines how many optimal exon-intron structures exist for each EST or cDNA query, allowing discrimination of introns with unambiguous versus ambiguous supporting evidence. I will present an evaluation of ifinder's performance on the C. elegans genome, comparing its output to the genomic annotation. I will also describe the current effort to improve ifinder, adding support for the minor class of U12-dependent introns, for detailed models of splicing signals to improve exon-intron boundary determination, and for identification of "micro-exons", exons too short to be identified in the initial genome-wide scan for matches to the query sequence. Reference: T.D. Wu and C.K. Watanabe, "GMAP: a genomic mapping and alignment program for mRNA and EST sequences", Bioinformatics (2005 Feb 22; epub ahead of print). PMID: 15728110. |
04/18: Automatic Identification and Classification
of Protein Domains (or aiming for an automatic Pfam) -- Elon Portugaly, Hebrew U
| Abstract: A very large number of protein sequences are already known, however, our knowledge about higher properties of proteins, such as their structure and function is scarce. A large scale classification of all proteins into families can help bridging this gap by facilitating homology modelling - e.g. the inference of a proteins function from the functions of other proteins of the same family, and by identifying unknown families of proteins as targets for future research. Proteins are typically composed of several domains - (semi) autonomous functional subunits, that are shuffled in a mix and match evolutionary process generating new proteins. I will present EVEREST, a process we have developed for the identification and classification of protein domains in a comprehensive database of protein sequences. EVEREST combines methodologies of sequence similarity identification, graph based clustering, machine learning, statistical modeling and iterative refinement. We achieve state of the art results, recovering 63% of the known domain families and suggesting new families with about 40% fidelity. This is joint work with Michal Linial and Nati Linial. |
04/25: Properties of short sequence signals on a
genomewide level -- Emily Rocke, UW GS
| Abstract: In this talk, I will present some previously uncharacterized patterns in the position of different short (~7 nucleotide) exact sequences relative to nearby genes, using whole genomes of invertebrate animals as a model. I will discuss both the computational techniques we are using to explore these patterns, and the potential biological implications for whole-genome regulatory motif discovery, gene categorization, and computational prediction of chromatin arrangement. |
05/02: Informatics of the Allen Brain Atlas -- Mike Hawrylycz and Sayan Pathak, Allen Institute for Brain Science
| Abstract: The Allen Brain Atlas (ABA) is an in-situ gene hybridization
(ISH) map of the approximately 20,000 gene transcripts of the adult
mouse brain. The main goal of our bioinformatics and image
processing groups is the design and implementation of quantitative
and analytic components of the atlas data pipeline including:
In this talk we will survey the methods and challenges of constructing the ABA and some preliminary results on quantitative measurement of expression signal by anatomic region. Two of the main distinguishing features of this project are its scope and high-throughput nature. These pose both limitations and interesting challenges on the types of methods that can be applied. References |
05/09: An Integrated Framework for Motif Discovery -- Goeff Macintyre, Latrobe U
| Abstract: The continuous increase in obtainable biological data types has
enabled improvement of the biological relevance of existing
methods in motif detection. Component based software design
provides a platform for the extension and integration of
existing regulatory motif detection methods.
I will present a software and probabilistic framework that is based on the work of Segal et al. The program is designed to take multiple data types as input to enhance the process of regulatory motif detection. In its current form, it uses an EM probabilistic framework to integrate micro array expression data and promoter sequences from multiple yeast species to determine regulatory modules. These modules associate specific predicted motifs with certain co-regulated genes. The component-based design allows integration of existing implemented clustering and motif detection programs. |
05/16: Probabilistic Segmentation and Intensity
Estimation for Microarray Images -- Raphael Gottardo, UW Stat
| Abstract: Algorithms for image segmentation and intensity estimation are crucial to the successful analysis of cDNA microarray experiments but most procedures allow little or no flexibility. Our methods described here are probabilistic and use Bayesian hierarchical modelling. Segmentation of images into foreground objects (spots) and background is more realistic in allowing a much wider variation in spots, including those with doughnut shapes, which are observed quite frequently in practice. We are also able to penalise the occurrence of small artefacts. Our intensity estimation is robust and avoids the common logical error that estimates of foreground may be less than those of corresponding background. Markov chain Monte Carlo provides an integrated approach to segmentation and estimation in sampling from the full posterior distribution of all the parameters and this enables us to focus on any quantity of interest. We illustrate the potential of this modelling approach on two publicly available datasets and show that they lead to improved segmentation in comparison to other approaches. In addition to point estimates, we compute measures of uncertainty, and these can be useful in suggesting low quality spots. |
05/23: Statistics of local multiple alignments -- Amol Prakash, UW CSE
| Abstract: BLAST [Karlin and Altschul, 1990] statistics have been shown to be extremely useful for searching for significant similarity hits, for amino acid and nucleotide sequences. While these statistics are well understood for pairwise comparisons, there has been little success developing statistical scores for multiple alignments. In particular, there is no score for multiple alignment that is well founded and treated as a standard. We extend the BLAST theory to multiple alignments. Following some simple assumptions, we present and justify a significance score for multiple segments of a local multiple alignment. We demonstrate its usefulness in distinguishing high and moderate quality multiple alignments from low quality ones, with supporting experiments on orthologous vertebrate promoter sequences. |
CSE's Computational Molecular Biology research group
Interdisciplinary Ph.D. program in Computational Molecular Biology
|
Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX [comments to cse590cb-webmaster] | |