Sequence-based heuristics for faster annotation of non-coding RNA families. | Paul G. Allen School of Computer Science & Engineering

Title	Sequence-based heuristics for faster annotation of non-coding RNA families.
Publication Type	Journal Article
Year of Publication	2006
Authors	Weinberg Z, Ruzzo WL
Journal	Bioinformatics (Oxford, England)
Volume	22
Issue	1
Pagination	35-9
Date or Month Published	2006 Jan 1
ISSN	1367-4803
Keywords	Algorithms, Computational Biology, Genome, Humans, Markov Chains, Models, Statistical, Nucleic Acid Conformation, Proteins, Protein Structure, Secondary, RNA, RNA, Transfer, RNA, Untranslated, ROC Curve, Sensitivity and Specificity, Sequence Alignment, Software
Abstract	MOTIVATION: Non-coding RNAs (ncRNAs) are functional RNA molecules that do not code for proteins. Covariance Models (CMs) are a useful statistical tool to find new members of an ncRNA gene family in a large genome database, using both sequence and, importantly, RNA secondary structure information. Unfortunately, CM searches are extremely slow. Previously, we created rigorous filters, which provably sacrifice none of a CM's accuracy, while making searches significantly faster for virtually all ncRNA families. However, these rigorous filters make searches slower than heuristics could be. RESULTS: In this paper we introduce profile HMM-based heuristic filters. We show that their accuracy is usually superior to heuristics based on BLAST. Moreover, we compared our heuristics with those used in tRNAscan-SE, whose heuristics incorporate a significant amount of work specific to tRNAs, where our heuristics are generic to any ncRNA. Performance was roughly comparable, so we expect that our heuristics provide a high-quality solution that--unlike family-specific solutions--can scale to hundreds of ncRNA families. AVAILABILITY: The source code is available under GNU Public License at the supplementary web site.
DOI	10.1093/bioinformatics/bti743
Downloads	http://www.ncbi.nlm.nih.gov/pubmed/16267089?dopt=Abstract
Alternate Journal	Bioinformatics
Citation Key	1880
PubMed ID	16267089