Title: Deep Curation: An Unsupervised Approach to Biological Data Curation
Advisor: Bill Howe
Supervisory Committee: Bill Howe (Chair), David Ribes (GSR, HCDE), Hoifung Poon (MSR), and Larry Ruzzo
Abstract: Public repositories such as Gene Expression Omnibus (GEO) have seen rapid accumulation of biological data. However, high-quality and consistent annotation is generally unavailable, which severely limits the use of multiple datasets for new discoveries, reproducibility, and other computational tasks. Previous attempts to automate the curation task require hand-labeled training data, which is not generally available and must be reacquired whenever the ontology that provides the class labels change.
We propose a new method that learns tissue type labels for GEO datasets with no training labels. We learn two classifiers in tandem, one over the free-text description for a dataset and another over the raw microarray signal itself, and having the two classifiers train each other iteratively. We applied this method to GEO to produce an expression-based classifier that outperforms the state-of-the-art supervised-learning method in accuracy without any hand-labeled training data.