Title: Computational Curation of Open Science Data
Advisor: Bill Howe
Supervisory Committee: Bill Howe (Chair), David Ribes (GSR, HCDE), Hoifung Poon (MSR), and Larry Ruzzo
Abstract: Rapid advances in data collection, storage and processing technologies are leading a new, data-driven paradigm in science. In life sciences, progress is driven by plummeting genome sequencing costs, opening up new fields of bioinformatics, genomics, and systems biology. The enormous investments into the collection and storage of the data are hindered by a lack of curation, leaving a lot of this data stagnant and waiting to be used. In this talk, we are going to introduce several approaches aimed at making open scientific data more easily accessible and reusable.
First, in the Wide-Open project, we introduce a text mining system for detecting datasets that are referenced in published papers but are still kept private. After parsing over 1.5 million open access publications, Wide-Open has identified hundreds of datasets overdue for publication, 400 of them were then released within one week.
Second, we propose a machine learning system, EZLearn, for annotating gene expression data into potentially thousands of classes without manual work required to provide training labels. EZLearn is based on an observation that in scientific domains data samples often come with natural language descriptions meant for human consumption. We take advantage of those descriptions by introducing an auxiliary natural language processing system, training it together with the main classifier in a co-training fashion.
Finally, we will discuss several approaches that can capture scientific claims from papers, validate them against the data associated with the paper, then generalize and adapt the claims to other relevant datasets in the repository to gather additional statistical evidence. We evaluated the system by applying it to gene expression datasets, and producing reports summarizing the evidence for or against the claim based on the entirety of the collected knowledge in the repository. We find that the claim-based algorithms we propose outperform conventional data integration methods and achieve high accuracy against manually validated claims.