Title: High-dimensional machine learning techniques for integrative analysis of heterogeneous molecular data

Advisor: Su-In Lee

Supervisory Committee: Su-In Lee (Chair), Meliha Yetisgen (GSR, Biomedical Informatics), Paul Crane (Public Health), Raymond David Hawkins (Genome Sciences), and Ali Shojaie (Biostatistics, Public Health)

Abstract: Mining robust and relevant information from molecular data is essential to identify the mechanisms of complex biological phenotypes including disease mechanisms. However, there are three main challenges with the application of a standard machine learning approach to an individual molecular dataset. First, most molecular data is within the p >> n regime (i.e. the number of variables is much greater than the number of samples); for instance, there are about 20,000 genes in a human cell while at most a few hundreds of patient samples are available in individual gene expression datasets. This implies that models allowing for arbitrarily rich dependencies among variables (such as those used in deep learning methods) are highly likely to overfit the data. Second, there will be either technical or experimental confounders in any one study that make the features learned from an individual dataset not necessarily generalizable to other datasets. Finally, and most importantly, any biological or disease mechanism originates from the collaboration of several different types of genetic and epigenetic elements, and focusing on any single type of those elements will provide a poor understanding of the underlying complex biology.

In this talk, I will introduce novel machine learning techniques to address the aforementioned challenges. I will start with describing two methods we developed, MGL and MERGE, each of them learning different kinds of biologically interpretable features from a single high-dimensional gene expression dataset. Then I will introduce the INSPIRE probabilistic model that enables utilizing all samples from multiple datasets with different sets of measured variables, increasing the statistical power to detect generalizable and robust features. I will show that when used to combine datasets from ovarian cancer, INSPIRE reveals important molecular events underlying the disease. Finally, I will propose a network learning framework that builds on our previous works and enables integrating heterogeneous data of different molecular types from Alzheimer's disease and leukemia patients. We anticipate that the features learned by the proposed method will provide a better understanding of complex disease mechanisms which will lead to new therapeutic targets as well as better diagnosis and personalized treatment of patients.

Place: 
CSE 303
When: 
Monday, October 17, 2016 - 13:00 to Tuesday, April 16, 2024 - 03:18