Title: Large Scale Analytics On Image Data

Advisor: Magdalena Balazinska

Supervisory Committee: Magdalena Balazinska (Chair), Ione Fine (GSR, Psychology), Su-In Lee, Hannaneh Hajishhirzi, Ariel Rokem (UW eScience Institute)

Abstract: The world of science is changing: scientific discoveries are increasingly driven by the analysis of large volumes of data. A large proportion of this data is images, such as images captured by telescope, satellite, microscopes, magnetic resonance devices, etc. For high dimension data, such as images deep learning has provided paradigm shifting performance. Traditional methods of building machine learning models can be iterative and arduous. Scientists typically build from tens to hundreds of models before settling on one. This problem is exacerbated for deep learning models by large input data and interpretability artifacts that need to be generated to build, diagnose, and interpret these models. This poses a significant data management burden to the scientist building these models. We propose a doctoral thesis that aims to reduce the data management burden for scientists building deep learning models by reducing the amount of data they need to examine at each iteration of model building. This relies on the insight that supervised deep learning models simultaneously learn a classifier and a low dimensional representation of the input data. The key idea underpinning our approach is to identify and target model decision boundaries to provide effective and efficient samples which can be utilized to debug and interpret these models. The talk will focus on following research ideas: (1) challenges associated with building, debugging and interpreting a deep learning models, (2) characterization of deep learning model building and debugging workload, and (3) application of sampling and approximate query processing to this workload. With advances in data collection and storage technologies, data analysis has become heralded as the fourth paradigm of science [1]. In many scientific fields, an increasing portion of this data is images [2, 3]. As an example, UK Biobank [4] is an international health resource with the aim of improving the prevention, diagnosis and treatment of a wide range of serious and life-threatening illnesses – including cancer, diabetes, arthritis, eye disorders, depression and forms of dementia etc. UK Biobank recruited 500,000 people aged between 40-69 years in 2006-2010. These participants have undergone measures, provided blood, urine and saliva samples for future analysis, detailed information about themselves and agreed to have their health followed. The size of this data is staggering: the Magnetic Resonance Imaging (MRI) data alone is over 200 TB [5]. This is not an isolated incident. Multiple other initiatives are similarly making large collections of image data available to researchers. With the increased availability of large scale data; improvements in hardware, such as the emergence and availability of GPU-based computing and cloud computing resources; better algorithms; and relatively easy-to-use open-source deep learning libraries with sizable communities are making deep learning a powerful tool for researchers working with image data [6, 7]. Compared to the traditional approach of selecting features before training a model, deep learning features are not manually given but learned automatically from the input data, which is arranged in dense, high-dimensional numerical arrays such as images, audio and text. It is with this kind of data that the deep neural networks have achieved paradigm shifting performance. For instance, deep learning trained neural networks outperform humans in classifying images from the ImageNet dataset [8], which is commonly used as a benchmark for convolutional architectures, trained neural nets now achieve better-than human performance on as illustrated in Figure 1.

Place: 
CSE1 (Allen Center) 291(Database lab)
When: 
Thursday, September 26, 2019 - 10:30 to 12:30