Machine Learning Seminars

Sponsored by

Title: Theoretical and Algorithmic Foundations of Online Learning
Speaker: Sasha Rakhlin, Wharton School, University of Pennsylvania
When: Tuesday, April 8, 2014 - 12:30
Location: CSE 305

A major challenge for machine learning in the next decade is the development of methods that continuously predict and learn from a stream of data. The scale of data and the non-stationary nature of the environment make the prevalent ``learn from a batch of i.i.d. data'' paradigm of Statistical Learning inadequate. Despite the extensive literature on sequential prediction (online learning) methods, the theoretical understanding of the subject has been lacking, and the results have been obtained on a case-by-case basis. This stands in contrast to Statistical Learning, where the inherent complexities have been well-understood in the last forty years. In this talk, we focus on no-regret online learning and develop the relevant notions of complexity in a surprising parallel to Statistical Learning Theory. This non-constructive study of inherent complexity is then augmented with a recipe for developing efficient online learning algorithms via a notion of a relaxation. To demonstrate the utility of our approach, we develop a new family of randomized methods, as well as new algorithms for collaborative filtering and node prediction.


Title: Scaling and Generalizing Variational Inference
Speaker: David Blei, Princeton University
When: Monday, March 31, 2014 - 12:30
Location: CSE 305

Latent variable models have become a key tool for the modern statistician, letting us express complex assumptions about the hidden structures that underlie our data. Latent variable models have been successfully applied in numerous fields including natural language processing, computer vision, electronic medical records, genetics, neuroscience, astronomy, political science, sociology, the digital humanities, and many others.

The central computational problem in latent variable modeling is posterior inference, the problem of approximating the conditional distribution of the latent variables given the observations. Posterior inference is central to both exploratory tasks, where we investigate hidden structures that underlie our data, and predictive tasks, where we use the inferred structures to generalize about future data. Approximate posterior inference algorithms have revolutionized Bayesian statistics, revealing its potential as a usable and general-purpose language for data analysis.

Bayesian statistics, however, has not yet reached this potential. First, statisticians and scientists regularly encounter massive data sets, but existing approximate inference algorithms do not scale well. Second, most approximate inference algorithms are not generic; each must be adapted to the specific model at hand. This often requires significant model-specific analysis, which precludes us from easily exploring a variety of models.

In this talk I will discuss our recent research on addressing these two limitations. First I will describe stochastic variational inference, an approximate inference algorithm for handling massive data sets. Stochastic inference is easily applied to a large class of Bayesian models, including time-series models, factor models, and Bayesian nonparametric models. I will demonstrate its application to probabilistic topic models of text conditioned on millions of articles. Stochastic inference opens the door to scalable Bayesian computation for modern data analysis.

Then I will discuss black box variational inference. Black box inference is a generic algorithm for approximating the posterior. We can easily apply it to many models with little model-specific derivation and few restrictions on their properties. Black box inference performs better than similarly generic sampling algorithms, such as Metropolis-Hastings inside Gibbs, and can be composed with stochastic inference to handle massive data. I will demonstrate its use on a suite of nonconjugate models of longitudinal healthcare data.

This is joint work based on these two papers:

M. Hoffman, D. Blei, J. Paisley, and C. Wang. Stochastic variational inference. Journal of Machine Learning Research, 14:1303-1347.

R. Ranganath and D. Blei. Black box variational inference. Artificial Intelligence and Statistics, 2014.


Title: On Estimating Many Effect-sizes, Selection Bias, and Local Adaptive Shrinkage
Speaker: Noah Simon, UW Biostatistics
When: Tuesday, February 4, 2014 - 12:30
Location: CSE 305

There is a growing scientific interest in simultaneously estimating the means (or effect-sizes) of many different features. Classical (but very unintuitive) results in Stein-estimation show that even if these features are independent, cleverly coupling their mean estimates can greatly improve estimation accuracy. These ideas have been extended to build locally adaptive estimators based on perspectives from non-parametric empirical Bayesian estimation (and compound decision theory). Unfortunately, these estimators are not simple, and are really only tractable in highly idealized scenarios.

In recent work, we introduce a different framework for this estimation problem. Our framework is intuitive and shows how estimates from "independent features" become coupled through selection bias. We also give a simple general estimator based on resampling which is applicable and performs well in a wide variety of scenarios.

In this talk I will discuss the intuition behind "coupling estimates from independent features." I will review empirical Bayes/Stein Estimation, and I will introduce our framework. I will explain the connections between these estimators, show how they compare in practice.

This presentation is intended for a general machine learning audience.

This is joint work with Richard Simon (NIH) and recently Kean Ming Tan and Daniela Witten (UW) with occasional skepticism from Brad Efron (Stanford).


Title: Recovering Block-structured Activations Using Compressive Measurements
Speaker: Mladen Kolar, University of Chicago
When: Tuesday, January 14, 2014 - 12:30
Location: CSE 305

In recent years, with the advancement of large-scale data acquisition technology in various engineering, scientific and socio-economical domains, traditional machine learning and statistical methods have started to struggle with massive amounts of increasingly high-dimensional data. Luckily, in many problems there is a simple structure underlying high-dimensional data that can be exploited to make learning feasible.

In this talk, I will focus on the problem of detection and localization of a contiguous block of weak activation in a large matrix, from a small number of noisy, possibly adaptive, compressive measurements. This is closely related to the problem of compressed sensing, where the task is to estimate a sparse vector using a small number of linear measurements. Contrary to results in compressed sensing, where it has been shown that neither adaptivity nor contiguous structure help much, we show that for reliable localization the magnitude of the weakest signals is strongly influenced by both structure and the ability to choose measurements adaptively while for detection neither adaptivity nor structure reduce the requirement on the magnitude of the signal. We characterize the precise tradeoffs between the various problem parameters, the signal strength and the number of measurements required to reliably detect and localize the block of activation. The sufficient conditions are complemented with information theoretic lower bounds.

Joint work with Sivaraman Balakrishnan, Alessandro Rinaldo and Aarti Singh.


Title: Using machine learning to achieve MOOC-scale grading and feedback for open-ended assignments
Speaker: Jonathan Huang, Stanford
When: Tuesday, November 12, 2013 - 12:15
Location: CSE Gates Commons
Massive open online courses (MOOCs) are one of the latest internet revolutions and may cure the “cost disease” of higher education. However, dealing with tens of thousands of homework submissions (particularly open-ended ones) remains a challenge in the online classroom. In courses where the student-teacher ratio can be ten thousand to one or worse, it is impossible for instructors to personally give feedback to students or to understand the multitude of approaches taken, and pitfalls encountered by their students.

Peer grading serves as a critical tool for scaling the grading of complex, open-ended assignments to courses with tens or hundreds of thousands of students. But despite promising initial trials, it does not always deliver accurate results compared to human experts. I will talk about methods for estimating and correcting for grader biases and reliabilities, showing significant improvement in peer grading accuracy on real data with 63,199 peer grades from Coursera's HCI course offerings --- the largest peer grading networks analyzed to date. I will also discuss the relationship between grader biases and reliabilities to other student factors such as student engagement, performance as well as commenting style.

I will also discuss scalable feedback in the context of coding assignments, which have more structure than arbitrary open ended assignments. I outline a novel way to decompose online coding submissions into a vocabulary of “code phrases”. Based on this vocabulary, we have architected a queryable index that allows fast searches into the massive dataset of student homework submissions. To demonstrate the utility of our homework search engine we index over a million code submissions from users worldwide in Stanford’s Machine Learning massive open online course (MOOC) and then (a) semi-automatically learn shared structure amongst homework submissions and (b) generate specific feedback for student mistakes.

This is joint work with Chris Piech, Andy Nguyen, Zhenghao Chen, Chuong Do, Andrew Ng, Daphne Koller and Leonidas Guibas


Title: A Characterization of Online Learning Problems
Speaker: Ofer Dekel, MSR
When: Tuesday, October 29, 2013 - 12:30
Location: CSE Gates Commons
We study the learning rate of the adversarial online learning problem in a finite action space. The “adversarial multi-armed bandit” problem and “predicting with expert advice” are two well-known special cases of this fundamental online learning setting. We present a rich and complete characterization of the inherent difficulty of this problem as we vary the power of the adversary and the amount of information provided to the learning algorithm. Specifically, we classify the different variants of this problem into three distinct difficulty classes: easy problems (those with sqrt(T) regret), such as learning against oblivious adversaries; hard problems (with T^(2/3) regret), such as the multi-armed bandit with switching costs; and unlearnable problems (with linear regret), such as learning against adversaries that have an unbounded memory. Our new characterization has two important implications. First, we formally confirm the common knowledge that learning with bandit feedback can be strictly harder than learning with full information. Second, we prove that learning in the bandit setting requires extensive switching between actions.

Joint work with Raman Arora (TTI-C), Nicolo Cesa-Bianchi (Milano), Jian Ding (UChicago), Tomer Koren (Technion), Yuval Peres (MSR), Ohad Shamir (Weizmann), and Ambuj Tewari (UMICH)

The talk is intended for a general machine learning audience.


Title: Local Low-Rank Matrix Approximation
Speaker: Guy Lebanon, Amazon
When: Tuesday, October 1, 2013 - 12:30
Location: CSE 305
Matrix approximation is a common tool in recommendation systems, text mining, and computer vision. A prevalent assumption in constructing matrix approximations is that the partially observed matrix is of low-rank. We propose a new matrix approximation model where we assume instead that the matrix is locally of low-rank, leading to a representation of the observed matrix as a weighted sum of low-rank matrices. We analyze the accuracy of the proposed local low rank modeling. Our experiments show improvements in prediction accuracy over classical approaches for recommendation tasks.


Title: Method-of-Moment Algorithms for Learning Bayesian Networks
Speaker: David Sontag, NYU
When: Tuesday, July 16, 2013 - 12:30
Location: CSE 305
We present two algorithms for extracting structure from big data. The first is a new approach to learning Bayesian network structure based on a data-dependent complexity penalty. We show that the new scoring function has a small sample complexity and has the property that it becomes computationally easier to find the highest-scoring structure as the amount of data increases. Next, we present a new algorithm with provable guarantees for discrete factor analysis from binary data, enabling the discovery of hidden variables and their causal relationships with observed data. These methodologies have applications throughout computational biology, medicine, and the social sciences.


Title: Which Supervised Learning Method Works Best for What? An Empirical Comparison of Learning Methods and Metrics
Speaker: Rich Caruana, Microsoft Research
When: Tuesday, June 4, 2013 - 12:30
Location: CSE Gates Commons

Decision trees can be intelligible, but do they perform well enough that you should use them? Have SVMs replaced neural nets, or are neural nets still best for regression and SVMs best for classification? Boosting maximizes a margin similar to SVMs, but can boosting compete with SVMs? If it does, is it better to boost weak models or to boost stronger models? Bagging is easier than boosting - how well does it stack up against boosting? Breiman said Random Forests are better than bagging and as good as boosting. Was he right? And what about old friends like logistic regression, KNN, and naive Bayes? Should they be put out to pasture, or do they fill important niches?

In this talk we'll compare the performance of ten supervised learning methods on nine criteria: Accuracy, F-score, Lift, Precision/Recall Break-Even Point, Area under the ROC, Average Precision, Squared Error, Cross-Entropy, and Probability Calibration. The results show that no one learning method does it all, but some methods can be 'repaired' to do very well across all performance metrics. In particular, we show how to obtain good probabilities from max-margin methods such as SVMs and boosting via Platt's Method and Isotonic Regression. We also describe a meta-ensemble method that combines select models from these ten learning methods to yield even better performance than any of the individual learning methods. Although these ensembles perform extremely well, they are too complex for some real-world applications. We'll describe a model compression method that tries to fix that. Finally, if time permits, we'll discuss how the nine performance metrics relate to each other, and on which of the metrics you probably should (or shouldn't) depend.


Title: Reproducibility and Probabilistic Tsunami Hazard Assessment
Speaker: Randall Leveque, UW Applied Math
When: Tuesday, May 28, 2013 - 12:30
Location: CSE Gates Commons

I will discuss two somewhat independent topics. The first is the importance of reproducibility in computational research and the need for higher standards and better tools to help insure that published results can be reproduced (preferably by other researchers, but at least by the authors!). I will mention some of the tools already available to help with this.

Tsunami modeling is a motivating example since the results of numerical simulations are often used to inform public policy decisions that can have significant financial and safety implications. I will also briefly describe some of the techniques used to perform probabilistic hazard assessment and to solve the inverse problems necessary to do real-time forecasting, in hopes of identifying overlapping interests with the machine learning community.