## Tuesday, November 12, 2013 - 12:15

## Using machine learning to achieve MOOC-scale grading and feedback for open-ended assignments

**Speaker: ** Jonathan Huang, Stanford

**Location: ** CSE Gates Commons

Massive open online courses (MOOCs) are one of the latest internet revolutions and may cure the “cost disease” of higher education. However, dealing with tens of thousands of homework submissions (particularly open-ended ones) remains a challenge in the online classroom. In courses where the student-teacher ratio can be ten thousand to one or worse, it is impossible for instructors to personally give feedback to students or to understand the multitude of approaches taken, and pitfalls encountered by their students.

Peer grading serves as a critical tool for scaling the grading of complex, open-ended assignments to courses with tens or hundreds of thousands of students. But despite promising initial trials, it does not always deliver accurate results compared to human experts. I will talk about methods for estimating and correcting for grader biases and reliabilities, showing significant improvement in peer grading accuracy on real data with 63,199 peer grades from Coursera's HCI course offerings --- the largest peer grading networks analyzed to date. I will also discuss the relationship between grader biases and reliabilities to other student factors such as student engagement, performance as well as commenting style.

I will also discuss scalable feedback in the context of coding assignments, which have more structure than arbitrary open ended assignments. I outline a novel way to decompose online coding submissions into a vocabulary of “code phrases”. Based on this vocabulary, we have architected a queryable index that allows fast searches into the massive dataset of student homework submissions. To demonstrate the utility of our homework search engine we index over a million code submissions from users worldwide in Stanford’s Machine Learning massive open online course (MOOC) and then (a) semi-automatically learn shared structure amongst homework submissions and (b) generate specific feedback for student mistakes.

This is joint work with Chris Piech, Andy Nguyen, Zhenghao Chen, Chuong Do, Andrew Ng, Daphne Koller and Leonidas Guibas

## Tuesday, October 29, 2013 - 12:30

## A Characterization of Online Learning Problems

**Speaker: ** Ofer Dekel, MSR

**Location: ** CSE Gates Commons

We study the learning rate of the adversarial online learning problem in a finite action space. The “adversarial multi-armed bandit” problem and “predicting with expert advice” are two well-known special cases of this fundamental online learning setting. We present a rich and complete characterization of the inherent difficulty of this problem as we vary the power of the adversary and the amount of information provided to the learning algorithm. Specifically, we classify the different variants of this problem into three distinct difficulty classes: easy problems (those with sqrt(T) regret), such as learning against oblivious adversaries; hard problems (with T^(2/3) regret), such as the multi-armed bandit with switching costs; and unlearnable problems (with linear regret), such as learning against adversaries that have an unbounded memory. Our new characterization has two important implications. First, we formally confirm the common knowledge that learning with bandit feedback can be strictly harder than learning with full information. Second, we prove that learning in the bandit setting requires extensive switching between actions.

Joint work with Raman Arora (TTI-C), Nicolo Cesa-Bianchi (Milano), Jian Ding (UChicago), Tomer Koren (Technion), Yuval Peres (MSR), Ohad Shamir (Weizmann), and Ambuj Tewari (UMICH)

The talk is intended for a general machine learning audience.

## Tuesday, October 1, 2013 - 12:30

## Local Low-Rank Matrix Approximation

**Speaker: ** Guy Lebanon, Amazon

**Location: ** CSE 305

Matrix approximation is a common tool in recommendation systems, text mining, and computer vision. A prevalent assumption in constructing matrix approximations is that the partially observed matrix is of low-rank. We propose a new matrix approximation model where we assume instead that the matrix is locally of low-rank, leading to a representation of the observed matrix as a weighted sum of low-rank matrices. We analyze the accuracy of the proposed local low rank modeling. Our experiments show improvements in prediction accuracy over classical approaches for recommendation tasks.

## Tuesday, July 16, 2013 - 12:30

## Method-of-Moment Algorithms for Learning Bayesian Networks

**Speaker: ** David Sontag, NYU

**Location: ** CSE 305

We present two algorithms for extracting structure from big data. The first is a new approach to learning Bayesian network structure based on a data-dependent complexity penalty. We show that the new scoring function has a small sample complexity and has the property that it becomes computationally easier to find the highest-scoring structure as the amount of data increases. Next, we present a new algorithm with provable guarantees for discrete factor analysis from binary data, enabling the discovery of hidden variables and their causal relationships with observed data. These methodologies have applications throughout computational biology, medicine, and the social sciences.

## Tuesday, June 4, 2013 - 12:30

## Which Supervised Learning Method Works Best for What? An Empirical Comparison of Learning Methods and Metrics

**Speaker: ** Rich Caruana, Microsoft Research

**Location: ** CSE Gates Commons

Decision trees can be intelligible, but do they perform well enough that you should use them? Have SVMs replaced neural nets, or are neural nets still best for regression and SVMs best for classification? Boosting maximizes a margin similar to SVMs, but can boosting compete with SVMs? If it does, is it better to boost weak models or to boost stronger models? Bagging is easier than boosting - how well does it stack up against boosting? Breiman said Random Forests are better than bagging and as good as boosting. Was he right? And what about old friends like logistic regression, KNN, and naive Bayes? Should they be put out to pasture, or do they fill important niches?

In this talk we'll compare the performance of ten supervised learning methods on nine criteria: Accuracy, F-score, Lift, Precision/Recall Break-Even Point, Area under the ROC, Average Precision, Squared Error, Cross-Entropy, and Probability Calibration. The results show that no one learning method does it all, but some methods can be 'repaired' to do very well across all performance metrics. In particular, we show how to obtain good probabilities from max-margin methods such as SVMs and boosting via Platt's Method and Isotonic Regression. We also describe a meta-ensemble method that combines select models from these ten learning methods to yield even better performance than any of the individual learning methods. Although these ensembles perform extremely well, they are too complex for some real-world applications. We'll describe a model compression method that tries to fix that. Finally, if time permits, we'll discuss how the nine performance metrics relate to each other, and on which of the metrics you probably should (or shouldn't) depend.

## Tuesday, May 28, 2013 - 12:30

## Reproducibility and Probabilistic Tsunami Hazard Assessment

**Speaker: ** Randall Leveque, UW Applied Math

**Location: ** CSE Gates Commons

I will discuss two somewhat independent topics. The first is the importance of reproducibility in computational research and the need for higher standards and better tools to help insure that published results can be reproduced (preferably by other researchers, but at least by the authors!). I will mention some of the tools already available to help with this.

Tsunami modeling is a motivating example since the results of numerical simulations are often used to inform public policy decisions that can have significant financial and safety implications. I will also briefly describe some of the techniques used to perform probabilistic hazard assessment and to solve the inverse problems necessary to do real-time forecasting, in hopes of identifying overlapping interests with the machine learning community.

## Tuesday, May 21, 2013 - 12:30

## Local Privacy, Minimax Rates, and Learning

**Speaker: ** John Duchi, UC Berkeley

**Location: ** CSE 305

We study statistical estimation minimization problems under a privacy model in which the data is kept confidential even from the learner. In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical inference procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, as measured by convergence rate, of any statistical estimator or learning procedure.

Joint work with Michael Jordan and Martin Wainwright.

## Tuesday, May 14, 2013 - 12:30

## Recursive Deep Learning for Modeling Semantic Compositionality

**Speaker: ** Richard Socher, Stanford

**Location: ** CSE Gates Commons

Compositional and recursive structure is commonly found in different modalities, including natural language sentences and scene images. I will introduce several recursive deep learning models that, unlike standard deep learning methods can learn compositional meaning vector representations for phrases, sentences and images. These recursive neural network based models obtain state-of-the-art performance on a variety of syntactic and semantic language tasks such as parsing, paraphrase detection, relation classification and sentiment analysis.

Besides the good performance, the models capture interesting phenomena in language such as compositionality. For instance the models learn different types of high level negation and how it can change the meaning of longer phrases with many positive words. They can learn that the sentiment following a "but" usually dominates that of phrases preceding the "but." Furthermore, unlike many other machine learning approaches that rely on human designed feature sets, features are learned as part of the model.

I will focus on two recent results: Pushing the performance of the Stanford parser by 3.8% and improving its speed by 20% using syntactically untied RNNs and a recursive neural tensor network trained on a novel sentiment treebank.

## Tuesday, May 7, 2013 - 12:30

## Validating Network Classifiers and Pricing Information

**Speaker: ** Eric Bax

**Location: ** CSE Gates Commons

Validating Network Classifiers: Networks are fundamental to our lives, from the network of gene interactions that shapes our bodies to the social networks that can eat up the hours of our lives. :) Collective classification uses network structure to predict node information. For example, if your friends all like jazz, are you likely to as well? Since networks grow by adding nodes based on the nodes already in the network, nodes are not drawn i.i.d. This makes trouble for most machine learning approaches to validation of classifier performance. We will discuss a method to validate network classifiers that is based on understanding how the network grows.

Pricing Information: In auctions for online advertising, data providers tell advertisers which users are the best bets for their ads. So advertisers buy a combination of information (from data providers) and advertising space (from publishers like Yahoo). How much should advertisers pay for each? Let's have a hands-on experience to find out.

## Tuesday, April 30, 2013 - 12:30

## Interpretable patient-level predictive models

**Speaker: ** Tyler McCormick, UW Sociology & Statistics

**Location: ** CSE 403

This talk presents statistical methods which generate patent-level predictions that are both accurate and highly interpretable to healthcare providers and patients. In this context, an interpretable model should be able to pinpoint exactly why a particular prediction was made, and provide the reason in a clear and natural way. The talk begins by introducing the Hierarchical Association Rule Model (HARM) which sequentially predicts a patient's possible future medical conditions given the patient's current and past history of reported conditions. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as ``dyspepsia and epigastric pain imply heartburn") from a large set of candidate rules. We next present a model for traditional classification problems based on decision lists, which consist of a series of if...then... statements (for example, if high blood pressure, then stroke). Decision lists discretize the high-dimensional, multivariate feature space into a series of simple, readily interpretable decision statements. Our Bayesian framework, known as the Bayesian List Machine (BLM), introduces a formal relationship between sparsity and interpretability through a prior structure over lists. We compare our model with the CHADS2 score, actively used in clinical practice for estimating the risk of stroke in patients that have atrial fibrillation. Our model is as interpretable as CHADS2, but more accurate.

This is collaborative work with Cynthia Rudin, Ben Letham, and David Madigan.