Machine Learning Seminars

Sponsored by

Title: Massive, Sparse, Efficient Multilabel Learning
Speaker: Charles Elkan, UCSD and Amazon
When: Tuesday, October 21, 2014 - 12:30
Location: TBA (not CSE 305)

Amazon has many applications whose core is multilabel
classification. This talk will present progress towards a multilabel
learning method that can handle 10^7 training examples, 10^6 features, and
10^5 labels on a single workstation. A sparse linear model is learned for
each label simultaneously by stochastic gradient descent with L2 and L1
regularization. Tractability is achieved through careful use of sparse data
structures, and speed is achieved by using the latest stochastic gradient
methods that do variance reduction. Both theoretically and practically,
these methods achieve order-of-magnitude faster convergence than Adagrad.
We have extended them to handle non-differentiable L1 regularization. We
show experimental results on classifying biomedical articles into 26,853
scientific categories. [Joint work with Galen Andrew, ML intern at Amazon.]

Bio Charles Elkan is the first Amazon Fellow, on leave from being a
professor of computer science at the University of California, San Diego.
In the past, he has been a visiting associate professor at Harvard and a
researcher at MIT. His published research has been mainly in machine
learning, data science, and computational biology. The MEME algorithm that
he developed with Ph.D. students has been used in over 3000 published
research projects in biology and computer science. He is fortunate to have
had inspiring undergraduate and graduate students who are in leadership
positions now such as vice president at Google.


Title: Learning Mixtures of Ranking Models
Speaker: Pranjal Awasthi, Princeton University
When: Tuesday, October 7, 2014 - 12:30
Location: CSE 305

Probabilistic modeling of ranking data is an extensively studied
problem with applications ranging from understanding user preferences
in electoral systems and social choice theory, to more modern learning
tasks in online web search, crowd-sourcing and recommendation
systems. This work concerns learning the Mallows model -- one of the
most popular probabilistic models for analyzing ranking data. In this
model, the user's preference ranking is generated as a noisy version
of an unknown central base ranking. The learning task is to recover
the base ranking and the model parameters using access to noisy
rankings generated from the model.

Although well understood in the setting of a homogeneous population (a
single base ranking), the case of a heterogeneous population (mixture
of multiple base rankings) has so far resisted algorithms with
guarantees on worst case instances. In this talk I will present the
first polynomial time algorithm which provably learns the parameters
and the unknown base rankings of a mixture of two Mallows models. A
key component of our algorithm is a novel use of tensor decomposition
techniques to learn the top-k prefix in both the rankings. Before this
work, even the question of identifiability in the case of a mixture of
two Mallows models was unresolved.

Joint work with Avrim Blum, Or Sheffet and Aravindan Vijayaraghavan.


Title: Convex and Bayesian methods for link prediction using distributed representations
Speaker: Guillaume Bouchard, Xerox Research Europe
When: Monday, September 29, 2014 - 12:30
Location: CSE 305

Many applications involve multiple interlinked data sources, but existing
approach to handle them are often based on latent factor models (i.e.
distributed representations) which are difficult to learn. At the same
time, recent advances in convex analysis, mainly based on the nuclear norm
(relaxation of the matrix rank) and sparse structured approximations, have
shown great theoretical and practical performances to handle very large
matrix factorization problems with non-Gaussian noise and missing data.

In this talk, we will show how multiple matrices or tensors can be jointly
factorized using a convex formulation of the problem, with a particular
focus on:

  • Multi-view learning: A popular approach is to assume that, both, the
    correlations between the views and the view-specific correlations have
    low-rank structure, leading to a model closely related to canonical
    correlation analysis called inter-battery factor analysis. We propose a
    convex relaxation of this model, based on a structured nuclear norm
  • Collective matrix factorization: When multiple matrices are related, they
    share common latent factors, leading to a simple yet powerful way of
    handling complex data structures, such as relational databases. Again, a
    convex formulation of this approach is proposed. We also show that the
    Bayesian version of this model can be used to tune the multiple
    regularization parameters involved in such models, avoiding costly

Another contribution to KB modeling relates to binary tensor and matrix
factorization with many zeros. We show a new learning approaches for binary
data that scales linearly with the number of positive examples. It is based
on a iterative split of the tensor (or matrix) on which the binary loss is
approximated by a Gaussian loss which itself can be efficiently minimized.
Experiments on popular tasks such as data imputation, multi-label
prediction, link prediction in graphs and item recommendation illustrate
the benefit of the proposed approaches.

BioGuillaume Bouchard is senior research scientist at Xerox Research
Centre Europe in Grenoble, France. After an engineering degree and master
in mathematics in Université de Rouen, he obtained a PhD in statistics from
Institut National de Recherche en Information et Automatique (INRIA) in
2004. Since then, he worked for Xerox on multiple machine learning research
project in big data analysis, including user modelling, recommender systems
and natural language processing. He was involved in French and European
research projects called LAVA, FUPOL, Fusepool and Dynamicité. His current
research focuses on the development of distributed statistical relational
models for knowledge bases, applied to the development of virtual agents.


Title: A General Framework for Fast Stagewise Algorithms
Speaker: Ryan Tibshirani, Carnegie Mellon University
When: Tuesday, May 20, 2014 - 12:30
Location: CSE 305

Forward stagewise regression follows a very simple strategy for constructing a sequence of sparse regression estimates: starting with all coefficients equal to zero, it iteratively updates the coefficient (by a small amount ε) corresponding to the variable that has maximal absolute inner product with the current residual. This procedure has an interesting connection to the lasso: under some conditions, it can be shown that the sequence of forward stagewise estimates exactly coincides with the lasso path, as the step size ε goes to zero. Further, essentially the same equivalence holds outside of the regression setting, for minimizing a differentiable convex loss function subject to an l1 norm constraint (and the stagewise algorithm now updating the coefficient corresponding to the maximal absolute component of the gradient).

Even when they do not match their l1-constrained analogues, stagewise estimates provide a useful approximation, and are computationally appealing. Their success in sparse modeling motivates the question: can a simple, effective strategy like forward stagewise be applied more broadly in other regularization settings, beyond the l1 norm and sparsity? This is the focus of the talk; we present a general framework for stagewise estimation, which yields fast algorithms for problems such as group-structured learning, matrix completion, image denoising, and more.


Title: Bayesian Optimization of Machine Learning Algorithms
Speaker: Ryan Adams, Harvard University
When: Tuesday, May 6, 2014 - 12:30
Location: CSE 403

Machine learning algorithms frequently involve careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a "black art" requiring expert experience, rules of thumb, or sometimes brute-force search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. I will describe my recent work on solving this problem with Bayesian nonparametrics, using Gaussian processes. This approach of "Bayesian optimization" models the generalization performance as an unknown objective function with a GP prior. I will discuss new algorithms that account for variable cost in function evaluation and take advantage of parallelism in evaluation. These new algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including latent Dirichlet allocation for text analysis, structured SVMs for protein motif finding, and convolutional neural networks for visual object recognition.


Title: Sparse latent factor models to uncover structure in genomic data
Speaker: Barbara Engelhardt, Duke University
When: Tuesday, April 29, 2014 - 12:30
Location: CSE 305

Within high-dimensional genomic data, including genetic markers, complex phenotypes, and gene expression measurements, hides a substantial amount of low dimensional structure. In order to extract this low dimensional structure from high dimensional data, latent factor models with appropriate priors and modeling assumptions can be applied. A key advantage of sparse latent factor models in genomic applications is interpretability: the low dimensional structure can often be understood in a biological context. Model specification here is important, however, as genomics data generally includes small numbers of samples n with respect to enormous numbers of features p. Sparse linear latent factor models have been used to identify population substructure. Applying these models to high dimensional noisy phenotypes produces interpretable phenotypes that can subsequently tested for association with genetic variants. This same idea can be considered for gene expression traits, where the low dimensional structure captures groups of interacting genes, and the associated genetic variants represent master regulators. In this talk, I will discuss recent work on sparse latent factor models, and results on genomic data.


Title: Submodular combinatorial problems in Machine Learning: Algorithms and Applications
Speaker: Rishabh Iyer, UW Electrical Engineering
When: Tuesday, April 22, 2014 - 12:30
Location: CSE 305

Having long been recognized in combinatorics, the concept of submodularity is becoming increasingly important in machine learning, since it can capture intuitive and yet non-trivial interactions between variables. This expressiveness has found many applications in machine learning, particularly in understanding structured data like text, vision and speech. Submodular functions naturally capture aspects of information, coverage and diversity in maximization problems, while also modelling notions of cooperation, sparsity, complexity and economies of scale in minimization problems. In this talk, we shall consider a large class of submodular optimization problems and motivate them with several real world applications, particularly in machine learning. We shall then provide a unifying algorithmic framework for solving several of these optimization problems, and highlight the scalability and practicality of this approach. We willl also highlight novel theoretical characterizations, which provide better connections between theory and practice. We shall ground this entire talk with a large number of applications in machine learning, including image segmentation, image correspondence, image collection summarization, feature selection, and data subset selection. This talk should be self contained and shall not require prior knowledge of submodular functions and optimization.

(this is based on joint work with Jeff Bilmes, Stefanie Jegelka, Sebastian Tschiatschek and Kai Wei)


Title: Theoretical and Algorithmic Foundations of Online Learning
Speaker: Sasha Rakhlin, Wharton School, University of Pennsylvania
When: Tuesday, April 8, 2014 - 12:30
Location: CSE 305

A major challenge for machine learning in the next decade is the development of methods that continuously predict and learn from a stream of data. The scale of data and the non-stationary nature of the environment make the prevalent ``learn from a batch of i.i.d. data'' paradigm of Statistical Learning inadequate. Despite the extensive literature on sequential prediction (online learning) methods, the theoretical understanding of the subject has been lacking, and the results have been obtained on a case-by-case basis. This stands in contrast to Statistical Learning, where the inherent complexities have been well-understood in the last forty years. In this talk, we focus on no-regret online learning and develop the relevant notions of complexity in a surprising parallel to Statistical Learning Theory. This non-constructive study of inherent complexity is then augmented with a recipe for developing efficient online learning algorithms via a notion of a relaxation. To demonstrate the utility of our approach, we develop a new family of randomized methods, as well as new algorithms for collaborative filtering and node prediction.


Title: Scaling and Generalizing Variational Inference
Speaker: David Blei, Princeton University
When: Monday, March 31, 2014 - 12:30
Location: CSE 305

Latent variable models have become a key tool for the modern statistician, letting us express complex assumptions about the hidden structures that underlie our data. Latent variable models have been successfully applied in numerous fields including natural language processing, computer vision, electronic medical records, genetics, neuroscience, astronomy, political science, sociology, the digital humanities, and many others.

The central computational problem in latent variable modeling is posterior inference, the problem of approximating the conditional distribution of the latent variables given the observations. Posterior inference is central to both exploratory tasks, where we investigate hidden structures that underlie our data, and predictive tasks, where we use the inferred structures to generalize about future data. Approximate posterior inference algorithms have revolutionized Bayesian statistics, revealing its potential as a usable and general-purpose language for data analysis.

Bayesian statistics, however, has not yet reached this potential. First, statisticians and scientists regularly encounter massive data sets, but existing approximate inference algorithms do not scale well. Second, most approximate inference algorithms are not generic; each must be adapted to the specific model at hand. This often requires significant model-specific analysis, which precludes us from easily exploring a variety of models.

In this talk I will discuss our recent research on addressing these two limitations. First I will describe stochastic variational inference, an approximate inference algorithm for handling massive data sets. Stochastic inference is easily applied to a large class of Bayesian models, including time-series models, factor models, and Bayesian nonparametric models. I will demonstrate its application to probabilistic topic models of text conditioned on millions of articles. Stochastic inference opens the door to scalable Bayesian computation for modern data analysis.

Then I will discuss black box variational inference. Black box inference is a generic algorithm for approximating the posterior. We can easily apply it to many models with little model-specific derivation and few restrictions on their properties. Black box inference performs better than similarly generic sampling algorithms, such as Metropolis-Hastings inside Gibbs, and can be composed with stochastic inference to handle massive data. I will demonstrate its use on a suite of nonconjugate models of longitudinal healthcare data.

This is joint work based on these two papers:

M. Hoffman, D. Blei, J. Paisley, and C. Wang. Stochastic variational inference. Journal of Machine Learning Research, 14:1303-1347.

R. Ranganath and D. Blei. Black box variational inference. Artificial Intelligence and Statistics, 2014.


Title: On Estimating Many Effect-sizes, Selection Bias, and Local Adaptive Shrinkage
Speaker: Noah Simon, UW Biostatistics
When: Tuesday, February 4, 2014 - 12:30
Location: CSE 305

There is a growing scientific interest in simultaneously estimating the means (or effect-sizes) of many different features. Classical (but very unintuitive) results in Stein-estimation show that even if these features are independent, cleverly coupling their mean estimates can greatly improve estimation accuracy. These ideas have been extended to build locally adaptive estimators based on perspectives from non-parametric empirical Bayesian estimation (and compound decision theory). Unfortunately, these estimators are not simple, and are really only tractable in highly idealized scenarios.

In recent work, we introduce a different framework for this estimation problem. Our framework is intuitive and shows how estimates from "independent features" become coupled through selection bias. We also give a simple general estimator based on resampling which is applicable and performs well in a wide variety of scenarios.

In this talk I will discuss the intuition behind "coupling estimates from independent features." I will review empirical Bayes/Stein Estimation, and I will introduce our framework. I will explain the connections between these estimators, show how they compare in practice.

This presentation is intended for a general machine learning audience.

This is joint work with Richard Simon (NIH) and recently Kean Ming Tan and Daniela Witten (UW) with occasional skepticism from Brad Efron (Stanford).