Machine Learning Seminars

Sponsored by

Title: Graphical Modeling with the Bethe Approximation
Speaker: Tony Jebara, Department of Computer Science, Columbia University
When: Wednesday, May 6, 2015 - 12:30
Location: Room CSE 305, Allen Center
Inference (a canonical problem in graphical modeling) recovers a probability distribution over a subset of variables in a given model. It is known to be NP-hard for graphical models with cycles and large tree-width. Learning (another canonical problem) reduces to iterative marginal inference and is also NP-hard. How can we efficiently tackle these problems in practice? We will discuss the Bethe free energy as an approximation to the intractable partition function. Heuristics like loopy belief propagation (LBP) are often used to optimize the Bethe free energy. Unfortunately, in general, LBP may not converge at all, and if it does, it may not be to a global optimum. To do marginal inference, we instead explore a more principled treatment of the Bethe free energy using discrete optimization. We show that in attractive loopy models we can find the global optimum in polynomial time even though the resulting landscape is non-convex. To generalize to mixed loopy models, we use double-cover methods that bound the true Bethe global optimum in polynomial time. Finally, to do learning, we combine Bethe approximation with a Frank-Wolfe algorithm in the convex dual which circumvents the intractable partition function. The result is a new single-loop learning algorithm which is more efficient than previous double-loop methods that interleaved iterative inference with iterative parameter updates. We show applications of these methods in friendship link recommendation, in social influence estimation, in computer vision, and in power networks. We also combine the approaches with sparse structure learning to model several years of Bloomberg data. These graphical models capture financial and macro-economic variables and their response to news and social media topics.


Title: Degree, curvature, and mixing of random walks on the phylogenetic subtree-prune-regraft graph, and what it tells us about phylogenetic inference via MCMC
Speaker: Erick Matsen, Fred Hutchinson Cancer Research Center
When: Tuesday, January 27, 2015 - 12:30
Location: CSE 305
Statistical phylogenetics is the inference of a tree structure representing evolutionary history using biological sequence data (such as from DNA) under a likelihood model of sequence evolution. All such inferences perform either heuristic search or Markov chain Monte Carlo (MCMC) on a graph built with the various trees as vertices and edges representing tree modifications. Because this graph is connected with nonzero transition probabilities, MCMC is guaranteed to work in the large time limit, although inference using a finite number of steps is determined by mixing properties of MCMC on the graph. However, almost nothing is known about the large-scale structure of, or properties of random walks on, the relevant graphs. In this talk, I will first demonstrate significant graph effects on phylogenetic inference using the subtree-prune-regraft (SPR) graph, which is a popular such graph involving reconnection of subtrees of a tree in a different location. I will then recap what is known about degrees in the SPR graph and describe our work on Ricci-Ollivier curvature for representative pairs of phylogenetic trees, and give evidence that degree and curvature essentially determine the behavior of the simple lazy random walk on the SPR graph. This work is joint with my postdoc Chris Whidden.


Title: Driving Time Variability Prediction Using Mobile Phone Location Data
Speaker: Dawn Woodard, Cornell University
When: Tuesday, January 13, 2015 - 12:30
Location: CSE 305

We introduce a method to predict the variability in (probability distribution of) driving time on an arbitrary route in a road network at a given time, using mobile phone GPS data. Although commercial mapping services currently provide a high-quality estimate of driving time on a given route, there can be considerable uncertainty in that prediction due for example to unknown timing of traffic signals, uncertainties in traffic congestion levels, and differences in driver habits. For this reason, a distribution prediction can be more valuable than a deterministic prediction of driving time, by accounting not just for the measured traffic conditions and other available information, but also for the presence of unmeasured conditions that also affect driving time. Accurate distribution predictions can be used to report variability to the user, to provide risk-averse route recommendations, and as a part of vehicle fleet decision support systems. Simple approaches to distribution prediction assume independence in driving time across road segments and as a result dramatically underestimate the variability in driving time. We propose a method that accurately accounts for dependencies in
driving time across road segments, and apply it to large volumes of mobile phone GPS data from the Seattle metropolitan region.


Title: TBA
Speaker: Yi Chang, Yahoo! Research
When: Tuesday, November 4, 2014 - 12:30
Location: CSE 305


Title: Deep Representation Learning: Challenges and New Directions
Speaker: Honglak Lee, University of Michigan
When: Thursday, October 30, 2014 - 12:30
Location: CSE 305

Machine learning is a powerful tool for tackling challenging problems
in artificial intelligence. In practice, success of machine learning
algorithms critically depends on the feature representations for input
data, which often becomes a limiting factor. To address this problem,
deep learning methods have recently emerged as successful techniques
to learn feature hierarchies from unlabeled and labeled data. In this
talk, I will present my perspectives on the progress, challenges, and
some new directions. Specifically, I will talk about my recent work to
address the following interrelated challenges: (1) how can we learn
invariant yet discriminative features, and furthermore disentangle
underlying factors of variation to model high-order interactions
between the factors? (2) how can we learn representations of the
output data when the output variables have complex high-order
dependencies? (3) how can we learn shared representations from
heterogeneous input data modalities?

Honglak Lee is an Assistant Professor of Computer Science and
Engineering at the University of Michigan, Ann Arbor. He received his
Ph.D. from Computer Science Department at Stanford University in 2010,
advised by Prof. Andrew Ng. His primary research interests lie in
machine learning, which spans over deep learning, unsupervised and
semi-supervised learning, transfer learning, graphical models, and
optimization. He also works on application problems in computer
vision, audio recognition, robot perception, and text processing. His
work received best paper awards at ICML and CEAS. He has served as a
guest editor of IEEE TPAMI Special Issue on Learning Deep
Architectures, as well as area chairs of ICML and NIPS. He received
the Google Faculty Research Award in 2011, and was selected by IEEE
Intelligent Systems as one of AI's 10 to Watch in 2013.


Title: Massive, Sparse, Efficient Multilabel Learning
Speaker: Charles Elkan, UCSD and Amazon
When: Tuesday, October 21, 2014 - 12:30
Location: TBA (not CSE 305)

Amazon has many applications whose core is multilabel
classification. This talk will present progress towards a multilabel
learning method that can handle 10^7 training examples, 10^6 features, and
10^5 labels on a single workstation. A sparse linear model is learned for
each label simultaneously by stochastic gradient descent with L2 and L1
regularization. Tractability is achieved through careful use of sparse data
structures, and speed is achieved by using the latest stochastic gradient
methods that do variance reduction. Both theoretically and practically,
these methods achieve order-of-magnitude faster convergence than Adagrad.
We have extended them to handle non-differentiable L1 regularization. We
show experimental results on classifying biomedical articles into 26,853
scientific categories. [Joint work with Galen Andrew, ML intern at Amazon.]

Bio Charles Elkan is the first Amazon Fellow, on leave from being a
professor of computer science at the University of California, San Diego.
In the past, he has been a visiting associate professor at Harvard and a
researcher at MIT. His published research has been mainly in machine
learning, data science, and computational biology. The MEME algorithm that
he developed with Ph.D. students has been used in over 3000 published
research projects in biology and computer science. He is fortunate to have
had inspiring undergraduate and graduate students who are in leadership
positions now such as vice president at Google.


Title: Learning Mixtures of Ranking Models
Speaker: Pranjal Awasthi, Princeton University
When: Tuesday, October 7, 2014 - 12:30
Location: CSE 305

Probabilistic modeling of ranking data is an extensively studied
problem with applications ranging from understanding user preferences
in electoral systems and social choice theory, to more modern learning
tasks in online web search, crowd-sourcing and recommendation
systems. This work concerns learning the Mallows model -- one of the
most popular probabilistic models for analyzing ranking data. In this
model, the user's preference ranking is generated as a noisy version
of an unknown central base ranking. The learning task is to recover
the base ranking and the model parameters using access to noisy
rankings generated from the model.

Although well understood in the setting of a homogeneous population (a
single base ranking), the case of a heterogeneous population (mixture
of multiple base rankings) has so far resisted algorithms with
guarantees on worst case instances. In this talk I will present the
first polynomial time algorithm which provably learns the parameters
and the unknown base rankings of a mixture of two Mallows models. A
key component of our algorithm is a novel use of tensor decomposition
techniques to learn the top-k prefix in both the rankings. Before this
work, even the question of identifiability in the case of a mixture of
two Mallows models was unresolved.

Joint work with Avrim Blum, Or Sheffet and Aravindan Vijayaraghavan.


Title: Convex and Bayesian methods for link prediction using distributed representations
Speaker: Guillaume Bouchard, Xerox Research Europe
When: Monday, September 29, 2014 - 12:30
Location: CSE 305

Many applications involve multiple interlinked data sources, but existing
approach to handle them are often based on latent factor models (i.e.
distributed representations) which are difficult to learn. At the same
time, recent advances in convex analysis, mainly based on the nuclear norm
(relaxation of the matrix rank) and sparse structured approximations, have
shown great theoretical and practical performances to handle very large
matrix factorization problems with non-Gaussian noise and missing data.

In this talk, we will show how multiple matrices or tensors can be jointly
factorized using a convex formulation of the problem, with a particular
focus on:

  • Multi-view learning: A popular approach is to assume that, both, the
    correlations between the views and the view-specific correlations have
    low-rank structure, leading to a model closely related to canonical
    correlation analysis called inter-battery factor analysis. We propose a
    convex relaxation of this model, based on a structured nuclear norm
  • Collective matrix factorization: When multiple matrices are related, they
    share common latent factors, leading to a simple yet powerful way of
    handling complex data structures, such as relational databases. Again, a
    convex formulation of this approach is proposed. We also show that the
    Bayesian version of this model can be used to tune the multiple
    regularization parameters involved in such models, avoiding costly

Another contribution to KB modeling relates to binary tensor and matrix
factorization with many zeros. We show a new learning approaches for binary
data that scales linearly with the number of positive examples. It is based
on a iterative split of the tensor (or matrix) on which the binary loss is
approximated by a Gaussian loss which itself can be efficiently minimized.
Experiments on popular tasks such as data imputation, multi-label
prediction, link prediction in graphs and item recommendation illustrate
the benefit of the proposed approaches.

BioGuillaume Bouchard is senior research scientist at Xerox Research
Centre Europe in Grenoble, France. After an engineering degree and master
in mathematics in Université de Rouen, he obtained a PhD in statistics from
Institut National de Recherche en Information et Automatique (INRIA) in
2004. Since then, he worked for Xerox on multiple machine learning research
project in big data analysis, including user modelling, recommender systems
and natural language processing. He was involved in French and European
research projects called LAVA, FUPOL, Fusepool and Dynamicité. His current
research focuses on the development of distributed statistical relational
models for knowledge bases, applied to the development of virtual agents.


Title: A General Framework for Fast Stagewise Algorithms
Speaker: Ryan Tibshirani, Carnegie Mellon University
When: Tuesday, May 20, 2014 - 12:30
Location: CSE 305

Forward stagewise regression follows a very simple strategy for constructing a sequence of sparse regression estimates: starting with all coefficients equal to zero, it iteratively updates the coefficient (by a small amount ε) corresponding to the variable that has maximal absolute inner product with the current residual. This procedure has an interesting connection to the lasso: under some conditions, it can be shown that the sequence of forward stagewise estimates exactly coincides with the lasso path, as the step size ε goes to zero. Further, essentially the same equivalence holds outside of the regression setting, for minimizing a differentiable convex loss function subject to an l1 norm constraint (and the stagewise algorithm now updating the coefficient corresponding to the maximal absolute component of the gradient).

Even when they do not match their l1-constrained analogues, stagewise estimates provide a useful approximation, and are computationally appealing. Their success in sparse modeling motivates the question: can a simple, effective strategy like forward stagewise be applied more broadly in other regularization settings, beyond the l1 norm and sparsity? This is the focus of the talk; we present a general framework for stagewise estimation, which yields fast algorithms for problems such as group-structured learning, matrix completion, image denoising, and more.


Title: Bayesian Optimization of Machine Learning Algorithms
Speaker: Ryan Adams, Harvard University
When: Tuesday, May 6, 2014 - 12:30
Location: CSE 403

Machine learning algorithms frequently involve careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a "black art" requiring expert experience, rules of thumb, or sometimes brute-force search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. I will describe my recent work on solving this problem with Bayesian nonparametrics, using Gaussian processes. This approach of "Bayesian optimization" models the generalization performance as an unknown objective function with a GP prior. I will discuss new algorithms that account for variable cost in function evaluation and take advantage of parallelism in evaluation. These new algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including latent Dirichlet allocation for text analysis, structured SVMs for protein motif finding, and convolutional neural networks for visual object recognition.