Many optimization algorithms can be viewed as a discretization of a continuous-time process (described using an ordinary differential equation, or a stochastic differential equation in the presence of noise). The continuous-time point of view can be useful for simplifying the analysis, drawing connections with physics, and streamlining the design of new algorithms and heuristics.
We will review results from continuous-time optimization, from the simple case of gradient flow, to more recent results on accelerated methods. In particular, we give simple interpretations of acceleration, and show how these interpretations motivate heuristics (restarting and adaptive averaging) which, empirically, can significantly improve the speed of convergence. We will then focus on the stochastic case, and study the interaction between acceleration and noise, and their effect on the convergence rates. We will conclude with a brief review of how the same tools can be applied in other problems, such as approximate sampling and non-convex optimization.
Bio:
Walid Krichene is at Google Research, where he works on large-scale optimization and recommendation. He received his Ph.D. in EECS in 2016 from UC Berkeley, where he was advised by Alex Bayen and Peter Bartlett, a M.A. in Mathematics from U.C. Berkeley, and a M.S. in Engineering and Applied Math from the Ecole des Mines Paristech. He received the Leon Chua Award and two outstanding GSI awards from U.C. Berkeley. His research interests include convex optimization, stochastic approximation, recommender systems, and online learning.

**Title:**Continuous-time dynamics for convex optimization

**Speaker:**Walid Krichene

**When:**Tuesday, February 20, 2018 - 12:00

**Location:**CSE 403

**Title:**Adaptive Averaging in Accelerated Descent Dynamics

**Speaker:**Walid Krichene (Google)

**When:**Tuesday, February 20, 2018 - 12:00

**Location:**CSE 403

We study accelerated descent dynamics for constrained convex optimization. This dynamics can be described naturally as a coupling of a dual variable accumulating gradients at a given rate eta(t), and a primal variable obtained as the weighted average of the mirrored dual trajectory, with weights w(t). Using a Lyapunov argument, we give sufficient conditions on η and w to achieve a desired convergence rate. As an example, we show that the replicator dynamics (an example of mirror descent on the simplex) can be accelerated using a simple averaging scheme. We then propose an adaptive averaging heuristic which adaptively computes the weights to speed up the decrease of the Lyapunov function. We provide guarantees on adaptive averaging in continuous-time, and give numerical experiments in discrete-time to compare it with existing heuristics, such as adaptive restarting. The experiments indicate that adaptive averaging performs at least as well as adaptive restarting, with significant improvements in some cases.

**Title:**Active Learning for Convex Regression

**Speaker:**Kevin Jamieson

**When:**Tuesday, February 6, 2018 - 12:00

**Location:**CSE 403

In this work, we will introduce the first principled adaptive-sampling procedure for learning a convex function in the L∞ norm, a problem that arises often in economics, psychology, and the social sciences. We present a function-specific measure of complexity and use it to prove that our algorithm is information theoretically near-optimal in a strong, function-specific sense. We also corroborate our theoretical contributions with extensive numerical experiments, finding that our method substantially outperforms passive, uniform sampling for favorable synthetic and data-derived functions in low-noise settings with large sampling budgets. Our results also suggest an idealized “oracle strategy”, which we use to gauge the potential for deploying the adaptive-sampling strategy on any function in any particular setting.

**Title:**Subspace clustering via matching pursuits

**Speaker:**Michael Tschannen (ETH Zürich)

**When:**Tuesday, January 9, 2018 - 12:10

**Location:**CSE 403

Subspace clustering refers to the problem of clustering unlabeled high-dimensional data points into a union of low-dimensional linear subspaces, whose number, orientations, and dimensions are all unknown. Among the large variety of subspace clustering algorithms, sparsity-based algorithms have attracted special attention thanks to their excellent performance in practical applications. A prominent example is the sparse subspace clustering (SSC) algorithm, which performs spectral clustering based on an adjacency matrix obtained by sparsely representing each data point in terms of all the other data points via the Lasso. When the number of data points is large or the dimension of the ambient space is high, the computational complexity of SSC quickly becomes prohibitive. Replacing the Lasso in SSC by the greedy orthogonal matching pursuit (OMP) or matching pursuit (MP) algorithm significantly reduces computational complexity while often maintaining clustering accuracy. In this talk, we present an analytical performance characterization of SSC-OMP and SSC-MP for noisy data. Both SSC-OMP and SSC-MP are shown to succeed even when the subspaces intersect and when the data points are contaminated by severe noise. Analytical results in combination with numerical results indicate that both SSC-OMP and SSC-MP with a data-dependent stopping criterion automatically detect the dimensions of the subspaces underlying the data. Finally, to further reduce computational complexity of SSC-OMP and SSC-MP, we consider the scenario where the dimensionality of the data is reduced by random projections, and present corresponding performance guarantees.

**Title:**Unified optimization for self-learning robust penalties

**Speaker:**Sasha Aravkin

**When:**Tuesday, November 28, 2017 - 12:15

**Location:**CSE 403

We consider extended optimization formulations that unify tasks classically done sequentially or hierarchically. In particular, we develop self-tuning robust penalties, that require optimizing over both model parameters and shape parameters (which are typically computed with cross-validation or black-box optimization). If time permits, we will also show extended formulations that simultaneously learn a model and detect outliers in the input data.
Bio: Aleksandr Aravkin is a Washington Research Foundation Data Science Assistant Professor in Applied Mathematics at the University of Washington. He is also a Fellow at the UW eSciences Institute, and an adjunct professor in the Mathematics and Statistics. He received his M.S. in Statistics and PhD in Mathematics (Optimization) from the University of Washington in 2010. From 2010-2012, he was a postdoctoral fellow in Computer Science and Earth and Ocean Sciences at UBC, working on robust approaches for large-scale inverse problems. From 2012-2015, Dr. Aravkin was a Research Staff Member at the IBM TJ Watson Research center, and an adjunct professor at Columbia University in Computer Science and IEOR departments.

**Title:**Density Tree and Density Ranking in Singular Measures

**Speaker:**Yen-Chi Chen

**When:**Tuesday, October 17, 2017 - 12:00

**Location:**CSE 403

A density tree (also known as a cluster tree of a probability density function) is a tool in topological data analysis that uses a tree structure to represent the shape of a density function. Even if the density function is multivariate, a density tree can always be displayed on a two-dimensional plane, making it an ideal tool for visualizing the shape of a multivariate dataset. However, in complex datasets such as GPS data, the underlying distribution function is singular so the usual density function and density tree no longer exist. To analyze this type of data and generalize the density tree, we introduce the concept of density ranking and ranking tree (also called an alpha-tree). We then show that one can consistently estimate the density ranking and the ranking tree using a kernel density estimator. Based on the density ranking, we introduce several geometric and topological summary curves for analyzing GPS datasets.
Bio:
Yen-Chi Chen is an Assistant Professor in the Department of Statistics, a data science fellow in the eScience Institute, and a statistician in the National Alzheimer's Coordinating Center at the University of Washington. Yen-Chi obtained a PhD in Statistics from Carnegie Mellon University in June 2016, advised by Professor Larry Wasserman and Christopher Genovese. His current research is on nonparametric statistics, topological data analysis, cluster analysis, and applications in Astronomy, industrial engineering, and medical research.

**Title:**Causal Learning

**Speaker:**Bernhard Schölkopf

**When:**Wednesday, August 16, 2017 - 15:00

**Location:**CSE 305

In machine learning, we use data to automatically find dependences in the
world, with the goal of predicting future observations. Most machine
learning methods build on statistics, but one can also try to go beyond
this, assaying causal structures underlying statistical dependences. Can
such causal knowledge help prediction in machine learning tasks? We argue
that this is indeed the case, due to the fact that causal models are more
robust to changes that occur in real world datasets. We discuss
implications of causality for machine learning tasks, and argue that many
of the hard issues benefit from the causal viewpoint. This includes domain
adaptation, semi-supervised learning, transfer, life-long learning, and
fairness, as well as an application to the removal of systematic errors in
astronomical problems.

Semantic Scholar page: Bernhard Schölkopf

Semantic Scholar page: Bernhard Schölkopf

**Title:**Revisiting the PC-Algorithm: How Much Conditioning Is Needed?

**Speaker:**Ali Shojaie

**When:**Tuesday, May 9, 2017 - 12:00

**Location:**CSE 305

The PC-Algorithm, named after its inventors, Peter Sprites and Clark Glymour, is the gold standard for learning directed acyclic graphs (DAGs) from observational data. The algorithm was recently popularized by Kalisch and Bulhmann for learning sparse high-dimensional DAGs, containing many nodes, from smaller number of observations. However, the computational and sample complexity of the PC-Algorithm scale with the maximum degree of the network. The maximum degree grows with the number of nodes in many random graph families, including power-law graphs commonly observed in real-world networks. The algorithm is thus inappropriate for learning such networks. Moreover, the PC-Algorithm requires a stringent faithfulness assumption, which has been shown not to hold in high dimensions.

In this work, we exploit properties of large random graphs to show that DAGs can be learned by only conditioning on sets of small cardinality. In other words, for many graph families, we justify early stopping of PC-Algorithm’s search. While guaranteed to learn the skeleton of high-dimensional DAGs, this simple modification also turns out to greatly improve the computational and sample complexity of the PC-Algorithm. The new algorithm also requires a weaker faithfulness assumption and results in improved empirical performance in DAGs with hub nodes.

Semantic Scholar page: Ali Shojaie

In this work, we exploit properties of large random graphs to show that DAGs can be learned by only conditioning on sets of small cardinality. In other words, for many graph families, we justify early stopping of PC-Algorithm’s search. While guaranteed to learn the skeleton of high-dimensional DAGs, this simple modification also turns out to greatly improve the computational and sample complexity of the PC-Algorithm. The new algorithm also requires a weaker faithfulness assumption and results in improved empirical performance in DAGs with hub nodes.

Semantic Scholar page: Ali Shojaie

**Title:**Continuous State Machines and Grammars for Linguistic Structure Prediction

**Speaker:**Noah Smith

**When:**Tuesday, April 25, 2017 - 12:00

**Location:**CSE 305

Linguistic structure prediction infers abstract representations of text, like syntax trees and semantic graphs, enabling interpretation in applications like question answering, information extraction, and opinion analysis. This talk is about the latest family of methods for linguistic structure prediction, which make heavy use of representation learning via neural networks. I'll present these new methods as continuous generalizations of state machines and probabilistic grammars. I'll show how they've led to fast and accurate performance on several syntactic and semantic parsing problems.

Semantic scholar page: Noah Smith

Semantic scholar page: Noah Smith

**Title:**Sequence Modeling: From Spectral Methods and Bayesian Nonparametrics to Deep Learning

**Speaker:**Alex Smola

**When:**Tuesday, April 11, 2017 - 12:00

**Location:**MOR 220

In this talk, Alex Smola will summarize a few recent developments in the design and analysis of sequence models. Starting with simple parametric models such as HMMs for sequences we look at nonparametric extensions in terms of their ability to model more fine-grained types of state and transition behavior. In particular we consider spectral embeddings, nonparametric Bayesian models such as the nested Chinese Restaurant Franchise and the Dirichlet-Hawkes Process. We conclude with a discussion of deep sequence models for user return time modeling, time-dependent collaborative filtering, and large-vocabulary user profiling.