9:30 | breakfast (Lyceum Ballroom in the HUB) |

10:00 | Rich Caruana (MSR), "Clustering: Probably Approximately Useless?" |

10:30 | Marina Meila (UW Statistics), "Geometry Preserving Non-Linear Dimension Reduction" |

Video of talks from first session | |

11:00 | break |

11:30 | Ran Gilad-Bachrach (MSR), "Deep Learning (but not the kind you were thinking of)" |

12:00 | Mathias Drton (UW Statistics), "A Bayesian Information Criterion for Singular Models" |

12:30 | lunch |

2:00 | Misha Bilenko (MSR), "Feature induction and aggregation for large-scale learning" |

2:30 | Ben Taskar (UW CSE), "Probabilistic Models of Diversity: Determinantal Point Processes" |

3:00 | Scott Yih (MSR), "Multi-Relational Latent Semantic Analysis" |

3:30 | Raj Rao (UW CSE), "Opportunities and Challenges for Machine Learning in Brain-Computer Interfacing" |

4-6 | reception + poster session (Atrium of Paul G. Allen Center for Computer Science & Engineering) |

### Abstracts

#### Clustering: Probably Approximately Useless?, Rich Caruana (MSR)

Clustering never seems to live up to the hype. To paraphrase the popular saying, clustering looks good in theory, yet often fails to deliver in practice. Why? You would think that something so simple and elegant as finding groups of similar items in data would be incredibly useful. Yet often it isn't. The problem is that clustering rarely finds the groups *you* want, or expected, or that are most useful for the task at hand. There are so many good ways to cluster a dataset that the odds of coming up with the clustering that is best for what you're doing are small. How do we fix this and make clustering more useful in practice? How do we make clustering do what you want, while still giving it the freedom to "do its own thing" and surprise us?

#### Geometry preserving non-linear dimension reduction, Marina Meila (UW Statistics)

In recent years, manifold learning has become increasingly popular as a tool for performing non-linear dimensionality reduction. This has led to the development of numerous algorithms of varying degrees of complexity that aim to recover the underlying low-dimensional parameters of the data using either local or global features.

It is also widely recognized that the low dimensional parametrizations will typically distort the geometric properties of the original data, like distances, angles, areas and so on. These distortions depend both on the data and on the algorithm used.

Building on the Laplacian Eigenmap framework, we propose a new paradigm that offers a guarantee, under reasonable assumptions, that *any* manifold learning algorithm will preserve the geometry of a data set. Our approach is based on augmenting the output of an algorithm with geometric information, embodied in the Riemannian metric of the manifold. The Riemannian metric allows us to compute geometric quantities (such as angle, length, or volume) for any coordinate system or embedding of the manifold. This geometric faithfulness, which is not guarante edfor most algorithms, allows us to define geometric measurements that are inde pendent of the algorithm used, and hence move seamlessly from one algorithm to another. In this work, we provide an algorithm for estimating the Riemannian metric from data and demonstrate the advantages of our approach in a variety of examples.

As an application of this new framework, we develop a new, principled, unsupervised to selecting the scale parameter in manifold learning, based on optimizing the geometric self-consistency w.r.t the scale.

This talk will not require any knowledge of advanced mathematics or manifold learning.

Joint work with Dominique Perrault-Joncas.

#### Deep Learning (but not the kind you were thinking of), Ran Gilad-Bachrach (MSR)

Typically, one approaches a supervised machine learning problem by writing down an objective function and finding a hypothesis that minimizes it. This is equivalent to finding the Maximum A Posteriori (MAP) hypothesis for a Boltzmann distribution. However, MAP is not a robust statistic. As an alternative, we define the depth of hypotheses and show that generalization and robustness can be bounded as a function of this depth. Therefore, we suggest using the median hypothesis, which is a deep hypothesis, and present algorithms for approximating it.

One contribution of this work is an efficient method for approximating the Tukey median. The Tukey median, which is often used for data visualization and outlier detection, is a special case of the family of medians we define: however, computing it exactly is exponentially slow in the dimension. Our algorithm approximates such medians in polynomial time while making weaker assumptions than those required by previous work.

The presentation is based on a joint work with Chris Burges.

#### A Bayesian information criterion for singular models, Mathias Drton (UW Statistics)

The Bayesian Information Criterion (BIC) is a widely used model selection technique that is inspired by the large-sample asymptotic behavior of Bayesian approaches to model selection. In this talk we will consider such approximate Bayesian model choice for problems that involve models whose Fisher-information matrices may fail to be invertible along other competing submodels. When models are singular in this way, the penalty structure in BIC generally does not reflect the large-sample behavior of their Bayesian marginal likelihood. While large-sample theory for the marginal likelihood of singular models has been developed recently, the resulting approximations depend on the true parameter value and lead to a paradox of circular reasoning. Guided by examples such as determining the number of components of mixture models, the number of factors in latent factor models or the rank in reduced-rank regression, we propose a resolution to this paradox and give a practical extension of BIC for singular model selection problems.

Joint work with Martyn Plummer.

**Bio**: Mathias Drton is Professor of Statistics at the University of Washington. A native of Germany, he received his PhD in Statistics
from UW in 2004. After a Postdoc in Mathematics at UC Berkeley, he spent seven years at the University of Chicago before returning to Seattle in 2012.

#### Feature induction and aggregation for large-scale learning, Misha Bilenko (MSR)

Learning from large-scale datasets with high-cardinality attributes is very common in information retrieval, online services and telemetry. Training on many examples of unbounded dimensionality requires adaptive yet efficient feature representations suitable for distributed computing platforms. We focus on one such representation that has been successfully used by practitioners for many years, yet received only casual mentions in literature: inducing features from aggregated conditional statistics. The talk will describe a generalization of this approach derived via gradient descent in functional space, and its connections to several alternative representations: feature hashing, matrix factorization and deep networks. The algorithm also provides a useful illustration to highlight requirements imposed by industrial engineering needs on large-scale learning methods: convenience of stage-wise training, ease of model monitoring and accuracy debugging, and suitability for conventional distributed computing infrastructure.

Joint work with Chris Meek.

#### Probabilistic Models of Diversity: Determinantal Point Processes in Machine Learning, Ben Taskar (UW CSE)

Many real-world problems involve negative interactions; we might want search results to be diverse, sentences in a summary to cover distinct aspects of the subject, or objects in an image to occupy different regions of space. However, traditional structured probabilistic models tend deal poorly with these kinds of situations; Markov random fields, for example, become intractable even to approximate. Determinantal point processes (DPPs), which arise in random matrix theory and quantum physics, behave in a complementary fashion: while they cannot encode positive interactions, they define expressive models of negative correlations that come with surprising and exact algorithms for many types of inference, including conditioning, marginalization, and sampling. I'll present our recent work on a novel factorization and dual representation of DPPs that enables efficient and exact inference for exponentially-sized structured sets. We develop an exact inference algorithm for DPPs conditioned on subset size and derive efficient parameter estimation for DPPs from several types of observations, as well as approximation algorithms for large-scale non-linear DPPs. I'll illustrate the advantages of DPPs on several natural language and computer vision tasks: document summarization, image search and multi-person pose estimation problems in images.

Joint work with Alex Kulesza, Jennifer Gillenwater, Raja Affandi and Emily Fox.

**Bio**: Ben Taskar received his bachelor's and doctoral degree in Computer Science from Stanford University. After a postdoc at the University of
California at Berkeley, he joined the faculty at the University of Pennsylvania. In the spring of 2013, he joined the University of Washington Computer Science & Engineering Department. His research interests include machine learning, natural language processing and computer vision. He has been awarded the Sloan Research Fellowship, the NSF CAREER Award, and selected for the Young Investigator Program by the Office of Naval Research and the DARPA Computer Science Study Group. His work on structured prediction has received best paper awards at several conferences.

#### Multi-Relational Latent Semantic Analysis, Scott Yih (MSR)

We present Multi-Relational Latent Semantic Analysis (MRLSA) which generalizes Latent Semantic Analysis (LSA). MRLSA provides an elegant approach to combining multiple relations between words by constructing a 3-way tensor. Similar to LSA, a low-rank approximation of the tensor is derived using a tensor decomposition. Each word in the vocabulary is thus represented by a vector in the latent semantic space and each relation is captured by a latent square matrix. The degree of two words having a specific relation can then be measured through simple linear algebraic operations. We demonstrate that by integrating multiple relations from both homogeneous and heterogeneous information sources, MRLSA achieves state-of-the-art performance on existing benchmark datasets for two relations, antonymy and is-a.

#### Opportunities and Challenges for Machine Learning in Brain-Computer Interfacing, Raj Rao (UW CSE)

The field of brain-computer interfacing has seen rapid advances in recent years, with applications ranging from cochlear implants for the deaf to brain-controlled prosthetic arms for the paralyzed. This talk will provide a brief overview of the various types of brain-computer interfaces (BCIs) and the techniques they use for mapping brain signals to control outputs. I will then highlight some opportunities as well as challenges for machine learning in helping facilitate the transition of BCIs from the laboratory to the real world.

**Bio**: Rajesh Rao is an Associate Professor in the Computer Science & Engineering department at the University of Washington. He is the recipient of an NSF CAREER award, an ONR Young Investigator Award, a Sloan Faculty Fellowship, and a Packard Fellowship for Science and Engineering. He is the author of the new textbook Brain-Computer Interfacing: An Introduction (Cambridge University Press, 2013) and the co-editor of two volumes, Probabilistic Models of the Brain (MIT Press, 2002) and Bayesian Brain (MIT Press, 2007). With Adrienne Fairhall, he offered the first MOOC on computational neuroscience (about 50K registered students). His research spans the areas of computational neuroscience, AI, and brain-computer interfacing. His other interests include the 4000-year-old undeciphered Indus script (a topic on which he has given a TED talk) and Indian miniature painting.

Page last updated 6 October 2013.