Our world is becoming more data driven. With the spread of ubiquitous sensors, network connectivity, and massive storage capabilities, we are able to collect more and more data. But our computation and analysis capabilities have not increased at a comparable rate. Computer scientists are facing looming questions such as "How do we deal with the massive amounts of data we are collecting? How can we extract value out of data?" A sub-question relevant to machine learning researchers is "what role will machine learning and data mining play?" Through a survey of current sources of Big Data and analysis workflow patterns, this talk aims to shed light on the latter question.
Tuesday, November 29, 2011 - 12:30
Machine Learning and Big Data Analysis
Tuesday, November 22, 2011 - 12:30
Entire Relaxation Path for Maximum Entropy Models
We describe a relaxed and generalized notion of maximum entropy problems for multinomial distributions. By introducing a simple re-parametrization we are able to derive an efficient homotopy tracking scheme for the entire relaxation path using linear space and quadratic time. We also show that the Legendre dual of the relaxed maximum entropy problem is the task of finding the maximum likelihood estimator for an exponential distribution with L1 regularization. Hence, our solution can be used for problems such as language modeling with sparse parameter representation.
Tuesday, November 15, 2011 - 12:30
Learning Semantic Parsers for More Languages and with Less Supervision
Recent work has demonstrated effective learning algorithms for a variety of semantic parsing problems, where the goal is to automatically recover the underlying meaning of input sentences. Although these algorithms can work well, there is still a large cost in annotating data and gathering other language-specific resources for each new application. In this talk, I will describe efforts to address these challenges by developing scalable, probabilistic CCG grammar induction algorithms. I will present recent work on methods that incorporate new notions of lexical generalization, thereby enabling effective learning for a variety of different natural languages and formal meaning representations. I will also describe a new approach for learning semantic parsers from conversational data, which does not require any manual annotation of sentence meaning. Finally, I will sketch future directions, including our recurring focus on building scalable learning techniques while attempting to minimize the application-specific engineering effort. Joint work with Yoav Artzi, Tom Kwiatkowski, Sharon Goldwater, and Mark Steedman
Tuesday, November 1, 2011 - 12:30
Speech Recognition with Segmental Conditional Random Fields
Segmental Conditional Random Fields (SCRFs) are a mathematical model in which an observation sequence is segmented into labeled chunks. For example, an audio sequence may be segmented into words, or a video segmented into scenes. These models have two key characteristics. First, in contrast to frame-based systems, long-span features can be used - features which relate an entire span of observations to a label, and which may make reference to precisely defined segment boundaries. Second, the models are log-linear in form, and allow for the convenient integration of features derived from heterogeneous information sources. This talk will provide an overview of SCRFs, and in particular their use in speech recognition. First, we describe the basic model. Then, we present results in which SCRFs are used to add new information sources to improve the performance of a state-of-the-art system. Finally, we present initial results in which the framework is used to do recognition from the ground up.
Tuesday, October 18, 2011 - 12:30
Penalized Methods for Estimation of High Dimensional Directed (cyclic) Graphs
In graphical models theory, directed edges often represent causal relationships among random variables and discovering such relationships is of main interest in many areas of application, including biological and social problems. In addition to challenges in estimating directed graphs due to high dimensionality of the space and inadequacy of observational data for estimating the direction of causation, presence of cycles in graphs further complicates application of statistical methodologies to estimation of directed graphs. I will start by describing an efficient (polynomial time) algorithm, based on penalized likelihood methods, for estimation of directed graphs with known topological ordering, and discuss extensions of this methodology for estimation of general directed graphs, using (i) temporal observations, and (ii) joint perturbation screens and steady-state observations of the systems. Asymptotic properties of the proposed estimators and their applications in the context of gene regulatory networks are further discussed.
Tuesday, October 4, 2011 - 12:30
Direct Loss Minimization for Structural Labeling with Applications to Speech Recognition
In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or word error rate in speech recognition. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. In the first part of the talk we present a theorem stating that a certain perceptron-like learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. Empirical results on phonetic alignment are presented surpassing all previously reported results on this problem on the TIMIT corpus. In the second part of the talk, we describe a new algorithm which aims to minimize the regularized task loss. We state a PAC-Bayesian generalization bound, which gives an upper-bound on the expected task loss in terms of the empirical task loss. Our algorithm is derived by finding the gradient of the PAC-Bayesian bound and minimizing it by stochastic gradient descent. The resulting algorithm is iterative and easy to implement. Experiments on phoneme recognition on the TIMIT corpus, when the task loss is chosen to be the phoneme error rate, show that our method achieves the lowest phoneme error rate compared to other discriminative and generative models with the same expressive power. Joint work with David McAllester and Tamir Hazan.
Tuesday, June 28, 2011 - 12:30
Learning with Exploration
In the real world, we are constantly presented with problems where we first see some information, then make a choice, and then get feedback about that choice. Direct examples of this occur in medical testing and online ad auctions. The standard online supervised learning setting is almost the same, except that it provides feedback about all choices. With this weaker form of feedback provided by the real world, how can we best learn?
Tuesday, May 31, 2011 - 12:30
Convex Relaxation and High-Dimensional Matrices
Problems that require estimating high-dimensional matrices from noisy observations arise frequently in statistics and machine learning. Examples include dimensionality reduction methods (e.g., principal components and canonical correlation), collaborative filtering and matrix completion (e.g., Netflix and other recommender systems), system identification of state-space models, and graphical model learning. When the sample size is less than the matrix dimensions, all of these problems are ill-posed, so that some type of structure is required in order to obtain interesting results. In recent years, relaxations based on the nuclear norm and other types of convex matrix regularizers have become popular. By framing a broad class of problems as special cases of matrix regression, we present a single theoretical result that provides guarantees on the accuracy of such convex relaxations. Two properties of the convex relaxation turn out to be essential: decomposability of the regularizer, and restricted strong convexity of the loss function. Our general result can be specialized to obtain various explicit and non-asymptotic bounds, among them sharp rates for noisy forms of matrix sketching, completion, and decomposition. Based on joint work with Sahand Negahban.
Tuesday, May 17, 2011 - 12:30
Learning Transformations for Search and Adaptation
There are several scenarios where one needs to construct or learn a transformation of data from one space to another. One example is auto-tagging of images, where an image is mapped from image space to text space to find the most relevant words to describe an image. Another example is metric learning, where one maps all of the data from one space to another such that standard distance functions in the mapped space better capture the desired distance structure of the data. In this talk, I will describe some of my recent work in analyzing and developing techniques for the transformation learning problem. I will describe a general formulation for learning linear transformations, and then show that this formulation may be efficiently adapted to learn non-linear transformations through kernelization. Crucially, this kernelization yields transformations that can be applied inductively to new data, and the analysis generalizes a significant amount of previous research in this area. I will then describe some specific instantiations of this formulation for the problems of metric learning and domain adaptation. I will show some results on a variety of vision tasks, and will discuss connections to related work on online learning and hashing techniques.
Tuesday, May 10, 2011 - 12:30
Language: How Far Can We Get With Unlabeled Data?
Unlabeled text data is plentiful and computing power continues to grow exponentially. The accuracy difference between competing supervised learning methods tends to decline as more data becomes available, indicating that more data can be more important than cleverer algorithms. "AskMSR" showed that leveraging unlabeled textual Web data can do an impressive job at question-answering. Recent work has shown that vector space models combined with neural nets can solve standard natural language problems well, and very recent work using recurrent neural nets to predict sequences of characters gives very intriguing results. Inspired by all of this, we are investigating the following question: what is the simplest vector space model we can build that (1) learns from unlabeled data, (2) represents both words and sentences as vectors, (3) is generative (it can efficiently map sentence vectors to text), (4) is as simple as possible but can still do something interesting? This work is still in early development.