Problems that require estimating high-dimensional matrices from noisy observations arise frequently in statistics and machine learning. Examples include dimensionality reduction methods (e.g., principal components and canonical correlation), collaborative filtering and matrix completion (e.g., Netflix and other recommender systems), system identification of state-space models, and graphical model learning. When the sample size is less than the matrix dimensions, all of these problems are ill-posed, so that some type of structure is required in order to obtain interesting results. In recent years, relaxations based on the nuclear norm and other types of convex matrix regularizers have become popular. By framing a broad class of problems as special cases of matrix regression, we present a single theoretical result that provides guarantees on the accuracy of such convex relaxations. Two properties of the convex relaxation turn out to be essential: decomposability of the regularizer, and restricted strong convexity of the loss function. Our general result can be specialized to obtain various explicit and non-asymptotic bounds, among them sharp rates for noisy forms of matrix sketching, completion, and decomposition. Based on joint work with Sahand Negahban.
There are several scenarios where one needs to construct or learn a transformation of data from one space to another. One example is auto-tagging of images, where an image is mapped from image space to text space to find the most relevant words to describe an image. Another example is metric learning, where one maps all of the data from one space to another such that standard distance functions in the mapped space better capture the desired distance structure of the data. In this talk, I will describe some of my recent work in analyzing and developing techniques for the transformation learning problem. I will describe a general formulation for learning linear transformations, and then show that this formulation may be efficiently adapted to learn non-linear transformations through kernelization. Crucially, this kernelization yields transformations that can be applied inductively to new data, and the analysis generalizes a significant amount of previous research in this area. I will then describe some specific instantiations of this formulation for the problems of metric learning and domain adaptation. I will show some results on a variety of vision tasks, and will discuss connections to related work on online learning and hashing techniques.
Unlabeled text data is plentiful and computing power continues to grow exponentially. The accuracy difference between competing supervised learning methods tends to decline as more data becomes available, indicating that more data can be more important than cleverer algorithms. "AskMSR" showed that leveraging unlabeled textual Web data can do an impressive job at question-answering. Recent work has shown that vector space models combined with neural nets can solve standard natural language problems well, and very recent work using recurrent neural nets to predict sequences of characters gives very intriguing results. Inspired by all of this, we are investigating the following question: what is the simplest vector space model we can build that (1) learns from unlabeled data, (2) represents both words and sentences as vectors, (3) is generative (it can efficiently map sentence vectors to text), (4) is as simple as possible but can still do something interesting? This work is still in early development.
The key limiting factor in graphical model inference and learning is the complexity of the partition function. We thus ask the question: what are general conditions under which the partition function is tractable? The answer leads to a new kind of deep architecture, which we call sum-product networks (SPNs). SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes, and weighted edges. We show that if an SPN is complete and consistent it represents the partition function and all marginals of some graphical model, and give semantics to its nodes. Essentially all tractable graphical models can be cast as SPNs, but SPNs are also strictly more general. We then propose learning algorithms for SPNs, based on backpropagation and EM. Experiments show that inference and learning with SPNs can be both faster and more accurate than with previous deep architectures. For example, SPNs perform face completion better than state-of-the-art deep networks for this task. SPNs also have intriguing potential connections to the architecture of the cortex. (Joint work with Hoifung Poon.)
We recently developed a DBN-HMM (Deep Belief Net/Hidden Markov Model) architecture for large-scale speech recognition with three steps in learning and decoding: 1) Stacked RBM pre-training; 2) Stochastic gradient descent back-propagation in fine tuning of DBN; and 3) Tying DBN weights in the phone-bound temporal dimension and using dynamic programming to arrive at the decision rule for speech recognition (Dahl, Yu, and Deng, ICASSP 2011 and IEEE Trans. Audio, Speech and Language Proc., 2011). While achieving remarkable success with this approach compared with the state-of-the-art in discriminatively trained HMM, we are faced with the seemingly insurmountable problem of scalability in dealing with virtually unlimited amount of speech training data in all sorts of acoustic environments. The main difficulty comes from the stochastic gradient learning algorithm in the fine tuning step of DBN. This step of the DBN-HMM learning algorithm is extremely difficult to parallelize over many machines with practical advantages, after some detailed exploration. To overcome the scalability challenge, we have recently developed a novel deep learning architecture, which we call deep convex network (DCN). The learning algorithm in the DCN is a convex one in each layer of the DCN. And the learning algorithm is batch-based instead of stochastic, naturally lending it amenable for parallelization. The construction of the DCN is related to our earlier work on building the deep hidden CRF (Yu, Wang, Deng, IEEE J. Selected Topics in Signal Proc., Dec 2010), but the learning algorithm is much simpler in each layer, enabling the use of hundreds of layers in the overall DCN architecture. We have empirically found that without doing global error fitting over all layers, the DCN can already achieve excellent discrimination results as long as a very deep architecture is built. One version of the algorithm contains a module that makes use of nonlinear random projection inspired by the work of extreme learning machine (Huang, 2006). While this module gives reasonably good results after it is embedded in the deep structure, we found that replacing it with the restricted Boltzmann machine (RBM) contributes to over 30% of error reduction in the image recognition (MNIST) experiment, achieving the state of the art performance in this task. More recent experiments on speech data will be discussed also.
We discuss some recent advances in local learning. We describe and compare state-of-the-art local learning methods, including the local support vector machine, optimal weights for k-NN, local Bayesian discriminant analysis, and completely lazy learning. We highlight the importance of local learning when given similarity data. Experimental results include applications in computer vision, sonar, bioinformatics, and non-destructive evaluation.
Multiplicative updates multiply the parameters by nonnegative factors. These updates are motivated by a Maximum Entropy Principle and they are prevalent in evolutionary processes where the parameters are for example concentrations of species and the factors are survival rates. The simplest such update is Bayes rule and we give an in vitro selection algorithm for RNA strands that implements this rule in the test tube where each RNA strand represents a different model. In one liter of the RNA soup there are approximately 1020 different strands and therefore this is a rather high-dimensional implementation of Bayes rule. We investigate multiplicative updates for the purpose of learning online while processing a stream of examples. The ``blessing'' of these updates is that they learn very fast because the good parameters grow exponentially. However their ``curse'' is that they learn too fast and wipe out parameters too quickly. We describe a number of methods developed in the realm of online learning that ameliorate the curse of these updates. The methods make the algorithm robust against data that changes over time and prevent the currently good parameters from taking over. We also discuss how the curse is circumvented by nature. Some of nature's methods parallel the ones developed in Machine Learning, but nature also has some additional tricks. This will be a high level talk. No background in online learning or evolutionary Biology will be required.
We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters.
Deducing the state or structure of a system from partial, noisy measurements is a fundamental task throughout the sciences and engineering. A commonly encountered difficulty that arises in such inverse problems is the very limited availability of data relative to the ambient dimension of the signal to be estimated. We analyze a class of ill-posed linear inverse problems in which the underlying model of interest has simple algebraic structure. Specifically the focus of this talk is on the setting in which we are given a small number of linear measurements of a simple model, which is formed by adding a small number of elements from some atomic set. Examples of such models include previously studied cases such as sparse signals and low-rank matrices, as well as others such as low-rank tensors, binary vectors, orthogonal matrices, and matrices formed as the sum of a few permutations. Inspired by the success of the L1-norm and nuclear-norm heuristics, we propose a general convex regularization framework to recover such simple models. We discuss general recovery guarantees for our approach, and describe several example applications. Thus our work extends the catalog of objects and structures (beyond sparse vectors and low-rank matrices) that can be recovered from limited information via tractable convex programming. Joint work with Benjamin Recht, Pablo Parrilo, and Alan Willsky.
This talk is concerned with summarizing -- by means of statistical models -- of data that expresses preferences. This data is typically a set of rankings of n items by a panel of experts; the simplest summary is the "consensus ranking", or the "centroid" of the set of rankings. Such problems appear in many tasks, ranging from combining voter preferences to boosting of search engines. We study the problem in its more general form of estimating a parametric model known as the Generalized Mallows (GM) model. I will present an exact estimation algorithm, non-polynomial in theory, but extremely effective in comparison with existing algorithms. From a statistical point of view, we show that the GM model is an exponential family, and introduce the conjugate prior for this model class. Then we introduce the infinite GM model, corresponding to "rankings" over an infinite set of items, and show that this model is both elegant and of practical significance. Finally, the talk will touch upon the subject of multimodal distributions and clustering. Joint work with: Harr Chen, Alnur Ali, Bhushan Mandhani, Le Bao, Kapil Phadnis, Arthur Patterson and Jeff Bilmes.