The key limiting factor in graphical model inference and learning is the complexity of the partition function. We thus ask the question: what are general conditions under which the partition function is tractable? The answer leads to a new kind of deep architecture, which we call sum-product networks (SPNs). SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes, and weighted edges. We show that if an SPN is complete and consistent it represents the partition function and all marginals of some graphical model, and give semantics to its nodes. Essentially all tractable graphical models can be cast as SPNs, but SPNs are also strictly more general. We then propose learning algorithms for SPNs, based on backpropagation and EM. Experiments show that inference and learning with SPNs can be both faster and more accurate than with previous deep architectures. For example, SPNs perform face completion better than state-of-the-art deep networks for this task. SPNs also have intriguing potential connections to the architecture of the cortex. (Joint work with Hoifung Poon.)
Tuesday, April 19, 2011 - 12:30
Sum-Product Networks: A Principled Approach to Deep Learning
Tuesday, April 5, 2011 - 12:30
Deep Convex Network: Architectures and Parallelizable Learning
We recently developed a DBN-HMM (Deep Belief Net/Hidden Markov Model) architecture for large-scale speech recognition with three steps in learning and decoding: 1) Stacked RBM pre-training; 2) Stochastic gradient descent back-propagation in fine tuning of DBN; and 3) Tying DBN weights in the phone-bound temporal dimension and using dynamic programming to arrive at the decision rule for speech recognition (Dahl, Yu, and Deng, ICASSP 2011 and IEEE Trans. Audio, Speech and Language Proc., 2011). While achieving remarkable success with this approach compared with the state-of-the-art in discriminatively trained HMM, we are faced with the seemingly insurmountable problem of scalability in dealing with virtually unlimited amount of speech training data in all sorts of acoustic environments. The main difficulty comes from the stochastic gradient learning algorithm in the fine tuning step of DBN. This step of the DBN-HMM learning algorithm is extremely difficult to parallelize over many machines with practical advantages, after some detailed exploration. To overcome the scalability challenge, we have recently developed a novel deep learning architecture, which we call deep convex network (DCN). The learning algorithm in the DCN is a convex one in each layer of the DCN. And the learning algorithm is batch-based instead of stochastic, naturally lending it amenable for parallelization. The construction of the DCN is related to our earlier work on building the deep hidden CRF (Yu, Wang, Deng, IEEE J. Selected Topics in Signal Proc., Dec 2010), but the learning algorithm is much simpler in each layer, enabling the use of hundreds of layers in the overall DCN architecture. We have empirically found that without doing global error fitting over all layers, the DCN can already achieve excellent discrimination results as long as a very deep architecture is built. One version of the algorithm contains a module that makes use of nonlinear random projection inspired by the work of extreme learning machine (Huang, 2006). While this module gives reasonably good results after it is embedded in the deep structure, we found that replacing it with the restricted Boltzmann machine (RBM) contributes to over 30% of error reduction in the image recognition (MNIST) experiment, achieving the state of the art performance in this task. More recent experiments on speech data will be discussed also.
Wednesday, March 9, 2011 - 12:30
Some Recent Advances in Local Learning
We discuss some recent advances in local learning. We describe and compare state-of-the-art local learning methods, including the local support vector machine, optimal weights for k-NN, local Bayesian discriminant analysis, and completely lazy learning. We highlight the importance of local learning when given similarity data. Experimental results include applications in computer vision, sonar, bioinformatics, and non-destructive evaluation.
Wednesday, February 23, 2011 - 12:30
The Blessing and the Curse of the Multiplicative Updates
Multiplicative updates multiply the parameters by nonnegative factors. These updates are motivated by a Maximum Entropy Principle and they are prevalent in evolutionary processes where the parameters are for example concentrations of species and the factors are survival rates. The simplest such update is Bayes rule and we give an in vitro selection algorithm for RNA strands that implements this rule in the test tube where each RNA strand represents a different model. In one liter of the RNA soup there are approximately 1020 different strands and therefore this is a rather high-dimensional implementation of Bayes rule. We investigate multiplicative updates for the purpose of learning online while processing a stream of examples. The ``blessing'' of these updates is that they learn very fast because the good parameters grow exponentially. However their ``curse'' is that they learn too fast and wipe out parameters too quickly. We describe a number of methods developed in the realm of online learning that ameliorate the curse of these updates. The methods make the algorithm robust against data that changes over time and prevent the currently good parameters from taking over. We also discuss how the curse is circumvented by nature. Some of nature's methods parallel the ones developed in Machine Learning, but nature also has some additional tricks. This will be a high level talk. No background in online learning or evolutionary Biology will be required.
Wednesday, January 26, 2011 - 12:30
A Framework for Feature Selection in Clustering
We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters.
Wednesday, January 12, 2011 - 12:30
The Convex Algebraic Geometry of Linear Inverse Problems
Deducing the state or structure of a system from partial, noisy measurements is a fundamental task throughout the sciences and engineering. A commonly encountered difficulty that arises in such inverse problems is the very limited availability of data relative to the ambient dimension of the signal to be estimated. We analyze a class of ill-posed linear inverse problems in which the underlying model of interest has simple algebraic structure. Specifically the focus of this talk is on the setting in which we are given a small number of linear measurements of a simple model, which is formed by adding a small number of elements from some atomic set. Examples of such models include previously studied cases such as sparse signals and low-rank matrices, as well as others such as low-rank tensors, binary vectors, orthogonal matrices, and matrices formed as the sum of a few permutations. Inspired by the success of the L1-norm and nuclear-norm heuristics, we propose a general convex regularization framework to recover such simple models. We discuss general recovery guarantees for our approach, and describe several example applications. Thus our work extends the catalog of objects and structures (beyond sparse vectors and low-rank matrices) that can be recovered from limited information via tractable convex programming. Joint work with Benjamin Recht, Pablo Parrilo, and Alan Willsky.
Tuesday, November 30, 2010 - 12:30
Learning with Permutations: Consensus Finding, Exponential Models, and Infinite Rankings
This talk is concerned with summarizing -- by means of statistical models -- of data that expresses preferences. This data is typically a set of rankings of n items by a panel of experts; the simplest summary is the "consensus ranking", or the "centroid" of the set of rankings. Such problems appear in many tasks, ranging from combining voter preferences to boosting of search engines. We study the problem in its more general form of estimating a parametric model known as the Generalized Mallows (GM) model. I will present an exact estimation algorithm, non-polynomial in theory, but extremely effective in comparison with existing algorithms. From a statistical point of view, we show that the GM model is an exponential family, and introduce the conjugate prior for this model class. Then we introduce the infinite GM model, corresponding to "rankings" over an infinite set of items, and show that this model is both elegant and of practical significance. Finally, the talk will touch upon the subject of multimodal distributions and clustering. Joint work with: Harr Chen, Alnur Ali, Bhushan Mandhani, Le Bao, Kapil Phadnis, Arthur Patterson and Jeff Bilmes.
Tuesday, November 9, 2010 - 12:30
Relax, Compensate and then Recover: A Theory of Anytime, Approximate Inference
I will present in this talk a theory of anytime, approximate inference, which explains some of the most influential algorithms in probabilistic reasoning, yet is based on one of the older ideas in symbolic reasoning: Relaxations. According to this theory, the fundamental notion of approximation is that of "relaxing" logical constraints (equalities in particular) for the purpose of decomposing a problem into smaller pieces that can be solved independently. A second fundamental notion is that of "compensation", which calls for imposing weaker and, typically, probabilistic notions of equality to compensate for the relaxed equalities. The third fundamental notion of the presented theory is that of "recovery," where some of the relaxed equalities are incrementally recovered, based on an assessment of their impact on improving the quality of approximations. I will discuss how this theory subsumes one of the most influential algorithms in probabilistic inference: loopy belief propagation and some of its generalizations, and comment on its relevance to lifted probabilistic inference. I will also show how this theory has formed the basis of an award-winning solver at the third UAI inference competition in 2010. I will finally discuss the relationship between this theory and current developments in symbolic reasoning -- in particular, how advances in knowledge compilation can impact the state of the art in approximate (and exact) probabilistic inference.
Tuesday, October 26, 2010 - 12:30
Unsupervised Inference of Genomic Structure from Multiple Functional Genomics Data Sets
Sequence census methods such as ChIP-seq have begun to produce an unprecedented amount of genome-anchored data. Numerous techniques exist to analyze the results of single assays, but genomics research has lacked an integrative method to identify patterns from multiple experiments simultaneously, while taking full advantage of the high-resolution data now available. We have developed such a method, which employs a dynamic Bayesian network to discover joint patterns across different experiment types. We have applied this method to ENCODE chromatin data for the human chronic myeloid leukemia cell line K562. In an unsupervised fashion, we have identified patterns associated with transcription initiation and elongation, as well as transcriptional repression, finding the locations of genes without using DNA sequence, mRNA or multispecies conservation data. We have also successfully performed a semi-supervised identification of enhancer regions across the genome using a curated set of seed enhancers. In both the unsupervised and semi-supervised cases, the method identifies patterns that elucidate the relationships among open chromatin, various transcription factors and histone modifications.
Tuesday, October 12, 2010 - 12:30
The Road from DBNs to Image Segmentation Is Paved with Submodularity
In this talk, I'll discuss a journey taken by my research interests leading from dynamic graphical models to image segmentation. I'll start out discussing dynamic graphical models (DGMs) and their efficient inference. Entailed in performing inference in such models is the problem of identifying a vertex cut that determines the boundary of a critical periodic graphical component. This component is used when the model is expanded to an arbitrary length, and also determines a local junction tree segment used as part of this expansion. Depending on the boundary of this component, inference costs can change by orders of magnitude. There are many ways of judging the quality of the component without performing inference itself, including the standard vertex cut criterion. Most interestingly, however, is the case where the DGM may be sparse, and the quality is determined by an objective that takes into account this sparsity. It turns out that such an objective is submodular. Considering this, and when vertex cut is seen as standard graph cut, a new problem is naturally introduced that involves finding a minimal cut in a graph where edges are judged not by a sum of edge weights but instead by a submodular function, a problem we call cooperative cut. The remainder of the talk will provide a brief background on submodularity, cooperative cut, and various hardness and approximate solutions to this problem, and how it can be applied to image segmentation. In this latter application, results show significant improvements over the standard (node submodular) graph cut approaches, in particular for images that exhibit a "shrinking bias" problem and for those images with smooth illumination gradients. Joint work with Stefanie Jegelka, with earlier contributions from Chris Bartels.