Multi-Sensor Object Recognition Using Web Data and Domain Adaptation
In this project we investigate using data such as images and 3D models freely available on the World Wide Web as labeled training data to learn classifiers for detecting objects in both outdoor and indoor scenes.
Project Contributors
Kevin Lai, Dieter Fox, Peter Henry
Main publications
Overview
Personal robotics is an exciting research frontier with a range of potential applications including domestic housekeeping, caring of the sick and the elderly, and office assistants for boosting work productivity. The ability to detect and identify objects in the environment is important if robots are to safely and effectively perform useful tasks in unstructured, dynamic environments such as our homes, offices and hospitals. One major obstacle to robust, many-class object recognition is the need for sufficient labeled training data to learn good classifiers. In this project, we investigate how to significantly reduce the need for manually labeled training data by leveraging data sets available on the World Wide Web. In contrast to much previous work on object recognition that have focused on one sensor modality such as vision, we investigate the use of multiple sensor modalities for object recognition. Specifically, we investigate techniques for combining visual information from camera images and shape information from a depth sensor (e.g. stereo or laser rangefinder) to do object recognition jointly in a single framework. We use images from
LabelMe and 3D models from
Google 3D Warehouse to train classifiers for realistic outdoor and indoor scenes encountered that may be encountered by a robot. In order to deal with the different characteristics of the web data and the real robot data, we additionally use a small set of labeled data collected by the robot and perform domain adaptation.
Classification Technique
Our technique builds upon the exemplar-based distance learning method introduced by Malisiewicz and Efros in 2008
1. The original technique performs visual recognition. We adapt their technique to shape classification by using shape-based features instead of visual features and also introduce a technique for domain adaptation. Domain adaptation is a method for leveraging large data sets that have characteristics differing from the target application. This is prominent in natural language processing (NLP), where text sources from very different topic domains are often combined to help classification. In the problem of robot object recognition, we have a wealth of exemplars available on the web, while labeling data collected by a robot is a tedious task. In our RSS 2009 paper, we demonstrate how to achieve good classification results using a large number of web exemplars combined with a small amount of hand-labeled data. Since then, our research focus has been on combining visual and shape information to perform object recognition. Our preliminary findings suggest that combining the two can improve classification performance over just using one or the other alone.
Examples
Indoor Multi-Sensor Classification
The image below shows the segmentation and classification results for three different objects: mug, book, and water bottle (top to bottom). Each column shows results when using (left to right) shape features only, visual features only, and combining both. Colors indicate the assigned object class (green:
mug, brown:
laptop, blue:
book, red:
apple, cyan:
water bottle).
Outdoor Shape Classification
We investigated classifying 3D laser scans collected by a vehicle driving through an urban environment. In this work, we only used shape information. 3D models on
Google 3D Warehouse served as our online source of training exemplars. Click on the images below to see an example. The image on the left is the classification attained by our technique. The one on the right is the ground truth. Cyan is
ground, blue is
car, purple is
pedestrian, green is
tree, yellow is
street sign, red is
building, grey is
background class and white is
unclassified (below probabilistic classification threshold). See our RSS 2009 paper for details and quantitative experiment results.

The scene, taken from left-, forward-, and right-facing cameras on the vehicle:



References
1. T. Malisiewicz and A. Efros. Recognition by Association via Learning Per-examplar Distances. In Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2008.