CSE 473 Autumn 1998, Copyright, S. Tanimoto, Univ. of Washington 
Introduction to Artificial Intelligence (Dec 7, 1998)

"Introduction to Image Understanding"

Motivation: Implementing Visual Intelligence
 

The visual world is the source of a great deal of the knowledge that people have.  The appearances of objects, people, and writing in books all communicate information.  Getting computers to understand things via visual information is a major goal of artificial intelligence.  Some neursurgeons have estimated that 25 percent of the human brain (in volume) is dedicated to the processes of vision.  It has also been shown by perceptual psychologists that many perceptual tasks require general problem solving to take place.  Others have documented "visual thinking" and the various intelligent processes that have to do with spatial and visual information in the brain.

The are many practical applications that would benefit from computer vision:  automatic inspection in factories, navigations systems for autonomous robots, automatic diagnosis of diseases from medical images such as CAT scans and microscope slides.
 

Preprocessing and Histograms

Computer vision systems usually involve sensors such as digital cameras.  The images that come from these devices often contain small perturbations in their pixels values.  These perturbations are called "noise."  Noise can cause difficulties for many of the vision algorithms, and thus it is common to filter the incoming images to reduce the amount of noise.  Most image noise if of high spatial frequency due to its statistical independence from one pixel to the next.  This allows it to be greatly attenuated by the use of low-pass filtering.  A typical low-pass filter is the "moving average" filter.

A horizontal moving average filter scans each line and replaces each pixel value by the average of
its value and those of its n-1 neighbors to the left, where n is the "neighborhood" size, typically 3, 4 or 5.

A bidirectional moving average filter replaces the pixel value by the average of the values in
an n by n box around the pixel.

A histogram of the pixel values in an image is a bar graph showing the relative frequency of each value in the range of allowable pixel values.  For a monochrome (i.e., grayscale) image, a typical range of values is 0 to 255, where 0 represents black and 255 represents white.

A histogram can be used to detect "bimodality" in an image --- the property of having two
prominent pixel values, separated in the range, with other values tending to be quite close
to the prominent values or else occuring infrequently.  Bimodal images often consist of a
figure on a background, and it can be relatively easy to separate the figure from the background as a result of the bimodality.
 
 

Connectedness and Connected Components
 
 
 

Suppose two pixels P1 and P2 in an image have coordinates (x1, y1) and (x2, y2).  Then P1 is a horizontal neighbor of P2 if y1=y2 and either x1=x2-1 or x2=x1-1.  Similarly P1 is a vertical neighbor of P2 if x1=x2 and either y1=y2-1 or y2=y1-1.  We say that P1 is 4-adjacent to P2 if it is either a horizontal or a vertical neighbor of P2.

Suppose there is a sequence of pixels P1, P2, ..., Pm  such that P1 is 4-adjacent to P2, P2 is 4-adjacent to P3, etc.  Then we say that P1 is 4-connected to Pm.

Suppose we have a set of pixels P1, P2, ..., Pk having the same pixel value, and
such that every pair in the set is 4-connected and no other pixel with that pixel value is 4-connected to any pixel in the set.  Then this set is called a 4-connected component of the image.

To pixels P1 and P2 are said to be 8-adjacent if they are 4-adjacent or they meet at a corner -- i.e.,
x1 = x2-1 or x2 = x1-1, and y1=y2-1 or y2=y1-1.

Then we define 8-connected and 8-connected component in a manner similar to that for 4-connected and 4-connected component.

One method for computing the 4 or 5-connected components of an images is to scan the image from top to bottom, left-to-right looking for the first unmarked pixel.  As soon as an unmarked pixel is found, mark it and start a depth-first search from it, moving at each step only to an adjacent pixel having the same pixel value, marking each new pixel reached and entering it into the connected component for that starting pixel.  When the component is marked, then the scan proceeds until it finds another unmarked pixel, etc., until all pixels in the image have been marked and entered into some connected component.
 

Edge Detection With Local Differencing Operators

Another approach to detecting the objects in an image is to first identify the pixels on the boundaries of the objects, and then to follow the boundaries to construct closed curves.  The first step here is called edge detection.  It is often accomplished by computing at every pixel location a value based on its neighbors.  Here are some typical operators.

Horizontal differences:   E(x, y) = F(x, y) - F(x-1, y).
Vertical differences: E(x, y) = F(x, y) - F(x, y-1).

Roberts cross operator:  R(x, y) = sqrt[ (F(x, y)-F(x-1,y-1))^2 + (F(x-1,y)-F(x,y-1))^2]
 
 

Describing Shape

Shape generally refers to the spatial distribution of the points (pixels) belonging to an object.  Shape can be described in terms of the boundary of the object and its curvature, vertices, etc.  Shape can also be described using specialized mathematical functions and structures.

One of the simplest measures of shape is the aspect ratio of the standard bounding box -- the smallest rectangle (with sides parallel to the coordinate axes) that encloses the object.  Suppose h is the height and w is the width of this box.  Then the aspect ratio is h/w.   This measure will distinguish a circle from, say, a horizontally oriented ellipse.  However, it is not "rotation invariant".  As you rotate any closed curve other than a circle, its enclosing rectangle changes, and so the aspect ratio changes.

The aspect ratio is an example of a scalar feature -- it is given by a single real number.

Other descriptors of shape can be more complicated.  One such structure is the "medial axis" or "skeleton" of a shape.  The skeleton of a set of points in the 2D plane is another set of points obtained by applying the "medial axis transformation".  (see details on pp.648-650 of the text.)
 
 
 
 

Blocks World Scene Analysis with Guzman's Method

During the late 1960s, members of the artificial intelligence research community explored aspects of computer vision that could be solved using combinatorial techniques.  One of these problems is the labelling of lines drawings that represent 3D scenes of polyhedra.   A. Guzman, then at the MIT Artificial Intelligence Laboratory, devised a set of vertex labelling and region linking rules that allowed a computer to determine, in many situations, the regions of an image that belong to the same polyhedron.

The vertices of the line drawing are labelled with labels from the following set:
Fork = three segments meet at a vertex, and no single angle is greater than or equal to 180 degrees.
Arrow - three segments meet at a vertex and there is one angle greater than 180 degrees.
Tee -- three segments meet at a vertex and one angle equals 180 degrees.
Ell -- two segments meet at a vertex, and the angle is not equal to 180 degrees.
Multi -- four or more segments meet at a vertex and all angles are less than 180 degrees.
Peak -- four or more segments meet at a vertex, and one angle is more than 180 degrees.

Linking rules:
At a fork, link the regions across each segment.
At an arrow, link the regions across the shank.
At a peak, link the regions across each segment except the two that make the large angle (greater than 180 degrees)
At a tee, link across the stem of the tee if and only if there is an "opposite" tee in the drawing.  An opposite tee is one whose stem is colinear with this tee's stem, and such that the bars of the tees are towards each other.

Object linking:
Any two regions that are linked to each other with two separate links are declared to be faces of the same polyhedron.
 
 


 
 

Last modified: December 6, 1998

Steve Tanimoto

tanimoto@cs.washington.edu