I. Overview

The project idea started out as a simple thought: I went fishing yesterday, and I took a picture of the fish I caught. What fish is that? This is just a speicification of the CBIR problem, but we both liked fishes better than general scenery. Since we couldn't find any existing free fish database, we crawled the web and decided that we have plenty of overlapping fish images. The project thus consisted of the foollowing phases:

  1. Retrieving raw fish images
  2. Annotate images, label by latin name
  3. Pre-matching processing: background removal
  4. Shape extraction (feature descriptor)
  5. Shape matching
  6. Scoring scheme

In the proceeding sections, we'll explain each phase in more detail.

II. Fish Database

Below are the web source of the fish images we collected:

http://animals.timduru.org/dirlist/fish/
http://www.reef.org/webres/gallery.htm
http://www.petra-aq.comp.cz/gallery.htm
http://www.flmnh.ufl.edu/fish/gallery/Gallery.asp
http://www.aquahobby.com/e_marine.php
http://www.nativefish.org/Gallery/
http://www.scuba-equipment-usa.com/marine/index.html
http://fishgallery.com/photos/default.aspx
http://www.tetra-fish.com/aquarium/fishgallery.aspx
http://fins.actwin.com/gallery/
http://www.amazonia.com/gallery.html
http://www.pacificislandtravel.com/fr_polynesia/about_destin/fishgallery.html
http://badmanstropicalfish.com/your-fish/fish1.html
http://www.beaufortonline.com/almanac/fish/
http://www.scottsbt.com/fishids/fishid.htm
http://www.elmersaquarium.com/
http://www.calacademy.org/research/ichthyology/species/
http://research.myfwc.com/gallery/view_category.asp?catid=1221&subcatid=5131

Approximately 1000 images were extracted. In practice we used about 200-300 images for our testing.

 

III. Image Pre-Processing

Foreground/Background Separation

In order to do any matching, we have to first identify the fish in the image. We were hoping that this could be an automatic method using available segmentation algorithms. We tried Color Clustering, Color-and-texture clustering, and analyzed the Blobworld method. However none of these turned out to be suitable for our purpose. So finally we settled on a tool called GrabCut, which needs user input for segmentation.
  1. K-means Clustering using color
    The first segmentation we tried was k-means clustering using color information. We decompose the image into RGB color vector and use k-means clustering to find the best segmentation. Since the fish's skin is not a smooth serface, the segmentation breaks down the fish into small little parts. Also usually the fish and the background blends together, which becomes ambiguous for this algorithm. The result shows that this is not a suitable algorithm for Foreground/Background Segmentation. Also, introducing some texture to the cluster should help with the segmentation.


    Original Image

    (color only) k-means clusters using k=4

  2. Clustering using color and texture
    Since texture on fish's skin is very distinguishable, we could make use of that in our segmentation algorithm. Therefore I picked a software, JSEG, that used unsupervised color and texture segmentation to segment my fish images. (http://vision.ece.ucsb.edu/segmentation/jseg/software/) Even though the segmentation becomes much better than the color-only segmentation, the clustering still tend to cluster the fish's tails and fins together with the background. The reasons for this might be that many fish's fins and tails are transparent, and isn't very distinguishable when looking at the images. However, for our matching, we can't afford to lose the fins and tails, or include too much of the background, so this isn't our best approach.


    Original Image

    (color and texture) clusters by JSEG

  3. Blobworld
    We considered the Blobworld segmentation, but it should be similar to color-texture segmentation.

    We spent a while trying to get the segmentation software <http://elib.cs.berkeley.edu/src/blobworld/segmentation_files.tar.gz>, but it did not work because of the settings that are probably differnt from what the programmer had.
  4. GrabCut
    Finally, we decided that fish's images has a lot of charasteristic that make the automatic methods fails. Also there aren't many assumptions we can make about the background because we want to be able to accept fish that swims in the sea, which can have any kind of background. Therefore, it is best to use human knowledge to help with segmentation, by giving them a proper tool. There are many tools that we could use, for example Photoshop magic wand, etc. GrabCut is a nice image foreground/background segmentation tool that uses very little human efforts. GrabCut [Rother et. al] is developed on graph-cut based segmentation technique introduced by Boykov and Jolly at ICCV 2001. GrabCut uses edge and region informationto create an energy function. The algorithm, then, tries to minimize the energy function, which as a result, produces the best segmentation. (The original implementation from Microsoft Research works very well, and usually gets the fish segmented with only one stroke around the fish.)

    Rohit Gupta and Krishnan Ramnath from Carnegie Mellon University provided a matlab implementation of the GrabCut on their webpages at http://www.andrew.cmu.edu/user/mohitg/segmentation.htm. We modified the interface a little for easier refinement by adding the rectangle tools to specified foreground and background.


Original Image


original segmentation with just bounding box chosen



User Interaction: red specified as fish, and blue specified as background

Result after user input

 

Fish Orientation

The matching method will perform best when the fish's orientation is similar, so Fish Orientation is another step of the pre-processing of the fish images. An automatic fish orientation can be done by detecting the fish's mouth and tail and pose it correctly. However due to our limitted time, we didn't get to implement that.

Manual Method:

Most of the well-oriented fish are the ones that face to the left and lay flat on the image plane, so we chose that as our fixed orientation. With that in mind, we can manually orient the fish easily if we know where the fish's month is and where the tail is. So our approach is to ask user to click on two points--first on the mouth and then on the tail.

code file: gui_rotate.m (and batch_gui_rotate.m)

Automatic Attempt:

With the observation that most fish are lengthy, we tried to find the orientation of the fish by looking at the bound box of the fish.

This method works on a portion of the image, but still have a lot of ambiguity. So I decided that it was not good enough to be used.

code file: autorotate.m (and batch_autorotate.m)

Comparing Results

Original Image
Manual Rotate
Auto Rotate
same as original image

 

 

IV. Matching Method

The matching method implemented is based on the paper "Shape Matching and Object Recognition Using Shape Contexts" by Belongie et al.,. The process can be divided into the following steps:

(1) Sampling points

The shape of a fish is captured by a finite subset of points sampled from the external (and sometimes internal contours) on the object. Using an edge detector such as the Canny edge operator, a binary image is obtained. From this binary image, N points are sampled such that the final subset is distributed sufficiently uniform along the edges. The number of points to sample, N , affects the accuracy of the shape representation, and we shall discuss that effect in the Results & Discussion section. For most of the fish images we're using, 100 - 300 points are sufficient.

Shown below are two fish images sampled with 150 points:

OriginalSampled

(2) Shape Context

First we define the problem: given two fish images F1 and F2, we've extracted a representative point subset P and Q, respectively. For each point pi on the first shape, we want to find the "best" matching point qj. How do we define a descriptor for the shape content for each point? What is the "goodness" metric of matching one point against the other?

The descriptor proposed by Belongie et al., is shape context , a feature vector that expresses the configuration of the entire shape relative to the reference point. To reduce the complexity of the feature vector, a coarse histogram is used. For each point pi in a shape, the corresponding feature vector is computed by a histogram hi of the relative coordinates of the remaining N-1 points:

hi(k) = number of points such that it's relative coordinate is in the k-th bin

Bins are uniform in log-polar space, making the descriptor more sensitive to positions of nearby sample points than to those of points farther away.

The cost of matching (pi, qj) thus can be defined using the C2 text statistic:

Cijº C(pi, qj) = ½S [hi(k) - hj(k)]2 / (hi(k) + hj(k))

The shape context is essentially translation invariant. To achieve scale invariance, all radial distances are normalized by mean distance between all point pairs. We ignore the handling of outliers and noise, assuming that our fish images are clean. To achieve rotation invariance, however, we took a more user-interactive approach. Instead of the approach proposed by the paper, which is treating the tangent vector at each point as the positive x-axis, we let users help pinpoint the outline of the fish and orient the input image to a fixed orientation (see Image Pre-Processing section).

(3) Bipartite Graph Matching

Given the parwise cost matrix, we want a one-to-one matching such that the total matching cost is minimized. This is simply the min-cost bipartite matching problem. The Kuhn-Munkres algorithm (also known as the Hungarian method) solves matching problem in O(n3). When implementing the bipartite matching, we tested running time and optimality on three methods: the Hungarian, a near-optimal heuristic approach, and the greedy approach. Performance comparison is given in the Results & Discussion section, and an outline of the heuristic and greedy approach is described in the Appendix. We decided to use the heuristic approach since it gives a sufficiently good solution within reasonable running time.

The Belongie paper used "dummy" nodes to handle outliers. Since our fish images are assumed to be clean, dummy nodes are generally not used. However, if we are to use variable sample points (dependent on the shape size or complexity), dummy nodes may be added with a threshold set to rule out bad matches. A few hand-tuned tests indicate that a threshold of 40 performs pretty well when sample points vary between 100 and 400.

Shown below is a good vs bad matching example. The left one is matching two fish shapes that are extracted from two different images containing the same species of fish (but probably not the same fish...). So this should have a "good" matching. The red dots are the sampled shape points for the query image, and the green dots are those of the correct database match image.

Right AnswerWrong Answer
Just aligning the two shapes points to get a view before the matching...

Right AnswerWrong Answer

The blue lines indicate point-to-point corresspondence. It's obvious that the left one has a more coherent and believable matching.

(4) Modeling Transformation

Given one-to-one correspondences between two pointsets, we can estimate a plane transformation using Thin Plate Splicing. Given a set of data points, a weighted combination of thin plate splines centered about each data point gives the interpolation function that passes through the points exactly while minimizing the "bending energy" - defined as

Since we don't expect corresponding point locations to be perfect, the interpolation is relaxed by means of regularization. By adding a regularization parameter l. We set l equal to a2l0, where a is again the mean distance, and l0 = 1.

We do TPS transformation individually for x and y. In the original paper, they did a matching-TPS reiteration to recover from errorneous correspondences. We consider this as unncessary, so we run bipartite matching only once, then use the bending energy as an indicator for goodness of matching.

(5) Shape Distance

Here comes to fun part: we have the query image of a fish whose identity is unknown, and a database of annotated fish images. We compute the matching cost and bending energy for all database images. Then, what should the scoring theme be? Belongie et al., used a combined distance cost of (1) symmetric sum of shape context matching cost over best matching points + (2) sum of squared brightness differences in Gaussian windows around sample points + (3) bending energy. And they weighted the three distance costs when it was applied to different image sets.

After experimenting with a few image matches, we decided to discard this scoring theme and come up with our own. Instead of using the real costs and weigh them differently, we use a weighted ranking system. Each database image would different rankings for different features, we can then weigh them in whatever way we like without worrying about different features having different cost ranges - for example, a bipartite matching cost could range from 9000 to 50000, depending on sample point size, while bending energy is usually less than 100.

In addition to matching cost and bending energy, we substituted texture for brightness differences. We used the texture feature from Manjunath et al.,, which uses a Gabor wavelet feature vector. We used the Matlab code provided by the author. The difficulty arises however, in deciding the "texture" of a fish. Unlike a grassy plane or cloth, the shape of a fish varies and it's not entirely clear which part of it should be considered as texture. If we include the entire fish image, it would take into account too much of the silhouette outline. Instead we sampled only the middle (supposedly the torso) part of the fish, where stripes, spots, and color patterns are most distinguishable.

A few texture patch example is shown below. We're showing it in color because it's prettier, but in reality the texture patch is converted to grayscale. This makes sense since illumination changes too much over different fish images.

V.1 Results & Discussion - pre-processing

V.2 Results & Discussion - matching

We address the performance of the matching method and discuss the influence of different bipartite matching algorithm, tuning of parameters, etc.

Performance of Bipartite Matching algorithms

In general, shape & texture feature extraction take a reasonable constant amount of time, and they're computed once-and-for-all and then stored in *.mat format. The greatest dragdown of matching speed is caused by the bipartite matching.

Below is a comparison table of running time(secs) on 3 methods: Hungarian, Heuristic, and Greedy.

# points100150200300400
Hungarian1.4224.57811.40550.152143.146
Heuristic0.5791.7814.0469.67125.436
Greedy3.34416.65555.324288.089961.354

As can be observed, the heuristic method is a better choice since it runs faster than the optimal Hungarian method while still achieving a near-optimal solution.

Performance of matching: why and when it works/fails

First of all we need to define the criteria for a "correct match". If the correct corresponding fish shows up as first place, it is undoubtedly a correct answer. But what if it goes second? third? The combination of weights on the different feature rankings could also lead to mismatches, especially when the query fish has the same shape with the correct database fish but has a different texture/color pattern, which is common. For example, if one googles up for images for Betta Splendens (fighter fish), then the following images show up:

Even when one ignores the orientation, the texture pattern is still very different. One might be tempted to say that all the fighter fishes have the same outline shape. That is only partially true: the fighter fishes all have similar head and torso outline, but the tails vary greatly. To make matters worse, some of the fish tails are transparent, which will cause great trouble in gettin a clean cut. Even when the query images is near noise-free, insufficient sample points may cause the details of the tail to be ignored, and the shape context descriptor would only represent it as a huge blob of tail fin.

We'll present examples on good and bad matches following subsections.

Zebrasoma Falvescens: A Bad Match

This is our query:

We try to match it against a 36-image database, which contains the right answer:

But when we match it, ths best match appeared to be something seemingly very different. The best match turns out to be Melichthys Niger!

First we look at bipartite shape matching. The highest ranking fish is Colisa Lalia. We compare the contour of the query, Colisa Lalia, and the right answer.

The query actually looks more similar to the shape of C. Lalia - a lenghtier body with the eye area captured. M. Niger ranks No.6 in shape matching.

M. Niger ranks No.1 in TPS bending energy and No.3 in texture, while the right answer ranks 15 and 23, correspondingly. It's not surprising that the texture ranking turned out this way, since by the way we sampled the texture it includes partial outline shapes, especially when the fish is round.

This leads to the thought that we should come up with a better way of sampling texture patches. Unfortunately, we didn't have enough time....

Abudefduf Saxatilis: A Good Match

The matching of Abudefduf Saxatilis matches perfectly - without surprise, since the two input images were extremely similar. The query fish shape was:

And the best result (and correct answer) was:

The right answer ranks 1 on shape, 3 on bending, and 10 on texture. The reason for the slightly lower rank on texture can be speculated: the texture patch is actually not that similar around the torso center, since the right answer had straight vertical black stripes with yellow taint, and the query image didn't have yellow taint and had more curved stripes. Still, the overall good ranking across 3 categories yielded the correct answer.

Chaetodon Vagabundus: A good matching result with high cost

This experiment was done with lavishness: Hungarian method for optimal min-cost bipartite matching, variable sample points (100, 200, 300, or 400 points based on image size).

And the best result (and correct answer) was:

And this match was done out of a database 86 fish images, with image size up to 600x600. It was encouraging to get this result, given the fact that the input had only 100 sample points (a 206x130 image), and the query had 300 (a 650x400 image). The matching took at least 40 minutes, which is why we didn't venture future in using Hungarian method + large database.

The running time of the matching system is heavily dependent on two factors: database size and sample point size. The sample point size increases the matching time for each single matching cubically. Since the matching is done sequentially (although they could be done concurrently under parallel implementation), the running time goes up linearly. In a word,

RunTime = (size of database) x (single matching time)

VI. User Interface

We divided our user graphical interface in two parts: the image pre-processing part and the database matching part.

User Interface for Image Pre-Processing

Part I: GrabCut (AutoCutRefine.m)

This program is developed by Rohit Gupta and Krishnan Ramnath from Carnegie Mellon University, with our modification in the user interface.

Part II: Fish Orientation (gui_rotate.m)

This function is called by passing the image filename. It will show the image on the screen.

 

User Interface for Database Matching

The Fish Bank GUI can be started by typing at matlab command line
fishGUI

which will start up the GUI that looks like this:

You can load a raw image (accepted formats are BMP, PPM, and JPG) by clicking Load Image, or use Load Record to load a .mat file that already has the query image's features extracted and stored. A image Record contains its original image, a thumbnail, coordinates of sampled points, shape context descriptor (histogram), texture feature vector, and most important of one, the fish's venerable latin name. (NOTE: we understand that nobody knows the scientific name of a fish besides biologists, but common names have too many aliases and confusion).

Once you have a image/record loaded, you can match it against a compiled database by clicking Do Query!. You can select any existing database, or you can first use Add image to DB to compile your own database.

When the matching is done it will be displayed (up to top 15 matches can be displayed). You can save the matching record for future reference.

VII. Future Works

Below is a list of things we think that can be done to improve the matching:
  1. Faster bipartite matching algorithm
  2. Parallelization
  3. Automated mechanism for extracting image information
  4. Improve texture feature by sampling a proper patch
  5. Rigorous approach for deciding which and how many points to sample
  6. Fish is an interesting object (animal) for computer vision. A freely-available fish database that have images with latin and common name labeled would be good for research in this area.
  7. Better segmentation that takes into account transparency of object could be useful to do foreground/background segmentation of fish.
  8. GrabCut is a very powerful tool, but some times the result images' edges are very hard and not pleasant to look at. If it could learn to make better edges, that would be useful.

VIII. References

[1] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: image segmentation using expectationmaximization and its application to image querying. IEEE PAMI, pages 1026–38, Aug 2002.

[2] Rother, C., Kolmogorov, V., and Blake, A. Grabcut - interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics 23, 3 (August 2004), 309–314.

[3] Deng, Y.[Yining], Manjunath, B.S., Unsupervised Segmentation of Color-Texture Regions in Images and Video, PAMI(23), No. 8, August 2001, pp. 800-810.

[4] S. Belongie and J. Puzicha. Shape Matching and Object Recognition Using Shape Contexts IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO.24, Apr 2002

[5] B.S. Manjunath and W.Y. Ma. Texture features for browsing and retrieval of image data IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI - Special issue on Digital Libraries), vol. 18, no. 8, pp. 837-42, Aug. 1996.

[6] I.Yamazaki et al., Segmenting Point Sets