The project idea started out as a simple thought: I went fishing yesterday, and I took a picture of the fish I caught. What fish is that? This is just a speicification of the CBIR problem, but we both liked fishes better than general scenery. Since we couldn't find any existing free fish database, we crawled the web and decided that we have plenty of overlapping fish images. The project thus consisted of the foollowing phases:
In the proceeding sections, we'll explain each phase in more detail.
Below are the web source of the fish images we collected:
http://animals.timduru.org/dirlist/fish/
http://www.reef.org/webres/gallery.htm
http://www.petra-aq.comp.cz/gallery.htm
http://www.flmnh.ufl.edu/fish/gallery/Gallery.asp
http://www.aquahobby.com/e_marine.php
http://www.nativefish.org/Gallery/
http://www.scuba-equipment-usa.com/marine/index.html
http://fishgallery.com/photos/default.aspx
http://www.tetra-fish.com/aquarium/fishgallery.aspx
http://fins.actwin.com/gallery/
http://www.amazonia.com/gallery.html
http://www.pacificislandtravel.com/fr_polynesia/about_destin/fishgallery.html
http://badmanstropicalfish.com/your-fish/fish1.html
http://www.beaufortonline.com/almanac/fish/
http://www.scottsbt.com/fishids/fishid.htm
http://www.elmersaquarium.com/
http://www.calacademy.org/research/ichthyology/species/
http://research.myfwc.com/gallery/view_category.asp?catid=1221&subcatid=5131
Approximately 1000 images were extracted. In practice we used about 200-300 images for our testing.
![]() Original Image |
![]() (color only) k-means clusters using k=4 |
![]() Original Image |
![]() (color and texture) clusters by JSEG |
|
|
![]() User Interaction: red specified as fish, and blue specified as background |
![]() Result after user input |
The matching method will perform best when the fish's orientation is similar, so Fish Orientation is another step of the pre-processing of the fish images. An automatic fish orientation can be done by detecting the fish's mouth and tail and pose it correctly. However due to our limitted time, we didn't get to implement that.
Most of the well-oriented fish are the ones that face to the left and lay flat on the image plane, so we chose that as our fixed orientation. With that in mind, we can manually orient the fish easily if we know where the fish's month is and where the tail is. So our approach is to ask user to click on two points--first on the mouth and then on the tail.
code file: gui_rotate.m (and batch_gui_rotate.m)
With the observation that most fish are lengthy, we tried to find the orientation of the fish by looking at the bound box of the fish.
- If the fish is in a vertical pose, the height will bevsiginificantly larger than the width. The image should be rotated either 90 degrees either left or right.
- If the fish is in a diagonal pose, if we flip it with the angle of the diagonal axis of the bounding box, we should get signifcantly smaller area of bounding box of the fish
This method works on a portion of the image, but still have a lot of ambiguity. So I decided that it was not good enough to be used.
code file: autorotate.m (and batch_autorotate.m)
Original Image |
Manual Rotate |
Auto Rotate |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
same as original image |
![]() |
![]() |
![]() |
The matching method implemented is based on the paper "Shape Matching and Object Recognition Using Shape Contexts" by Belongie et al.,. The process can be divided into the following steps:
The shape of a fish is captured by a finite subset of points sampled from the external (and sometimes internal contours) on the object. Using an edge detector such as the Canny edge operator, a binary image is obtained. From this binary image, N points are sampled such that the final subset is distributed sufficiently uniform along the edges. The number of points to sample, N , affects the accuracy of the shape representation, and we shall discuss that effect in the Results & Discussion section. For most of the fish images we're using, 100 - 300 points are sufficient.
Shown below are two fish images sampled with 150 points:
| Original | Sampled |
![]() |
![]() |
![]() |
![]() |
First we define the problem: given two fish images F1 and F2, we've extracted a representative point subset P and Q, respectively. For each point pi on the first shape, we want to find the "best" matching point qj. How do we define a descriptor for the shape content for each point? What is the "goodness" metric of matching one point against the other?
The descriptor proposed by Belongie et al., is shape context , a feature vector that expresses the configuration of the entire shape relative to the reference point. To reduce the complexity of the feature vector, a coarse histogram is used. For each point pi in a shape, the corresponding feature vector is computed by a histogram hi of the relative coordinates of the remaining N-1 points:
Bins are uniform in log-polar space, making the descriptor more sensitive to positions of nearby sample points than to those of points farther away.
The cost of matching (pi, qj) thus can be defined using the C2 text statistic:
The shape context is essentially translation invariant. To achieve scale invariance, all radial distances are normalized by mean distance between all point pairs. We ignore the handling of outliers and noise, assuming that our fish images are clean. To achieve rotation invariance, however, we took a more user-interactive approach. Instead of the approach proposed by the paper, which is treating the tangent vector at each point as the positive x-axis, we let users help pinpoint the outline of the fish and orient the input image to a fixed orientation (see Image Pre-Processing section).
Given the parwise cost matrix, we want a one-to-one matching such that the total matching cost is minimized. This is simply the min-cost bipartite matching problem. The Kuhn-Munkres algorithm (also known as the Hungarian method) solves matching problem in O(n3). When implementing the bipartite matching, we tested running time and optimality on three methods: the Hungarian, a near-optimal heuristic approach, and the greedy approach. Performance comparison is given in the Results & Discussion section, and an outline of the heuristic and greedy approach is described in the Appendix. We decided to use the heuristic approach since it gives a sufficiently good solution within reasonable running time.
The Belongie paper used "dummy" nodes to handle outliers. Since our fish images are assumed to be clean, dummy nodes are generally not used. However, if we are to use variable sample points (dependent on the shape size or complexity), dummy nodes may be added with a threshold set to rule out bad matches. A few hand-tuned tests indicate that a threshold of 40 performs pretty well when sample points vary between 100 and 400.
Shown below is a good vs bad matching example. The left one is matching two fish shapes that are extracted from two different images containing the same species of fish (but probably not the same fish...). So this should have a "good" matching. The red dots are the sampled shape points for the query image, and the green dots are those of the correct database match image.
| Right Answer | Wrong Answer |
![]() | ![]() |
| Just aligning the two shapes points to get a view before the matching... |
| Right Answer | Wrong Answer |
![]() | ![]() |
The blue lines indicate point-to-point corresspondence. It's obvious that the left one has a more coherent and believable matching.
Given one-to-one correspondences between two pointsets, we can estimate a plane transformation using Thin Plate Splicing. Given a set of data points, a weighted combination of thin plate splines centered about each data point gives the interpolation function that passes through the points exactly while minimizing the "bending energy" - defined as

Since we don't expect corresponding point locations to be perfect, the interpolation is relaxed by means of regularization. By adding a regularization parameter l. We set l equal to a2l0, where a is again the mean distance, and l0 = 1.
We do TPS transformation individually for x and y. In the original paper, they did a matching-TPS reiteration to recover from errorneous correspondences. We consider this as unncessary, so we run bipartite matching only once, then use the bending energy as an indicator for goodness of matching.
Here comes to fun part: we have the query image of a fish whose identity is unknown, and a database of annotated fish images. We compute the matching cost and bending energy for all database images. Then, what should the scoring theme be? Belongie et al., used a combined distance cost of (1) symmetric sum of shape context matching cost over best matching points + (2) sum of squared brightness differences in Gaussian windows around sample points + (3) bending energy. And they weighted the three distance costs when it was applied to different image sets.
After experimenting with a few image matches, we decided to discard this scoring theme and come up with our own. Instead of using the real costs and weigh them differently, we use a weighted ranking system. Each database image would different rankings for different features, we can then weigh them in whatever way we like without worrying about different features having different cost ranges - for example, a bipartite matching cost could range from 9000 to 50000, depending on sample point size, while bending energy is usually less than 100.
In addition to matching cost and bending energy, we substituted texture for brightness differences. We used the texture feature from Manjunath et al.,, which uses a Gabor wavelet feature vector. We used the Matlab code provided by the author. The difficulty arises however, in deciding the "texture" of a fish. Unlike a grassy plane or cloth, the shape of a fish varies and it's not entirely clear which part of it should be considered as texture. If we include the entire fish image, it would take into account too much of the silhouette outline. Instead we sampled only the middle (supposedly the torso) part of the fish, where stripes, spots, and color patterns are most distinguishable.
A few texture patch example is shown below. We're showing it in color because it's prettier, but in reality the texture patch is converted to grayscale. This makes sense since illumination changes too much over different fish images.
![]() |
![]() |
![]() |
In general, shape & texture feature extraction take a reasonable constant amount of time, and they're computed once-and-for-all and then stored in *.mat format. The greatest dragdown of matching speed is caused by the bipartite matching.
Below is a comparison table of running time(secs) on 3 methods: Hungarian, Heuristic, and Greedy.
| # points | 100 | 150 | 200 | 300 | 400 |
| Hungarian | 1.422 | 4.578 | 11.405 | 50.152 | 143.146 |
| Heuristic | 0.579 | 1.781 | 4.046 | 9.671 | 25.436 |
| Greedy | 3.344 | 16.655 | 55.324 | 288.089 | 961.354 |
As can be observed, the heuristic method is a better choice since it runs faster than the optimal Hungarian method while still achieving a near-optimal solution.
First of all we need to define the criteria for a "correct match". If the correct corresponding fish shows up as first place, it is undoubtedly a correct answer. But what if it goes second? third? The combination of weights on the different feature rankings could also lead to mismatches, especially when the query fish has the same shape with the correct database fish but has a different texture/color pattern, which is common. For example, if one googles up for images for Betta Splendens (fighter fish), then the following images show up:

Even when one ignores the orientation, the texture pattern is still very different. One might be tempted to say that all the fighter fishes have the same outline shape. That is only partially true: the fighter fishes all have similar head and torso outline, but the tails vary greatly. To make matters worse, some of the fish tails are transparent, which will cause great trouble in gettin a clean cut. Even when the query images is near noise-free, insufficient sample points may cause the details of the tail to be ignored, and the shape context descriptor would only represent it as a huge blob of tail fin.
We'll present examples on good and bad matches following subsections.
This is our query:
We try to match it against a 36-image database, which contains the right answer:
But when we match it, ths best match appeared to be something seemingly very different. The best match turns out to be Melichthys Niger!
First we look at bipartite shape matching. The highest ranking fish is Colisa Lalia. We compare the contour of the query, Colisa Lalia, and the right answer.
The query actually looks more similar to the shape of C. Lalia - a lenghtier body with the eye area captured. M. Niger ranks No.6 in shape matching.
M. Niger ranks No.1 in TPS bending energy and No.3 in texture, while the right answer ranks 15 and 23, correspondingly. It's not surprising that the texture ranking turned out this way, since by the way we sampled the texture it includes partial outline shapes, especially when the fish is round.
This leads to the thought that we should come up with a better way of sampling texture patches. Unfortunately, we didn't have enough time....
The matching of Abudefduf Saxatilis matches perfectly - without surprise, since the two input images were extremely similar. The query fish shape was:
And the best result (and correct answer) was:
The right answer ranks 1 on shape, 3 on bending, and 10 on texture. The reason for the slightly lower rank on texture can be speculated: the texture patch is actually not that similar around the torso center, since the right answer had straight vertical black stripes with yellow taint, and the query image didn't have yellow taint and had more curved stripes. Still, the overall good ranking across 3 categories yielded the correct answer.
This experiment was done with lavishness: Hungarian method for optimal min-cost bipartite matching, variable sample points (100, 200, 300, or 400 points based on image size).
And the best result (and correct answer) was:
And this match was done out of a database 86 fish images, with image size up to 600x600. It was encouraging to get this result, given the fact that the input had only 100 sample points (a 206x130 image), and the query had 300 (a 650x400 image). The matching took at least 40 minutes, which is why we didn't venture future in using Hungarian method + large database.
The running time of the matching system is heavily dependent on two factors: database size and sample point size. The sample point size increases the matching time for each single matching cubically. Since the matching is done sequentially (although they could be done concurrently under parallel implementation), the running time goes up linearly. In a word,
Part I: GrabCut (AutoCutRefine.m)
This program is developed by Rohit Gupta and Krishnan Ramnath from Carnegie Mellon University, with our modification in the user interface.
Part II: Fish Orientation (gui_rotate.m)
This function is called by passing the image filename. It will show the image on the screen.
fishGUI
which will start up the GUI that looks like this:

You can load a raw image (accepted formats are BMP, PPM, and JPG) by clicking Load Image, or use Load Record to load a .mat file that already has the query image's features extracted and stored. A image Record contains its original image, a thumbnail, coordinates of sampled points, shape context descriptor (histogram), texture feature vector, and most important of one, the fish's venerable latin name. (NOTE: we understand that nobody knows the scientific name of a fish besides biologists, but common names have too many aliases and confusion).

Once you have a image/record loaded, you can match it against a compiled database by clicking Do Query!. You can select any existing database, or you can first use Add image to DB to compile your own database.

When the matching is done it will be displayed (up to top 15 matches can be displayed). You can save the matching record for future reference.

[1] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: image segmentation using expectationmaximization and its application to image querying. IEEE PAMI, pages 1026–38, Aug 2002.
[2] Rother, C., Kolmogorov, V., and Blake, A. Grabcut - interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics 23, 3 (August 2004), 309–314.
[3] Deng, Y.[Yining], Manjunath, B.S., Unsupervised Segmentation of Color-Texture Regions in Images and Video, PAMI(23), No. 8, August 2001, pp. 800-810.
[4] S. Belongie and J. Puzicha. Shape Matching and Object Recognition Using Shape Contexts IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO.24, Apr 2002
[5] B.S. Manjunath and W.Y. Ma. Texture features for browsing and retrieval of image data IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI - Special issue on Digital Libraries), vol. 18, no. 8, pp. 837-42, Aug. 1996.
[6] I.Yamazaki et al., Segmenting Point Sets