Project 4 Report: Automatic Tone Mapping By Learning User Preferences
CSE576, Spring 2008
Shulin Yang, Xiaoyu Chen, and Xing Li
June 06, 2008
1 Introduction
In this project, we developed a automatic tone mapping
system to produce images in a user-specific manner, by learning user-preferred,
tone-mapping parameters for different images. After user training on a limited
set of images and a certain extent of learning on-the-fly, our system is able to
provide parameter-free tone mapping. Comparing with normal tone-mapping
software, our system may save users many manual operations and much time.
Moreover, we implemented HDR construction and compression to generate a HDR
image from a set of images with different exposure. Our goal is to provide
automatic tone manipulation for both ordinary images and HDR images.
2 Related works
2.1 Tone manipulation
A great deal of work has been done on the tone
manipulation problem [1, 2, 3, 4]. Some work has focused on the high dynamic
range compression problem for HDR image, about which we will discuss later. To
apply tone reproduction to ordinary images, many different tone mapping
operators have been proposed over decades. A common way to classify the
different approaches is global operators and local operator. Specifically,
global operators utilize a curve to map each pixel to a display value, and local
operators utilize local information of the image patch around a pixel to adjust
its value. Generally speaking, global operators are usually faster, while
spatially variant operators are better at preserving local contrasts of an
image.
Although algorithms for tone mapping have been developed for a long time [5]
and some of the operators are called as “automatic”, most of them require
tweaking parameters for a better result for a particular input image. Parameter
tweaking has been a problem for all tone mapping algorithms, and research on
adjustment of tonal values has been going on all the time. Some operators have
been extended to dynamic and interactive setting [4].
2.2 HDR compression
The goal of HDR compression is to spatially vary the
mapping from scene luminance to display luminance while preserving local
contrasts. There are multiple ways to compress a HDR image [6, 7]. One common
method is to decompose a HDR image into multi-scale detail layers and a base
layer. By reducing the contrast at the base layer, while boosting the fine scale
detail layers, a reduced dynamic range image with well preserved details is then
reconstructed by combining the compressed base layer and boosted detail layers.
To perform the multi-scale decompositions, there are a bunch of methods based on
different smoothing filters, such as bilateral filter, anisotropic diffusion,
and weighted least squares. All these filters are local non-linear
edge-preserving filters. The problem with global linear smoothing filters, such
as the Laplacian pyramid, is the halo artifacts produced near edges. This
problem may also exist for some non-linear filters and is still an open research
area.
Apart from the described multi-scale decomposition based approach, there is
another gradient domain HDR compression method [8]. As we can observe that any
drastic change in the luminance across a high dynamic range image must give rise
to large magnitude luminance gradients at some scales. Fine details, such as
texture, on the other hand, correspond to gradients of much smaller magnitude.
Based on this observation, HDR compression can be achieved by identifying large
gradients at various scales and attenuating their magnitudes while keeping their
direction unaltered. The attenuation must be progressive, penalizing larger
gradients more heavily than smaller ones, thus compressing drastic luminance
changes, while preserving fine details. A reduced dynamic range image is then
reconstructed from the attenuated gradient field.
2.3 GIST
Research in scene understanding has traditionally treated
objects as the atoms of recognition. However, behavior experiments on fast scene
perception suggest that we do not need to perceive objects in the scene to
identify its semantic category. The spatial layout is more important for scene
recognition [9]. Literatures in this area have proposed different approaches to
represent the gist of a scene, for example, some methods are based on the
analysis of texture [10]. Specifically, there is a proposed computational model
[11] of the scene recognition that bypasses the problem of segmentation and the
processing of individual objects or scenes. The procedure is based on a very low
dimensional representation of the scene, called spatial envelope. Basically, the
dominant spatial structure of a scene is represented by a set of perceptual
dimensions, including naturalness, roughness, openness, expansion, and
ruggedness. These dimensions may be reliably estimated using spectral and
coarsely localized information. Based on these perceptual dimensions, a
multidimensional space is constructed in the work of the spatial envelope
descriptor. In the multidimensional space, similar scenes, such as streets and
highway are projected closed together.
3 Approaches
We applied / developed several tone-mapping techniques,
including detail manipulation, intensity manipulation, and color manipulation.
Based on six intuitive parameters, our system provides simple but effective ways
to perform tone mapping for an image. In parallel, we implemented HDR
construction and compression to generate a HDR image.
3.1 System design
Our system first learns a user’s preferences on
tone-mapping parameters. An initial user database is set up by training the user
on a set of images. After that, a new image loaded by the user will be
automatically tone mapped with parameters that are learned from the user
database. The user can choose to manually change parameters and re-process the
image. Finally, the final tone-mapping parameters along with the features of the
input image can be stored to extend the user database, which is called online
learning prefrences in our system. The details of our system design is shown
in the following diagram.
- Pre-processing: we collect a set of training images and create four new
images for each of them. Each of the four new images is generated by applying
a setting of the tone-mapping parameters of our system, and it represents an
different style of the original image.
- Preference training: During training, a user is provided with an original
image and four modified image with different styles. The user is then asked to
choose his favorite from the four images. The system will record in the user’s
database: 1)the features of the original training image; 2)the set of
parameters associated with the modified image that the user chose.
- Feature extraction: When the user loads a new image into the system, the
spatial features and the color & intensity features of the image will be
extracted separately.
- Learning tone-mapping parameters: After extracting image features of the
new image, the system will search in the database for an image that is the
nearest neighbor of the new image. The nearest neighbor is identified by
calculating similarity between the new image and each image in the database
based on their features. Then, the parameters used to process that image which
was preferred by the user, will be applied to the new image.
- On-line learning: After an output image is generated, if the result is not
satisfying, the user is allowed to adjust the parameter and re-process the
input image. The final parameter setting can be saved to the user database.
3.2 Tone mapping techniques
Three aspects of tone mapping techniques are
used in our system, detail manipulation, intensity manipulation, and color
manipulation.
3.2.1 Detail manipulation
In some images, detail information may be either too little or too much. In
our tone mapping method, we used an edge-preserving operator based on the WLS
(weighted least squares) optimization framework to find out an edge layer at a
certain scale, and then increase/decrease edge information by attenuating or
exaggerating this edge layer.
We choose the WLS-based operator because it was demonstrated to be able to
extract detail at arbitrary scales and effectively avoid the halo artifacts [4].
Moreover, the WLS-based operator is robust and particularly well-suited for
progressive coarsening of images, and for detail extraction at various spatial
scales.
We use WLS to extract an edge layer E from the original image I: E = I -
wls(I). Then we can exaggerate or attenuate the edge layer: I’ = I + E*t. When
t>0, the edge information in the image is increased, and the image will
appear to contain more details; when t<0, edge information is decreased and
the image will appear smooth. The results showing detail manipulation will be
shown in Section “Experiments”.
3.2.2 Intensity manipulation
A great part of the problems of an image
come from its lighting condition, for example, too bright or too dark. Though
part of the problems will cause losing information of a scene and thus cannot be
revised by simply modifying the image, adjusting intensity of an image can still
be effective in changing how good it looks. Therefore, we provide two kinds of
intensity operations, intensity shift and intensity exaggeration.
For both of these operations, we use a function to form a mapping from the
original intensity value of a pixel to a new intensity value. The mapping
function determines how the intensity values of an image is adjusted.
For intensity shift, we use a convex function to modify intensity value. The
curvature of the function determines the extent to which intensity value will be
modified. The following graphs are examples of two ways of intensity shift: The
left one will lighten up the whole image since it is a concave function, and the
right one will darken the whole image since it is a convex function.
for intensity contrast, we use functions with an “S” shape to modify the
intensity. When the function has a positive “S” shape (as in the right graph),
it will spread out the intensity values of an image from the center and enlarge
their difference. When the function has a opposite “S” shape (as in the left
graph), it will converge the intensity of an image to the center.
3.2.3 Color manipulation
Adjusting colors of an image can play an
important role in making it look better. Therefore, similar to intensity, the
values of different color channels of an image can be shifted or exaggerated to
make an image look better.
Color shift enables a user to shift the color of an image along the
R/G/B channel. Specifically, R shift increases the value of R channel of an
image while reducing value of the other two color channels. Similarly, G shift
increases the G channel of an image while reducing channel R and B, and B shift
increases the B channel of an image while reducing channel R and G. The formulas
for R shift are:
r’ = r (1+t)
g’ = g (1-t’)
b’ = b (1-t’)
In the above formulas, r, g, and b are the original values of the three
channels, and r’, g’, and b’ are the value after R shift. t is a parameter
representing the extent to which the value of R channel should be increased. We
define t’ = k_r*r*t/(k_g*g+k_b*b ), so that the intensity value of a pixel will
not be changed.
Color exaggeration enables a user to enlarge the contrast of different
color channels. Specifically, the new values for the color channels r’, g’, b’
are calculated as follows:
r’ = i + (r-i) t
g’ = i + (g-i) t
b’ = i + (b-i) t
where t is a parameter representing the extent to which all color channels
should be exaggerated, and i is intensity value of a pixel. The results of color
shift and color exaggeration will also be shown in Section “Experiments”.
3.3 Image similarity
We measure the similarity of two images from two
aspects: 1)the overall color and intensity of an image; 2)the scene content of
the image. We use intensity histogram and color histogram to measure image
similarity in terms of overall intensity and color their similarity, and as to
scene content of the images, we use a Gist feature descriptor to represent the
images and measure image similarity using the Gist features. The metrics Pearson
correlation coefficient is used for calculating distances of feature vectors for
histogram and Gist separately, and then we used weighted sum of the two aspects
of distance measure as final output of similarity of two images.
- Spatial features. As an effective approach to extract spatial information
from an image, the Gist descriptor is used in our system to represent image
scene content. Specifically, we apply a spatial envelop feature
descriptor[11], which produces a vector of 960 spatial features for each
image. All the values are normalized to be between 0 and 1.
- Color & intensity features. For each histogram (intensity or color),
we use a vector of size ten to represent it, by dividing the histogram into 10
bins and computing the total number of pixels in each bin. We finally get a
vector of size 40 (10 for the intensity histogram, and 10 for each of G/R/B
histogram) to represent color/intensity histogram information of an image.
- Pearson correlation coefficient. We use Pearson correlation coefficient to
measure the similarity of two feature vectors. Given two vectors X={Xi} and
Y={Yi}, Pearson correlation is computed as follows:
where
are the standard score,
mean and standard deviation of X (the same for Y). The output r is the average
product of two standard scores computed from each vector.
- Combining histogram feature with Gist feature After computing Pearson
correlation coefficients for both spatial features and color & intensity
features of two images, we use the weighted sum of the two coefficents as the
final similarity of the two images.
S = (w1*r1+w2*r2)/(w1+w2)
In our implementation, w1 and w2 are both set to 0.5.
3.4 HDR construction and compression
3.4.1 HDR construction
High dynamic range radiance maps of real scene
can be constructed from a few photographs of the scene with different exposure.
The specific method we use is described in [12]. The following image acquisition
pipeline shows how scene radiance becomes pixel values. Unknown nonlinear
mapping can occur during exposure, development, scanning, digitization, and
remapping. The proposed algorithm determines the aggregate mapping from scene
radiance L to pixel values Z from a set of differently exposed image.
After the response function of the imaging process has been recovered, the
algorithm can fuse the multiple photographs into a single, high dynamic range
radiance map whose pixel values are proportional to the true radiance values in
the scene.
3.4.2 HDR compression
In our project, we have implemented the HDR
compression by using the WLS-based, multi-scale decompositions. Specifically, we
apply 4-level decompositions, one coarse base layer and three detail layers of
the log-luminance channel, multiply each level by a scaling factor, and
reconstruct a new log-luminance channel. There are two reasons for working in
the log domain: 1) the logarithm of the luminance is a crude approximation to
the perceived brightness; 2)Gradients in the log domain correspond to ratios
(local contrasts) in the luminance domain. The processing block diagram[7] is
displayed as followed.
Our goal is to generate a rather flat image with exaggerated local contrasts.
This was achieved by compressing the base layer, and boosting the fine scale
detail layers. Specifically, the scaling factor
needs to be set smaller than
other scaling factors, whereas the scaling factor
for the finest scale detail layer
needs to be the largest.
Usually
is set to be 1 and
is set to be around 0.2.
Consequently,
and
are chosen to be some middle
numbers in between
and
As for the scaling factor for the
color channel
, it is
suggested to be set as equal to
.
3.5 GUI design
We designed a GUI to integrate most of the functions of
our system and to facilitate user interaction with our system. The training
interface shown below is used to train a user with pre-processed images. Given a
training image A, we pre-process it with four different settings of tone-mapping
parameters. After a user select his favorite processed image B, the tone-mapping
parameters used to generate image B coupled with the features of image A will be
stored into the user’s database.
The functions of the training interface include:
- Guide the user to navigate all the training images.
- Allow the user to choose his favorite processed image for each training
image.
- Store the tone-mapping parameters used to generate a selected, processed
image along with the features of the original training image to the user’s
database.
- Exit the training process.
The main interface shown below allows a user to process a new image. After a
new image is loaded, the system will search in the user’s database to find the
most similar image by comparing the features of the new image and the features
of each image in the database. Then, the system will extract the parameters used
to process the most similar image and apply them to process the new image. The
user is allowed to adjust the automatically extracted parameters and to
re-process the image. The user can also save the new image and its tone-mapping
parameters to the database.
The functions of the main interface include:
- Load in a new image and automatically process the image based on the
user’s database.
- Change the tone-mapping parameters and re-process the image.
- Save the tone-mapping parameters along with the features of the new image
into the user’s database.
- Save the processed image.
- Exit the system.
4 Experiments
4.1 Applying individual tone manipulation techniques
The followed images
show the results of detail manipulation. The left one is the original image. The
one in the middle is the result with image details attenuated, while the right
one is the result with image details exaggerated. We can observe that results of
both increasing and decreasing details look pretty natural.
The followed images show the results of color manipulation. The left one is
the original image. The one in the middle is the result of doing R-shift for the
original image, while the right one is the result after exaggerating the image
color. With image intensity remaining the same, both color shifting and color
exaggeration produce very interesting and good looking output.
4.2 System evaluation
We selected five pairs of images (called
original pairs) with different scenes and colors, including cottage,
flower, iceberg, sea, and waterfall. Separating each pair into two groups, we
got two sets of images to evaluate our system: one set was used as training
images, while the other set was used as test images. After user training, the
features of the five training images and the tone-mapping parameters used to
process them were stored into the user database (details below). For each test
image, the system first identified the most similar image in the database.
(Ideally, the most similar image of a test image should be its partner in the
original pair.) Then, the parameters applied to the most similar image were used
to process the test image. If the system works well, we will be able to see that
the test image was processed in the same way as its partner in the original
pair.
4.2.1 User training
For each user, our system generates a database to
record the features of a set of images and the tone-mapping parameters used to
process the images in a user-specific manner. For simplicity, we assume that
there is only one user in the followed experiments.
We show below the training interface for each training image, which contains
four pre-processed images of different styles. The selected radio-button (the
one with red dot) indicates which processed image the user preferred, and the
corresponding parameters were stored in the database along with the features of
the training image.
4.2.2 Learning the most similar image
As described in Section “Image
similarity”, we used the weighted sum of two Pearson correlation coefficients,
which measures image distance in terms of scene content as well as color and
intensity. The followed table contains the similarity of each test image to each
training image. We can observe that every test image got the highest similarity
to its partner in the original pair (shown in italic). This indicates that our
metric is effective in measuring image similarity. It leads to the success of
our system in identifying the most similar image by finding the nearest
neighbor.
| Image |
Cottage_train |
Flower_train |
Iceberg_train |
Sea_train |
Waterfall_train |
| Cottage_test |
0.442 |
0.375 |
0.263 |
0.331 |
0.309 |
| Flower_test |
0.418 |
0.561 |
0.044 |
-0.087 |
0.264 |
| Iceberg_test |
0.326 |
-0.062 |
0.558 |
0.364 |
0.149 |
| Sea_test |
0.135 |
-0.243 |
0.134 |
0.365 |
-0.244 |
| Waterfall_test |
0.383 |
0.354 |
0.161 |
0.172 |
0.681 |
4.2.3 Results for the test images
After learning the most similar image
for a test image, the tone-mapping parameters associated with the most similar
image were applied to the test image. We show below the results of automatic
tone mapping for each test image. To compare, we also show the image in the
database that was identified as the most similar image to the test image. (Top
left: test image; Top right: automatically tone-mapped image; Bottom left: the
most similar image identified from the database; Bottom right: the pre-processed
image that the user selected during training.)
Waterfall
Cottage
Flower
Iceberg
Sea
4.3 Online learning
We used the system to process one of our own images
(shown below on the left). The system performed automatic tone-mapping based on
the database of five images (described in Section “System evaluation”), and the
result image is shown below in the middle. The result looked reddish and lack of
details. So we tried to remove a little bit red and add more details. The
processed image is shown below on the right. It looks much better than the
original one, and we saved the parameters used to generate it into the database
along with the features of the original image. In total, we had six images in
the database.
We then processed in our system two more images taken in the similar place.
The results after automatic tone-mapping are shown below.
The results suggests that our system identified the last image saved to the
database as the most similar image to the two new images, and that the
parameters we just saved to the database were applied to the new images. This
illustrates the value of the online learning function of our system: Although
starting with a small database, our system has the power to enrich the database
in a user-specific manner while it is used.
4.4 Results on more images
To generate results on more images, we first
expanded our database to contain ten images by performing online learning. The
added four more images are followed. (Left: original image; Right: processed
image.)
We then applied our system to some randomly picked images. Some results are
interesting as shown below:
However, the results for some images are not very good:
Since the original photo is taken under a very dark condition, most detail
information is lost in the original image. As a result, it is hard to recovery
the information simply by adjusting its lighting and color globally.
For some images, even if both their spatial features and color/intensity
features are very similar, they can still have very different image style. In
such cases, unnatural results can still be generated by applying the same
parameters to those images.
Our similarity metric is based on the spatial features and the
color/intensity features of an image, and no object oriented measure is
involved. For images that focus on the content of a specific object in the
scene, measuring similarity based on Gist features may lead to failures in
finding the most similar image.
4.5 Results on HDR
4.5.1 HDR reconstruction
The following photographs are taken with
increasing exposure time: 1/750s, 1/180sand 1/45s.
In order to recover the high dynamic range radiance map, first we need to
combine them together to estimate the response function for each color channel.
By applying the construction algorithm described before, we get the estimated
response functions as follow. Here, the polynomial fitting has been employed to
suppress noise affection. The recovered response curve corresponds to the solid
line in each picture.
After we get the response function, we can fuse the multiple photographs into
a single, high dynamic range radiance map. The dynamic range for the
reconstructed radiance is about 3000:1. If we map the high dynamic range image
linearly to the display range of 0~255 without any tone manipulation, we get a
really ugly picture as follow.
4.5.2 HDR compression
After we constructed the HDR radiance map, we will
apply the WLS-based decomposition to the HDR image. The scaling factors for each
level are set as 1.0, 0.8, 0.4 and 0.16. The compressed image is displayed as
follow. As we can see, the details are well preserved. But the color does not
look pleasant. The reason for this problem is two fold: 1) when reconstructing
the imaging system response curve for each color channel, the scaling factors
relating relative radiance to absolute radiance for each channel are unknown. As
a result, the color balance of the radiance map may be changed. 2) During the
multi-scale decomposition, the scaling factors for the color channel
may not be chosen properly.
In this experiment, we found that it is very important to choose the proper
smoothness coefficients for the WLS formulation. Otherwise, some details from
certain scales may be lost during the decomposition. Another important problem
is about how to set the scaling factor for each decomposition level. Different
choices will affect the whole dynamic range and local contrasts of the resulting
image.
4.5.3 Apply automatic tone mapping to HDR
In order to refine the
resulted low dynamic range image, we plug it into the our system to perform
automatic tone mapping. Then we get a lovely HDR image as follow. This example
has indicated the effectiveness of our system.
There is another example. The exposure time for each photograph is as follow:
1/500s,1/125s, 1/30, 1/8s, 1/2s. All the results are obtained in the same manner
as described above.
The color of the compressed image also seems like been washed out.
By refining it with our automatic tone manipulation system, the HDR image
looks pretty nice.
5 Conclusion & future work
5.1 Conclusion
In summary, we designed a system for automatic image tone
mapping. The basic idea is to apply previous tone mapping parameters to new
images that has similar color/intensity histogram and spatial features. At the
beginning, when the database contains only a small set of images, on-line
learning is important, which makes use of the same user’s earlier parameter
setting for later images. When the user database is enriched, the system will be
more capable of producing images without user adjusting parameters. The more the
system is used, the more powerful it will be.
5.2 Future work
Firstly, we can extend the system to provide more
options for color manipulation, rather than merely color shift and exaggeration
for R/G/B channels. Secondly, we want to use alternative image similarity metric
to combine both spatial information and tone information, rather than merely
taking the weighted sum of them. Thirdly, another change we can make is to
generate multiple automatic tone-mapping output images for a new, instead of
only one output. Finally, we want to expand the database to include a large
number of pre-processed images.
References
[1] Erik Reinhard, etc. Photographic tone reproduction for
digital images.
[2] Kresimir Matkovic, etc. A survey of tone mapping techniques.
[3] Dani Lischinski, etc. Interactive local adjustment of tonal values.
[4] Zeev Farbman, etc. Edge-preserving decompositions for multi-scale tone
anddetail manipulation.
[5] Reinhard, E., etc. High dynamic range imaging.
[6] Fredo Durand, etc. Fast bilateral filtering for the display of high
dynamicrange images.
[7] Jack Tumblin, etc.LCIS: a boundary hierarchy for detail-preserving
contrast reduction.
[8] Raanan Fattal,etc. Gradient domain high dynamic range compression.
[9] Aude Oliva,etc. Building the gist of a scene: the role of global image
features inrecognition.
[10] Laura Walder, etc. When is scene identificationjust texture recognition?
[11] Aude Oliva, etc. Modeling the shape of thescene: a holistic
representation of the spatial envelope.
[12] Paul E., etc. Recovering high dynamic rangeradiance maps from
photographs.