CSE 576 Final Project

Fishbowl VR

Peter Henry and Eric Holk

Introduction

We have created a virtual reality environment augmented with head tracking information from a commodity webcam. Our system adjusts the perspective of the 3D image on screen such that it is correct from the user's point of view. This allows for far more immersive 3D visualization experiences, and raises several possibilities for more natural interaction with rendered 3D scenes.

Related Work

This work was primarily inspired by the head tracking VR display done by Johnny Lee from CMU using the Nintendo Wii Remote. Our work extends this by requiring only hardware that is commonly available in today's laptop PCs.

Students at the University of Toronto have also improved a molecular visualization system by using face tracking to control the point of view.

Origin Instruments has an existing product Headmouse that requires a tracking dot to be worn, and tracks the head position in only two dimensions.

Free Track is another existing solution that can track a number of degrees of freedom, but again, requiring the user to wear trackable hardware on their head.

To the best of our knowledge, our work is unique among face-tracking VR systems because of its combination of patch tracking to enable smoother facial position measurements while still maintaining a framerate acceptable for interactive experiences.

Algorithm

Face Detection

We use the Haar classifier included with OpenCV as part of our solution. With the provided frontal-face classifier, this detector can quickly locate the largest face-like feature in the camera image. However, because it works by finding the most likely face across discrete scales, it is of limited use for determining depth, and exhibits noise and fluctuations in the z-direction. This manifests itself visually as scene objects constantly flickering towards and away from the user, even if the user is attempting to hold their head stationary. It also makes no use of the temporaly sequential nature of our video frames, so it will sometimes jump to detect an unrelated object in the camera's view for a single frame before jumping back to the true face in the next frame.

It is fast and reliable enough to provide one of our position estimates, and to determine the initial region for our other head position estimator, the frame-to-frame feature tracker.

Feature Tracking

Given an initial face detection, our program finds Harris corners within a subset region of the detected face, in order to detect only points actually on the face. We detect 80 feature points, and store a 7x7 grayscale patch as the feature descriptor. In the next frame, we sample points from the same region, and attempt to find the best similarity transform to desribe the motion of feature points between frames (with the origin as the center of the last detection region).

We chose the similarity transform because it has only 4 degrees of freedom, yet represents all the transformations that an assumed planar face undergoes during program use. The matrix A has only two degrees of freedom, and corresponds to a scaled rotation matrix. As rotation is irrelivant to determination of depth, we take the determinant of the A matrix to find the scale factor between frames, from which we can calculate the change in face depth between frames much more precisely than the discrete scale detections of the face detector. The translation vector t provides the x and y position change between frames.

In running RANSAC, we first use only the 40 best matches as determined by patch SSD. We then run 200 RANSAC iterations on two randomly selected matches to find the similarity transform that has the most inlier matches (defined as falling within 2 pixels). We then compute the least squares similarity transformation for all inlier matches, and use this to compute our estimate of face position. This estimated position and scale is used to define the region from which feature points are sampled in the next frame.

The feature tracker provides much more precise measurements of scale changes, but is effective at modeling fast motion, because the face can fall too far outside the feature search region defined in the previous frame. One solution to this would be to simply enlarge the region searched for features, but we then have the problem of picking up background features for which the identity transformation is the best solution. Another issue is that the small transformations cause drift over time, which can ultimately lead to the region no longer catching enough face features, and as a result, the face is lost.

To combat these issues, in our program, for every Haar classifier face detection, we measure the difference between the current feature region and the detection region. If four successive detections fall outside our difference threshold (defined as a total of 18 pixels difference among x, y, and z measurements), we reset the feature region to match the detector region. This can be observed towards the end of the first demonstration video, where the blue rectangle lags behind the green detections as I move quickly, but ultimately snaps to the center of the green rectangle and continues tracking the face. This also allows the tracked face to change automatically during execution if a new user takes over.

Position Estimation and Viewing Perspective

Given the x-offset, y-offset, and face height in pixels, we desire the corresponding real-world coordinates of the user's head. We estimate the field of view of the camera, and assume a detected face height of 0.15 meters. Note that both of these values could be obtained through calibration by first measuring the camera's field of view, and then having the user position their head at some prescribed distance from the screen.

Given the camera FOV and the face height in meters, we know the conversion from pixels to meters. We can then use similar triangles to produce an estimate of the users displacement from the center of the camera in meters. We must then use the correct off-axis perspective for viewing the scene.

In OpenGL, the simplest method is to use the glFrustum() method to define the viewing frustum:

Our projection plane corresponds to the screen, which falls between the near-z and far-z. Objects in front of the projection plane appear to float in front of the screen. Using x-offset as an example, we see that the ratio of screen-z to x-offset equals the ratio of near-z to glFrustum() offset. When we measure the screen-z as the estimate of the head's real-world z coordinate, this allows us to define the appropriate off-axis projection that corresponds to the user's current head position.

Position Filtering

Using both the face detector and feature tracker gives us two estimates for the user's head position. Both of these inputs have a certain amount of noise inherent in them, which we would like to remove as much as possible. We are fortunate that the two different trackers have largely orthogonal strengths and weaknesses, which makes it possible to devise an algorithm that combines the strengths of both. In particular, the face tracker was very good at getting the x and y positions of the face correct, but it had quite a bit of noise. The noise was especially apparently in the scale, because the face tracker was unable to make smooth steps between different face sizes. The feature tracker, on the other hand, provided very smooth movements due to the use of least squares and RANSAC, and it tended to track the changing positions accurate for a couple of frames. Over time however, the feature tracker would accumulate quite a large error. Our filter was able to overcome the weaknesses from both of our position trackers.

The filtering algorithm we used basically does a weighted average on the two tracker inputs. While we only had two inputs, the algorithm easily generalizes to more, if more were available. The algorithm works by keeping a current best estimate for the state, y, and considers inputs from several position trackers, xi. At each time step, the best estimate is updated according to the following equation.

The update equation performs a weighted average between a predicted next state based on the previous best guess, and the new information from the position trackers, which is combined into x'. The position tracker inputs are combined in a weighted average, where the weights are inversely proportional to the variance measured over the last five updates. In particular, the entries of x' are calculated according to the following equation.

Here, xk,i is the mean of the previous five updates from the ith input, and σ2k,i represents the variance over the last five updates. The term ε in the denominator is used to avoid dividing by 0 in cases where the variance is mistakenly measured to be 0 for the last five frames. This is especially common in the z value from the face tracker, due to its discreet scales. The ε term can also be viewed as a minimum uncertainty for each input.

For our implementation, we selected an &alpha value of 0.25 and an ε of 1e-7. The ε value was chosen because the variances that we normally observed were in the ballpark of 1e-6 to 1e-5.

Our filter does not require updates from each input at every iteration. This is important because it is quite possible for the face detector to fail to find a face, while the patch tracker is able to keep tracking the face. In these cases, the filter simply ignores the sources than have not updated, and incorporates them once again when a new update is available. This further improves our robustness, as the feature tracker could almost always continue to track features even when the face detector intermittently did not detect a face.

Results

As our evaluation is primarily qualitative, we present two short demonstration videos.

In the first video, we show the tracking and display side by side. Please patiently wait through the first few seconds of Youtube video conversion artifacts. Stuttering in the framerate is caused by the screencast capture program. The green rectangle shows the result of face detection, the blue rectangle shows the region from which feature points are sampled, and the red dots show the detected features in each frame. Note that when Peter rotates his head, the frontal face detector no longer fires, but the features continue to track from frame to frame, maintaining a correct estimation of the head position.

In the second video, Peter holds a camera underneath his chin to give the viewer an idea of the correct perspective. Notice that the bottom target appears to float over the keyboard. This video gives a better idea of the use of the system and of the smooth framerate.

Discussion

We see that the system does a generally fine job of producing a convincing scene for the user. The system is perhaps not as robust as we might like to rapid head movements, as these confuse the feature tracker and force the system to rely solely on the face detector. Also, if the face is in a position unmeasurable by the face detector (e.g., if the face is rotated, or partially out view of the camera), the feature tracker does a passable job of continuing to track the face, but drifts over time and sporadically calculates incorrect transformations.

Because of the use of running averages in our filter, it has the potential to introduce a significant amount of lag behind the actual face prediction. In practice, we did not find this to be a problem. There are several mitigating factors in our favor. Most importantly, people's face movements tend to be smooth and at a moderate speed. This means the predictive term based on the average velocity is able to do a reasonably good job of predicting where the user's head will be in the next frame. Secondly, our detectors run at reasonable framerate that seems to be roughly in the 30 frames per second range. At this rate, 5 sensor updates equates to 1/6th of a second in real time. This is a short enough time that for smooth movements, the lag does not seem noticeable.

Future Work

We would like to experiment with running parts of our algorithm on the GPU. The GPU's architecture is especially well-suited to many computer vision tasks, at could allow our software to track higher camera framerates. Using the GPU for vision tasks would also add the challenge of balancing resources between rendering the 3D scene and processing the camera images.

Other patch tracking techniques, such as a Lucas-Kanada tracking where patches are tracked across more than two frames, might help the feature tracker to provide more consistant results. Similarly, specialized feature trackers for eyes and mouth might prove more robust.

The field of view of many webcams is quite limited, and new users tend to move their face out of range. A fisheye lens on top of the camera could provide a much wider field of view.

Even with the head tracking and perspective-correct rendering, our human visual systems were not completely fooled that we were seeing an actual 3D scene. We suspect a major reason for this is the lack of stereo depth cues that are normally present. It would be interesting to add support for stereo glasses to our system and see if this produces a more convincing effect.

Face tracking opens up possibilities for using your face as a control mechanism. It would be fun to use the face tracker to control the paddle in a game such as 3D Pong.

Conclusions and Lessons Learned

OpenCV is a powerful environment in which to work, providing both high-level resources (the Haar classifier) and low level resources (the matrix manipulation used for feature tracking).

The perspective illusion is most effective when closing one eye, meaning that internet demonstrations recorded with a monocular camera suggest a more convincing illusion than is experienced in person with both eyes open.

We found that the user's perception is much more important than it was to precisely replicate real-world behavior. As long as we were able to smoothly animate the changing perspective, our brains seemed perfectly happy to tolerate the fact that the measured head position might have been slightly different than the actual head position. Our experience was that it was very intuitive to view a 3D rendering by shifting our head position to change our perspective. The illusion was close enough that we naturally started interacting with our screens as we would have with an actual physical object.

Classifiers have difficulty providing accurate depth information, because they generally work only in discrete scales. The combination of a classifier for initial detection and lower level feature tracking is a powerful way to leverage the advantages of both techniques.