Novel Backprojection Method for Monocular Head Pose Estimation
- Author: Ju Kun, Shin Bok-Suk, Klette Reinhard
- Organization: Ju Kun; Shin Bok-Suk; Klette Reinhard
- Publish: International Journal of Fuzzy Logic and Intelligent Systems Volume 13, Issue1, p50~58, 25 March 2013
Estimating a driver’s head pose is an important task in driver-assistance systems because it can provide information about where a driver is looking, thereby giving useful cues about the status of the driver (i.e., paying proper attention, fatigued, etc.). This study proposes a system for estimating the head pose using monocular images, which includes a novel use of backprojection. The system can use a single image to estimate a driver’s head pose at a particular time stamp, or an image sequence to support the analysis of a driver’s status. Using our proposed system, we compared two previous pose estimation approaches. We introduced an approach for providing ground-truth reference data using a mannequin model. Our experimental results demonstrate that the proposed system provides relatively accurate estimations of the yaw, tilt, and roll angle. The results also show that one of the pose estimation approaches (perspective-n-point, PnP) provided a consistently better estimate compared to the other (pose from orthography and scaling with iterations, POSIT) using our proposed system.
Head pose estimation , Backprojection , Monocular vision
Driver-assistance technologies are the subjects of widespread research and development in the car industry, especially computer-vision-based driver assistance . Computer-vision-based driver monitoring is aimed mainly at detecting sleepiness and distraction using a camera mounted in front of the driver (e.g., on the dashboard). By studying and analyzing the frames recorded by the camera, the system can provide feedback or warn the driver if their state is abnormal, even in non-ideal lighting conditions . Research is focused on detecting the signs of sleepiness, particularly by analyzing a driver’s eyes and their gaze. For example, long blinks shows that a driver is experiencing fatigue. Gaze detection can also determine where a driver is looking. When the eyes focus in one direction for a long period, this indicates that driver may be experiencing distraction.
However, some situations must include head pose estimation, e.g., eye detection will not work if a driver is wearing (dark) sunglasses while driving, and eyes cannot be detected on occasions when the driver is wearing a hat. The direction of the head can convey important information, so we can use head poses to understand where driver is focused and to determine a driver’s state while driving. For example, a sudden head movement while driving may indicate that the driver is looking at something special outside the car. Long-term bowing or frequent nodding can also indicate that the driver is tired or sleepy. In these cases, it is better to use head pose analysis to determine whether a driver is distracted or suffering from sleepiness. These issues are addressed in this paper. Our goal is to develop a system that can provide a robust estimation of a driver’s head pose to analyze the status of the driver. This is achieved by
combining well-established methods with a novel backprojection technique. We use two pose-estimation approaches, perspective-n-point (PnP)  and pose from orthography and scaling with iterations (POSIT) . Both pose estimation approaches are based on 2D-3D correspondences, so we also use common KLT features  to detect and track good feature pointsinside the face region. The driver’s head is represented usingan ellipsoid model, which provides a more intuitive view of the estimated head pose.
The remainder of this paper is divided into the following sections. Section 2 provides a brief description of previous headpose estimation approaches. Section 3 describes the details of our proposed system. Section 4 presents all the experiments we conducted, with a discussion of the results. Section 5 provides our conclusions and possible future work.
Feature-based approachesgenerally utilize facial features, such as the eyes, nose, or mouth, in 3D space or in the image domain to calculate the actual head direction. They learn the correlations between the positions of facial features and the head pose using training data [11,12]. Furthermore, in , the researchers detected the upper points of the eyebrows, the upper nasolabial-furrow corners, and the nasal root. They considered these features as stable facial feature points. Appearance-based approachesfocus on the entire head image instead of specific facial features before learning the pose by training. These are also known as global head-pose estimation approaches. To locate the head, they only need to locate the face, without detecting any facial features or creating a face model. After the face is located, template matching is performed to find the best matching pose and generate the pose estimation. The most popular template-matching approaches use Gabor wavelets, principal components analysis (PCA), support vector machines (SVM), or neural networks [13-15]. Model-based approachesalways need to create a 3D model to represent the human head. Pose recognition is then achieved by matching the 2D points in the head region in the image plane with their corresponding 3D points on a 3D model. The pose (rotation and translation) of the 3D model is then calculated and it has to be consistent with the pose of the human head. Various 3D models have been proposed for representing the human head. Frequently used 3D models are the active appearance model (AAM), a cylindrical head model (CHM), the ellipsoidal head model (EHM), or a simple planar model [16-18].
The proposed head-pose estimation system is able to provide the roll, yaw and tilt angle of the driver’s head from a monocular image sequence. Then, according to those angles and detected changes of them, we will be able to understand the status of the driver.
The flowchart in Figure 1 demonstrates basic procedures of the proposed head-pose estimation system. A monocular image is the input of the system for one run. Initially, face detection is applied to find the face region in the input image. The detected region is considered to be the
region of interest.
Then, the system detects the
good featureswithin this region of interest. The next step is to create a 3D model to represent the human head in order to provide an intuitive visualization when demonstrating the driver’s status, and also to have a surface for feature backprojection.
In the proposed system, because POSIT and PnP are utilized for pose estimation, both require not only the feature points in 2D coordinates on the image plane but also corresponding 3D coordinates which are located on the surface of the 3D model. Therefore, the projection points of the 2D feature points on the 3D model have to be computed beforehand.
Moreover, by camera calibration we calculate the focal length of the used camera before computing 3D projection points and implementing the POSIT pose-estimation function, and also the camera intrinsic matrix and distortion coefficients needed for implementing the PnP pose-estimation function.
Finally, the pose-estimation algorithms are used to calculate or update the affine transform of the driver’s head. Based on the rotation matrix, the system computes roll, yaw, and tilt angles of the driver’s head. Then, the created 3D model is rotated based on those three estimated rotation angles (e.g., for visualizing the head pose). As a consequence, the viewing direction of the driver can be detected.
Moreover, to analyze the status of the driver, the system has to work on a sequence of images to study changes in roll, yaw and tilt angles. Thus, a procedure for updating the coordinates is also included. It consists of detecting the new face region in the new input image and tracking previously detected good features. Then, as some of the good feature points are lost during tracking, the number of lost features is replaced by newly selected ones. Next the updated good features will be passed on to the process of feature point backprojection. At the end, the new pose is estimated according to the updated feature points.
After applying all these processes on a sequence of images, we analyze the results for providing information about the status of the driver. The remaining parts of this section describe the procedures more in detail.
The camera is assumed to be mounted on the dashboard in front of the driver. Face detection is the first step for locating the head. For initialization, we assume a frontal view on a face where yaw, tilt, and roll angles are small. In general, a driver is looking forward, thus we can expect that this initialization state occurs in a reasonable time.
The most descriptive head feature points are in the face region. Compared to this, using a profile face view or other angular face views for initialization works against robust and “meaningful” feature detection, and creates an uncertainty for the initialization of head pose angles. The Viola-Jones face detection approach is used for detecting the face region by simply using the pretrained frontal-view cascade classifier from the OpenCV library due to its robustness and high performance for the considered initialization step.
Tracking the movement of individual feature points in the face region provided for us better results (for understanding head rotation) than tracking the whole face region at once. We apply the KLT feature detector for selecting the features in the face region. Detected features are evaluated for their quality (their trackability), thus defining selected good features within the detected face region. These features are stored in descending order according to their
goodnessand then fed to the tracker later on.
There are many different 3D models that can be utilized for representing a driver’s head. Popular 3D models are the cylindrical head model, the ellipsoidal head model, a detailed active appearance model, or a planar model. For the detailed active appearance model, there are many degrees of freedom that need to be considered when initializing the model. Furthermore, its tracking performance is not regarded as being robust and precise . The cylindrical or ellipsoidal head model appear to be more preferable due to simplicity (fewer degrees of freedom, reasonable fits to a human head).
According to experimental results in , the 3D cylindrical head model can provide robust performance on yaw and roll, but is unstable for tilt. Therefore, for the proposed system, we decide to render an
ellipsoid of revolution as the model to represent the driver’s head.
As mentioned before, camera calibration is needed due to the requirement of both pose estimation algorithms, as well as the approach which is used to backproject 2D points onto the 3D ellipsoidal model surface. By calibrating the camera, we only need to obtain the
camera intrinsic matrixand the lens distortion coefficients.
In order to get rotation angles to render the ellipsoid model at the correct pose of the driver’s head, we compare the performance of PnP and POSIT pose-estimation methods. We have to provide not only coordinates of 2D image points but also their corresponding 3D coordinates on the surface of the ellipsoid. The feature detection stage provided the 2D feature points. In this step, we calculate their corresponding 3D points which are projected back onto the surface of the ellipsoid.
The concept here is to calculate the intersection of rays with an ellipsoid . A ray is defined by
Ldefines the locations of points on the ray in dependency of a non-negative real s; pis the origin of the ray, and dits direction. We assume that the camera’s focal point is the origin of projection rays. See Figure 2. Direction d is represented as ( u, v, f), where uand vare the coordinates of a 2D point in the image plane, and fis the effective focal length of the used camera. As we keep the number of selected good feature points constant, we also have the same constant number of rays from origin towards the ellipsoid.
Next, a general ellipsoid is represented as below, where
C= ( Cx, Cy, Cz) is the center point, rthe radius (e.g., scaled to be equals 1), with axes defined by k, l, and m:
However in our system, we utilize an ellipsoid of revolution with a pair of equal semi-axes (i.e.,
k= m) but a distinct third semi-axis. Thus, the equation of the ellipsoid can be modified to
By substituting Eq. (1) into Eq. (3), the intersection is defined by
simplified as a quadratic equation in the form
a, b, and cinto Eq. (9), two svalues are defined. In our case, we only need the closest intersection point (i.e., the smaller s-value) and substitute it back into Eq. (1) for calculating the coordinate of the projection point on the ellipsoidal surface.
Now we have 2D feature points and corresponding 3D coordinates on the ellipsoid surface; we are ready to estimate the pose of the driver’s head.
For this stage, we implemented two different pose estimation methods from the OpenCV library,
sovlePnPand cvPOSIT, for comparison. Referring to the OpenCV library, we need to provide some required data for both methods.
In fact, all the previous steps are the preparation stages for triggering these pose estimation methods. Once the program is initialized, the requested 2D points can be obtained from the feature detection step as well as the corresponding 3D points on the surface of the ellipsoid. These data are required by both pose-estimation methods as indicated by Table 1. Data
camera matrix, distortion coefficientsand focal lengthare obtained by camera calibration. Criteria are used to decide how many iterations need to be done for generating a sufficiently accurate pose. A result when using solvePnPis shown on the left of Figure 3, and the right image shows a result generated by cvPOSIT.
The previously described steps are all for initializing head pose estimation. In order to analyze the status of the driver, we also analyze subsequent frames for providing a sequence of head-pose estimations.
By using initial 3D model points and updated 2D points, both tested pose-estimation algorithms provide updated estimations. However, two problems need to be solved. First, when a driver’s head is rotating, tracked feature point get lost. As a result, there are insufficient feature points to be fed into the pose estimation algorithms. As a consequence, they will not be able to provide robust estimations for a driver’s head. Second, there is the goal of automatically re-initializing the system. We need to decide when and how to re-initialize, such that time-efficiency and system performance is guaranteed.
In the following sections, we describe how to update 2D coordinates for generating new estimations of the driver’s head pose, as well as how to solve those two problems mentioned above.
Updating of 2D coordinates consists of two steps: tracking of feature points and replacing lost feature points by newly detected ones.
As illustrated in Figure 1, in order to increase the performance of tracking, face detection is applied again for the new input image. However, at this time, face detection is based on frontal and profile face detection, as well as on the previously detected face region. The reason is that the face in the new image may not be the frontal view. If not, the face will be detected by using profile-detection procedure or according to the previously detected face region due to the small displacement between two consecutive frames. Once the face region is again detected or estimated, the KLT tracker will track previously detected features within this region.
However, by experience, many features will be lost due to the preset constraints. Not only due to head motion, there are various other reasons which can lead to tracking errors, such as illumination changes, feature location occlusions, and so on. Thus, in order to prevent bad pose estimation due to a lack of feature points, lost feature points are replaced by using newly detected feature points. Once the replacing process is finished, the feature list is updated. Next, the new 2D and 3D correspondences will be obtained, then new pose estimation will be calculated.
According to our design, the system has to be able to do reinitialization automatically. For example, based on the detected head pose it could be decided when to trigger the function for replacing lost feature points. However, a simple approach of replacing lost features in every
n-th frame provided even a better solution compared to more complicated approaches, such as, replacing lost features whenever the driver’s face returns back to the frontal view, presetting few specific yaw angles as the re-initialization points (such as when ever the yaw angle is
about 0°, 15°, 35°, etc.), and so on.
We take every fifth frame as a re-initialization frame to replace lost features. More in detail, contributing processes perform the following:
1. Estimate the head pose; if a frontal-view face then first (re-)initialization.
2. Calculate three head-rotation angles for the first (re-) initialization frame.
3. Trigger the function to replace lost features.
4. Estimate the head pose in the following frames, until the next re-initialization frame, according to newly updated features.
5. Calculate the head rotation angles for the current frame. The result will be referred to as that for the most recent re-initialization frame.
6. Adding newly calculated head rotation angles which refer to the most recent re-initialization frame for providing the head rotation angles which refer to the frontal view face.
We conducted extensive experiments using our proposed system with image sequences recorded in a test vehicle. However, these real-world data lacked ground-truth data for comparison. Thus, we also developed an approach for generating ground-truth data. We compared the estimated results (using the PnP and POSIT pose-estimation approaches) with the head-pose ground-truth data, which were obtained from manually marked feature points or automatically tracked feature points in the input images.
To achieve this, we used a mannequin model as our test object and recorded three sequences: one for changing the tilt angle
only, one for changing the yaw angle only, and one for changing the roll angle only. Sample input images and their manually marked feature points are shown in Figure 4.
Based on experiments using mannequin model sequences, we were able to determine the accuracy of our results when using manually marked feature points for head pose estimation or the results when running our system to automatically detect feature points, and we compared the results with the ground-truth head pose data. We calculated the standard deviations of the errors between the pose results based on manually marked feature points or the KLT-detected feature points compared with the measured mannequin head pose.
The first four rows in Table 2 show the standard deviations of the errors for the estimated angles, which were obtained using both pose estimation approaches with manually marked feature points or KLT-detected features, compared with the ground-truth data. In general, PnP provided better estimates than POSIT, while the manually marked feature points yielded
better results than automatically (here: KLT) detected feature points.
The results obtained using manually mark points were very close to the (measured) head-pose ground-truth data. Thus, they can be regarded as
substitute reference valuesfor testing the accuracy of the results when applying the proposed system to the test vehicle sequences. Both pose-estimation approaches were relatively sensitive to changes in the types of feature points. Furthermore, we found that the estimated roll angles had the most accurate values. This was because the change in the angle was not significant when the head was slanted so the majority of the feature points remained visible.
This required fewer replacements of lost feature points, and to more accurate and robust estimates. The estimated yaw angles had the largest standard deviation error. This was because the changes in yaw angle could be very large and a large yaw angle change required more replacements.
As mentioned earlier, both pose estimation approaches were sensitive to changes in the feature points. Therefore, there were bigger errors when estimating the yaw angle. The accuracy of the estimated tilt angle was between the yaw and roll angle estimates.
The last two rows in Table 2 show the standard deviation error for the KLT approach relative to the manually marked points approach. These values can be used as reference for checking the accuracy of the results when applying the pose estimation approaches to test vehicle sequences.
We applied manually marked feature points to four test vehicle sequences (including scenes of a non-occluded driver’s face, eyeglasses, and dark sunglasses) to provide a
substitute reference value. Table 3 shows the standard deviation errors for the substitute reference valuesrelative to the PnP and POSIT pose estimation results, as well as the average standard deviation of the errors.
Table 3 shows that the average errors were very similar to the errors shown in the last two rows in Table 2. This demonstrated that the use of the mannequin model sequences was appropriate. Our system can provide relatively accurate estimates of a driver’s head pose and PnP provided better results than POSIT using our proposed system.
We developed a model-based monocular head pose estimation system that utilizes a novel form of backprojection and a KLT feature tracker for selecting, tracking, and replacing feature points in a detected face region, as well as calculating their corresponding 3D coordinates on the surface of an ellipsoid model to estimate the pose of a driver’s head and to analyze the status of the driver’s head.
There is no publicly available source code for other headpose estimation methods, so our comparisons were limited to re-implementation. However, we compared two different pose estimation algorithms: PnP and POSIT. Using the results for manually marked feature points as
substitute reference values, we showed that PnP yielded better results than POSIT with our system. Furthermore, based on the experiments we conducted to obtain ground-truth data, we showed that both of these pose estimation approaches were sensitive to changes (i.e., updates) in the feature points.
It would be interesting to test the performance of other feature points (e.g., SIFT and Harris corner detector) to improve the accuracy of our estimation results. Moreover, the rapid advances in mobile devices, such as smartphones, means that a mobile device application would be a very useful extension of our system. We could obtain many advantages from mobile device applications. For example, the majority of smartphones have a built-in camera so we would simply mount it on the driver’s side of the windscreen instead of using a separate camera.
No potential conflict of interest relevant to this article was reported.
1. Klette R., Ahn J., Haeusler R., Hermann S., Huang J., Khan W., Manoharan S., Morales S., Morris J., Nicolescu R., Ren F., Schauwecker K., Yang X. 2011 “Advance in vision-based driver assistance” [in Proceedings of 2011 International Conference on Electric Technology and Civil Engineering] P.987-990
[Figure 1.] Flowchart of the proposed-model-based head pose estimation algorithm.
[Figure 2.] Ray tracing starts at the camera’s focal point, passes through 2D points in the image plane and corresponding 3D points on the ellipsoid’s surface.
[Table 1.] OpenCV pose estimation methods
[Figure 3.] (a) Result of head pose estimation using solvePnP. (b) Result of head pose estimation using cvPOSIT. PnP, perspective-n-point; POSIT, pose from orthography and scaling with iterations.
[Figure 4.] Recorded images of a mannequin model with manually marked feature points.
[Table 2.] Standard deviations of the errors
[Table 3.] PnP results