A Novel Approach to Mugshot Based Arbitrary View Face Recognition

Zeng
Dan; Long
Shuqin; Li
Jing; Zhao
Qijun

doi:10.3807/JOSK.2016.20.2.239

OA학술지
Journal of the optical society of Korea

A Novel Approach to Mugshot Based Arbitrary View Face Recognition

DOI : 10.3807/JOSK.2016.20.2.239
Author: Zeng Dan, Long Shuqin, Li Jing, Zhao Qijun
Publish: Journal of the optical society of Korea Volume 20, Issue2, p239~244, 25 Apr 2016

ABSTRACT

Mugshot face images, routinely collected by police, usually contain both frontal and profile views. Existing automated face recognition methods exploited mugshot databases by enlarging the gallery with synthetic multi-view face images generated from the mugshot face images. This paper, instead, proposes to match the query arbitrary view face image directly to the enrolled frontal and profile face images. During matching, the 3D face shape model reconstructed from the mugshot face images is used to establish corresponding semantic parts between query and gallery face images, based on which comparison is done. The final recognition result is obtained by fusing the matching results with frontal and profile face images. Compared with previous methods, the proposed method better utilizes mugshot databases without using synthetic face images that may have artifacts. Its effectiveness has been demonstrated on the Color FERET and CMU PIE databases.

KEYWORD

Mugshot-based face recognition , Arbitrary view face recognition , Three-dimensional face reconstruction

본문

Collapse all

I. INTRODUCTION

Mugshot face images are widely used for identity recognition in forensic and security applications. They usually consist of frontal and profile face images, which are routinely collected by police. The frontal and profile face images provide complementary information of the face, and are thus believed to be useful for pose-robust face recognition if both of them are effectively utilized. However, most existing automated face recognition methods [1-3] are devised for the scenario of only frontal face images in the gallery. To recognize non-frontal faces, they usually adopt one of the following three ways: (i) Normalizing the non-frontal probe face to frontal pose and matching it to the enrolled frontal faces [4-6], (ii) Generating synthetic face images from the frontal faces in the gallery according to the pose of the probe face, and then comparing the probe face with these synthetic face images [7, 8], (iii) Extracting pose-adaptive features directly from the gallery and probe face images for comparison [9, 10].

When both frontal and profile face images are available in the gallery, in order to utilize the profile face images to better recognize arbitrary view faces, Zhang et al. [11] first reconstructed 3D face shapes from the frontal and profile face images, and then recovered facial textures by fitting the Phong reflection model based on the obtained 3D face shapes and the assumed lighting direction prior. They generated synthetic face images of various poses for each enrolled subject to enlarge the gallery. Given a new face image, they first estimate its head pose, and then match it with the gallery face images that have the same pose. A main drawback of this method lies in that the facial textures in the synthetic face images are computed by an assumed model, which may, however, cause artifacts especially when the frontal and profile face images have varying shading effects. Recently, Han et al. [12] proposed to directly map the textures in frontal face images onto the reconstructed 3D face shapes based on Delaunay triangulation of the 2D facial fiducial points. Although they avoided using synthetic textures, they simply used the profile face images for adjusting the depth information of 3D face shapes without further using the texture of the profile face images.

This paper proposes a novel approach to better use mugshot face images to deal with arbitrary view face recognition. Unlike previous approaches, we directly compare probe face images to both frontal and profile gallery face images with assistance of 3D face shape models reconstructed from the gallery mugshot face images. This way, it gets rid of potential artifacts in synthetic face images, and meanwhile effectively exploits the information in both frontal and profile face images. Experiments performed on two challenging public face databases, Color FERET [13] and CMU PIE [14], have proven its effectiveness.

The remainder of the paper is organized as follows. Section II introduces in detail the proposed mugshot based arbitrary view face recognition approach. Section III reports the experimental results on the Color FERET [13] and CMU PIE [14] databases along with some discussion on the results. Finally, a conclusion is drawn in Section IV.

II. METHODS

An overview of the proposed approach is depicted in Fig. 1. As can be seen, it includes two phases: enrollment and matching.

[FIG. 1.] Overview of the proposed approach. dfj and dpj denote the distances of the query to the frontal and profile templates of the jth subject, respectively (j = 1, 2, …, L , and L is the total number of enrolled subjects).

During enrollment, mugshot face images are captured. Here, we assume that three face images of each subject are included, i.e., frontal, left and right profile face images. This is a conventional configuration in forensic mugshot databases. Based on these three face images, the 3D face shape is reconstructed for each subject by using the method in [15]. Motivated by the component-based face recognition method in [16], we define on the frontal faces a set of local patches (called semantic parts in this paper) around the following seven facial fiducial points: two eyebrow centers, two eye centers, nose tip, nose root, and mouth center (see Fig. 1 and Fig. 2). When a face changes its head pose, these patches will deform according to the pose angles. The deformed patches of a subject under different poses can be easily computed by transforming his/her 3D face shape, and features are extracted from semantic parts on the frontal and profile face images as enrolled templates.

[FIG. 2.] Illustration of semantic parts. (a) shows the used ten semantic parts. (b) and (c) show the deformed local patches of some semantic parts under different pose angles.

Given a new query face image to be recognized, its head pose is estimated and fiducial points are located. To compare it with a subject in the gallery, its semantic parts are segmented according to its head pose and the subject's 3D face shape. Features are then extracted from the semantic parts, and compared with the enrolled templates. Matching results with the enrolled frontal and profile templates are fused to give the final recognition result, i.e., the identity of the query face image. Next, we introduce in detail the matching steps.

   2.1. Semantic Part Segmentation

Segmenting the semantic parts (i.e., deformed local patches around fiducial points) is the key of the proposed approach to handle arbitrary view face images. An essential difficulty in recognizing arbitrary view face images is due to the non-linear deformation occurring to 2D face images when rotating the face in 3D space [10]. To compute the distance between two face images that have different pose angles, it is important to find their corresponding semantic parts because comparison between different semantic parts is nonsense.

In this paper, we consider the following semantic parts. For each enrolled subject, we first project its 3D face shape to 2D plane via orthogonal projection to get a frontal face image. We define ten local patches w.r.t. the seven fiducial points on the 2D frontal face image (See Fig. 2(a)). These patches are then mapped back to the 3D face shape to get the semantic parts in 3D space. Without loss of generality, we assume that the inter-ocular distance is about 90 pixels, and sizes of the semantic parts are then scaled as follows:

⋅Eyebrow - 20×50 pixels

⋅Eye - 40×45 pixels

⋅Mouth - 55×50 pixels

⋅Nose root - 55×50 pixels

⋅Nose tip - 60×50 pixels

In order to get the semantic parts on an arbitrary view 2D face image to be matched with the subject, the above defined semantic parts have to be projected onto 2D face images. This is done by estimating a 3D-to-2D projection based on the corresponding fiducial points in the 3D face shape and the 2D face image. Let {p_i = (x_i, y_i, z_i)'|i = 1, 2, …, n} and{q_i = (u_i, v_i)'|i = 1, 2, …, n} be, respectively, the 3D coordinates of the fiducial points in the 3D face shape and their 2D coordinates on the corresponding 2D face image (n is the total number of available fiducial points). The projection can be then obtained by fulfilling the following optimization objective,

where the projection T = s[R, t] is a 2×4 matrix composed by three parameters, i.e., scale s, rotation R and translation t. See Fig. 2.

   2.2. Feature Extraction and Comparison

Extracting features from an arbitrary view 2D face image consists of two steps. First, visible semantic parts are segmented according to the head pose of the 2D face image. For example, for frontal and nearly-frontal 2D face images, all semantic parts are visible and will be used for feature extraction and comparison; for left-profile and nearlyleft-profile 2D face images, only the semantic parts on the left side of the face are visible, and thus to be considered. Second, once the semantic parts have been segmented from the 2D face image, certain features are extracted from them. In this paper, we extract local binary pattern (LBP) features [17] for the sake of simplicity, but other more elaborated features could be used also.

To compute the distance between two face images, their corresponding visible semantic parts are first determined. The distances between these semantic parts are then computed. In this paper, we employ the chi-square distances for LBP features of the semantic part defined as

in which x and ξ are the normalized enhanced histograms of the corresponding semantic part to be compared, indices i and j refer to the i ^th bin in the histogram corresponding to the j^th local region in the semantic part. The size of the local region is 5×5 pixels. The distance between the two face images is finally defined as the average of the distances between all visible corresponding semantic parts on them, i.e.,

where N_P is the total number of semantic parts, and w_i is the weight of the i ^th semantic part.

   2.3. Fusion of Frontal and Profile Comparisons

If the probe face image is frontal or near frontal (i.e., facing [-15, 15] degrees), it is directly compared with the enrolled frontal face templates. Otherwise, it will be compared with both frontal and profile templates, and matching results will be fused.

Let df_j and dp_j denote its distance to the j^th subject's frontal and profile templates, respectively j = 1, 2, …, L and L is the total number of enrolled subjects). Among {df₁, df₂, …, df_L}, we find the first two smallest distances, denoted as df₁ and df₂. Similarly, the first two smallest distances among {dp₁, dp₂, …, dp_L} are dp_p1 and dp_p2. The confidence of frontal and profile comparisons is then defined as S_f = df_f2 - df_f1 and S_p = df_p2 - df_p1, respectively. The identity of the probe face is finally decided as,

where frontal_id is the identity according merely to the matching result with the frontal faces, and profile_id is the identity according merely to the matching result with the profile faces. Here, δ is a pre-specified confidence threshold (δ=8 in this paper).

III. RESULTS

In this section, we evaluate the proposed approach on the Color FERET [13] and CMU PIE [14] databases. Color FERET database contains face images of 768×512 pixels with multiple pose variations (see Fig. 3). Among them, we manually remove the face images whose poses are inaccurately labelled, resulting in 461 subjects used in the experiments. CMU PIE database consists of 884 face images of 68 persons under 13 various poses. The size of these face images is 486×640 pixels. For all the face images, facial fiducial points are manually marked, and they are then normalized to 233×187 pixels by affine transformation based on the fiducial points.

[FIG. 3.] Example multi-view face images of two subjects in the Color FERET database [13].

In order to evaluate the effectiveness of comparing deformed semantic parts, we also implement a method that uses the same local features as the proposed method, but extracts features from regular (rather than pose-deformed) local patches (baseline) [17]. In addition, a commercial off-the-shelf face matcher, VeriLook [18], is also considered in the comparison experiments.

Table 1 reports the rank-1 and rank-10 recognition accuracy of different methods on Color FERET at various yaw angles. It can be seen that the proposed method works consistently well under varying yaw angles. When yaw angles are larger than 60 degrees, the proposed method still achieves about 80% rank-1 recognition accuracy, whereas the accuracy of the counterpart methods drastically decreased to about 60% or even lower than 30%.

[TABLE 1.] Rank-1 and Rank-10 recognition accuracy of different methods: Using regular patches [17] (Baseline, for short), VeriLook [18] and Proposed deformed patches (Proposed, for short) on the Color FERET database [13] at various yaw angles

Rank-1 and Rank-10 recognition accuracy of different methods: Using regular patches [17] (Baseline, for short), VeriLook [18] and Proposed deformed patches (Proposed, for short) on the Color FERET database [13] at various yaw angles

We also compare the proposed method with the one in [11]. It is the only published research work we are aware of that reported mugshot-based face recognition accuracy on the CMU PIE database. For a fair comparison, following the protocol in [11], we choose pose c27 as the frontal face, and c22 and c34 as the left profile face and right profile face, respectively. The remaining 10 poses of 68 persons are considered as probe images. The results are shown in Fig. 4, which again proves the effectiveness of the proposed method in recognizing face images of large pose angles. The rank-1 accuracy of the proposed method is better than the one in [11] especially for large poses (i.e., c31, c14, c25, c02). The recognition rate of the method in [11] for pose c31 is 80%, whereas our recognition rate reaches 93%. The rank-1 accuracy of the proposed method is firmly higher than the other one across poses. The mean accuracy among various poses of probe images by our method and the method in [11] are 97.8% and 95.5%, respectively. The standard deviations across various probe poses by them are 3.32% and 6.86%, respectively.

[FIG. 4.] Rank-1 recognition accuracy of our proposed method and the method in [11] on the CMU PIE database [14].

The superior performance of the proposed method is likely due to several factors. (i) The semantic parts are much more robust to distortions caused by pose variation. (ii) The semantic parts provide pixel-level matching rather than matching local patches around facial fiducial points. (iii) The proposed method can enlarge the inter-class variation thanks to the specific-subject 3D face model to segment the semantic parts. Figure 5 shows some example off-angle probe face images and the results by the proposed and baseline methods.

[FIG. 5.] Example off-angle probe face images from the Color FERET database [13] and the results obtained by the proposed and the baseline methods. The first to third columns show, respectively, the probe face images, the rank-1 gallery face images identified by the baseline method, and the rank-1 gallery face images identified by the proposed method. The table gives the rank of the true gallery face images for these probe images obtained by the two methods.

IV. CONCLUSION

This paper proposed a method to better utilize mugshot databases to recognize arbitrary view face images. It directly matches probe face images to frontal and profile gallery face images based on their corresponding semantic parts. Compared with previous methods, it not only makes better use of both frontal and profile face images, but also avoids synthesizing novel view face images, which often contain artifacts. Evaluation results on public databases have demonstrated the effectiveness of the proposed method in recognizing large off-angle face images. A major limitation of the proposed method is its relatively high computational cost. According to our experiments using Matlab R2012b on an Intel core i5 2.60-GHz desktop computer with 4-GB memory, the proposed method takes, on average, around 9 seconds to recognize a probe face from a background database of 68 subjects. In the future, we will improve the efficiency, investigate automatic facial fiducial point detection methods, and explore more semantic parts and advanced local features.

참고문헌

1. Ding C., Tao D. (2015) “A comprehensive survey on poseinvariant face recognition,”
2. Zhang X., Gao Y. (2009) “Face recognition across pose: A review,” [Pattern Recognition] Vol.42 P.2876-2896
3. Bowyer K. W., Chang K., Flynn P. (2006) “A survey of approaches and challenges in 3D and multi-modal 3D + 2D face recognition,” [Computer Vision and Image Understanding] Vol.101 P.1-15
4. Chai X., Shan S., Chen X., Gao W. (2007) “Locally linear regression for pose-invariant face recognition,” [IEEE Transactions on Image Processing] Vol.16 P.1716-1725
5. Astnana A., Marks T. K., Jones M. J., Tieu K. H., Rohith M. 2011 “Fully automatic pose-invariant face recognition via 3D pose normalization,” [Proc. IEEE International Conference on Computer Vision] P.937-944
6. Shao X., Zhou X., Cheng C., Han T. X. 2013 “3D face reconstruction and dynamic feature extraction for poseinvariant face recognition,” [Proc. International Symposium on Computer, Communication, Control and Automation] P.119-122
7. Franco A., Miao D., Maltoni D. (2008) “2D face recognition based on supervised subspace learning from 3D models,” [Pattern Recognition] Vol.41 P.3822-3833
8. Heo J., Savvides M. (2012) “Gender and ethnicity specific generic elastic models from a single 2D image for novel 2D pose face synthesis and recognition,” [IEEE Transactions on Pattern Analysis and Machine Intelligence] Vol.34 P.2341-2350
9. Li A., Shan S., Chen X., Gao W. 2009 “Maximizing intraindividual correlations for face recognition across pose differences,” [Proc. IEEE International Conference on Computer Vision and Pattern Recognition] P.605-611
10. Yi D., Lei Z., Z. Li S. 2013 “Towards pose robust face recognition,” [Proc. IEEE International Conference on Computer Vision and Pattern Recognition] P.3539-3545
11. Zhang X., Gao Y., Leung M. K. H. (2008) “Recognizing rotated faces from frontal and side views: An approach toward effective use of mugshot databases,” [IEEE Transactions on Information Forensics and Security] Vol.3 P.684-697
12. Han H., Jain A. K. 2012 “3D face texture modeling from uncalibrated frontal and profile images,” [Proc. IEEE International Conference on Biometrics: Theory, Applications and Systems] P.223-230
13. Phillips P. J., Moon H., Rizvi S. A., Rauss P. J. (2000) “The FERET evaluation methodology for face-recognition algorithms,” [IEEE Transactions on Pattern Analysis and Machine Intelligence] Vol.22 P.1090-1104
14. Sim T., Baker S., Bsat M. 2002 “The CMU pose, illumination, and expression database,” [Proc. IEEE International Conference on Automatic Face and Gesture Recognition] P.46-51
15. Li J., Long S., Zeng D., Zhao Q. 2015 “Example-based 3D face reconstruction from uncalibrated frontal and profile images,” [Proc. IEEE International Conference on Biometrics] P.193-200
16. Bonnen K., Klare B. F., Jain A. K. (2013) “Component-based representation in automated face recognition,” [IEEE Transactions on Information Forensics and Security] Vol.8 P.239-253
17. Ahonen T., Hadid A., Pietikainen M. (2006) “Face description with local binary patterns: Application to face recognition,” [IEEE Transactions on Pattern Analysis and Machine Intelligence] Vol.28 P.2037-2041
18. http://www.neurotechnology.com

이미지 / 테이블

[ FIG. 1. ] Overview of the proposed approach. dfj and dpj denote the distances of the query to the frontal and profile templates of the jth subject, respectively (j = 1, 2, …, L , and L is the total number of enrolled subjects).
[ FIG. 2. ] Illustration of semantic parts. (a) shows the used ten semantic parts. (b) and (c) show the deformed local patches of some semantic parts under different pose angles.
[ ]
[ ]
[ ]
[ ]
[ FIG. 3. ] Example multi-view face images of two subjects in the Color FERET database [13].
[ TABLE 1. ] Rank-1 and Rank-10 recognition accuracy of different methods: Using regular patches [17] (Baseline, for short), VeriLook [18] and Proposed deformed patches (Proposed, for short) on the Color FERET database [13] at various yaw angles
[ FIG. 4. ] Rank-1 recognition accuracy of our proposed method and the method in [11] on the CMU PIE database [14].
[ FIG. 5. ] Example off-angle probe face images from the Color FERET database [13] and the results obtained by the proposed and the baseline methods. The first to third columns show, respectively, the probe face images, the rank-1 gallery face images identified by the baseline method, and the rank-1 gallery face images identified by the proposed method. The table gives the rank of the true gallery face images for these probe images obtained by the two methods.