Silhouette-Edge-Based Descriptor for Human Action Representation and Recognition

Odoyo Wilfred O.; Choi Jae-Ho; Moon In-Kyu; Cho Beom-Joon

doi:10.6109/jicce.2013.11.2.124

OA학술지
Journal of information and communication convergence engineering

Silhouette-Edge-Based Descriptor for Human Action Representation and Recognition

DOI : 10.6109/jicce.2013.11.2.124
Author: Odoyo Wilfred O., Choi Jae-Ho, Moon In-Kyu, Cho Beom-Joon
Organization: Odoyo Wilfred O.; Choi Jae-Ho; Moon In-Kyu; Cho Beom-Joon
Publish: Journal of information and communication convergence engineering Volume 11, Issue2, p124~131, 30 June 2013

ABSTRACT

Silhouette-Edge-Based Descriptor for Human Action Representation and Recognition

KEYWORD

Action recognition , Hausdorff distance , Shape descriptor , Silhouette-edge-based

본문

Collapse all

I. INTRODUCTION

Human action recognition is a new but highly researched field in computer vision owing to the vast applications it attracts. The applications include but are not limited to intelligent human-computer interactions and entertainment, automatic annotation of video data, surveillance, and medical diagnosis and training. We can reiterate the fact that there are thousands of variants of activities and gestures that can be inferred from a video clip or on still pictures. Describing and recognizing these actions involve image processing methods that are not trivial. Most of the action recognition machines are application-specific and have predefined tasks to execute. Widely researched and modeled human actions with publicly available databases include walking, running, jumping, jogging, waving (one and/or two hands), bending, handshaking, boxing, kissing in movies, and many others. Like that of many other research works, our task in this paper is to locate, extract, describe, and recognize (classify) several predefined human actions into proper classification categories.

Recognition or categorization of these actions refers to classification of the action in a motion picture into one of the many predefined classes. Action detection not only classifies the action but also identifies the spatial and temporal aspects of the action that occurred in the video. These two very important tasks have been dealt with through various methods, all in the effort to improve their performance. These approaches have developed from silhouette-based methods [1], and pose estimation [2], which are considered to be traditional methods of human action recognition, to more dynamic spatiotemporal domain methods [3]. The latter methods are closely related to our work in general, as actions are considered to exhibit spatiotemporal patterns that can be modeled based on local features extracted [4], optical flow [5], and gradient based descriptors [6].

In this paper we describe and model human actions from human silhouette edges based on spatiotemporal features extracted from the video. Information entropy measurement [7] is used to aid in the selection of important key posture frames from the input video. Shape context [8] is used in many studies as a local feature selection method to avoid redundancy of similar frames selected. Summing up the finely extracted edge features forms the vital descriptors for various action classes in our gallery database. Similarity measurement is performed based on rigid Hausdorff distance measure [9] to validate the query template into an appropriate grouping. Most of the image preprocessing procedures are outside the scope of this paper and are just mentioned. Section II of the paper explains in brief the theory of information entropy. We describe both the formation of our descriptor and the Hausdorff distance in Section III. Section IV contains experimental results with the conclusion of the paper in the final part, Section V.

II. EXTRACTION OF KEY POSTURES

For proper classification, information from the motion pictures should be collected in a proper manner. The extracted features will influence the outcome of the categorization or classification of the task at hand. Evaluation of information entropy is a step towards extraction of the key postures that will be later applied in the formation of human action descriptors. Image sequences from the motion pictures reveal that human actions contain a few important postures and gestures that are significantly different from each other. These need to be properly extracted so that they can be used in the discrimination process. Motion analysis is performed in the original video to extract the moving object (human body). This can be achieved through methods like background subtraction, temporal differencing between frames, and optical flow [10]. For salient extraction of the features of the moving object, our paper used the background subtraction method. Successive thresholding and binarization accomplished this vital process.

Information entropy is a statistical measure of uncertainty used to collect all possible key postures frame-by-frame based on their information measurements. We applied information entropy [7] as a criterion in designing an optimal feature selection process. The formation states that given a discrete set of pattern classes of {v₁, v₂, ？？？, v_m} with their subsequent probabilities of {P₁, P₂, ？？？, P_m}, information entropy is arrived at by the following formula:

The entropy is computed per frame in sequential order. This is to say that the entropy values of two frames are compared to determine the similarity between the postures. For every frame, the information entropy is computed by first dividing the frame into blocks of w*h and then calculating the pixels the human body covers as shown in Fig. 1. The probability of the human body pixels in each block is obtained by:

with N_i being the number of pixels the human body occupies in block i , and N the total number of pixels in each frame. Having known how many pixels the human body occupies per block in Eq. (2), we can now compute the total information entropy of each frame according to Eq. (1). These results in a sequence of entropy values denoted as E₁, E₂, ..., E_t . t depend on the total number of frames from the whole sequence in the original video. E₁ represents the entropy value for the first selected frame, E₂ the second, and the rest follow in that order. The information entropy between two frames is the difference between them based on some threshold, T. Where the entropy value is larger than T, a difference in the postures is registered.

where Threshold is adaptable depending on each action class and is calculated as

E_max and E_min are the maximum and the minimum values of information entropy in the whole sequence, respectively. The idea of using entropy should now be clear. The second frame’s value is subtracted from the first frame’s value, and the remainder is compared to the set threshold. If the value is less than the threshold, the actions should be considered to be similar. If the value is more than the threshold, the actions should be considered to be different and both actions are registered in the key frame list for the next process. In case of the latter, the second frame becomes the reference frame for comparisons with the next frame in the sequence.

[Fig. 1.] Frame blocking for entropy computation.

[Fig. 2.] Selection by entropy from particular action video sequences.

The key posture frame list is formed using information entropy. Fig. 2 shows the flow chart for key frame selection.

III. EDGE DESCRIPTOR

This part is the main contribution of this paper. We have chosen to represent the human body in terms of its external characteristics: the boundary. Morphological operations [11] on the edge images form clear double-boundaries, which in turn are superimposed to exhibit a good silhouetteedge shape used as a descriptor to the action classes. The boundary has been chosen fundamentally because the task we are faced with is dependent on the shape of the human body rather than the regional properties like color or texture.

[Fig. 3.] Framework of the approach.

The vital key postures from the entropy section are operated on further to form the salient features for an excellent description of the action classes. The exemplar images formed from the edge descriptors are stored as templates for the gallery images in our database for comparisons with the query templates presented for classification.

> A. Process Flow

1) Motion analysis is performed to extract the object (human body) in the frame sequences from the original video stream. Background subtraction is done simultaneously.

2) The information entropy measurement (Section II) is used for key frame selection.

3) Edge detection is performed and extraction of boundaries executed.

4) Finally, selected binarized boundaries are super-imposed to form a feature descriptor for each action class.

The result of the feature descriptors formed from the procedure in Fig. 3 above is shown below in Fig. 5.

[Fig. 4.] Raw data for ‘bend’, ’forward jump’, ‘jack’, ‘spot jump’, ‘run’, ‘skip’, ‘wave(one-hand)’, ‘wave(two-hand)’, respectively.

Snapshots of video captures of typical actions are shown from the Weizmann human action database in Fig. 4.

> B. Matching and Classification

The Hausdorff distance measure [9] is a distance measure that computes the maximum distance of a set to the nearest point in the other set. In other words, it measures the extent to which each point of a model pattern set lies near some point of the probe image set. With this value computed, we can use it to determine the similarity between the two objects/images in question. This is done by superimposing the images on one another. This is the characteristic that we need for the comparison between the action-trained descriptor models and the input image for classification.

The Hausdorff distance method has been widely used and is lauded for many advantages with the two main ones enhancing the effectiveness of our proposed descriptor.

1) The Hausdorff distance is simple to implement and has a fast computation time. The log polar transforms of our descriptor are of the same size and superimposing makes comparison easy, even visually.

2) Our work involves stacking of frames to form a sum of the edges for description. The Hausdorff distance is relatively insensitive to small perturbations of the image such as those that occur with edge detectors and other feature extraction methods.

Hausdorff distance can be extended to cater for occlusion and help identify portions of a shape that are hidden from view but need to be identified. Occluded images are not included in our work, but Hausdorff distance could help solve that problem when it presents itself.

We formulate the Hausdorff distance method of recognition as in Eq. (5). Given two finite point sets of A = {a₁,a₂,...,a_p} and B = {b₁,b₂,...,b_q}, the Hausdorff distance is defined as:

where

and ||？|| is some underlying distance on the points of A and B; in this case, we assume it to be Euclidean norm.

The Hausdorff distance is asymmetric, which means that the condition h(A, B) ≠ h(B, A) could apply. In comparing two images, each vertex on both images must be compared with all the others on the other image and their distance value computed. The minimum or maximum values are not the same in some cases, and that is the reason for the condition above of inequality. The distance h(a,b) is actually the largest of the two minimum distances of h(A, B) and h(B, A) . This argument put above can be written as in Eq. (7),

where a and b are points of sets A and B, respectively. The d (a,b) is taken to be the Euclidean distance between a and b , for simplicity reasons.

By computing the maximum distance between two sets of images, the Hausdorff distance measures the mismatch between two sets that are at fixed positions with respect to each other. This is true in our case, as the images that we intend to compare are fixed without translation. If translations were to be considered, Hausdorff distance obeys metric properties, as can be seen in [12,13]. This means that the function is everywhere positive and has the properties of identity, symmetry, and triangle inequality. We hold the view that a pattern formed from modeling a particular action could only resemble the same pattern executed by the same action. On the same note, the order of comparison of the formed patterns does not matter and two different patterns cannot both be similar to a third one. All these notions are in line with the three properties of identity, symmetry, and triangle inequality mentioned above.

The fourth image in each group shown in Fig. 5 above represents the descriptor for that particular action and is modeled by an individual’s action. The Weizmann database constitutes 9 subjects. An action of one subject is enough to produce a model descriptor that can be used to identify similar actions from the remaining 8 subjects, which are taken to be the test data. Below in Fig. 6 is the extracted descriptor for each of the 8 trained activities.

[Fig. 5.] Step-by-step shown for 8 different actions from Weizmann database.

[Fig. 6.] Model descriptors for ‘bend’, ‘jack’, ’forward jump’, ‘spot jump’, ‘run’, ‘skip’, ‘wave(one-hand)’, and ‘wave(two-hand)’.

The descriptors above are stored as templates in the gallery. A new action to be recognized is modeled using the procedure explained above. Classification is done based on the distance as in part B above between the gallery images and the query template presented.

IV. EXPERIMENTAL RESULTS

Apart from our own generated data for human actions, we have used the publicly available Weizmann Institute of Science’s human action database [14] to validate our claims. The Weizmann database contains actions from 9 different actors performing 9 actions, among which we have applied our method to most of them. Actions experimented on include bend, jack, jump forward on two legs, jump up in one spot, running, skip forward, one-hand wave, and twohand wave. The video resolution is given and is taken at the speed of 25 fps. Our own generated video dataset (OGD) contains five subjects performing eight different actions of bending, walking, jumping jacks, jumping in place, onehand wave, two-hand wave, jumping in place wave, and running. The videos are taken at 15 frames per second. Each frame is of the size 320 × 240, with images of 24 bit RGB. The videos are also available in binary form as we have applied binarization for background removal.

We have extracted the silhouettes based on the background subtraction method. The number of key frames used was limited to 20, which was enough to describe a full cycle of an event, considering the speed of the video. The performance was comparable to other methods that used different descriptors for action representation. One such method is our own [15], which used additional shape context for fine selection of key postures. [16] used a global histogram to describe the extracted rectangles from the human silhouette with a claim of a 100% recognition rate [17]. Two main assumptions apply in both databases used: the viewing direction is fixed and the background static. The restrictions contributed to the efficient extraction of the human body silhouette-edges, thereby enhancing the recognition. The Hausdorff distance performed better with the normalization of images to equal sizes. Actions with varied rotation, translation, and transformation problems were fixed for matching purposes. For example, in the case where running directions were different, the activity was trained independently for exact and fast recognition.

[Fig. 7.] The sample test data of various actions from different individuals.

[Table 1.] Hausdorff distances between different models

Hausdorff distances between different models

In Fig. 7, we extracted the descriptor for each of the various activities. The descriptor is representative of an action model that can be differentiated one from another.

Table 1 shows some of the results of dissimilarity among various action model images by using the Hausdorff distance measure. The Hausdorff distance measure compared the action models by measuring how much difference existed within an action itself and even among other actions. We can check for the explanation of the Hausdorff distance as a dissimilarity measure in Section IIIB. The experiment was done on the Weizmann human action database.

[Fig. 8.] Graph based on Hausdorff distances above. Fjump: jump forward on two legs, Pjump: jump up in one spot, Wave1: one-hand wave, Wave2: two-hand wave.

Fig. 8 is an expression of Table 1, which shows a sample experiment on eight actions compared against the rest of the models, where the zero values of the line on the graph indicate that the action was compared against itself and express absolute similarity. Just like with all distance measurement techniques, the smaller the value in between actions, the more similar the images.

[Table 2.] Hausdorff distance values of OGD

Hausdorff distance values of OGD

[Fig. 9.] Minmax distance comparison among actions in own generated video dataset. Fjump: jump forward on two legs, Pjump: jump up in one spot, Wave1: one-hand wave, Wave2: two-hand wave, Pjump_jack: jumpjack in one spot.

[Table 3.] The comparison of trained data between OGD and the Weizmann database using Hausdorff distance

The comparison of trained data between OGD and the Weizmann database using Hausdorff distance

[Fig. 10.] Overall average of six OGD-based dataset compared with 8- templates from Weizmann database. OGD: own generated video dataset.

Table 2 indicates the Hausdorff distance value computed between each pair of action as indicated. From these values we have tabulated a general bar graph. Fig. 9 graphs the expressions of the values in Table 2, comparing each labeled action with the other remaining actions. It can be further generalized to produce the following graph in Fig. 10 for clarity. Table 3 is a comparison of some of the actions we took from our own recorded video data (OGD) and one of the benchmark databases, the Weizmann database. The recognition via Hausdorff distance performed fairly well with the distances, as shown in the table. The lower the value, the closer the actions are to each other. The distance values between two inter-databases did not yield zero frequently, except for with the ‘point-jump’ action. The recognition process followed the criteria explained in the proposed algorithm, starting from object identification in videos, to the template modeling. The results exhibited some closeness in particular actions but the smallest distance was still clear. Some actions still presented a similarity challenge, as can be witnessed in the experiment among the ‘point-jump’, ‘skip’, and ‘run’ actions. The Hausdorff distance measure provided a value just significant enough to separate the actions and to let us claim the effectiveness of our proposed method. High classification performance was achieved, empirically.

A graph is shown in Fig. 10 for the cases in which clear differences existed among several actions. In the graph, it is obvious that the difference between the ‘two-hand wave’ and ‘jumping jack’ actions compared to the ‘point-jump’ action revealed the greatest dissimilarity on average. From this particular experiment, ‘run’ and ‘one-hand wave’ actions can be seen to be quite similar to each other.

V. CONCLUSION

A simple but powerful image descriptor based on the silhouette boundary extraction is proposed. Fast and efficient representation was achieved through the two steps of information entropy and summed boundary shapes. We managed to model well and represent most of the actions in Weizmann database and from our self-generated human action database. Future work includes more tests on various databases containing complex action-videos with assorted backgrounds. A dynamic matching method and an improved boundary descriptor will improve the classification performance and contribute to this particular research field.

참고문헌

1. Agarwal A., Triggs B. 2004 “3D human pose from silhouettes by relevance vector regression” [in Proceeding of IEEE Conference on Computer Vision and Pattern Recognition] P.882-888
2. Fengjun Lv., Nevatia R. 2007 “Single view human action recognition using key pose matching and Viterbi path searching” [in Proceeding of IEEE Conference on Computer Vision and Pattern Recognition] P.1-8
3. Dollar P., Rabaud V., Cottrell G., Belongie S. 2005 “Behavior recognition via sparse spatio-temporal features” [in Proceedings of the 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance] P.65-72
4. Ke Y., Sukthankar R. 2004 “PCA-SIFT: a more distinctive representation for local image descriptors” [in Proceeding of IEEE Conference on Computer Vision and Pattern Recognition] P.506-513
5. Black M. J. 1999 “Explaining optical flow events with parameterized spatio-temporal models” [in Proceeding of IEEE Conference on Computer Vision and Pattern Recognition] P.326-332
6. Dala N., Triggs B. 2005 “Histograms of oriented gradients for human detection” [in Proceeding of IEEE Conference on Computer Vision and Pattern Recognition] P.886-893
7. Shannon C. E. 1948 “A mathematical theory of communication” [Bell System Technical Journal] Vol.27 P.379-423
8. Belongie S., Malik J., Puzicha J. 2002 “Shape matching and object recognition using shape contexts” [IEEE Transactions on Pattern Analysis and Machine Intelligence] Vol.24 P.509-522
9. Hutternlocher D. P., Klanderman G. A., Rucklidge W. J. 1993 “Comparing images using the Hausdorff distance” [IEEE Transactions on Pattern Analysis and Machine Intelligence] Vol.15 P.850-863
10. Chen X. W., Huang T. 2003 “Facial expression recognition: a clustering-based approach” [Pattern Recognition Letter] Vol.24 P.1295-1302
11. Gonzalez R. C., Woods R. E., Eddins S. L. 2004 Digital Image Processing Using MATLAB
12. Dubuisson M. P., Jain A. K. 1994 “A modified Hausdorff distance for object matching” [in Proceedings of the 12th IAPR International Conference on Pattern Recognition] P.566-568
13. Sim D. G., Kwon O. K., Park R. H. 1999 “Object matching algorithms using robust Hausdorff distance measures”, vol. 8, no. 3 [in IEEE Transactions on Image Processing] P.425-428
14. Blank M., Gorelick L., Shechtman E., Irani M., Basri R. 2005 “Actions as space-time shapes” [in Proceeding of the 10th IEEE Conference on Computer Vision] P.1395-1402
15. Lee G. B., Odoyo W. O., Yeom J. N., Cho B. J. 2009 “Extraction of key postures using shape contexts” [in Proceeding of the 11th IEEE Conference on Advanced Communication Technology] P.1311-1314
16. Ikizler N., Duygulu P. 2009 “Histogram of oriented rectangles: a new pose descriptor for human action recognition” [Image and Vision Computing] Vol.27 P.1515-1526
17. Odoyo W. O. 2012 “Silhouette edge-based log-polar descriptor for human action representation and recognition”

OAK XML 통계

이미지 / 테이블

[ Fig. 1. ] Frame blocking for entropy computation.
[ Fig. 2. ] Selection by entropy from particular action video sequences.
[ Fig. 3. ] Framework of the approach.
[ Fig. 4. ] Raw data for ‘bend’, ’forward jump’, ‘jack’, ‘spot jump’, ‘run’, ‘skip’, ‘wave(one-hand)’, ‘wave(two-hand)’, respectively.
[ Fig. 5. ] Step-by-step shown for 8 different actions from Weizmann database.
[ Fig. 6. ] Model descriptors for ‘bend’, ‘jack’, ’forward jump’, ‘spot jump’, ‘run’, ‘skip’, ‘wave(one-hand)’, and ‘wave(two-hand)’.
[ Fig. 7. ] The sample test data of various actions from different individuals.
[ Table 1. ] Hausdorff distances between different models
[ Fig. 8. ] Graph based on Hausdorff distances above. Fjump: jump forward on two legs, Pjump: jump up in one spot, Wave1: one-hand wave, Wave2: two-hand wave.
[ Table 2. ] Hausdorff distance values of OGD
[ Fig. 9. ] Minmax distance comparison among actions in own generated video dataset. Fjump: jump forward on two legs, Pjump: jump up in one spot, Wave1: one-hand wave, Wave2: two-hand wave, Pjump_jack: jumpjack in one spot.
[ Table 3. ] The comparison of trained data between OGD and the Weizmann database using Hausdorff distance
[ Fig. 10. ] Overall average of six OGD-based dataset compared with 8- templates from Weizmann database. OGD: own generated video dataset.