Online Selective-Sample Learning of Hidden Markov Models for Sequence Classification

Kim
Minyoung

doi:10.5391/IJFIS.2015.15.3.145

OA학술지
International Journal of Fuzzy Logic and Intelligent Systems

Online Selective-Sample Learning of Hidden Markov Models for Sequence Classification

DOI : 10.5391/IJFIS.2015.15.3.145
Author: Kim Minyoung
Publish: International Journal of Fuzzy Logic and Intelligent Systems Volume 15, Issue3, p145~152, 25 Sep 2015

ABSTRACT

Online Selective-Sample Learning of Hidden Markov Models for Sequence Classification

KEYWORD

Machine learning , Sequence classification , Online learning , Hidden Markov models

본문

Collapse all

1. Introduction

Sequences or time-series data are parts of our everyday life in diverse forms such as videos of image frames, speech signals, financial asset prices and meteorological records, to name a few. Sequence classification involves automatically assigning class labels to these instances in a meaningful way. Owing to the increasing demand for efficient indexing, search, and organization of a huge amount of sequence data, it has recently received significant attention in machine learning, data mining, and related fields. Unlike conventional classification of non-sequential vectorial data, sequence classification entails inherent difficulty originating from potentially variable-length and non-stationary structures.

Recent sequence classification approaches broadly adopt one of the two different frameworks. In the first framework, one estimates the similarity (or distance) measure between pairs of sequences, from which any kernel machines (e.g., support vector machines) or exemplarbased classifiers (e.g., nearest neighbor) can be employed. As the sequence lengths and sampling rates can differ from instance to instance, one often resorts to warping or alignment of sequences (e.g., dynamic time warping), alternatively, incorporating a specially-tailored similarity measure (e.g., spectral or string kernels [1]).

The second framework is based on probabilistic sequence models, essentially aiming to represent an underlying generative model for sequence data. The hidden Markov model (HMM) is considered to be one of the most popular sequence models, and has previously shown broad success in various application areas including automatic speech recognition [2, 3], computer vision [4–6], and bioinformatics [7]. y-based one in several aspects: certain statistical constraints such as Markov dependency structures can be easily imposed. In addition, the model-based approaches are typically computationally less demanding for training, specifically linear time in the number of training samples, while similarity-based methods require quadratic. Thus, throughout the study we focus on HMM-based sequence classification: the details of the model are thoroughly described in Section 2.

Unlike the traditional learning setup where all the labeled training samples are stored and available to a learning algorithm (i.e., batch learning), the learning setup we deal with in this study is quite different. We consider the online selective-sample learning: it is basically a streaming data scenario where we receive an input sequence (without its class label) one at a time, and the technique is assumed to have no capability of storing the sample for future use. The learning algorithm has an option of either querying the class label or not, and this decision has to be made on the fly based solely on the current data sample (i.e., unable to look into previous samples). The model can then be updated with the sequence sample and the class label (if queried). Of course, there is a limit to the total number of queries that can be made, and therefore the learner’s main objective is to select samples for queries that are the most salient and important for learning an accurate model.

Considering the cost of obtaining class labels, which typically requires human experts endeavor, the online learning algorithms can be more favorable. Moreover, they are better suited for various applications, most notably the mobile computing environments, in which there are massive amount of data collected and observed whereas the computing platforms (e.g., mobile devices) have minimal storage and limited computing power. It is worth noting that the online selective-sample learning setup is different from the well-known active learning [8, 9] in machine learning. In the active learning, the algorithm has full access to entire data samples (thus requiring data storage capabilities), and the goal is to output a subset for which the class labels are queried. In this sense, the online selective-sample learning can be more realistic and challenging than the active learning.

Recently there have been attempts to tackle the online selective-sample learning problem. In particular, the main idea of the latest aggressive algorithms (e.g., [10, 11]) is to decide to query labels for samples that appear to be novel (compared to the previous samples) with respect to the current model. Moreover, it is desirable to ask for labels for those data that have less confident class prediction under the current model, which is also intuitively appealing. However, most approaches are limited to vectorial (non-sequential) data and binary classification setups with the underlying classification model confined to the simple linear model.

We extend the idea to the multi-class sequence data classification problem within the HMM classification model. Specifically, we deal with the negative data log-likelihood of the incoming sequence as a measure of data novelty. Furthermore, the entropy of the class conditional distribution given the incoming sequence is considered as a measure of strength/confidence in class prediction. The proposed approach is not only intuitively appealing, but also shown to yield superior performance to the baseline approaches that make queries randomly or greedily. These results have been verified for several sequence classification datasets/problems.

The paper is organized as follows: after introducing a few notations and a formal problem setup, we describe the HMM-based sequence classification model that we deal with throughout the study in Section 2. Our main approach is described in Section 3, and the empirical evaluations are demonstrated in Section 4.

   1.1 Notations and Problem Setup

We consider a K-way sequence classification problem, where we let c ∈ {1, ..., K} be the class variable and x = x₁ . . . x_T be the sequence observation (the length T can vary from instance to instance). We assume each time slice x_t ∈ ℝ^d is a realvalued d-dimensional vector.

In the online selective-sample learning setup, the learner receives a sequence one at a time, and decides whether to query its class label or not. More formally, at the i-th stage, the algorithm receives a new sequence (length T_i), and outputs the class prediction for xⁱ from the current model. It then decides whether to query the true class label c_i or not, and the learning algorithm may update the classification model using the observed data sample (either xⁱ or (c_i, xⁱ), latter only when the query is made). In general, there are two (complementary) goals for the learner: to yield an accurate class prediction model and to make as few queries as possible. One possible way of enforcing the goals, which we adopt in this paper, is to introduce the budget B, the upper bound of the number of queries to be made, and devise an algorithm yielding the smallest classification error with the budget constraint.

2. HMM-based Sequence Classification Model

We consider a probabilistic sequence classification model P(c, x) = P(c)·P(x|c), where P(c) is the class prior (modeled by the multinomial distribution over {1, . . . , K}), and P(x|c) is the c-th HMM model (with c = 1, . . . , K) that is responsible for generation of a sequence x under the class c. In this paper we use the Gaussian-emission HMM models, for which we provide formal descriptions below.

The HMM is composed of two generative components: (i) a (hidden) state sequence s = s₁ . . . s_T is generated where each hidden state s_t takes a discrete value from {1, ..., S}) conforming to the 1st-order Markov dependency, (ii) at each time slice t, the observation x_t is generated whose distribution (Gaussian) is determined by s_t. More formally, we have the following local conditional models and associated parameters:

The parameters of the HMM are denoted by

In typical cases, the state sequence s is not observed, and one has to deal with the observation likelihood which is marginalization of the full joint likelihood over all possible hidden sequences, namely P(x) = Σ_s P(s, x). This marginalization can be done very efficiently (time linear in sequence length T) using the dynamic programming [12]. Given an HMM model and the observation sequence x, the task of computing the hidden state posteriors (i.e., P(s|x)) is very important. Often referred to as probabilistic inference, it can be done efficiently using the forward/backward recursions [12]. Specifically, we denote two key quantities by: γ_t(j) := P(s_t = j|x) and ξ_t(j, l) := P(s_t−1 = j, s_t = l|x).

While the parameter learning of the HMM can typically done by the EM algorithm [13], one can directly optimize the log-likelihood log P(x) by standard gradient ascent. This is indeed exploited in our online selective-sample learning algorithm since we need to update a model with a single sample within the stochastic gradient optimization framework [14]. The gradient of the observation log-likelihood can be derived as follows:

Returning to the classification model, we deal with K HMMs and treat each one as a class conditional density P(x|c) for each class c. The class prior P(c) is modeled by a multinomial with a K-dimensional paraemter vector p where p_k = P(c = k). Overall the model has parameters , and is the parameters of the k-th HMM. For simplicity, the hidden state cardinalities (i.e., S) are assumed to be the same across all the component HMMs. We refer to this HMM-based classification model as SC-HMM. In the SC-HMM model, class prediction for an unseen test sample x = x₁ . . . x_T is done by the maximum-a-posteriori (MAP) decision: c^∗ = arg max_c P(c|x) = arg max_c P(c, x).

3. Online Selective-Sample Learning for SC-HMM

Our online selective-sample learning algorithm for the SC-HMM model is motivated from the aggressive algorithm of [11], and we briefly review their approach here. However, it should be noted that their approach is based on the simple linear model for binary classification of vectorial (non-sequential) data, hence should be differentiated from our extension.

In [11] a linear classification model y = sgn(w^⊤z) is considered where z ∈ ℝ^d is the input observation vector, w is the parameter vector of the same dimension, and sgn(·) returns the sign (±1) of its argument. The so-called score w^⊤z is real valued, where its magnitude indicates the confidence in prediction (i.e., a larger score implies that z lies farther away from the decision boundary, and vice versa). We denote by z_i the i-th sample received by the algorithm, and by δ_i a binary variable indicating whether i-th sample is queried (1) or not (0).

The model w is basically estimated by L2-regularized linear regression, which gives rise to the estimate w = A⁻¹b at stage i, where

Note that (3) can be computed/updated in an online fashion. Then the model’s score is computed as: . Whether to query y_i or not is then based on this confidence, namely we query if p is smaller than some threshold, and this decision strategy is employed in [10, 11]. In [11], it is further considered the novelty of the sample z_i, which is measured by . This is based on the 0-mean Gaussian assumption for z, and we query y_i if r is larger than certain threshold.

In their algorithm, due to the simple linear regression model and Gaussianity assumption, the query decision rule is easily derived and the update equation becomes simple. For sequence classification, the underlying model is a rather complex SC-HMM, and extension of the idea requires further endeavor. At each stage i, upon receiving a new sequence xⁱ, our online selective-sample learning algorithm performs two main steps: 1) decide whether to query the true class label c_i or not, and 2) update the model with the current sample, either (c_i, xⁱ) if queried or xⁱ if not. We provide detailed derivations for each step in the subsequent sections.

   3.1 Decision for Query

First, decision to query is made if the incoming sample meets either of two conditions with respect to the current model Θ. (Cond-1) − log P(xⁱ ; Θ) ≥ τ_NLL indicates the current data sample attains high negative log-likelihood, or equivalently, the sample appears to be novel for the current model. The threshold τ_NLL is properly chosen. (Cond-2) −H(p(c|xⁱ ; Θ)) ≤ τ_NCE implies that the entropy of the posterior class distribution for the current sample is high, meaning that the con- fidence in prediction is not very strong. Here is the entropy of the class posterior. Again we choose the threshold parameter τ_NCE adequately.

   3.2 Model Update

The second step is to update the model Θ using the current data sample. The true class label c_i is either available or not depending on the decision in the above step, and we let the binary variable δ_i record it (i.e., δ_i = 1(0) if queried (not)). To present our model update algorithm, we begin with the batch data learning formulation assuming all labeled and unlabeled data can be accessed. We then derive the stochastic gradient update rule from the batch formulation.

In the batch learning case with n data samples, the learning problem can be formulated with partially labeled data (sine we have selectively labeled samples) as follows:

Here note that we can exploit even the unlabeled samples (the second term) by maximizing the marginal log-likelihood where the two objectives are balanced by the hyperparameter λ ≥ 0.

In the stochastic gradient ascent [14], instead of computing the full gradient of (4), it only updates the model using the gradient for a single sample, say i, that is,

which can be seen as an unbiased sample for the total gradient. The constant η > 0 is a learning rate which we fix throughout the iterations (e.g., η = 0.001).

The nice thing is that (5), although derived from a batch setup, can be computed in an online fashion, and it exactly forms our model update equations. To be more specific to the SC-HMM, the gradients can be computed by (for each k = 1, . . . , K):

where I(·) is the 1/0 indicator function, and the gradient in the latter exactly follows the HMM gradient (2). For the marginal log-likelihood term, since

The initial model is chosen randomly and blindly, thus tends to select first a few samples for query, which is a desirable strategy considering that model is inaccurate with little labeled data received at initial stages. We stop the algorithm when the number of queries made reaches the budget constraint B. The proposed online selective-sample learning algorithm is summarized in Alg. 1.

4. Empirical Evaluation

In this section we evaluate the classification performance of the proposed online selective-sample learning algorithm on several real-world time-series datasets including human gait/activity recognition and facial emotion prediction. For each dataset we form online learning setups by feeding a learning algorithm one sample at each stage randomly drawn from the data pool, where we restrict the algorithm to make queries up to B times. We test with two different values of budget B: 10% and 30% of the entire data samples. Once the algorithm uses up the budget, from then on it can only exploit the incoming sequences only (without class labels) to update the model.

Our algorithm is compared with two fairly basic online learning strategies: the first is to make queries at the first B stages (greedy approach; denoted by GREEDY), and the second one makes random queries (denoted by RANDOM). The former is reasonable in the sense that the model is initially incorrect, thus trying to learn with labeled data as early as possible. The related hyperparameters (e.g., the learning rate η and the impact of the marginal log-likelihood term λ in model update) are chosen identical across competing models for fair comparison. As a reference, we also contrast with the online learner that is not restricted by the budget constraint (i.e., it can make query every stage), which perhaps provides an upper bound for the classification accuracy (denoted by QRYALL).

As a performance measure, we accumulates the test errors up to n stages. That is, the prediction error of an online algorithm is: . We provide detailed descriptions of the datasets, experimental setups, and test results in the subsequent sections.

   4.1 Human Gait Recognition

The classification task we deal with is to identify a person based on his/her gait sequence. We collect data from the speed-control human gait database [15, 16], where we focus on samples from 6 different subjects to form a K = 6-way multi-class classification setup. Each subject performs walking motion with four different speeds (0.7m/s, 1.0m/s, 1.3m/s, 1.6m/s), and the task is to predict a subject regardless of the walking speed. For the sequence representation, we take 3D motion captures of some marker points at the lower body parts (refer to [15] for details). We then select sub-sequences of lengths around 80 with random starting/ending positions.

For n = 648 sequences, we form online learning setups with two budget setups, B = 0.1n and B = 0.3n, and it is repeated 5 times to report the average test errors. The results are summarized in Table 1. The reference QRYALL that makes unlimited queries, as expected, attains the lower bound for test errors of all competing methods. For both budget setups, the proposed approach consistently outperforms the other two methods, which can be attributed to its more sophisticated decision strategy based on sample novelty and/or model’s certainty on class prediction, rather than simple greedy or random querying strategies.

[Table 1.] Accumulated test prediction errors (%) on human gait recognition data

Accumulated test prediction errors (%) on human gait recognition data

It is also interesting to see that when the budget B increases from 10% to 30%, the difference between GREEDY and RANDOM becomes small, which can be explained as follows: as we have more labeled samples, it is less critical when the labels are asked. Also, for smaller B, the differences between our method and two competing ones are more significant, which emphasizes the effectiveness of the proposed method under a more restricted budget scenario, namely that when only a few queries are allowed to be made, the careful decision strategy in our proposed approach yields outstanding performance.

   4.2 Facial Emotion Classification

Next we consider the problem of recognizing facial emotion from a video sequence comprised of facial image frames that undergo changes in facial expression. From the Cohn-Kanade facial expression video dataset [17], we collect videos of two emotions, fear and happiness. Differentiating these two emotions are recognized as a hard problem in that visually they are very similar to each other. We tackle this binary classification task.

For the sequence representation, we use certain image features extracted from images as follows. We first estimate the tight bounding boxes for faces using the standard face detector (e.g., Viola and Jones [18]), then normalize the sizes of the cropped face images. Then the Haar-like features are extracted for each frame where we follow the procedure similar to that of [19]. The last step is to reduce the dimensionality by principal component analysis. We obtain n = 139 sequences with lengths varying from 10 to 30 for about 90 different subjects, and there are 54 sequences for the fear class and 85 for happiness.

The test errors are depicted in Table 2, where the reference QRYALL again gives lower bound on the test errors. Our proposed method outperforms the competing methods significantly and consistently for all budget setups, while for B = 0.3n it attains error rate nearly close to the lower bound. The results again signify the superiority of the proposed query decision strategy.

[Table 2.] Accumulated test prediction errors (%) on facial emotion classification data

Accumulated test prediction errors (%) on facial emotion classification data

   4.3 Human Activity Recognition

Last but not least, we tackle the problem of human activity recognition from a sequence of motion localization records. We obtain the dataset from the UCI machine learning repository [20], where the original goal was to reconstruct locations and postures from the localization data recorded from wearable tags at articulation points of subjects performing actions [21]. In this experiment, we focus on the task of recognizing the activity type. For this purpose, we collect from three different subjects about 360 sequences of three activities: walking, lying, and sitting. For the sequence representation, we use the 3-dim velocity information evaluated from 3D coordinates data obtained from the left-ankle tag.

For this 3-way classification problem, we form online learning setups similarly as previous experiments. The test results are summarized in Table 3. We again have similar behavior as the former two experiments. The performance of the proposed approach is outstanding: in particular, it even attains more accurate prediction with the smaller budget constraint. In this dataset, due to the rather limited feature information, the overall prediction performance is low, and under such scenarios, the sophisticated query decision in our algorithm is shown to yield viable solutions.

[Table 3.] Accumulated test prediction errors (%) on human activity recognition data

Accumulated test prediction errors (%) on human activity recognition data

5. Conclusion

In this paper we have proposed a novel online selective-sample learning algorithm for multi-class sequence classification problems. Under the HMM-based sequence classification model, we devise reasonable criteria for decision for query based on the data novelty (likelihood of the incoming data) and the class prediction confidence (entropy of the class posterior). For several sequence classification datasets/tasks in online learning setups, we have shown that the proposed algorithm yields superior prediction accuracy to greedy or random approaches even with a limited query budget. One current issue may be how to choose the threshold parameters adequately, and we leave it as our future work to investigate a systematic way to tune those.

참고문헌

1. Leslie C., Eskin E., Noble W. S. 2002 “The spectrum kernel: A string kernel for SVM protein classification,” [Pacific Symposium on Biocomputing] Vol.7 P.566-575
2. Juang B. H., Rabiner L. R. 1985 “A probabilistic distance measure for hidden Markov models,” [AT&T Technical Journal]
3. Sha F., Saul L. K. 2007 “Large margin hidden Markov models for automatic speech recognition,” [Advances in Neural Information Processing Systems] Vol.19
4. Starner T., Pentland A. 1995 “Real-time American sign language recognition from video using hidden Markov models,” [International Symposium on Computer Vision]
5. Wilson A. D., Bobick A. F. 1999 “Parametric hidden Markov models for gesture recognition,” [IEEE Transactions on Pattern Analysis and Machine Intelligence] Vol.21 P.884-900
6. Alon J., Sclaroff S., Kollios G., Pavlovic V. 2003 “Discovering clusters in motion time-series data,” [Computer Vision and Pattern Recognition]
7. Durbin R., Eddy S., Krogh A., Mitchenson G. 2002 Biological Sequence Analysis
8. Dasgupta S., Kalai A. T., Monteleoni C. 2005 “Analysis of perceptron-based active learning,” [Conference on Learning Theory]
9. Beygelzimer A., Dasgupta S., Langford J. 2009 “Importance weighted active learning,” [International Conference on Machine Learning]
10. Cesa-Bianchi N., Gentile C., Zaniboni L. 2006 “Worst-case analysis of selective sampling for linear classification,” [Journal of Machine Learning Research] Vol.7 P.1205-1230
11. Crammer K. 2014 “Doubly aggressive selective sampling algorithms for classification,” [International Conference on AI & Statistics]

이미지 / 테이블

[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ Table 1. ] Accumulated test prediction errors (%) on human gait recognition data
[ Table 2. ] Accumulated test prediction errors (%) on facial emotion classification data
[ Table 3. ] Accumulated test prediction errors (%) on human activity recognition data

OA 실천선언

기관별보기

유형별보기

OA 학술지

OAK 소개

소통/참여

1. Introduction

1.1 Notations and Problem Setup

2. HMM-based Sequence Classification Model

3. Online Selective-Sample Learning for SC-HMM

3.1 Decision for Query

3.2 Model Update

4. Empirical Evaluation

4.1 Human Gait Recognition

4.2 Facial Emotion Classification

4.3 Human Activity Recognition

5. Conclusion

OAK 소개

소통/참여

상세검색

자료유형선택

발행연도

기관선택

국내 기관

국외 기관