Online SelectiveSample Learning of Hidden Markov Models for Sequence Classification
 Author: Kim Minyoung
 Publish: International Journal of Fuzzy Logic and Intelligent Systems Volume 15, Issue3, p145~152, 25 Sep 2015

ABSTRACT
We consider an online selectivesample learning problem for sequence classification, where the goal is to learn a predictive model using a stream of data samples whose class labels can be selectively queried by the algorithm. Given that there is a limit to the total number of queries permitted, the key issue is choosing the most informative and salient samples for their class labels to be queried. Recently, several aggressive selectivesample algorithms have been proposed under a linear model for static (nonsequential) binary classification. We extend the idea to hidden Markov models for multiclass sequence classification by introducing reasonable measures for the novelty and prediction confidence of the incoming sample with respect to the current model, on which the query decision is based. For several sequence classification datasets/tasks in online learning setups, we demonstrate the effectiveness of the proposed approach.

KEYWORD
Machine learning , Sequence classification , Online learning , Hidden Markov models

1. Introduction
Sequences or timeseries data are parts of our everyday life in diverse forms such as videos of image frames, speech signals, financial asset prices and meteorological records, to name a few. Sequence classification involves automatically assigning class labels to these instances in a meaningful way. Owing to the increasing demand for efficient indexing, search, and organization of a huge amount of sequence data, it has recently received significant attention in machine learning, data mining, and related fields. Unlike conventional classification of nonsequential vectorial data, sequence classification entails inherent difficulty originating from potentially variablelength and nonstationary structures.
Recent sequence classification approaches broadly adopt one of the two different frameworks. In the first framework, one estimates the similarity (or distance) measure between pairs of sequences, from which any kernel machines (e.g., support vector machines) or exemplarbased classifiers (e.g., nearest neighbor) can be employed. As the sequence lengths and sampling rates can differ from instance to instance, one often resorts to warping or alignment of sequences (e.g., dynamic time warping), alternatively, incorporating a speciallytailored similarity measure (e.g., spectral or string kernels [1]).
The second framework is based on probabilistic sequence models, essentially aiming to represent an underlying generative model for sequence data. The hidden Markov model (HMM) is considered to be one of the most popular sequence models, and has previously shown broad success in various application areas including automatic speech recognition [2, 3], computer vision [4–6], and bioinformatics [7]. ybased one in several aspects: certain statistical constraints such as Markov dependency structures can be easily imposed. In addition, the modelbased approaches are typically computationally less demanding for training, specifically linear time in the number of training samples, while similaritybased methods require quadratic. Thus, throughout the study we focus on HMMbased sequence classification: the details of the model are thoroughly described in Section 2.
Unlike the traditional learning setup where all the labeled training samples are stored and available to a learning algorithm (i.e.,
batch learning ), the learning setup we deal with in this study is quite different. We consider theonline selectivesample learning : it is basically a streaming data scenario where we receive an input sequence (without its class label) one at a time, and the technique is assumed to have no capability of storing the sample for future use. The learning algorithm has an option of either querying the class label or not, and this decision has to be made on the fly based solely on the current data sample (i.e., unable to look into previous samples). The model can then be updated with the sequence sample and the class label (if queried). Of course, there is a limit to the total number of queries that can be made, and therefore the learner’s main objective is to select samples for queries that are the most salient and important for learning an accurate model.Considering the cost of obtaining class labels, which typically requires human experts endeavor, the online learning algorithms can be more favorable. Moreover, they are better suited for various applications, most notably the mobile computing environments, in which there are massive amount of data collected and observed whereas the computing platforms (e.g., mobile devices) have minimal storage and limited computing power. It is worth noting that the online selectivesample learning setup is different from the wellknown active learning [8, 9] in machine learning. In the active learning, the algorithm has full access to entire data samples (thus requiring data storage capabilities), and the goal is to output a subset for which the class labels are queried. In this sense, the online selectivesample learning can be more realistic and challenging than the active learning.
Recently there have been attempts to tackle the online selectivesample learning problem. In particular, the main idea of the latest aggressive algorithms (e.g., [10, 11]) is to decide to query labels for samples that appear to be novel (compared to the previous samples) with respect to the current model. Moreover, it is desirable to ask for labels for those data that have less confident class prediction under the current model, which is also intuitively appealing. However, most approaches are limited to vectorial (nonsequential) data and binary classification setups with the underlying classification model confined to the simple linear model.
We extend the idea to the multiclass sequence data classification problem within the HMM classification model. Specifically, we deal with the negative data loglikelihood of the incoming sequence as a measure of data novelty. Furthermore, the entropy of the class conditional distribution given the incoming sequence is considered as a measure of strength/confidence in class prediction. The proposed approach is not only intuitively appealing, but also shown to yield superior performance to the baseline approaches that make queries randomly or greedily. These results have been verified for several sequence classification datasets/problems.
The paper is organized as follows: after introducing a few notations and a formal problem setup, we describe the HMMbased sequence classification model that we deal with throughout the study in Section 2. Our main approach is described in Section 3, and the empirical evaluations are demonstrated in Section 4.
1.1 Notations and Problem Setup
We consider a
K way sequence classification problem, where we letc ∈ {1, ...,K } be the class variable and x = x_{1} . . . x_{T} be the sequence observation (the lengthT can vary from instance to instance). We assume each time slice x_{t} ∈ ℝ^{d} is a realvaluedd dimensional vector.In the online selectivesample learning setup, the learner receives a sequence one at a time, and decides whether to query its class label or not. More formally, at the
i th stage, the algorithm receives a new sequence (lengthT_{i} ), and outputs the class prediction forx^{i} from the current model. It then decides whether to query the true class labelc_{i} or not, and the learning algorithm may update the classification model using the observed data sample (eitherx^{i} or (c_{i}, x^{i} ), latter only when the query is made). In general, there are two (complementary) goals for the learner: to yield an accurate class prediction model and to make as few queries as possible. One possible way of enforcing the goals, which we adopt in this paper, is to introduce the budgetB , the upper bound of the number of queries to be made, and devise an algorithm yielding the smallest classification error with the budget constraint.2. HMMbased Sequence Classification Model
We consider a probabilistic sequence classification model
P (c , x) =P (c )·P (xc ), whereP (c ) is the class prior (modeled by the multinomial distribution over {1, . . . ,K }), andP (xc ) is thec th HMM model (withc = 1, . . . ,K ) that is responsible for generation of a sequence x under the classc . In this paper we use the Gaussianemission HMM models, for which we provide formal descriptions below.The HMM is composed of two generative components: (i) a (hidden) state sequence s =
s _{1} . . .s_{T} is generated where each hidden states_{t} takes a discrete value from {1, ...,S }) conforming to the 1storder Markov dependency, (ii) at each time slicet , the observation x_{t} is generated whose distribution (Gaussian) is determined bys_{t} . More formally, we have the following local conditional models and associated parameters:The parameters of the HMM are denoted by
In typical cases, the state sequence s is not observed, and one has to deal with the observation likelihood which is marginalization of the full joint likelihood over all possible hidden sequences, namely
P (x) = Σ_{s}P (s, x). This marginalization can be done very efficiently (time linear in sequence lengthT ) using the dynamic programming[12] . Given an HMM model and the observation sequence x, the task of computing the hidden state posteriors (i.e.,P (sx)) is very important. Often referred to asprobabilistic inference , it can be done efficiently using the forward/backward recursions[12] . Specifically, we denote two key quantities by:γ_{t} (j ) :=P (s_{t} =j x) andξ_{t} (j ,l ) :=P (s _{t−1} =j ,s_{t} =l x).While the parameter learning of the HMM can typically done by the EM algorithm
[13] , one can directly optimize the loglikelihood logP (x) by standard gradient ascent. This is indeed exploited in our online selectivesample learning algorithm since we need to update a model with a single sample within the stochastic gradient optimization framework[14] . The gradient of the observation loglikelihood can be derived as follows:Returning to the classification model, we deal with
K HMMs and treat each one as a class conditional densityP (xc ) for each classc . The class priorP (c ) is modeled by a multinomial with aK dimensional paraemter vectorp wherep_{k} =P (c =k ). Overall the model has parameters , and is the parameters of thek th HMM. For simplicity, the hidden state cardinalities (i.e.,S ) are assumed to be the same across all the component HMMs. We refer to this HMMbased classification model as SCHMM. In the SCHMM model, class prediction for an unseen test sample x = x_{1} . . . x_{T} is done by the maximumaposteriori (MAP) decision:c ^{∗} = arg max_{c}P (c x) = arg max_{c}P (c , x).3. Online SelectiveSample Learning for SCHMM
Our online selectivesample learning algorithm for the SCHMM model is motivated from the aggressive algorithm of [11], and we briefly review their approach here. However, it should be noted that their approach is based on the simple linear model for binary classification of vectorial (nonsequential) data, hence should be differentiated from our extension.
In [11] a linear classification model
y = sgn(w^{⊤}z) is considered where z ∈ ℝ^{d} is the input observation vector, w is the parameter vector of the same dimension, and sgn(·) returns the sign (±1) of its argument. The socalled score w^{⊤}z is real valued, where its magnitude indicates the confidence in prediction (i.e., a larger score implies that z lies farther away from the decision boundary, and vice versa). We denote byz_{i} thei th sample received by the algorithm, and byδ_{i} a binary variable indicating whetheri th sample is queried (1) or not (0).The model w is basically estimated by L2regularized linear regression, which gives rise to the estimate w =
A ^{−1}b at stagei , whereNote that (3) can be computed/updated in an online fashion. Then the model’s score is computed as: . Whether to query
y_{i} or not is then based on this confidence, namely we query ifp is smaller than some threshold, and this decision strategy is employed in [10, 11]. In [11], it is further considered the novelty of the sample z_{i}, which is measured by . This is based on the 0mean Gaussian assumption for z, and we queryy_{i} ifr is larger than certain threshold.In their algorithm, due to the simple linear regression model and Gaussianity assumption, the query decision rule is easily derived and the update equation becomes simple. For sequence classification, the underlying model is a rather complex SCHMM, and extension of the idea requires further endeavor. At each stage
i , upon receiving a new sequencex^{i} , our online selectivesample learning algorithm performs two main steps: 1) decide whether to query the true class labelc_{i} or not, and 2) update the model with the current sample, either (c_{i} , x^{i}) if queried or x^{i} if not. We provide detailed derivations for each step in the subsequent sections.3.1 Decision for Query
First, decision to query is made if the incoming sample meets either of two conditions with respect to the current model Θ. (Cond1) − log
P (x^{i} ; Θ) ≥τ_{NLL} indicates the current data sample attains high negative loglikelihood, or equivalently, the sample appears to be novel for the current model. The thresholdτ_{NLL} is properly chosen. (Cond2) −H (p (c x^{i} ; Θ)) ≤τ_{NCE} implies that the entropy of the posterior class distribution for the current sample is high, meaning that the con fidence in prediction is not very strong. Here is the entropy of the class posterior. Again we choose the threshold parameterτ_{NCE} adequately.3.2 Model Update
The second step is to update the model Θ using the current data sample. The true class label
c_{i} is either available or not depending on the decision in the above step, and we let the binary variableδ_{i} record it (i.e.,δ_{i} = 1(0) if queried (not)). To present our model update algorithm, we begin with the batch data learning formulation assuming all labeled and unlabeled data can be accessed. We then derive the stochastic gradient update rule from the batch formulation.In the batch learning case with
n data samples, the learning problem can be formulated with partially labeled data (sine we have selectively labeled samples) as follows:Here note that we can exploit even the unlabeled samples (the second term) by maximizing the marginal loglikelihood where the two objectives are balanced by the hyperparameter
λ ≥ 0.In the stochastic gradient ascent
[14] , instead of computing the full gradient of (4), it only updates the model using the gradient for a single sample, sayi , that is,which can be seen as an unbiased sample for the total gradient. The constant
η > 0 is a learning rate which we fix throughout the iterations (e.g.,η = 0.001).The nice thing is that (5), although derived from a batch setup, can be computed in an online fashion, and it exactly forms our model update equations. To be more specific to the SCHMM, the gradients can be computed by (for each
k = 1, . . . ,K ):where
I (·) is the 1/0 indicator function, and the gradient in the latter exactly follows the HMM gradient (2). For the marginal loglikelihood term, sinceThe initial model is chosen randomly and blindly, thus tends to select first a few samples for query, which is a desirable strategy considering that model is inaccurate with little labeled data received at initial stages. We stop the algorithm when the number of queries made reaches the budget constraint
B . The proposed online selectivesample learning algorithm is summarized in Alg. 1.4. Empirical Evaluation
In this section we evaluate the classification performance of the proposed online selectivesample learning algorithm on several realworld timeseries datasets including human gait/activity recognition and facial emotion prediction. For each dataset we form online learning setups by feeding a learning algorithm one sample at each stage randomly drawn from the data pool, where we restrict the algorithm to make queries up to
B times. We test with two different values of budgetB : 10% and 30% of the entire data samples. Once the algorithm uses up the budget, from then on it can only exploit the incoming sequences only (without class labels) to update the model.Our algorithm is compared with two fairly basic online learning strategies: the first is to make queries at the first
B stages (greedy approach; denoted by GREEDY), and the second one makes random queries (denoted by RANDOM). The former is reasonable in the sense that the model is initially incorrect, thus trying to learn with labeled data as early as possible. The related hyperparameters (e.g., the learning rateη and the impact of the marginal loglikelihood termλ in model update) are chosen identical across competing models for fair comparison. As a reference, we also contrast with the online learner that is not restricted by the budget constraint (i.e., it can make query every stage), which perhaps provides an upper bound for the classification accuracy (denoted by QRYALL).As a performance measure, we accumulates the test errors up to
n stages. That is, the prediction error of an online algorithm is: . We provide detailed descriptions of the datasets, experimental setups, and test results in the subsequent sections.4.1 Human Gait Recognition
The classification task we deal with is to identify a person based on his/her gait sequence. We collect data from the speedcontrol human gait database
[15 ,16] , where we focus on samples from 6 different subjects to form aK = 6way multiclass classification setup. Each subject performs walking motion with four different speeds (0.7m/s, 1.0m/s, 1.3m/s, 1.6m/s), and the task is to predict a subject regardless of the walking speed. For the sequence representation, we take 3D motion captures of some marker points at the lower body parts (refer to[15] for details). We then select subsequences of lengths around 80 with random starting/ending positions.For
n = 648 sequences, we form online learning setups with two budget setups,B = 0.1n andB = 0.3n , and it is repeated 5 times to report the average test errors. The results are summarized in Table 1. The reference QRYALL that makes unlimited queries, as expected, attains the lower bound for test errors of all competing methods. For both budget setups, the proposed approach consistently outperforms the other two methods, which can be attributed to its more sophisticated decision strategy based on sample novelty and/or model’s certainty on class prediction, rather than simple greedy or random querying strategies.It is also interesting to see that when the budget
B increases from 10% to 30%, the difference between GREEDY and RANDOM becomes small, which can be explained as follows: as we have more labeled samples, it is less critical when the labels are asked. Also, for smallerB , the differences between our method and two competing ones are more significant, which emphasizes the effectiveness of the proposed method under a more restricted budget scenario, namely that when only a few queries are allowed to be made, the careful decision strategy in our proposed approach yields outstanding performance.4.2 Facial Emotion Classification
Next we consider the problem of recognizing facial emotion from a video sequence comprised of facial image frames that undergo changes in facial expression. From the CohnKanade facial expression video dataset
[17] , we collect videos of two emotions, fear and happiness. Differentiating these two emotions are recognized as a hard problem in that visually they are very similar to each other. We tackle this binary classification task.For the sequence representation, we use certain image features extracted from images as follows. We first estimate the tight bounding boxes for faces using the standard face detector (e.g., Viola and Jones
[18] ), then normalize the sizes of the cropped face images. Then the Haarlike features are extracted for each frame where we follow the procedure similar to that of[19] . The last step is to reduce the dimensionality by principal component analysis. We obtainn = 139 sequences with lengths varying from 10 to 30 for about 90 different subjects, and there are 54 sequences for the fear class and 85 for happiness.The test errors are depicted in Table 2, where the reference QRYALL again gives lower bound on the test errors. Our proposed method outperforms the competing methods significantly and consistently for all budget setups, while for
B = 0.3n it attains error rate nearly close to the lower bound. The results again signify the superiority of the proposed query decision strategy.4.3 Human Activity Recognition
Last but not least, we tackle the problem of human activity recognition from a sequence of motion localization records. We obtain the dataset from the UCI machine learning repository
[20] , where the original goal was to reconstruct locations and postures from the localization data recorded from wearable tags at articulation points of subjects performing actions[21] . In this experiment, we focus on the task of recognizing the activity type. For this purpose, we collect from three different subjects about 360 sequences of three activities:walking ,lying , andsitting . For the sequence representation, we use the 3dim velocity information evaluated from 3D coordinates data obtained from the leftankle tag.For this 3way classification problem, we form online learning setups similarly as previous experiments. The test results are summarized in Table 3. We again have similar behavior as the former two experiments. The performance of the proposed approach is outstanding: in particular, it even attains more accurate prediction with the smaller budget constraint. In this dataset, due to the rather limited feature information, the overall prediction performance is low, and under such scenarios, the sophisticated query decision in our algorithm is shown to yield viable solutions.
5. Conclusion
In this paper we have proposed a novel online selectivesample learning algorithm for multiclass sequence classification problems. Under the HMMbased sequence classification model, we devise reasonable criteria for decision for query based on the data novelty (likelihood of the incoming data) and the class prediction confidence (entropy of the class posterior). For several sequence classification datasets/tasks in online learning setups, we have shown that the proposed algorithm yields superior prediction accuracy to greedy or random approaches even with a limited query budget. One current issue may be how to choose the threshold parameters adequately, and we leave it as our future work to investigate a systematic way to tune those.

[]

[]

[]

[]

[]

[]

[]

[]

[]

[Table 1.] Accumulated test prediction errors (%) on human gait recognition data

[Table 2.] Accumulated test prediction errors (%) on facial emotion classification data

[Table 3.] Accumulated test prediction errors (%) on human activity recognition data