Goal-oriented dialogues such as appointment scheduling, call routing, and hotel reservation booking consist of sequences of goal-oriented utterances. The speakers’ intentions implied by each utterance can be represented using semantic forms called speech acts [1]. In Table 1, Utterance (3) shows that a user requests a system to search his schedule. The requesting action comprising Utterance (3) is the speech act.
As shown in Table 1, to generate correct reactions, a dialogue system should identify the speech acts indicated by users’ utter-ances to capture the speaker’s intentions. If a dialogue system fails to capture users’ intentions, the system will not be able to decide whether to respond to users’ questions or to request addi-tional information from users to achieve the task goals. It is dif-ficult, however, to infer speech acts from surface utterances because they are context-dependent. For example, the speech
[Table. 1.] Example of a goal-oriented dialogue annotated with speech acts
Example of a goal-oriented dialogue annotated with speech acts
act of Utterance (7) in Table 1 can be evaluated as an “inform” or “response” in surface analysis. Various models have been proposed over the past 25 years to resolve this ambiguity. In recent years, there has been increased interest in using statistical and machine learning approaches. In this paper, we review a representative probabilistic model for speech act classification and then compare various machine learning models to it.
This paper is organized as follows. In Section II, we review ear-lier work on speech act identification. In Section III, we review speech act classification models based on machine learning methods. In Section IV, we experimentally compare the reviewed models. And finally, in Section V, we draw conclusions.
Initial approaches to speech act identification have been based on knowledge such as plan inference recipes and domain-specific knowledge [2, 3]. Since these knowledge-based models depend on costly handcrafted knowledge, they are difficult to scale up and expand to other domains. To overcome this prob-lem, in recent years, many machine learning approaches have been proposed for speech processing. The task of identifying users’ intentions is an example of an area in which this approach has been relatively successful as shown using various machine learning models [4-14]. Machine learning models offer a way to associate utterance features with particular categories indicating users’ intentions since the computer can efficiently analyze a large quantity of data and consider many different feature inter-actions. However, input features critically affect machine learn-ing models. If the input features are uninformative and biased, they do not take full advantage of particular input features that may provide valuable clues in identifying users’ intentions. Many feature extraction and selection methods have been proposed to resolve these problems. Lee et al. [5] have used the structural information of discourse in speech act analysis. However, such structural information is insufficient for covering various dia-logues since the authors used a restricted rule-based model such as a recursive transition network to perform the discourse struc-ture analysis. Kim et al. [15] comparatively studied optimal fea-ture identification for Korean speech act classification. They evaluated and compared each feature combination. Many research-ers have studied feature selection methods for text categoriza-tion and speech act classification [16, 17]. Yang and Pedersen [17] present a comparative study of the feature selection meth-ods for statistical learning in text categorization. They found information gain and use of the
III. SPEECH ACT CLASSIFICATION MODEL
>
A. Representative Probabilistic Model for Speech Act Classification
Given
We can rewrite Equation (1) as Equation (2) using the Bayes theorem. We exclude
Next we simplify Equation (2) by making the following assumptions: the current speech act is only dependent on earlier speech acts, and the current utterance is dependent on its speech act. With these assumptions, we formulate the speech act classi-fication model as a product of sentential probability
and contextual probability
[4, 18] as shown in Equation (3):
Equation (3) is a representative probability model (a so-called hidden Markov model, HMM), which has been the basis of many machine learning approaches to speech act classification.
>
B. Machine Learning Models Adopted in Speech Act Classification
The earlier machine learning approaches for speech act classification can be divided into 3 groups: rule-, margin-, and statistics-based. The main idea of the rule-based group is to automatically generate a set of ordered rules from a training corpus. Transformation-based learning (TBL) and a decision tree (DT) are often used for speech act classification [7, 19]. The central idea of TBL is to learn an ordered set of symbolic rules according to their contribution to the training corpus. ADT is a decision-making mechanism that automatically generates possible choices according to information gain. TBL and the DT offer the advantage of having human-interpretable rules that can be manually edited for performance tuning.
The main idea of the margin-based group is to identify the most effective decision boundaries that separate positive examples and negative examples in a vector space. A support vector machine (SVM) and a multilayer perceptron (MLP; a feed forward artificial neural network model) have shown good performance in speech act classification [20, 21]. The goal of an SVM is to find the particular hyperplane that maximizes the margin of separation between a cluster of positive examples and a cluster of negative examples. An SVM transforms the given non-linear problems into linear problems by projecting input features into higher dimensions and then quickly solving the given problems high performance. An SVM is one of the best known binary classification models. The goal of MLP is to find the set of weight values that will cause the neural network out-put to match the actual target values as closely as possible. In particular, anything that can be represented as a mapping between vector spaces can be approximated to arbitrary preci-sion by MLP (the most frequently used type). In practice, MLP is especially useful for solving mapping problems to which hard and fast rules cannot easily be applied.
The goal of the statistics-based group is to overcome the following weak points of an HMM: the observation bias problem and the label bias problem. A maximum entropy model (MEM) and conditional random fields (CRFs) are representative statistical models that are adopted in speech act classification [13, 22]. A MEM focuses on relaxing the 2 independence assumptions of the HMM mentioned in Section III-A. Due to the strong independence assumptions, the observation targets of the HMM are restricted to atomic entities such as words and parts of speech (POS). In particular, it is not practical to represent multiple interacting features or long-range dependencies of the observations [23]. In Equation (3), all terms of the right hand side are represented by conditional probabilities. We can esti-mate the probability of each term using Equation (4):
Now we can evaluate
In Equation (5),
Machine learning model performance is critically affected by the quality of the input features (
As shown in Table 2, input features for speech act classification are divided into two types: one pertains to the input features associated with the sentential probability
in Equation (3), while the other pertains to the input features associated with the contextual probability
in Equation (3). The former are generally called sentential features, while the latter are called contextual features. In many cases, sentential features are too numerous to be used as inputs to machine learning models. Therefore, methods of removing non-informative features have been required. Yang and Pedersen [17] performed a comparative study of optimal feature selection for document classification. They showed that the
[Table 2.] Optimal feature set for Korean speech act classification
Optimal feature set for Korean speech act classification
dence between a feature,
In Equation (7),
In Equation (8),
>
A. Data Sets and Experimental Settings
We collected a Korean dialogue corpus simulated in a schedule management domain similar to appointment scheduling and alarm setting. The dialogue corpus was obtained by eliminating interjections and erroneous expressions from the original transcriptions of simulated dialogues between two speakers, to whom a task of the dialogue had been given in advance: one participant freely asks something about his/her daily schedules, and the other participant responds to the questions or asks some questions in return, using knowledge bases provided in advance. This corpus consists of 900 dialogues, 20,079 utterances (22.3 utterances per dialogue). Each utterance in the dialogues is manually annotated with speech acts and concept sequences. Table 3 shows part of the annotated dialogue corpus.
In Table 3, KS represents a Korean sentence and EN represents the translated English sentence that is not written in the original dialogue corpus. SP has a value of either User or System depending on the speaker. SA represents a speech act. In this paper, we define 11 domain-independent speech acts (Table 4).
The manual tagging of speech acts was performed by five
[Table 3.] Part of the annotated dialogue corpus
Part of the annotated dialogue corpus
[Table 4.] Speech acts and their meanings
Speech acts and their meanings
[Table 5.] Comparison of memory requirements and processing speeds
Comparison of memory requirements and processing speeds
graduate students with dialogue analysis knowledge and post-processed by a student in a doctoral course for consistency. To evaluate various machine learning models, we divided the annotated dialogue corpus into the training corpus (800 dialogues) and the testing corpus (100 dialogues). We selected a representative model per machine learning group for use as comparison models: a DT in the rule-based group, an SVM in the margin-based group, and a MEM in the statistics-based group. We selected MEM instead of CRFs in the statistics-based group because CRFs showed performance similar to the MEM despite the requirement of much more training time. We think that CRFs are more appropriate for batch jobs, such as POS tagging and named entity (NE) tagging, which are started after all strings have been input. The comparison models used the same input features as in Table 2. The numbers of features for machine learning methods are determined experimentally. A total of 3,000 sentential features were selected based on the
The first experiment performed evaluated the memory requirements and processing speeds of the various models. Table 5 shows the results of the first experiment. The comparison
[Table 6.] Performance comparison in terms of various evaluation measures
Performance comparison in terms of various evaluation measures
models were evaluated on a personal computer with an Intel Xeon 2.00 GHz CPU, 4 GB MB memory, and Red Hat Linux.
As shown in Table 5, the memory usage and spending time of a DT showed low performance compared to those of the other methods because the feature selection did not work in the DT. Because an SVM is a binary classification method, extension of the binary classification using an SVM is normally applied to
The second experiment compared the performance of the various models. Table 6 shows the model performance in terms of various evaluation measures such as the accuracy, macro precision, macro recall rate, and macro F1 measure.
In Table 6, the accuracy is the proportion of correct speech acts of those returned. The macro precision is the average proportion of correct speech acts per category of those returned. The macro recall rate is the average proportion of correctly returned speech acts per category of those that are correct. The macro F1-measure combines the macro precision and macro recall rate with an equal weighting in the following form: F1 = (2.0 × macro precision × macro recall rate)/(macro precision + macro recall rate). As shown in Table 6, an SVM shows the best performance, which is similar to Kim et al. [15], which reported that the MEM is also an efficient method for speech act classification because it has advantages in terms of hardware requirements and exhibits a performance of <1% compared with the SVM.
We reviewed the earlier machine learning methods for Korean speech act classification. First we reviewed a representative statistical model. Based on the statistical model, we reviewed three groups of machine learning models: a rule-based group, a margin-based group, and a statistics-based group. In the experiments with a goal-oriented dialogue corpus in a schedule management domain, we selected a single representative per group among previous models: C4.5 in the rule-based group, SVM in the margin-based group, and MEM in the statistics-based group. We then compared the representative models using various evaluation measures. The experimental results revealed that the MEM offers advantages in terms of hardware requirements while the SVM offers advantages in terms of performance.