Review of Korean Speech Act Classification: Machine Learning Methods

  • cc icon

    To resolve ambiguities in speech act classification, various machine learning models have been proposed over the past 10 years. In this paper, we review these machine learning models and present the results of experimental comparison of three representative models, namely the decision tree, the support vector machine (SVM), and the maximum entropy model (MEM). In experiments with a goal-oriented dialogue corpus in the schedule management domain, we found that the MEM has lighter hardware requirements, whereas the SVM has better performance characteristics.


    Korean speech act classification , Machine learning method


    Goal-oriented dialogues such as appointment scheduling, call routing, and hotel reservation booking consist of sequences of goal-oriented utterances. The speakers’ intentions implied by each utterance can be represented using semantic forms called speech acts [1]. In Table 1, Utterance (3) shows that a user requests a system to search his schedule. The requesting action comprising Utterance (3) is the speech act.

    As shown in Table 1, to generate correct reactions, a dialogue system should identify the speech acts indicated by users’ utter-ances to capture the speaker’s intentions. If a dialogue system fails to capture users’ intentions, the system will not be able to decide whether to respond to users’ questions or to request addi-tional information from users to achieve the task goals. It is dif-ficult, however, to infer speech acts from surface utterances because they are context-dependent. For example, the speech

    act of Utterance (7) in Table 1 can be evaluated as an “inform” or “response” in surface analysis. Various models have been proposed over the past 25 years to resolve this ambiguity. In recent years, there has been increased interest in using statistical and machine learning approaches. In this paper, we review a representative probabilistic model for speech act classification and then compare various machine learning models to it.

    This paper is organized as follows. In Section II, we review ear-lier work on speech act identification. In Section III, we review speech act classification models based on machine learning methods. In Section IV, we experimentally compare the reviewed models. And finally, in Section V, we draw conclusions.


    Initial approaches to speech act identification have been based on knowledge such as plan inference recipes and domain-specific knowledge [2, 3]. Since these knowledge-based models depend on costly handcrafted knowledge, they are difficult to scale up and expand to other domains. To overcome this prob-lem, in recent years, many machine learning approaches have been proposed for speech processing. The task of identifying users’ intentions is an example of an area in which this approach has been relatively successful as shown using various machine learning models [4-14]. Machine learning models offer a way to associate utterance features with particular categories indicating users’ intentions since the computer can efficiently analyze a large quantity of data and consider many different feature inter-actions. However, input features critically affect machine learn-ing models. If the input features are uninformative and biased, they do not take full advantage of particular input features that may provide valuable clues in identifying users’ intentions. Many feature extraction and selection methods have been proposed to resolve these problems. Lee et al. [5] have used the structural information of discourse in speech act analysis. However, such structural information is insufficient for covering various dia-logues since the authors used a restricted rule-based model such as a recursive transition network to perform the discourse struc-ture analysis. Kim et al. [15] comparatively studied optimal fea-ture identification for Korean speech act classification. They evaluated and compared each feature combination. Many research-ers have studied feature selection methods for text categoriza-tion and speech act classification [16, 17]. Yang and Pedersen [17] present a comparative study of the feature selection meth-ods for statistical learning in text categorization. They found information gain and use of the χ2 statistic to be most effective in the experiments. Kim et al. [16] proposed that a neural net-work can partially increase precision and decrease training time using the feature selection method based on the χ2 statistic.


      >  A. Representative Probabilistic Model for Speech Act Classification

    Given n utterancesic U1,n in a dialogue, let S1,n denote the speech acts of U1,n·The speech act classification model can then be formally defined as follows:


    We can rewrite Equation (1) as Equation (2) using the Bayes theorem. We exclude P(U1,n) from Equation (s) since it is always constant for S1,n :


    Next we simplify Equation (2) by making the following assumptions: the current speech act is only dependent on earlier speech acts, and the current utterance is dependent on its speech act. With these assumptions, we formulate the speech act classi-fication model as a product of sentential probability


    and contextual probability


    [4, 18] as shown in Equation (3):


    Equation (3) is a representative probability model (a so-called hidden Markov model, HMM), which has been the basis of many machine learning approaches to speech act classification.

      >  B. Machine Learning Models Adopted in Speech Act Classification

    The earlier machine learning approaches for speech act classification can be divided into 3 groups: rule-, margin-, and statistics-based. The main idea of the rule-based group is to automatically generate a set of ordered rules from a training corpus. Transformation-based learning (TBL) and a decision tree (DT) are often used for speech act classification [7, 19]. The central idea of TBL is to learn an ordered set of symbolic rules according to their contribution to the training corpus. ADT is a decision-making mechanism that automatically generates possible choices according to information gain. TBL and the DT offer the advantage of having human-interpretable rules that can be manually edited for performance tuning.

    The main idea of the margin-based group is to identify the most effective decision boundaries that separate positive examples and negative examples in a vector space. A support vector machine (SVM) and a multilayer perceptron (MLP; a feed forward artificial neural network model) have shown good performance in speech act classification [20, 21]. The goal of an SVM is to find the particular hyperplane that maximizes the margin of separation between a cluster of positive examples and a cluster of negative examples. An SVM transforms the given non-linear problems into linear problems by projecting input features into higher dimensions and then quickly solving the given problems high performance. An SVM is one of the best known binary classification models. The goal of MLP is to find the set of weight values that will cause the neural network out-put to match the actual target values as closely as possible. In particular, anything that can be represented as a mapping between vector spaces can be approximated to arbitrary preci-sion by MLP (the most frequently used type). In practice, MLP is especially useful for solving mapping problems to which hard and fast rules cannot easily be applied.

    The goal of the statistics-based group is to overcome the following weak points of an HMM: the observation bias problem and the label bias problem. A maximum entropy model (MEM) and conditional random fields (CRFs) are representative statistical models that are adopted in speech act classification [13, 22]. A MEM focuses on relaxing the 2 independence assumptions of the HMM mentioned in Section III-A. Due to the strong independence assumptions, the observation targets of the HMM are restricted to atomic entities such as words and parts of speech (POS). In particular, it is not practical to represent multiple interacting features or long-range dependencies of the observations [23]. In Equation (3), all terms of the right hand side are represented by conditional probabilities. We can esti-mate the probability of each term using Equation (4):


    Now we can evaluate P(a,b) using the MEM shown in Equa-tion (5) [24]:


    In Equation (5), a is a speech act, depending on the term, b is the context of α, π is a normalization constant, while αi is the model parameter corresponding to each feature function, f. CRFs are focused on resolving the problem of transition probabilities being locally normalized (the so-called label bias problem): the transitions leaving a given state compete only against each other rather than against all of the other transitions in the model [23] as shown in Equation (6):


    Machine learning model performance is critically affected by the quality of the input features (i.e., how informative the input features are). Therefore, many researchers have performed various feature extraction methods [5, 15, 16]. Kim et al. [15] comparatively studied optimal feature identification for Korean speech act classification. Table 2 shows a set of optimal features proposed by Kim et al. [15].

    As shown in Table 2, input features for speech act classification are divided into two types: one pertains to the input features associated with the sentential probability


    in Equation (3), while the other pertains to the input features associated with the contextual probability


    in Equation (3). The former are generally called sentential features, while the latter are called contextual features. In many cases, sentential features are too numerous to be used as inputs to machine learning models. Therefore, methods of removing non-informative features have been required. Yang and Pedersen [17] performed a comparative study of optimal feature selection for document classification. They showed that the χ2 statistic outperforms mutual information and information gain in document classification. The χ2 statistic measures the lack of indepen-

    dence between a feature, f, and a category, S (i.e., a speech act) as shown in Equation (7):


    In Equation (7), A is the number of times that f and S cooccur, B is the number of times that f occurs without S, C is the number of times that S occurs without f, and D is the number of times neither S nor f occur. To remove non-informative features, the maximum χ2 statistic of a feature-category pair is calculated, as shown in Equation (8), and the top-n features are selected according to the feature scores:


    In Equation (8), Sk is the kth instance among m speech acts.


      >  A. Data Sets and Experimental Settings

    We collected a Korean dialogue corpus simulated in a schedule management domain similar to appointment scheduling and alarm setting. The dialogue corpus was obtained by eliminating interjections and erroneous expressions from the original transcriptions of simulated dialogues between two speakers, to whom a task of the dialogue had been given in advance: one participant freely asks something about his/her daily schedules, and the other participant responds to the questions or asks some questions in return, using knowledge bases provided in advance. This corpus consists of 900 dialogues, 20,079 utterances (22.3 utterances per dialogue). Each utterance in the dialogues is manually annotated with speech acts and concept sequences. Table 3 shows part of the annotated dialogue corpus.

    In Table 3, KS represents a Korean sentence and EN represents the translated English sentence that is not written in the original dialogue corpus. SP has a value of either User or System depending on the speaker. SA represents a speech act. In this paper, we define 11 domain-independent speech acts (Table 4).

    The manual tagging of speech acts was performed by five

    graduate students with dialogue analysis knowledge and post-processed by a student in a doctoral course for consistency. To evaluate various machine learning models, we divided the annotated dialogue corpus into the training corpus (800 dialogues) and the testing corpus (100 dialogues). We selected a representative model per machine learning group for use as comparison models: a DT in the rule-based group, an SVM in the margin-based group, and a MEM in the statistics-based group. We selected MEM instead of CRFs in the statistics-based group because CRFs showed performance similar to the MEM despite the requirement of much more training time. We think that CRFs are more appropriate for batch jobs, such as POS tagging and named entity (NE) tagging, which are started after all strings have been input. The comparison models used the same input features as in Table 2. The numbers of features for machine learning methods are determined experimentally. A total of 3,000 sentential features were selected based on the χ2 statistic in Equation (8) for each SVM and MEM. Because the feature selection did not improve DT performance, it used all of the sentential features (10,082 features). The toolkits used for implementations included C4.5 [25] for the DT, SVMlight [26] for the SVM, and MEMT [27] for the MEM. We set all parameters of each toolkit to default values.

      >  B. Experimental Results

    The first experiment performed evaluated the memory requirements and processing speeds of the various models. Table 5 shows the results of the first experiment. The comparison

    models were evaluated on a personal computer with an Intel Xeon 2.00 GHz CPU, 4 GB MB memory, and Red Hat Linux.

    As shown in Table 5, the memory usage and spending time of a DT showed low performance compared to those of the other methods because the feature selection did not work in the DT. Because an SVM is a binary classification method, extension of the binary classification using an SVM is normally applied to n-ary classification. Therefore, the SVM requires more memory and computation time than does a MEM.

    The second experiment compared the performance of the various models. Table 6 shows the model performance in terms of various evaluation measures such as the accuracy, macro precision, macro recall rate, and macro F1 measure.

    In Table 6, the accuracy is the proportion of correct speech acts of those returned. The macro precision is the average proportion of correct speech acts per category of those returned. The macro recall rate is the average proportion of correctly returned speech acts per category of those that are correct. The macro F1-measure combines the macro precision and macro recall rate with an equal weighting in the following form: F1 = (2.0 × macro precision × macro recall rate)/(macro precision + macro recall rate). As shown in Table 6, an SVM shows the best performance, which is similar to Kim et al. [15], which reported that the MEM is also an efficient method for speech act classification because it has advantages in terms of hardware requirements and exhibits a performance of <1% compared with the SVM.


    We reviewed the earlier machine learning methods for Korean speech act classification. First we reviewed a representative statistical model. Based on the statistical model, we reviewed three groups of machine learning models: a rule-based group, a margin-based group, and a statistics-based group. In the experiments with a goal-oriented dialogue corpus in a schedule management domain, we selected a single representative per group among previous models: C4.5 in the rule-based group, SVM in the margin-based group, and MEM in the statistics-based group. We then compared the representative models using various evaluation measures. The experimental results revealed that the MEM offers advantages in terms of hardware requirements while the SVM offers advantages in terms of performance.

  • 1. Allen J 1995 Natural Language Understanding google
  • 2. Lambert L, Carberry S 1991 June 18-21 “A tripartite plan-based model of dialogue” [Proceedings of the 29th Annual Meeting on Associa-tion for Computational Linguistics] P.47-54 google
  • 3. Litman D. J, Allen J. F 1987 “A plan recognition model for sub-dialogues in conversations” [Cognitive Science] Vol.11 P.163-200 google
  • 4. Nagata M, Morimoto T 1994 “First steps towards statistical mod-eling of dialogue to predict the speech act type of the next utter-ance” [Speech Communication] Vol.15 P.193-203 google
  • 5. Lee J, Kim G. C, Seo J 1997 July 11 “A dialogue analysis model with statistical speech act processing for dialogue machine translation” [Proceedings of Spoken Language Translation Workshop in con-junction with ACL/EACL] P.10-15 google
  • 6. Reithinger N, Klesen M 1997 “Dialogue act classification using language models” [Proceedings of EuroSpeech] P.2235-2238 google
  • 7. Samuel K, Carberry S, Vijay-Shanker K 1998 “Dialogue act tag-ging with transformation-based learning” [Proceedings of the 17th International Conference on Computational Linguistics] P.1150-1156 google
  • 8. Stolcke A, Coccaro N, Bates R, Taylor P, Van Ess-Dykema C, Ries K, Shriberg E, Jurafsky D, Martin R, Meteer M 2000 “Dialogue act modeling for automatic tagging and recognition of conversational speech” [Computational Linguistics] Vol.26 P.339-373 google
  • 9. Langley C. T 2002 July “Analysis for speech translation using grammar-based parsing and automatic classification” [Proceedings of Stu-dent Research Workshop at the 40th Annual Meeting of the Asso-ciation of Computational Linguistics] google
  • 10. Kim H, Seo J 2003 “An efficient trigram model for speech actanalysis in small training corpus” [Journal of Cognitive Science] Vol.4 P.107-120 google
  • 11. Webb N, Hepple M, Wilks Y 2005 July 9-10 “Dialogue act classification basedon intra-utterance features” [Proceedings of the AAAI Workshop on Spoken Language Understanding] google
  • 12. Choi W. S, Kim H, Seo J 2005 “An integrated dialogue analy-sis model for determining speech acts and discourse structures” [IEICE Transactions on Information and Systems] Vol.E88-D P.150-157 google
  • 13. Lee H, Kim H, Seo J 2008 “Domain action classification using a maximum entropy model in a schedule management domain” [AI Communications] Vol.21 P.221-229 google
  • 14. Kang S, Kim H, Seo J 2010 “A reliable multidomain model for speech act classification” [Pattern Recognition Letters] Vol.31 P.71-74 google
  • 15. Kim M. J, Park J. H, Kim S. B, Rim H. C, Lee D. G 2008 “A comparative study on optimal feature identification and combina-tion for korean dialogue act classification” [Journal of Korean Institute of Information Scientists and Engineers: Software and Applications] Vol.35 P.681-691 google
  • 16. Kim K, Kim H, Seo J 2004 “A neural network model with fea-ture selection for Korean speech act classification” [International Journal of Neural Systems] Vol.14 P.407-414 google
  • 17. Yang Y, Pedersen J. O 1997July “A comparative study on feature selection in text categorization” [Proceedings of the 14th Interna-tional Conference on Machine Learning] P.412-420 google
  • 18. Charniak E 1993 Statistical Language Learning google
  • 19. Lee S, Seo J 2002 “Korean speech act analysis system using hid-den markov model with decision trees” [International Journal of Computer Processing of Oriental Languages] Vol.15 P.231-243 google
  • 20. Surendran D, Levow G. A 2006 Sep. “Dialogue act tagging with sup-port vector machines and hidden markov models” [Proceedings of Interspeech/ICSLP] google
  • 21. Lee H, Kim H, Seo J, King I, Wang J, Chan L.W, Wang D 2006 “Efficient domain action classifica-tion using neural networks” [Neural Information Processing. Lec-ture Notes in Computer Science] Vol.4223 P.150-158 google
  • 22. Kim D, Kim H, Seo J 2008 “A statistical prediction model ofspeakers’ intentions in a goal-oriented dialogue” [Journal of Korean Institute of Information Scientists and Engineers: Software and Applications] Vol.35 P.554-561 google
  • 23. Lafferty J. D, McCallum A, Pereira F. C. N 2001 June28-July 1 “Conditional random fields: probabilistic models for segmenting and labeling sequence data” [Proceedings of the Eighteenth International Con-ference on Machine Learning] P.282-289 google
  • 24. Reynar J. C, Ratnaparkhi A 1997 March 31-April 3 “A maximum entropy approach to identifying sentence boundaries” [Proceedings of the 5th Con-ference on Applied Natural Language Processing] P.16-19 google
  • 25. Quinlan J. R 1993 C4.5: Programs for Machine Learning google
  • 26. Joachims T “SVMLight: Support Vector Machine Version: 6.02” google
  • 27. Ristad E. S 1996 Maximum entropy modeling toolkit Technical Report google
  • [Table. 1.] Example of a goal-oriented dialogue annotated with speech acts
    Example of a goal-oriented dialogue annotated with speech acts
  • [Table 2.] Optimal feature set for Korean speech act classification
    Optimal feature set for Korean speech act classification
  • [Table 3.] Part of the annotated dialogue corpus
    Part of the annotated dialogue corpus
  • [Table 4.] Speech acts and their meanings
    Speech acts and their meanings
  • [Table 5.] Comparison of memory requirements and processing speeds
    Comparison of memory requirements and processing speeds
  • [Table 6.] Performance comparison in terms of various evaluation measures
    Performance comparison in terms of various evaluation measures