OA 학술지
Improved Bimodal Speech Recognition Study Based on Product Hidden Markov Model
  • cc icon
  • cc icon

Recent years have been higher demands for automatic speech recognition (ASR) systems that are able to operate robustly in an acoustically noisy environment. This paper proposes an improved product hidden markov model (HMM) used for bimodal speech recognition. A two-dimensional training model is built based on dependently trained audio-HMM and visual-HMM, reflecting the asynchronous characteristics of the audio and video streams. A weight coefficient is introduced to adjust the weight of the video and audio streams automatically according to differences in the noise environment. Experimental results show that compared with other bimodal speech recognition approaches, this approach obtains better speech recognition performance.

Feature extraction , Bimodal speech recognition , Product hidden Markov model , Weight coefficient
  • 1. Introduction

    Speech recognition technology has made great progress in recent decades, and automatic speech recognition (ASR) systems have become increasingly widespread. Since an ASR system is vulnerable to speech noise, and since almost all voice signals contain noise, ASR identification performance using only audio information cannot meet the need. Therefore, developing a robust speech recognition system in a noisy environment is an urgent problem. Developing an integration strategy for audio and visual information is one of the many challenges facing an audio-visual (bimodal) ASR system. From the point of view of perception, video information corresponding to audio information can improve a person’s understanding of a speaker’s voice. In a noisy environment or for hearing-impaired listeners, video information is a useful complement.

    Generally, the audio-visual ASR (AVSR) systems work by the following procedures. First, the acoustic and the visual signals of speech are recorded by a microphone and a camera, respectively. Then, each signal is converted into an appropriate form of compact features. Finally, the two modalities are integrated for recognition of the given speech. Integration of acoustic and visual information aims at obtaining as good recognition results as possible in noisy circumstances. It can take place either before the two information sources are processed by a recognizer early integration (EI) or after they are classified independently late integration (LI). LI has been shown to be preferable because of its better performance and robustness than EI [1], and psychological supports [2].

    Considering the asynchronous nature of speech signal and video signal, we put forward an improved product hidden Markov model (HMM) in this paper, used for implementing the bimodal voice recognition of Chinese words, which formulate an improved HMM as a multi-stream HMM. Moreover, we control the stream weights of the audio-visual HMM by the generalized Pareto distribution (GPD) algorithm [1,3], in order to adaptively optimize the audio-visual ASR. According to the corresponding relation between the weight coefficient and instantaneous signal-to-noise ratio (SNR), it can adjust the weight ratio of audio stream and video stream adaptively.

    Experimental results demonstrate that the importance of audio features is far higher than that of video features in a quiet environment, while in the presence of noise, video features make an important contribution to speech recognition [4-6].

    In Section 2, we introduce some related research works about speech recognition technology. In Section 3, we introduce the classical speech feature parameter extraction approach. In Section 4, we put forward our improved HMM, and, in Section 5, we present our weight optimization approach based on the GDP algorithm. In Section 6, we implement our improved bimodal product HMM system, and, show some experiment results between other ASR systems and the improved HMM. Section 7 is conclusion and our future work.

    2. Related Works

    In speaker recognition, features are extracted from speech signals to form feature vectors, and statistical patter n recognition methods are applied to model the distribution of the feature vectors in the feature space. Speakers are recognized by pattern matching of the statistical distribution of their feature vectors with target models. Speaker verification (SVR) is the task of deciding, upon receiving tested feature vectors, whether to accept or reject a speaker hypothesis, according to the speaker’s model. Mel-frequency cepstral coefficients (MFCC) [7] are a popular feature-extraction method for speech signal processing, and Gaussian mixture models (GMM) have become a dominant approach for statistical modeling of speech feature vectors for text-independent SVR [8].

    A recently developed method for overcoming model mismatch is to use a reverberant speech database for training target models [9]. This method was tested on an adaptive-GMM (AGMM)-based SVR system [10] with reverberant speech, with various values of reverberation time (RT). Matching of RT between training and testing data was reported to reduce the equalerror rate (EER) from 16.44% to 9.9%, on average, when using both Z-norm and T-norm score normalizations. However, the study in [9] did not investigate the effect of GMM order on SVR performance under reverberation conditions. In fact, it may be difficult to find such research studies on this effect in the literature.

    The audio and visual fusion techniques investigated in previous work include feature fusion, model fusion, or decision fusion. In feature fusion, the combined audio-visual feature vectors are obtained by the concatenation of the audio and visual features, followed by a dimensionality reduction transform [11]. The resulting observation sequences are then modeled using one HMM [12]. A model fusion system based on multi-stream HMM was proposed in [13]. The multi-stream HMM assumes that audio and video sequences are state synchronous but allows the audio and video components to have different contribution to the overall observation likelihood. However, it is well known that the acoustic features of speech are delayed from the visual features of speech, and assuming state synchronous models can be inaccurate. We proposed an audio visual bimodal that uses a product HMM. The audio visual product HMM can be seen as an extension of the multi-stream HMM that allows for audio-video state asynchrony. Decision fusion systems model independently the audio and video sequences using two HMMs, and combine the likelihood of each observation sequence based on the reliability of each modality [11].

    3. Speech Feature Parameter Extraction

       3.1 Audio Feature Extraction

    The normalized energy, MFCC and linear predictive cepstrum coefficients (LPCC) of speech describe the prosodic features, timbre features and perceived features, respectively, so they are selected as audio feature parameters in this paper.

    The MFCC computation formula is as follows:


    where N is the number of triangular filters; Emel(t,j) is the output energy for the jth filter at t time, {MFCC(t,i)}i=1,2,...,p is the corresponding MFCC parameters at t time, and P of {MFCC(t,i)}i=1,2,...,p is order.

    The LPCC computation formula is as follows:

    LPCC(t; i) =LPC(t; i)


    where LPC (t, k) is the k?thlinear prediction coefficient at time t, {LPCC(t,i)}i=1,2,...,p is the corresponding LPCC parameters at time t, and P of {LPCC(t,i)}i=1,2,...,p is order.

       3.2 Video Feature Extraction

    We select lip parameters as video features, a segment video image using a two-dimensional fast thresholding segmentation algorithm, and lip feature extraction parameters [14], as shown in Figure 1, where x is the distance from the labial center line to the edge, u0 and d0 are the heights of the upper and lower halves, respectively, of the mouth centerline, and u1,d1,u2,d2 are the respective heights of the corresponding x-third point. The original lip parameter is vt= [x, u0, d0, u1,d1,u2,d2] for the image frame at t time.

    4. Improved Product HMM

    Auditory and visual features have some synchronicity, with some asynchrony within a certain range. When people talk, mouth movement has already begun before the voice, and it takes time to close the mouth and return to the natural state after the voice, so visual information is usually ahead of auditory information by about 120 ms [15], which is close to the average duration of a phoneme.

    Therefore, asynchrony can be permitted in the auditory and visual training model. This paper proposes an improved product HMM model based on the product HMM method. For Chinese words recognition, which usually corresponds to five or six states, only one state migration is allowed between the audio and video streams as a result of the presence of asynchrony. Figure 2 shows theHMMmodel for the audio and video streams, and Figure 3 shows the topology of the corresponding product

    HMM model.

    We extract the video features and audio features during the training stage. Generally speaking, the frame rate of the phonetic features is higher than the frame rate of the video features, so we use interpolation for them after video-feature extraction in order to ensure training synchronicity in the data flow.

    A bimodal speech feature vector consists of the observation vectors of audio features and video features. According to Bayes’ theorem, the classification result for the maximum posterior probability is


    where W denotes some word and Oa and Oυdenote the vector sequence of audio and video features, respectively; if they are independent of each other, the joint output probability is P(Oa,Oυ)= P(Oa │ W)P(Oυ │ W).

    A multiple-model method needs to combine the audio and video streams in terms of formulas. For the improved product HMM method, we assume that the audio and video streams are conditionally independent, so observation vector and transition probabilities at t time are


    where aaij is the transition probability from state i to state j among the audio HMM and aυmn is the transition probability from state m to state n in the video HMM. The output probability of state ij is


    The weight coefficient ? (0 < ? < 1) reflects the different weights of the two modes, which depend on the recognition performance of each mode under the different noise conditions. Zhao et al.’s experiments [16] showed the following linear relationship.


    The weight coefficient is greater (? > 0:825) when the speech signal noise is smaller (SNR>25 dB), illustrating that audio information plays a larger role in the decision-making. The weight coefficient decreases with an increase of noise, illustrating that the proportion of video information increases gradually in the decision-making. When SNR = 5 dB, ? ? 0.5, illustrating that they have the same importance at that moment.

    5. Weight Optimization Based on the GPD Algorithm

    Using the existing training data, based on formula (7), and calculating the weight coefficient according to the GDP algorithm [17], this training algorithm defines a misclassification distance that provides the correct distance between the class information and other information. The misclassification distance is computed using a smooth loss function and is minimized.

    For a well-trained product HMM model, according to the Nbest recognized hypotheses, supposing x as the unknown word vector,


    (?) as the logarithmic likelihood values of correctly identifying the x in the model,


    (?) as the logarithmic likelihood values of the N-best candidate vector of misrecognized words, the misclassification distance is


    Where η is a smoothing parameter and N is the total candidate number. The total loss function of x after smoothing is


    The purpose of training is to minimize Lost(?) so as to minimize the error. The recursive formula of weight is


    The condition is


    is a finite positive matrix sequence. The algorithm converges as k→ ∞. The recursion stops, and the final weight is obtained when the difference of the recursive value is smaller than a given threshold.

    6. Experiment Results and Analysis

       6.1 Experiment Dataset

    We constructed a bimodal corpus and selected from seven people (five for male and two for female). The corpus contains 50 Chinese words, totaling 750 words for the seven people, including 550 words for training and the others for recognition. As needed, we added some noises of different intensity for recognition speech words. The sampling rate of the speech signal is 22.05 kHz. The quantitative value is 16 bits. The frame length of the speech frame is 28 ms. The frame shift is 14 ms, using a Hamming window as the window function. In order to ensure the synchronization of the video and audio streams after the extraction of video and audio features, we interpolated the video features and input these feature parameters into the improved product HMM, shown as in Figure 4. The video features were the lip parameter vt and the dynamic parameter vt, totaling 14 dimensions. To determine the final scheme of the audio features, we preselected three sets of features as follows:

    (1) MFCC feature: MFCC (14 dimensions) + ΔMFCC (14 dimensions) + normalized audio energy, totaling 29 dimensions;

    (2) LPCC feature: LPCC (14 dimensions) + ΔLPCC (14 dimensions) + normalized audio energy, totaling 29 dimensions;

    (3) MFCC-LPCC joint feature: MFCC (14 dimensions) + ΔMFCC (14 dimensions) + LPCC (14 dimensions) + ΔLPCC (14 dimensions) + normalized audio energy, totaling 57 dimensions.

       6.2 Experiment Results

    Under different SNR conditions, according to the recognition result (Table 1), we selected the MFCC-LPCC joint feature to train modal and recognize speech.

    The convergence performance of the GPD algorithm depends on the choice of the feature parameters. After many experiments, we set N = 2 in formula (8), α = 0:1 in formula (9), ?k = 50=k in formula (10), and convergence threshold Tc= 0.01. The recursion stopped when the recursive interpolation ▽W=Wk+1?Wk< Tc.

    Four speech recognition schemes were proposed for comparing the bimodal speech recognition performance of the different methods.

    (1) For the single-modal speech recognition of the audio, the single audio model adopts the classical left-to-right no-cross HMM modal [18]. Considering that the training objects are Chinese words, five or six states are selected and the output probability density function is a four-dimension mixed Gaussian density distribution. The audio parameters are the aforementioned 57 dimension MFCC-LPCC joint feature parameters.

    (2) Based on the EI model [19], the joint feature vector is composed of the audio feature vector and video feature vectors, which are inputted to the single HMM model for training, with the same modal architecture as in solution (1).

    (3) The multiple HMM model, mixing audio and video features on the state layer, assigning different weights to the audio and video streams according to the principle of formula (6) [5,6], requires special restrictions so as to maintain synchronization between the video and voice stream.

    (4) This solution achieves word recognition through the improved product HMM model (Figure 3) proposed here. Based on the assumption that the audio and video streams are independent of each other, it allows for a step-state deviation between them.

    To facilitate the performance comparison, the audio and video feature parameters of solutions (2) and (3) are the same as that of solution (4), and the recognition results are as shown in Figure 5.

       6.3 Result Comparison and Analysis

    In a low-noise environment, the difference in the recognition rate between the single-mode audio recognition method (solution 1) and the bimodal recognition method (solution 2-solution 4) is not large. When noise increases, the gap in recognition rate between single-mode and double-mode recognition will increase. This shows that in high-noise environments, the video information contributes more to the speech recognition rate.

    For the bimodal recognition method, the EI model-based method did not assign the weights of video and audio information dynamically, leading to the lowest recognition rate for this method among all the bimodal recognition methods. The solution 3 in [5] only considered the speech recognition of isolated words, while we consider recognizing Chinese phrases, increasing the recognition length. In contrast to our method, [5] used a neural network model, which was slower in actual speech recognition. Furthermore, [5] did not consider the asynchronous nature of the video and audio signals, but simply used the weighted fusion of the speech and video streams. Different from other methods, the method proposed in this paper considered the asynchrony, as shown in Figure 5, the speech recognition rate was slightly higher than in solution 3.

    7. Conclusion

    In this paper, with the aim of achieving effective speech recognition in noisy environments, a product HMM-based bimodal speech model allowing a one-step state offset to adapt to the asynchronous nature of the video signal and audio signal is proposed.

    According to the corresponding relation between the weight coefficient and instantaneous SNR, the model can adjust the weight ratio of the audio and video streams adaptively. We selected a 50 Chinese 2-digit word corpuses as training and identification data, in contrast to other types of programs. The result showed that our proposed model can ensure the accuracy and robustness of speech recognition in a noisy environment.

    Although we have shown effectiveness of the proposed bimodal HMM method on the Chinese 2-digit word recognition tasks, this scheme can be extended for multiword or continuous speech recognition tasks. In such cases, it would be a problem that, from the two modalities, we have unmanageably many possible word or phoneme sequence hypotheses to be considered for weighted integration. Also, more complicated interactions between the modalities can be modeled by using cross-modal associations and influences, where we still can use the proposed integration method for adaptive robustness. With these considerations, further investigation of applying the proposed system to complex tasks such as multiword or continuous speech recognition is in progress.

      >  Conflict of Interest

    No potential conflict of interest relevant to this article was reported.

  • 1. Dasarathy B. V. 1997 “Sensor fusion potential exploitation: innovative architectures and illustrative applications” [Proceedings of the IEEE] Vol.85 P.24-38 google doi
  • 2. Massaro D. W. 1987 “Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry” google
  • 3. Lee J. S., Park C. H. 2006 “Training hidden Markov models by hybrid simulated annealing for visual speech recognition” [in Proceedings of 2006 IEEE International Conference on Systems, Man and Cybernetics] P.198-202 google
  • 4. Kumatani K., Nakamura S., Shikano K. 2001 “An adaptive integration based on product HMM for audio-visual speech recognition” [in Proceedings of 2001 IEEE International Conference on Multimedia and Expo] P.813-816 google doi
  • 5. Lee J. S., Park C. H. 2008 “Robust audio-visual speech recognition based on late integration” [IEEE Transactions on Multimedia] Vol.10 P.767-779 google doi
  • 6. Dupont S., Luettin J. 2000 “Audio-visual speech modeling for continuous speech recognition” [IEEE Transactions on Multimedia] Vol.2 P.141-151 google doi
  • 7. Davis S., Mermelstein P. 1980 “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences” [IEEE Transactions on Acoustics, Speech and Signal Processing] Vol.28 P.357-366 google doi
  • 8. Reynolds D. A., Quatieri T. F., Dunn R. B. 2000 “Speaker verification using adapted Gaussian mixture models” [Digital Signal Processing] Vol.10 P.19-41 google
  • 9. Peer I., Rafaely B., Zigel Y. 2008 “Reverberation matching for speaker recognition” [in Proceedings of 2008 IEEE International Conference on Acoustics, Speech and Signal Processing] P.4829-4832 google doi
  • 10. Bimbot F., Bonastre J. F., Fredouille C., Gravier G., Magrin-Chagnolleau I., Meignier S., Merlin T., Ortega-Garcia J., Petrovska-Delacretaz D., Reynolds D. A. 2004 “A tutorial on text-independent speaker verification” [EURASIP Journal on Advances in Signal Processing] Vol.2004 P.430-451 google doi
  • 11. Neti C., Potamianos G., Luettin J., Matthews I., Glotin H., Vergyri D., Sison J., Mashari A., Zhou J. 2000 “Audio visual speech recognition,” in Final Workshop 2000 Report google
  • 12. Rabiner L. R., Juang B. H. 1993 Fundamentals of Speech Recognition google
  • 13. Luettin J., Potamianos G., Neti C. 2001 “Asynchronous stream modeling for large vocabulary audio-visual speech recognition” [in Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing] P.169-172 google doi
  • 14. Zhao H., Tang C., Yu T. 2008 “Fast thresholding segmentation for image with high noise” [in Proceedings of 2008 International Conference on Information and Automation] P.290-295 google doi
  • 15. Xie Lei, Jiang D. 2004 “Audio-visual synthesis and synchronous asynchronous experimental research for bimodal speech recognition” [Journal of Northwestern Polytechnical University] Vol.22 P.171-175 google
  • 16. Zhao H., Gu Y., Tang C. 2009 “Research of relationship between weight coefficient of product HMM and instantaneous SNR in bimodal speech recognition” [Journal of Computer Application] Vol.29 P.279-285 google
  • 17. Adjoudani A., Benot C. 1995 “On the integration of auditory and visual parameters in an HMM-based ASR” [in Proceedings NATO ASI Conference on Speechreading by Man and Machine: Models, Systems and Applications] P.461-471 google
  • 18. Rabiner L. 1989 “A tutorial on hidden Markov models and selected applications in speech recognition” [Proceedings of the IEEE] Vol.77 P.257-286 google doi
  • 19. Bregler C., Omohundro S. M. 1995 “Nonlinear manifold learning for visual speech recognition” [in Proceedings of 1995 5th International Conference on Computer Vision] P.494-499 google doi
이미지 / 테이블
  • [ Figure 1. ]  Mouth parameters.
    Mouth parameters.
  • [ Figure 2. ]  Double stream hidden Markov model model for word speech recognition (? is synchronization point).
    Double stream hidden Markov model model for word
speech recognition (? is synchronization point).
  • [ Figure 3. ]  Topological structure of the proposed product hidden Markov model.
    Topological structure of the proposed product hidden
Markov model.
  • [ Figure 4. ]  Proposed bimodal speech recognition process.
    Proposed bimodal speech recognition process.
  • [ Table 1. ]  Bimodal speech recognition rate of audio feature parameters for different SNR
    Bimodal speech recognition rate of audio feature parameters for different SNR
  • [ Figure 5. ]  Speech recognition rate comparison of four solutions for different signal-to-noise ratios (SNRs).
    Speech recognition rate comparison of four solutions for
different signal-to-noise ratios (SNRs).
(우)06579 서울시 서초구 반포대로 201(반포동)
Tel. 02-537-6389 | Fax. 02-590-0571 | 문의 : oak2014@korea.kr
Copyright(c) National Library of Korea. All rights reserved.