Text-independent Speaker Identification Using Soft Bag-of-Words Feature Representation

  • cc icon

    We present a robust speaker identification algorithm that uses novel features based on soft bag-of-word representation and a simple Naive Bayes classifier. The bag-of-words (BoW) based histogram feature descriptor is typically constructed by summarizing and identifying representative prototypes from low-level spectral features extracted from training data. In this paper, we define a generalization of the standard BoW. In particular, we define three types of BoW that are based on crisp voting, fuzzy memberships, and possibilistic memberships. We analyze our mapping with three common classifiers: Naive Bayes classifier (NB); K-nearest neighbor classifier (KNN); and support vector machines (SVM). The proposed algorithms are evaluated using large datasets that simulate medical crises. We show that the proposed soft bag-of-words feature representation approach achieves a significant improvement when compared to the state-of-art methods.


    Speaker identification , Clustering , Bag-of-Words (BoW) feature representation , Fuzzy membership , Possibilistic membership , Naive Bayes classifier

  • 1. Introduction

    The Simulation for Pediatric Assessment, Resuscitation, and Communication (SPARC) group within the Department of Pediatric Critical Care Medicine at the University of Louisville makes extensive use of simulation in training teams of nurses, medical students, residents, and attending physicians. These simulation sessions involve trained actors simulating family members in various crisis scenarios. Sessions involve 4 to as many as 9 people and last approximately 20 minutes to one hour. They are scheduled approximately twice per week and are recorded as video data. After each session, the physician/instructor must manually review and annotate the recording and then debrief the trainees on the session. The goal is to enhance the care of children and strengthen interdisciplinary and clinician-patient interactions [1].

    The physician responsible for the simulation has recorded 100’s of sessions, and has realized that the manual process of review and annotation is labor intensive and that retrieval of specific video segments (based on speaker or what was said) is not trivial. Using machine learning methods, we have developed a speaker segmentation and identification system that can provide the physician with automated and efficient methods to semantically index and retrieve specific segments from the large collections of simulation sessions.

    The architecture of this system is illustrated in Figure 1. It has two main components. The first one is for offline training, and the second one is for online testing. In the offline training, first audio streams are extracted from the training videos. Then, the divide-and-conquer (DAC3) based speaker segmentation method [2] is used to partition the speech sequence into homogeneous segments. Finally, a classifier is trained to discriminate between segments that correspond to different speakers. In the online testing, the input consists of an unlabeled video recording. First, the audio component is extracted and segmented. Then, each segment is labeled by the classifier. As a result, our system will identify “who spoke and when”. In this paper, we focus on developing an efficient and accurate algorithm for speaker identification.

    The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 presents the proposed soft bag-of-words feature representation method. Section 4 discusses our experiments. Finally, concluding remarks are given in Section 5.

    2. Related Works

       2.1 Speaker Recognition

    Speaker identification, classifying speech utterances into different speaker classes, and speaker verification, verifying a person’s claimed identity from his/her voice, are generally referred to as speaker recognition [3]. Several features and classification methods have been proposed for this task. For instance, Mel frequency cepstral coefficients (MFCC) [4], which take into account how humans perceive the difference between sounds of different frequencies, is one of the most commonly used features [5, 6]. Perceptual linear predictions (PLP) [7] and linear prediction cepstral coefficients (LPCC) [8] are two other common features that rely on psychophysically based spectral transformations and linear prediction. These features may not always achieve good performance, especially in noisy environment, and many other features have been proposed. For instance, in [9], Wang proposed combing MFCC features and phase information for speaker identification. In [10], Li proposed an auditory based feature extraction algorithm by using a set of modules cochlear filter banks. In [11], Gabor filtering was applied to speech spectrum features, and nonnegative tensor factorization method was used to extract more robust features. Other feature representations can be found in [3, 5, 12].

    Most of the above methods extract features from small overlapping windows. Thus, each speech segment can be represented by a large number of features. Moreover, since speech segments can have different durations, they will be represented by different number of features. To overcome this limitation, the above features are usually summarized by a small fixed number of representatives. For instance, Gaussian mixture model (GMM), with universal background model (UBM) for speaker adaptation [13], has been widely used for speaker recognition [5, 14]. In [14], GMM adaptation was applied to UBM to learn each speaker. Then, log-likelihood scores with nonlinear normalization were used for speaker discrimination. In [5], the GMM mean supervector, that represents the variable size segments with a fixed dimension by concatenating all adapted Gaussian mean vectors, was combined with a support vector machines (SVM) classifier. This approach was proven to be one of the most effective methods for speaker recognition.

    Another alternative approach, called possibilistic histogram features (PHF) was proposed in [1, 15]. The PHF is inspired by the “bag of words” concept used in information retrieval. It identifies a fixed set of representative prototypes, and each audio segment is mapped to the closest prototype. The relative frequency of occurrence of each prototype is used as the feature vector of each audio segment. The PHF has been used with a KNN classifier and its performance was constrained. On one hand, a reduced vocabulary cannot represent all variations within the features. On the other hand, a larger vocabulary can improve the feature representation, but can also degrade the KNN classifier.

       2.2 Feature Representation with Bag-of-Words (BoW)

    The bag-of-words model has been widely used in various applications, such as document classification, computer vision, speech and speaker recognition, etc. In document classification, the feature is constructed based on the frequency of occurrence of each word [16]. Generally, there are two different models to represent the document. One model uses a vector of binary attributes to indicate whether a word occurs or does not occur in the document. This representation can be modeled as a multivariate Bernoulli distribution. Another model takes the number of word occurrences into account, and represents the document by a sparse histogram of words frequencies. This representation can be modeled as a multinomial model. For both models, the Naive Bayes classifier is commonly used for classification.

    In computer vision, a bag of visual words is a vector of frequency counts of a vocabulary of local image features. It has been used mainly in image/video scenes classification and retrieval [17, 18]. In [17], a “bag of key points” method was proposed based on vector quantization of affine invariant descriptors of image patches. Two different classifiers, Naive Bayes and SVM, were applied for semantic visual categories classification. Similarly, in [18], a set of viewpoint invariant region descriptors were extracted to search and localize all the occurrences of a given query object in a video. In this approach, a visual vocabulary was built through vector quantizing the descriptors into clusters. Using the standard indexing method used in text retrieval, the term frequency-inverse document frequency (TF-IDF) was computed and the cosine similarity was used for retrieval.

    The BoW has also been used for the analysis of speech data. In [19], the high-frequency keywords (e.g. you know, um, right, etc.) were selected by computing the frequent, reflexive words and word pairs, and modeling them via word-based HMM models. Integrating this advantage of text-dependent modeling into the traditional GMM-based text-independent speaker recognition was shown to improve the performance. In [20], a bag-of-words (BoW)-style feature representation, which quantizes the observed direction of arrival (DOA) powers into discrete “word” samples, was developed to solve the speaker-clustering problem. In this approach, a time-varying probabilistic model was combined with the DOA information calculated from a microphone array to estimate the number and locations of the speakers.

       2.3 BoW Feature Representation with Naive Bayes Classifier

    Assume that we have a set of labeled speech segments X = {Xi}, C classes [S1, …, Sj , …, SC ], and representative vocabularies (i.e. codebook or cluster centers) V = {vt}. Let ft(Xi) denotes the relative frequency of the occurrence of word vt in segment Xi . To classify a new test sample, Xs , Bayes’ rule is applied and the maximum a posteriori score is used for prediction:


    In (1), P(Sj ) is the a priori probability of class Sj , and the class-conditional probability P(vt|Sj) denotes the probability of word vt occurring in class Sj and can be estimated using:


    In order to avoid the zero probability estimation in (2), the Laplace smoothing is frequently used, and (2) can be replaced with:


    3. Soft BoW Audio Feature Representation

    In this paper, we propose a generalization of the BoW feature representation. In addition to the standard binary voting, where each sample contributes to each keyword with a binary value (1 if the keyword is the closest one to the sample and 0 otherwise), we propose a generalization that uses soft voting. We discuss the advantages and disadvantages of each voting scheme. We also show that the soft BoW representations with a Naive Bayes classifier outperform existing methods for speaker identification.

       3.1 Visual Vocabulary Construction

    Assume that each speaker i has a training set of Ni low-level features, that is, where is a D dimensional feature vector extracted from the jth segment of the ith speaker.

    The first step consists of summarizing each Xi by a set of representative prototypes . This quantization step is achieved by partitioning Xi into Ki clusters and letting be the centroid of the kth partition. Any clustering algorithm can be used for this task. In this paper, we report the results using the Fuzzy C-means (FCM) [21] algorithm. The FCM partitions the Ni samples into Ki clusters by minimizing the sum of within-cluster distances, i.e.,


    In (4), d refers to the Euclidean distance between and , and U = [μtj] represents the membership of feature vector in cluster t [22] and satisfies the constraints:


    Each prototype, pk, is a representative of cluster ck that summarizes a group of similar speech segments. Let σk be the variance of all features xj assigned to cluster ck. After clustering, the Ki prototypes obtained by partitioning the data of speaker i, Xi, are all combined to form a dictionary or a codebook with words, where Nsp is the number of speakers.

    Instead of using the original feature space X, the bag-of-words based histogram feature descriptor (BoW-HFD) approach maps it to a new space H characterized by the K clusters that capture the characteristics of the training data. Formally, this mapping is defined as


    In (6), fi(xj) ∈ [0, 1] is a measure of belongingness of feature xj to cluster i represented by prototype pi. This measure could be crisp, fuzzy, or possibilistic [22]. These different mappings are described in the following subsections.

       3.2 Crisp Mapping

    In crisp mapping, each feature vector xj is assigned a binary membership value to each “word” i based on the distance between them. This mapping considers only the closest word (i.e. prototype) to word i and is defined as:


    This mapping is used in the standard BoW approach [17] and considers only the closest word. Thus, it is reasonable if xj is close to one word and far from the other words. However, if xj is close to multiple words (i.e., xj is located close to the clusters’ boundaries), then, crisp mapping will not preserve this information.

       3.3 Fuzzy Mapping

    Instead of using binary voting (as in eq. (7)), fuzzy mapping uses soft labels to allow for partial or gradual membership values. This type of labeling offers a richer representation of belongingness and can handle uncertain cases. In particular, a sample xj votes to each word i in the codebook with a membership degree such that:


    Many clustering algorithms use this type of labels to obtain a fuzzy partition. In the proposed fuzzy BoW (F-BoW) approach, we use the memberships derived within the Fuzzy C-Means (FCM) [21] algorithm, i.e.,


    where m ∈ (1, ∞) is a constant that controls the degree of fuzziness. In (9), Djt is the distance between feature vector xj and the “word” summarizing cluster t. To take into account the shape of the clusters, we use


    where is the variance of feature k of cluster t and M is the dimensionality of the feature space.

       3.4 Possibilistic Mapping

    The fuzzy membership in (9) is a relative number that depends on the distance of xj to all prototypes. Thus, it does not distinguish between samples that are equally close to multiple prototypes and samples that are equally far from all prototypes.

    An alternative approach to generate soft labels is based on possibility theory [22]. Possibilistic labeling relaxes the constraint in (8) that the memberships across all words must sum to one. It assigns “typicality” values, , that do not consider the relative position of the point to all clusters. As a result, if xj is a noise point, then , and if xj is typical of more than one cluster, we can have . Many robust partitional clustering algorithms [23, 24] use this type of labeling in each iteration. In this paper, we use the membership function derived within the Possibilistic C-Means [22], i.e.,


    In (11), ƞj is a cluster-dependent resolution/scale parameter [22] and m ∈ (1, ∞).

    Robust statistical estimators, such as M-estimators and W-estimators [25], use this type of memberships to reduce the effect of noise and outliers.

    4. Experimental Results and Discussion

       4.1 Data Collection

    Multiple data sets are used to validate and compare our proposed soft BoW-based audio feature representation with Naive Bayes classifier for speaker identification. In particular, we use 15 medical simulations videos. We only use the audio information for speaker identification as it contains most conversation information. This is because the video resolution is low and has no additional information (people are sitting with little movement and just talking). As shown in Table 1, each simulation has four speakers (patient, patient’s friend, doctor, and nurse). Videos are recorded in different rooms, and have different quality with different levels of background noise and frequent interruptions. The content of the conversations involve similar topics. For all experiments reported in this paper, we use a k-fold cross validation with k = 5. That is, for each video, we keep 80% of data for training and use the remaining 20% for testing. We repeat this process 5 times by testing different subsets and report the average of the 5 numbers.

       4.2 Preprocessing

    First, the audio component is extracted from the video. All speech files are single-channel data sampled at 22.05kHz frequency. Then, since silence segments provide no information about the speakers and actually may reduce the correct speaker identification rate, each audio stream is processed to identify and remove silence segments. We use a trainable support vector machines (SVM) classifier [26] based on 3 low-level audio features (short-time energy, zero crossing rate, and spectral centroid) to discriminate between speech and nonspeech audio.

    The remaining speech segments are decomposed into small frames using a 25ms analysis window with 10ms overlap. From each window, we extract MFCC [4], PLP [7], LPCC [8], and GFCC [11] features. For GFCC, instead of using tensor decomposition as proposed in [11], we simply average all Gabor filtered spectrum features along the scales and phases to reduce the computational complexity.

    For each extracted feature, we use the BIC algorithm [27] to identify changing points within the audio stream and partition it into homogeneous segments. Each segment will then be processed by our proposed algorithm to identify the speaker. Table 1 displays the number of speech segments identified by BIC for each video. The average length of all segments is short due to the frequent interruption during the conversation. We should note here that each video segment is processed independently since it involves different speakers. The reported results are the average over the 15 datasets.

       4.3 Evaluation and Discussion

    First, the same low-level features used to segment the audio stream (MFCC, PLP, LPCC, and GFCC) are also used for speaker identification. Next, bag of words features (C-BoW, F-BoW, and P-BoW) are constructed for each feature as described in Section 3. The initial number of prototypes is set to 100 per speaker, i.e. Ki = 100, resulting in a codebook with K = 400 words.

    For each low-level feature, we evaluate the performance of the proposed mapping using 3 different classifiers: K-NN, Naive Bayes, and SVM [28]. K-NN has the advantage of incorporating various distance measures. Naive Bayes is a simple and efficient classifier that proved to be effective is classification problems that use the bag-of-word feature representation [16, 17]. SVM is one of the most commonly used classifiers. For each classifier, we compare the performance of the 3 proposed feature mapping methods.

    For the K-NN classifier, first we experiment with several measures, as discussed in [29], to compute the dissimilarity between two histogram features (i.e. vectors mapped to histograms using bag of words representation). In particular, we use chi-square statistics (CS), histogram intersection (HI), Jensen-Shannon divergence (JS), Kolmogorov-Smirnov distance (KS), Kullback-Leibler divergence (KL), match distance (MD), diffusion distance (DD), and cosine distance (CD). The speaker recognition accuracies, averaged over the 15 datasets, using the MFCC features with a K-NN classifier (K=7), are displayed in Table 2. As it can be seen, the cosine distance has the best performance for the crisp, fuzzy, and possibilistic bag of words representations. Similar results are obtained for the PLP, LPCC, and GFCC features. Thus, for the remaining experiments, the cosine distance will be used within the K-NN classifier to compare it to other classifiers.

    In Figure 2, we compare the speaker identification accuracy of the proposed soft BoW feature mappings using MFCC features with the K-NN, NB, and SVM classifiers. First, we notice that the NB classifier outperforms the K-NN and SVM classifiers for the crisp, fuzzy, and possibilistic cases. Second, on average, the soft (fuzzy and possibilistic) feature mappings outperform the crisp mapping. Similar results were obtained for the PLP, LPCC, and GFCC features.

    In a second experiment, we compare our methods to 3 existing speaker identification algorithms: GMM-UBM [14], GMM mean supervector [5] with K-NN classifier (SV-KNN) and SVM classifier (SV-SVM), and PHF [1] with KNN classifier (PHF-KNN). For the GMM-UBM-based speaker identification, the UBM is estimated using all training features, while GMM adaptation is applied to the UBM to get each training speaker model. Then, log-likelihood scores are used for the classification. For both the GMM-UBM and GMM mean supervector methods, we experiment with several values for the number of Gaussian components and set this parameter to 10. The results are reported in Figure 3. As it can be seen, for all 4 features, the proposed soft feature mapping coupled with the NB classifier outperform the state of the art methods.

    5. Conclusions

    We proposed a soft feature mapping approach for speaker identification. Our approach uses bag-of-words model to extract robust histogram descriptors from low-level spectral features. We formulated three kinds of feature mapping methods using crisp, fuzzy, and possibilistic membership functions.

    Using 15 datasets, we showed that the Naive Bayes is the best classifier to be used with our soft mapping. We also showed that the proposed approach outperforms commonly used methods.

    The Proposed mappings provide more accurate speaker identification results. This allows the physicians to analyze the simulation sessions more easily and to identify and retrieve speech segments for a given speaker more accurately.

    In our future work, we will focus on the fusion of multiple histograms that map different features (e.g. MFCC, PLP, LPCC, and GFCC) and applying ensemble learning approaches to further improve the accuracy of the speaker identification.

  • 1. Jiang S., Frigui H., Calhoun A. 2012 “Semantic indexing of video simulations for enhancing medical care during crises,” [11th International Conference on Machine Learning and Applications (ICMLA)] P.520-525 google
  • 2. Cheng S., Wang H., Fu H. 2010 “Bic-based speaker segmentation using divide-and-conquer strategies with application to speaker diarization,” [IEEE Trans. on Audio, Speech, and Language Processing] Vol.18 P.141-157 google doi
  • [Figure 1.] Overview of the proposed speaker segmentation and identification system
    Overview of the proposed speaker segmentation and identification system
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [Table 1.] Data collections used to validate the proposed speaker identification approach
    Data collections used to validate the proposed speaker identification approach
  • [Table 2.] Classification rate of the K-NN classifier using the proposed soft bag of words representation of MFCC features and various distance measures
    Classification rate of the K-NN classifier using the proposed soft bag of words representation of MFCC features and various distance measures
  • [Figure 2.] Performance of the crisp, fuzzy, and possibilistic BoW using MFCC features with the KNN, SVM, and NB classifiers
    Performance of the crisp, fuzzy, and possibilistic BoW using MFCC features with the KNN, SVM, and NB classifiers
  • [Figure 3.] Comparison of the classification accuracy of the proposed soft BoW feature mappings using the NB classifier with GMM-UBM, GMM mean supervector with K-NN (SV-KNN) and SVM (SV-SVM), and PHF with KNN.
    Comparison of the classification accuracy of the proposed soft BoW feature mappings using the NB classifier with GMM-UBM, GMM mean supervector with K-NN (SV-KNN) and SVM (SV-SVM), and PHF with KNN.