Direct Divergence Approximation between Probability Distributions and Its Applications in Machine Learning
 DOI : 10.5626/JCSE.2013.7.2.99
 Author: Sugiyama Masashi, Liu Song, du Plessis Marthinus Christoffel, Yamanaka Masao, Yamada Makoto, Suzuki Taiji, Kanamori Takafumi
 Organization: Sugiyama Masashi; Liu Song; du Plessis Marthinus Christoffel; Yamanaka Masao; Yamada Makoto; Suzuki Taiji; Kanamori Takafumi
 Publish: Journal of Computing Science and Engineering Volume 7, Issue2, p99~111, 30 June 2013

ABSTRACT
Approximating a divergence between two probability distributions from their samples is a fundamental challenge in statistics, information theory, and machine learning. A divergence approximator can be used for various purposes, such as twosample homogeneity testing, changepoint detection, and classbalance estimation. Furthermore, an approximator of a divergence between the joint distribution and the product of marginals can be used for independence testing, which has a wide range of applications, including feature selection and extraction, clustering, object matching, independent component analysis, and causal direction estimation. In this paper, we review recent advances in divergence approximation. Our emphasis is that directly approximating the divergence without estimating probability distributions is more sensible than a naive twostep approach of first estimating probability distributions and then approximating the divergence. Furthermore, despite the overwhelming popularity of the KullbackLeibler divergence as a divergence measure, we argue that alternatives such as the Pearson divergence, the relative Pearson divergence, and the
L ^{2}distance are more useful in practice because of their computationally efficient approximability, high numerical stability, and superior robustness against outliers.

KEYWORD
Machine learning , Probability distributions , KullbackLeibler divergence , Pearson divergence , L2 distance

I. INTRODUCTION
Let us consider the problem of approximating a divergence
D between two probability distributionsP andP' on ？^{d} from two sets of independent and identically distributed samplesfollowing
P andP' .A divergence approximator can be used for various purposes, such as twosample testing [1,2], change detection in timeseries [3], classprior estimation under classbalance change [4], salient object detection in images [5], and event detection from movies [6] and Twitter [7]. Furthermore, an approximator of the divergence between the joint distribution and the product of marginal distributions can be used for solving a wide range of machine learning problems [8], including independence testing [9], feature selection [10,11], feature extraction [12,13], canonical dependency analysis [14], object matching [15], independent component analysis [16], clustering [17,18], and causal direction learning [19]. For this reason, accurately approximating a divergence between two probability distributions from their samples has been one of the challenging research topics in the statistics, information theory, and machine learning communities.
A naive way to approximate the divergence from
P toP' , denoted byD(PP') , is to first obtain estimatorsand
of the distributions
P andP' separately from their samplesX andX ', and then compute a plugin approximatorHowever, this naive twostep approach violates
Vapnik’s principle [20]:If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem.
More specifically, if we know the distributions
P andP ', we can immediately know their divergenceD(PP') . However, knowing the divergenceD(PP') does not necessarily imply knowing the distributionsP andP ', because different pairs of distributions can yield the same divergence value. Thus, estimating the distributionsP andP' is more general than estimating the divergenceD(PP') . Following Vapnik’s principle, direct divergence approximatorsthat do not involve the estimation of distributions
P andP' have been developed recently [2125].The purpose of this article is to give an overview of the development of such direct divergence approximators. In Section II, we review the definitions of the Kullback Leibler divergence, the Pearson divergence, the relative Pearson divergence, and the
L ^{2}distance, and discuss their pros and cons. Then, in Section III, we review direct approximators of these divergences that do not involve the estimation of probability distributions. In Section IV, we show practical usage of divergence approximators in unsupervised changedetection in timeseries, semisupervised classprior estimation under classbalance change, salient object detection in an image, and evaluation of statistical independence between random variables. Finally, we conclude in Section V.II. DIVERGENCE MEASURES
A function
d (·, ·) is called adistance if and only if the following four conditions are satisfied:Nonnegativity: ∀x, y, d(x, y) ≥ 0
Nondegeneracy: d(x, y) = 0 ⇔ x = y
Symmetry: ∀x, y, d(x, y) = d(y, x)
Triangle inequality: ∀x, y, z d(x, z) ≤ d(x, y) + d(y, z)
A divergence is a pseudodistance that still acts like a distance, but it may violate some of the above conditions. In this section, we introduce useful divergence and distance measures between probability distributions.
> A. KullbackLeibler (KL) Divergence
The most popular divergence measure in statistics and machine learning is the KL divergence [26], defined as
where
p(x) andp'(x) are probability density functions ofP andP' , respectively.Advantages of the KL divergence are that it is compatible with maximum likelihood estimation, that it is invariant under input metric change, that its Riemannian geometric structure is well studied [27], and that it can be approximated accurately via
direct densityratio estimation [21,22,28]. However, it is not symmetric, it does not satisfy the triangle inequality, its approximation is computationally expensive due to the log function, and it is sensitive to outliers and numerically unstable, because of the strong nonlinearity of the log function, and possible unboundedness of the densityratio functionp/p' [24,29].> B. Pearson (PE) Divergence
The PE divergence [30] is a squaredloss variant of the KL divergence defined as
Because both the PE and KL divergences belong to the class of AliSilveyCsiszar divergences (which is also known as
f divergences) [31,32], they share similar theoretical properties such as the invariance under input metric change.The PE divergence can also be accurately approximated via direct densityratio estimation, in the same way as the KL divergence [23,28]. However, its approximator can be obtained
analytically in a computationally much more efficient manner than the KL divergence, because the quadratic function the PE divergence adopts is compatible with leastsquares estimation. Furthermore, the PE divergence tends to be more robust against outliers than the KL divergence [33]. However, other weaknesses of the KL divergence, such as asymmetry, violation of the triangle inequality, and possible unboundedness of the densityratio functionp/p' , remain unsolved in the PE divergence.> C. Relative Pearson (rPE) Divergence
To overcome the possible unboundedness of the density ratio function
p/p' , the rPE divergence was recently introduced [24]. The rPE divergence is defined aswhere, for 0 ≤ α < 1,
q_{α} is defined as theα mixture ofp andp' :When
α = 0, the rPE divergence is reduced to the plain PE divergence. The quantityp/q_{α} is called the relative densityratio, which is always upperbounded by 1/α forα > 0, becauseThus, it can overcome the unboundedness problem of the PE divergence, while the invariance under input metric change is still maintained.
The rPE divergence is still compatible with leastsquares estimation, and it can be approximated in almost the same way as the PE divergence via
direct relative density ratio estimation [24]. Indeed, an rPEdivergence approximator can still be obtained analytically in an accurate and computationally efficient manner. However, it still violates symmetry and the triangle inequality, in the same way as the KL and PE divergence. Furthermore, the choice ofα may not be straightforward, in some applications.> D. L^{2}Distance
The
L ^{2}distance is another standard distance measure between probability distributions, defined asThe
L ^{2}distance is a proper distance measure, and thus it is symmetric and satisfies the triangle inequality. Furthermore, the density differencep(x) ？p'(x) is always bounded as long as each density is bounded. Therefore, theL ^{2}distance is stable without the need of tuning any control parameter such asα in the rPE divergence.The
L ^{2}distance is also compatible with leastsquares estimation, and it can be accurately and analytically approximated in a computationally efficient and numerically stable manner via directdensitydifference estimation [25]. However, theL ^{2}distance is not invariant under input metric change, which is a unique property inherent to ratiobased divergences.III. DIRECT DIVERGENCE APPROXIMATION
In this section, we review recent advances in direct divergence approximation.
Suppose that we are given two sets of independent and identically distributed samples
from probability distributions on ？^{d}, with densities
p(x) andp'(x) , respectively:Our goal is to approximate a divergence between from
p top' from samplesX andX' .> A. KL Divergence Approximation
The key idea of direct KL divergence approximation is to estimate the density ratio
p/p' without estimating the densitiesp andp' [21]. More specifically, a densityratio estimator is obtained by minimizing the KL divergence fromp tor·p' with respect to a densityratio modelr under the constraints that the densityratio function is nonnegative andr·p' is integrated to one:Its empirical optimization problem, where an irrelevant constant is ignored and the expectations are approximated by the sample averages, is given by
Let us consider the following Gaussian densityratio model:
where · denotes the
l _{2}norm. We define the vector of parametersas
where, ^{T} denotes the transpose. In this model, the Gaussian kernels are located on numerator samples
, because the density ratio
p/p' tends to take large values in the regions where the numerator samplesexist. Alternatively, Gaussian kernels may be located on both numerator and denominator samples, but this seems not to further improve the accuracy [21]. When n is very large, a (random) subset of numerator samples
may be chosen as Gaussian centers, which can reduce the computational cost.
For the Gaussian densityratio model (3), the above optimization problem is expressed as
This is a convex optimization problem, and thus the global optimal solution can be obtained easily, e.g., by gradientprojection iterations. Furthermore, the global optimal solution tends to be
sparse (i.e., many parameter values become exactly zero), which can be utilized for reducing the computational cost.The Gaussian width
σ is a tuning parameter in this algorithm, and it can be systematically optimized bycrossvalidation with respect to the objective function. More specifically, the numerator samples X :=are divided into
T disjoint subsetsof (approximately) the same size. Then, a densityratio estimator
is obtained using
X\X_{t} and(i.e., all numerator samples without
X_{t} and all denominator samples), and its objective value for the holdout numerator samplesX_{t} is computed:where 
X_{t}  denotes the number of elements in the setX_{t} . This procedure is repeated fort = 1,...,T , and theσ value that maximizes the average of the above holdout objective values is chosen as the best one.Given a densityratio estimator
, a KLdivergence approximator
can be constructed as
A MATLABR^{®} (MathWorks, Natick, MA, USA) implementation of the above KL divergence approximator (called the
KL importance estimation procedure ; KLIEP) is available from“http://sugiyamawww.cs.titech.ac.jp/ ~sugi/software/KLIEP/”.
Variations of this procedure for other densityratio models have been developed including the loglinear model [34], the Gaussian mixture model [35], and the mixture of probabilistic principal component analyzers [36]. Also, an unconstrained variant, which corresponds to approximately maximizing the
LegendreFenchel lower bound of the KL divergence [37], was proposed [22]:> B. PE Divergence Approximation
The PE divergence can also be directly approximated without estimating the densities
p andp' via direct estimation of the density ratiop/p' [23]. More specifically, a densityratio estimator is obtained by minimizing thep'  weighted squared difference between a densityratio modelr and the true densityratio functionp/p' :Its empirical criterion, where an irrelevant constant is ignored, and the expectations are approximated by the sample averages, is given by
For the Gaussian densityratio model (3) together with the
L _{2}regularizer, the above optimization problem is expressed aswhere
λ ≥ 0 denotes the regularization parameter, andis the
n×n matrix with the (l,l' )th element defined byand
is the
n dimensional vector with thel th element defined byThis is a convex optimization problem, and the global optimal solution can be computed
analytically aswhere
I denotes the identity matrix.The Gaussian width
σ and the regularization parameterλ are the tuning parameters in this algorithm, and they can be systematically optimized by crossvalidation, with respect to the objective function, as follows: First, the numerator and denominator samples X =and
are divided into
T disjoint subsetsand
respectively. Then, a densityratio estimator
is obtained using
X\X_{t} andX'\X'_{t} (i.e., all samples withoutX_{t} andX'_{t} ), and its objective value for the holdout samplesX_{t} andX'_{t} is computed:This procedure is repeated for
t = 1,...,T , and theσ andλ values that maximize the average of the above holdout objective values are chosen as the best ones.By expanding the squared term
in Equation (1), the PE divergence can be expressed as
Note that Equation (7) can also be obtained via
Legendre Fenchel convex duality of the divergence functional [38]. Based on these expressions, PE divergence approximators are obtained using a densityratio estimatoras
Equation (8) is suitable for algorithmic development because this would be the simplest expression, while Equation (9) is suitable for theoretical analysis because this corresponds to the negative of the objective function in Equation (4).
A MATLAB implementation of the above method (called
unconstrained leastsquares importance fitting ; uLSIF) is available from“http://sugiyamawww.cs.titech.ac.jp/ ~sugi/software/uLSIF/”.
If the
L _{2}regularizerin Equation (4) is replaced with the
L _{1}regularizerthe solution tends to be sparse [39]. Then the solution can be obtained in a computationally more efficient way [40], and furthermore, a regularization path tracking algorithm [41] is available for efficiently computing solutions with different regularization parameter values.
> C. rPE Divergence Approximation
The rPE divergence can be directly estimated in the same way as the PE divergence [24]:
Its empirical criterion, where an irrelevant constant is ignored and the expectations are approximated by sample averages, is given by
For the Gaussian densityratio model (3), together with the
L _{2}regularizer, the above optimization problem is expressed aswhere
is the
n×n matrix, with the (l, l' )th element defined byThis is a convex optimization problem, and the global optimal solution can be computed analytically as
Crossvalidation for tuning the Gaussian width
σ and the regularization parameterλ can be carried out in the same way as the PEdivergence case, with Equation (5) replaced byBy expanding the squared term
in Equation (2), the rPE divergence can be expressed as
Based on these expressions, rPE divergence approximators are given, using the relative densityratio estimator
as
A MATLAB implementation of this algorithm (called
relative uLSIF ; RuLSIF) is available from“http://sugiyamawww.cs.titech.ac.jp/ ~yamada/RuLSIF.html”.
> D. L2Distance Approximation
The key idea is to directly estimate the density difference
p ？ p' without estimating each density [25]. More specifically, a densitydifference estimator is obtained by minimizing the squared difference between a densitydifference modelf and the true densitydifference functionp ？p' :Its empirical criterion, where an irrelevant constant is ignored and the expectation is approximated by the sample average, is given by
Let us consider the following Gaussian densitydifference model:
where
are Gaussian centers. Then the above optimization problem is expressed as
where the
L _{2}regularizeris included,
U is the (n + n' ) × (n + n' ) matrix, with the (l, l' )th element defined byd denotes the dimensionality ofx , andis the (
n + n' ) dimensional vector, with thel th element defined byThis is a convex optimization problem, and the global optimal solution can be computed
analytically asThe above optimization problem is essentially the same form as leastsquares densityratio approximation for the PE divergence, and therefore leastsquares densitydifference approximation can enjoy all the computational properties of leastsquares densityratio approximation.
Crossvalidation for tuning the Gaussian width
σ and the regularization parameterλ can be carried out as follows: First, the numerator and denominator samples X =and
are divided into
T disjoint subsetsand
respectively. Then a densitydifference estimator
f_{t}(x) is obtained usingX\X_{t} andX'\X'_{t} (i.e., all samples withoutX_{t} andX'_{t} ), and its objective value for the holdout samplesX_{t} andX'_{t} is computed:Note that the first term can be computed analytically for the Gaussian densitydifference model (14):
where
is the parameter vector learned from
X\X_{t} andX'\X'_{t} .For an equivalent expression of the
L ^{2}distance,if
f is replaced with a densitydifference estimator, and the expectations are approximated by empirical averages, the following
L ^{2}distance approximator can be obtained:Similarly, for another expression
replacing
f with a densitydifference estimatorgives another
L ^{2}distance approximator:Equations (15) and (16) themselves give valid approximations to
L ^{2}(p,p '), but their linear combinationwas shown to have a smaller bias than Equations (15) and (16).
A MATLAB implementation of the above algorithm (called
leastsquares density difference ; LSDD) is available from“http://sugiyamawww.cs.titech.ac.jp/ ~sugi/software/LSDD/”.
IV. USAGE OF DIVERGENCE APPROXIMATORS IN MACHINE LEARNING
In this section, we show applications of divergence approximators in machine learning.
> A. ClassPrior Estimation under ClassBalance Change
In realworld pattern recognition tasks, changes in class balance are often observed between the training and test phases. In such cases, naive classifier training produces significant estimation bias, because the class balance in the training dataset does not properly reflect that in the test dataset. Here, let us consider a binary pattern recognition task of classifying pattern
x to classy ∈ {+1, 1}. The goal is to learn the class balance in a test dataset under a semisupervised learning setup, where unlabeled test samples are provided, in addition to labeled training samples [42].The class balance in the test set can be estimated by matching a πmixture of classwise training input densities,
to the test input density
p_{test}(x) under some divergence measure [4]. Here,π ∈ [0, 1] is a mixing coefficient to be learned, to minimize the divergence (Fig. 1).We use four
UCI benchmark datasets (http://archive. ics.uci.edu/ml/) for numerical experiments, where we randomly choose 10 labeled training samples from each class, and 50 unlabeled test samples, following true classpriorπ* = 0.1, 0.2, ... , 0.9.
The graphs on the lefthand side of Fig. 2 plot the mean and standard error of the squared difference between true and estimated classbalances π. These graphs show that LSDD tends to provide better classbalance estimates than the twostep procedure of first estimating probability densities by kernel density estimation (KDE) and then learning π.
The graphs on the righthand side of Fig. 2 plot the test misclassification error obtained by a weighted
L _{2}regularized kernel leastsquares classifier [43], with weighted crossvalidation [44]. The results show that the LSDDbased method provides lower classification errors, which would be brought by good estimates of test classbalances.> B. ChangeDetection in TimeSeries
The goal is to discover abrupt property changes behind timeseries data. Let
y(t) ∈ ？^{m} be anm dimensional timeseries sample at timet , and letbe a subsequence of time series at time
t with lengthk . Instead of a single pointy(t) , the subsequenceY(t) is treated as a sample here, because timedependent information can be naturally incorporated by this trick [3]. Letbe a set of
r retrospective subsequence samples starting at timet . Then a divergence between the probability distributions ofY(t) andY(t +r ) may be used as the plausibility of change points (Fig. 3).In Fig. 4, we illustrate results of unsupervised change detection for the
Information Processing Society of Japan Special Interest Group of Spoken Language Processing (IPSJ SIGSLP)Corpora and Environments for Noisy Speech Recognition (CENSREC) dataset (http://research. nii.ac.jp/src/en/CENSREC1C.html) that records human voice in noisy environments, such as a restaurant, and theHuman Activity Sensing Consortium (HASC) challenge 2011 dataset (http://hasc.jp/hc2011/), that provides human activity information collected by portable threeaxis accelerometers. These graphs show that the KLbased method is excessively sensitive to noise, and thus change points are not clearly detected. On the other hand, theL ^{2} based method more clearly indicates the existence of change points.It was also demonstrated that divergencebased changedetection methods are useful in event detection from movies [6], and Twitter [7].
> C. Salient Object Detection in an Image
The goal is to find salient objects in an image. This can be achieved by computing a divergence between the prob
ability distributions of image features (such as brightness, edges, and colors) in the center window and its surroundings [5]. This divergence computation is swept over the entire image, with changing scales (Fig. 5).
The object detection results on the MSRA salient object database [45] by the rPE divergence with
α = 0.1 are described in Fig. 6, where pixels in grayscale saliency maps take brighter color if the estimated divergence valueis large. The results show that visually salient objects can be successfully detected by the divergencebased
approach.
> D. Measuring Statistical Independence
The goal is to measure how strongly two random variables
U andV are statistically dependent, from paired samplesdrawn independently from the joint distribution, with density
p _{U,V}(u,v) . Let us consider a divergence between the joint densityp _{U,V}, and the product of marginal densitiesp _{U} ·p _{V}. This actually serves as a measure of statistical independence, becauseU andV are independent if and only if the divergence is zero (i.e.,p _{U,V} =p _{U} ·p _{V}), and the dependence betweenU andV is stronger, if the divergence is larger.Such a dependence measure can be approximated in the same way as ordinary divergences by using the two datasets formed as
The dependence measure based on the KL divergence is called
mutual information (MI) [46]:MI plays a central role in information theory [47].
On the other hand, its PEdivergence variant is called the
squaredloss mutual information (SMI):SMI is useful for solving various machine learning tasks [8], including independence testing [9], feature selection [10,11], feature extraction [12,13], canonical dependency analysis [14], object matching [15], independent component analysis [16], clustering [17,18], and causal direction estimation [19].
An
L ^{2}distance variant of the dependence measure is calledquadratic mutual information (QMI) [48]:QMI is also a useful dependence measure in practice [49].
V. CONCLUSIONS
In this article, we reviewed recent advances in direct divergence approximation. Direct divergence approximators theoretically achieve optimal convergence rates, both in parametric and nonparametric cases, and experimentally compare favorably with the naive densityestimation counterparts [2125].
However, direct divergence approximators still suffer from the curse of
dimensionality . A possible cure for this problem is to combine them with dimensionality reduction, based on the hope that two probability distributions share some commonality [51,52]. Further investigating this line would be a promising future direction.

[Fig. 1.] Schematic of classprior estimation under class balance change.

[Fig. 2.] (Left) Squared error of classprior estimation. (Right) Misclassification error by a weighted L2regularized kernel leastsquares classifier with weighted crossvalidation. (a) Australian, (b) diabetes, (c) German, and (d) statlogheart datasets. LSDD: leastsquares density difference, KDE: kernel density estimation.

[Fig. 3.] Schematic of changepoint detection in timeseries.

[Fig. 4.] Results of changepoint detection. Original timeseries data is plotted in the top graphs, and change scores obtained by KLIEP (KullbackLeibler importance estimation procedure; Section IIIA) and LSDD (leastsquares density difference; Section IIID) are plotted in the bottom graphs. (a) CENSREC (Corpora and Environments for Noisy Speech Recognition) dataset, (b) HASC (Human Activity Sensing Consortium) dataset.

[Fig. 5.] Schematic of salient object detection in an image.

[Fig. 6.] Results of salient object detection in an image: (upper) original images and (lower) obtained saliency mapsbrighter color means more salient.