Direct Divergence Approximation between Probability Distributions and Its Applications in Machine Learning

  • cc icon
  • ABSTRACT

    Approximating a divergence between two probability distributions from their samples is a fundamental challenge in statistics, information theory, and machine learning. A divergence approximator can be used for various purposes, such as two-sample homogeneity testing, change-point detection, and class-balance estimation. Furthermore, an approximator of a divergence between the joint distribution and the product of marginals can be used for independence testing, which has a wide range of applications, including feature selection and extraction, clustering, object matching, independent component analysis, and causal direction estimation. In this paper, we review recent advances in divergence approximation. Our emphasis is that directly approximating the divergence without estimating probability distributions is more sensible than a naive twostep approach of first estimating probability distributions and then approximating the divergence. Furthermore, despite the overwhelming popularity of the Kullback-Leibler divergence as a divergence measure, we argue that alternatives such as the Pearson divergence, the relative Pearson divergence, and the L2-distance are more useful in practice because of their computationally efficient approximability, high numerical stability, and superior robustness against outliers.


  • KEYWORD

    Machine learning , Probability distributions , Kullback-Leibler divergence , Pearson divergence , L2- distance

  • I. INTRODUCTION

    Let us consider the problem of approximating a divergence D between two probability distributions P and P' on ?d from two sets of independent and identically distributed samples

    image

    following P and P'.

    A divergence approximator can be used for various purposes, such as two-sample testing [1,2], change detection in time-series [3], class-prior estimation under classbalance change [4], salient object detection in images [5], and event detection from movies [6] and Twitter [7]. Furthermore, an approximator of the divergence between the joint distribution and the product of marginal distributions can be used for solving a wide range of machine learning problems [8], including independence testing [9], feature selection [10,11], feature extraction [12,13], canonical dependency analysis [14], object matching [15], independent component analysis [16], clustering [17,18], and causal direction learning [19]. For this reason, accurately approximating a divergence between two probability distributions from their samples has been one of the challenging research topics in the statistics, information theory, and machine learning communities.

    A naive way to approximate the divergence from P to P', denoted by D(P||P'), is to first obtain estimators

    image

    and

    image

    of the distributions P and P' separately from their samples X and X', and then compute a plug-in approximator

    image

    However, this naive two-step approach violates Vapnik’s principle [20]:

    If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem.

    More specifically, if we know the distributions P and P', we can immediately know their divergence D(P||P'). However, knowing the divergence D(P||P') does not necessarily imply knowing the distributions P and P', because different pairs of distributions can yield the same divergence value. Thus, estimating the distributions P and P' is more general than estimating the divergence D(P||P'). Following Vapnik’s principle, direct divergence approximators

    image

    that do not involve the estimation of distributions P and P' have been developed recently [21-25].

    The purpose of this article is to give an overview of the development of such direct divergence approximators. In Section II, we review the definitions of the Kullback- Leibler divergence, the Pearson divergence, the relative Pearson divergence, and the L2-distance, and discuss their pros and cons. Then, in Section III, we review direct approximators of these divergences that do not involve the estimation of probability distributions. In Section IV, we show practical usage of divergence approximators in unsupervised change-detection in time-series, semi-supervised class-prior estimation under class-balance change, salient object detection in an image, and evaluation of statistical independence between random variables. Finally, we conclude in Section V.

    II. DIVERGENCE MEASURES

    A function d(·, ·) is called a distance if and only if the following four conditions are satisfied:

    Non-negativity: ∀x, y, d(x, y) ≥ 0

    Non-degeneracy: d(x, y) = 0 ⇔ x = y

    Symmetry: ∀x, y, d(x, y) = d(y, x)

    Triangle inequality: ∀x, y, z d(x, z) ≤ d(x, y) + d(y, z)

    A divergence is a pseudo-distance that still acts like a distance, but it may violate some of the above conditions. In this section, we introduce useful divergence and distance measures between probability distributions.

      >  A. Kullback-Leibler (KL) Divergence

    The most popular divergence measure in statistics and machine learning is the KL divergence [26], defined as

    image

    where p(x) and p'(x) are probability density functions of P and P', respectively.

    Advantages of the KL divergence are that it is compatible with maximum likelihood estimation, that it is invariant under input metric change, that its Riemannian geometric structure is well studied [27], and that it can be approximated accurately via direct density-ratio estimation [21,22,28]. However, it is not symmetric, it does not satisfy the triangle inequality, its approximation is computationally expensive due to the log function, and it is sensitive to outliers and numerically unstable, because of the strong non-linearity of the log function, and possible unboundedness of the density-ratio function p/p' [24,29].

      >  B. Pearson (PE) Divergence

    The PE divergence [30] is a squared-loss variant of the KL divergence defined as

    image

    Because both the PE and KL divergences belong to the class of Ali-Silvey-Csiszar divergences (which is also known as f-divergences) [31,32], they share similar theoretical properties such as the invariance under input metric change.

    The PE divergence can also be accurately approxi-mated via direct density-ratio estimation, in the same way as the KL divergence [23,28]. However, its approximator can be obtained analytically in a computationally much more efficient manner than the KL divergence, because the quadratic function the PE divergence adopts is compatible with least-squares estimation. Furthermore, the PE divergence tends to be more robust against outliers than the KL divergence [33]. However, other weaknesses of the KL divergence, such as asymmetry, violation of the triangle inequality, and possible unboundedness of the density-ratio function p/p', remain unsolved in the PE divergence.

      >  C. Relative Pearson (rPE) Divergence

    To overcome the possible unboundedness of the density- ratio function p/p', the rPE divergence was recently introduced [24]. The rPE divergence is defined as

    image

    where, for 0 ≤ α < 1, qα is defined as the α-mixture of p and p':

    image

    When α = 0, the rPE divergence is reduced to the plain PE divergence. The quantity p/qα is called the relative density-ratio, which is always upper-bounded by 1/α for α > 0, because

    image

    Thus, it can overcome the unboundedness problem of the PE divergence, while the invariance under input metric change is still maintained.

    The rPE divergence is still compatible with least-squares estimation, and it can be approximated in almost the same way as the PE divergence via direct relative density- ratio estimation [24]. Indeed, an rPE-divergence approximator can still be obtained analytically in an accurate and computationally efficient manner. However, it still violates symmetry and the triangle inequality, in the same way as the KL and PE divergence. Furthermore, the choice of α may not be straightforward, in some applications.

      >  D. L2-Distance

    The L2-distance is another standard distance measure between probability distributions, defined as

    image

    The L2-distance is a proper distance measure, and thus it is symmetric and satisfies the triangle inequality. Furthermore, the density difference p(x)p'(x) is always bounded as long as each density is bounded. Therefore, the L2-distance is stable without the need of tuning any control parameter such as α in the rPE divergence.

    The L2-distance is also compatible with least-squares estimation, and it can be accurately and analytically approximated in a computationally efficient and numerically stable manner via direct density-difference estimation [25]. However, the L2-distance is not invariant under input metric change, which is a unique property inherent to ratio-based divergences.

    III. DIRECT DIVERGENCE APPROXIMATION

    In this section, we review recent advances in direct divergence approximation.

    Suppose that we are given two sets of independent and identically distributed samples

    image

    from probability distributions on ?d, with densities p(x) and p'(x), respectively:

    image

    Our goal is to approximate a divergence between from p to p' from samples X and X'.

      >  A. KL Divergence Approximation

    The key idea of direct KL divergence approximation is to estimate the density ratio p/p' without estimating the densities p and p' [21]. More specifically, a density-ratio estimator is obtained by minimizing the KL divergence from p to r·p' with respect to a density-ratio model r under the constraints that the density-ratio function is non-negative and r·p' is integrated to one:

    image

    Its empirical optimization problem, where an irrelevant constant is ignored and the expectations are approximated by the sample averages, is given by

    image

    Let us consider the following Gaussian density-ratio model:

    image

    where ||·|| denotes the l2-norm. We define the vector of parameters

    image

    as

    image

    where, T denotes the transpose. In this model, the Gaussian kernels are located on numerator samples

    image

    , because the density ratio p/p' tends to take large values in the regions where the numerator samples

    image

    exist. Alternatively, Gaussian kernels may be located on both numerator and denominator samples, but this seems not to further improve the accuracy [21]. When n is very large, a (random) subset of numerator samples

    image

    may be chosen as Gaussian centers, which can reduce the computational cost.

    For the Gaussian density-ratio model (3), the above optimization problem is expressed as

    image

    This is a convex optimization problem, and thus the global optimal solution can be obtained easily, e.g., by gradient-projection iterations. Furthermore, the global optimal solution tends to be sparse (i.e., many parameter values become exactly zero), which can be utilized for reducing the computational cost.

    The Gaussian width σ is a tuning parameter in this algorithm, and it can be systematically optimized by cross-validation with respect to the objective function. More specifically, the numerator samples X :=

    image

    are divided into T disjoint subsets

    image

    of (approximately) the same size. Then, a density-ratio estimator

    image

    is obtained using X\Xt and

    image

    (i.e., all numerator samples without Xt and all denominator samples), and its objective value for the hold-out numerator samples Xt is computed:

    image

    where |Xt| denotes the number of elements in the set Xt. This procedure is repeated for t = 1,...,T, and the σ value that maximizes the average of the above hold-out objective values is chosen as the best one.

    Given a density-ratio estimator

    image

    , a KL-divergence approximator

    image

    can be constructed as

    image

    A MATLABR® (MathWorks, Natick, MA, USA) implementation of the above KL divergence approximator (called the KL importance estimation procedure; KLIEP) is available from

    “http://sugiyama-www.cs.titech.ac.jp/ ~sugi/software/KLIEP/”.

    Variations of this procedure for other density-ratio models have been developed including the log-linear model [34], the Gaussian mixture model [35], and the mixture of probabilistic principal component analyzers [36]. Also, an unconstrained variant, which corresponds to approximately maximizing the Legendre-Fenchel lower bound of the KL divergence [37], was proposed [22]:

    image

      >  B. PE Divergence Approximation

    The PE divergence can also be directly approximated without estimating the densities p and p' via direct estimation of the density ratio p/p' [23]. More specifically, a density-ratio estimator is obtained by minimizing the p'- weighted squared difference between a density-ratio model r and the true density-ratio function p/p':

    image

    Its empirical criterion, where an irrelevant constant is ignored, and the expectations are approximated by the sample averages, is given by

    image

    For the Gaussian density-ratio model (3) together with the L2-regularizer, the above optimization problem is expressed as

    image

    where λ ≥ 0 denotes the regularization parameter, and

    image

    is the n×n matrix with the (l,l')-th element defined by

    image

    and

    image

    is the n-dimensional vector with the l-th element defined by

    image

    This is a convex optimization problem, and the global optimal solution can be computed analytically as

    image

    where I denotes the identity matrix.

    The Gaussian width σ and the regularization parameter λ are the tuning parameters in this algorithm, and they can be systematically optimized by cross-validation, with respect to the objective function, as follows: First, the numerator and denominator samples X =

    image

    and

    image

    are divided into T disjoint subsets

    image

    and

    image

    respectively. Then, a density-ratio estimator

    image

    is obtained using X\Xt and X'\X't (i.e., all samples without Xt and X't ), and its objective value for the holdout samples Xt and X't is computed:

    image

    This procedure is repeated for t = 1,...,T, and the σ and λ values that maximize the average of the above hold-out objective values are chosen as the best ones.

    By expanding the squared term

    image

    in Equation (1), the PE divergence can be expressed as

    image
    image

    Note that Equation (7) can also be obtained via Legendre- Fenchel convex duality of the divergence functional [38]. Based on these expressions, PE divergence approximators are obtained using a density-ratio estimator

    image

    as

    image
    image

    Equation (8) is suitable for algorithmic development because this would be the simplest expression, while Equation (9) is suitable for theoretical analysis because this corresponds to the negative of the objective function in Equation (4).

    A MATLAB implementation of the above method (called unconstrained least-squares importance fitting; uLSIF) is available from

    “http://sugiyama-www.cs.titech.ac.jp/ ~sugi/software/uLSIF/”.

    If the L2-regularizer

    image

    in Equation (4) is replaced with the L1-regularizer

    image

    the solution tends to be sparse [39]. Then the solution can be obtained in a computationally more efficient way [40], and furthermore, a regularization path tracking algorithm [41] is available for efficiently computing solutions with different regularization parameter values.

      >  C. rPE Divergence Approximation

    The rPE divergence can be directly estimated in the same way as the PE divergence [24]:

    image

    Its empirical criterion, where an irrelevant constant is ignored and the expectations are approximated by sample averages, is given by

    image

    For the Gaussian density-ratio model (3), together with the L2-regularizer, the above optimization problem is expressed as

    image

    where

    image

    is the n×n matrix, with the (l, l')-th element defined by

    image

    This is a convex optimization problem, and the global optimal solution can be computed analytically as

    image

    Cross-validation for tuning the Gaussian width σ and the regularization parameter λ can be carried out in the same way as the PE-divergence case, with Equation (5) replaced by

    image

    By expanding the squared term

    image

    in Equation (2), the rPE divergence can be expressed as

    image
    image

    Based on these expressions, rPE divergence approximators are given, using the relative density-ratio estimator

    image

    as

    image
    image

    A MATLAB implementation of this algorithm (called relative uLSIF; RuLSIF) is available from

    “http://sugiyama-www.cs.titech.ac.jp/ ~yamada/RuLSIF.html”.

      >  D. L2-Distance Approximation

    The key idea is to directly estimate the density difference p ? p' without estimating each density [25]. More specifically, a density-difference estimator is obtained by minimizing the squared difference between a density-difference model f and the true density-difference function pp':

    image

    Its empirical criterion, where an irrelevant constant is ignored and the expectation is approximated by the sample average, is given by

    image

    Let us consider the following Gaussian density-difference model:

    image

    where

    image

    are Gaussian centers. Then the above optimization problem is expressed as

    image

    where the L2-regularizer

    image

    is included, U is the (n + n') × (n + n') matrix, with the (l, l')-th element defined by

    image

    d denotes the dimensionality of x, and

    image

    is the (n + n')- dimensional vector, with the l-th element defined by

    image

    This is a convex optimization problem, and the global optimal solution can be computed analytically as

    image

    The above optimization problem is essentially the same form as least-squares density-ratio approximation for the PE divergence, and therefore least-squares density-difference approximation can enjoy all the computational properties of least-squares density-ratio approximation.

    Cross-validation for tuning the Gaussian width σ and the regularization parameter λ can be carried out as follows: First, the numerator and denominator samples X =

    image

    and

    image

    are divided into T disjoint subsets

    image

    and

    image

    respectively. Then a density-difference estimator ft(x) is obtained using X\Xt and X'\X't (i.e., all samples without Xt and X't ), and its objective value for the hold-out samples Xt and X'tis computed:

    image

    Note that the first term can be computed analytically for the Gaussian density-difference model (14):

    image

    where

    image

    is the parameter vector learned from X\Xt and X'\X't.

    For an equivalent expression of the L2-distance,

    image

    if f is replaced with a density-difference estimator

    image

    , and the expectations are approximated by empirical averages, the following L2-distance approximator can be obtained:

    image

    Similarly, for another expression

    image

    replacing f with a density-difference estimator

    image

    gives another L2-distance approximator:

    image

    Equations (15) and (16) themselves give valid approximations to L2(p,p'), but their linear combination

    image

    was shown to have a smaller bias than Equations (15) and (16).

    A MATLAB implementation of the above algorithm (called least-squares density difference; LSDD) is available from

    “http://sugiyama-www.cs.titech.ac.jp/ ~sugi/software/LSDD/”.

    IV. USAGE OF DIVERGENCE APPROXIMATORS IN MACHINE LEARNING

    In this section, we show applications of divergence approximators in machine learning.

      >  A. Class-Prior Estimation under Class-Balance Change

    In real-world pattern recognition tasks, changes in class balance are often observed between the training and test phases. In such cases, naive classifier training produces significant estimation bias, because the class balance in the training dataset does not properly reflect that in the test dataset. Here, let us consider a binary pattern recognition task of classifying pattern x to class y ∈ {+1, -1}. The goal is to learn the class balance in a test dataset under a semi-supervised learning setup, where unlabeled test samples are provided, in addition to labeled training samples [42].

    The class balance in the test set can be estimated by matching a π-mixture of class-wise training input densities,

    image

    to the test input density ptest(x) under some divergence measure [4]. Here, π ∈ [0, 1] is a mixing coefficient to be learned, to minimize the divergence (Fig. 1).

    We use four UCI benchmark datasets (http://archive. ics.uci.edu/ml/) for numerical experiments, where we randomly choose 10 labeled training samples from each class, and 50 unlabeled test samples, following true classprior

    π* = 0.1, 0.2, ... , 0.9.

    The graphs on the left-hand side of Fig. 2 plot the mean and standard error of the squared difference between true and estimated class-balances π. These graphs show that LSDD tends to provide better class-balance estimates than the two-step procedure of first estimating probability densities by kernel density estimation (KDE) and then learning π.

    The graphs on the right-hand side of Fig. 2 plot the test misclassification error obtained by a weighted L2-regularized kernel least-squares classifier [43], with weighted cross-validation [44]. The results show that the LSDDbased method provides lower classification errors, which would be brought by good estimates of test class-balances.

      >  B. Change-Detection in Time-Series

    The goal is to discover abrupt property changes behind time-series data. Let y(t) ∈ ?m be an m-dimensional timeseries sample at time t, and let

    image

    be a subsequence of time series at time t with length k. Instead of a single point y(t), the subsequence Y(t) is treated as a sample here, because time-dependent information can be naturally incorporated by this trick [3]. Let

    image

    be a set of r retrospective subsequence samples starting at time t. Then a divergence between the probability distributions of Y(t) and Y(t + r) may be used as the plausibility of change points (Fig. 3).

    In Fig. 4, we illustrate results of unsupervised change detection for the Information Processing Society of Japan Special Interest Group of Spoken Language Processing (IPSJ SIG-SLP) Corpora and Environments for Noisy Speech Recognition (CENSREC) dataset (http://research. nii.ac.jp/src/en/CENSREC-1-C.html) that records human voice in noisy environments, such as a restaurant, and the Human Activity Sensing Consortium (HASC) challenge 2011 dataset (http://hasc.jp/hc2011/), that provides human activity information collected by portable three-axis accelerometers. These graphs show that the KL-based method is excessively sensitive to noise, and thus change points are not clearly detected. On the other hand, the L2- based method more clearly indicates the existence of change points.

    It was also demonstrated that divergence-based changedetection methods are useful in event detection from movies [6], and Twitter [7].

      >  C. Salient Object Detection in an Image

    The goal is to find salient objects in an image. This can be achieved by computing a divergence between the prob-

    ability distributions of image features (such as brightness, edges, and colors) in the center window and its surroundings [5]. This divergence computation is swept over the entire image, with changing scales (Fig. 5).

    The object detection results on the MSRA salient object database [45] by the rPE divergence with α = 0.1 are described in Fig. 6, where pixels in gray-scale saliency maps take brighter color if the estimated divergence value

    is large. The results show that visually salient objects can be successfully detected by the divergence-based

    approach.

      >  D. Measuring Statistical Independence

    The goal is to measure how strongly two random variables U and V are statistically dependent, from paired samples

    image

    drawn independently from the joint distribution, with density pU,V(u,v). Let us consider a divergence between the joint density pU,V, and the product of marginal densities pU · pV. This actually serves as a measure of statistical independence, because U and V are independent if and only if the divergence is zero (i.e., pU,V = pU · pV), and the dependence between U and V is stronger, if the divergence is larger.

    Such a dependence measure can be approximated in the same way as ordinary divergences by using the two datasets formed as

    image

    The dependence measure based on the KL divergence is called mutual information (MI) [46]:

    image

    MI plays a central role in information theory [47].

    On the other hand, its PE-divergence variant is called the squared-loss mutual information (SMI):

    image

    SMI is useful for solving various machine learning tasks [8], including independence testing [9], feature selection [10,11], feature extraction [12,13], canonical dependency analysis [14], object matching [15], independent component analysis [16], clustering [17,18], and causal direction estimation [19].

    An L2-distance variant of the dependence measure is called quadratic mutual information (QMI) [48]:

    image

    QMI is also a useful dependence measure in practice [49].

    V. CONCLUSIONS

    In this article, we reviewed recent advances in direct divergence approximation. Direct divergence approximators theoretically achieve optimal convergence rates, both in parametric and non-parametric cases, and experimentally compare favorably with the naive density-estimation counterparts [21-25].

    However, direct divergence approximators still suffer from the curse of dimensionality. A possible cure for this problem is to combine them with dimensionality reduction, based on the hope that two probability distributions share some commonality [51,52]. Further investigating this line would be a promising future direction.

  • 1. Sugiyama M., Suzuki T., Itoh Y., Kanamori T., Kimura M. 2011 “Least-squares two-sample test” [Neural Networks] Vol.24 P.735-751 google doi
  • 2. Kanamori T., Suzuki T., Sugiyama M. 2012 “f-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models” [IEEE Transactions on Information Theory] Vol.58 P.708-720 google doi
  • 3. Kawahara Y., Sugiyama M. 2012 “Sequential change-point detection based on direct density-ratio estimation” [Statistical Analysis and Data Mining] Vol.5 P.114-127 google doi
  • 4. du Plessis M. C., Sugiyama M. 2012 “Semi-supervised learning of class balance under class-prior change by distribution matching” [in Proceedings of the 29th International Conference on Machine Learning] P.823-830 google
  • 5. Yamanaka M., Matsugu M., Sugiyama M. 2013 “Salient object detection based on direct density ratio estimation” google
  • 6. Yamanaka M., Matsugu M., Sugiyama M. 2013 “Detection of activities and events without explicit categorization” google
  • 7. Liu S., Yamada M., Collier N., Sugiyama M. 2013 “Changepoint detection in time-series data by relative density-ratio estimation” [Neural Networks] Vol.43 P.72-83 google doi
  • 8. Sugiyama M. 2013 “Machine learning with squared-loss mutual information” [Entropy] Vol.15 P.80-112 google doi
  • 9. Sugiyama M., Suzuki T. 2011 “Least-squares independence test” [IEICE Transactions on Information and Systems] Vol.94 P.1333-1336 google doi
  • 10. Suzuki T., Sugiyama M., Kanamori T., Sese J. 2009 “Mutual information estimation reveals global associations between stimuli and biological processes” [BMC Bioinformatics] Vol.10 P.S52 google doi
  • 11. Jitkrittum W., Hachiya H., Sugiyama M. 2013 “Feature selection via L1-penalized squared-loss mutual information” google
  • 12. Suzuki T., Sugiyama M. 2013 “Sufficient dimension reduction via squared-loss mutual information estimation” [Neural Computation] Vol.25 P.725-758 google doi
  • 13. Yamada M., Niu G., Takagi J., Sugiyama M. 2011 “Computationally efficient sufficient dimension reduction via squaredloss mutual information” [JMLR Workshop and Conference Proceedings] Vol.20 P.247-262 google
  • 14. Karasuyama M. 2012 “Canonical dependency analysis based on squared-loss mutual information” [Neural Networks] Vol.34 P.46-55 google doi
  • 15. Yamada M., Sugiyama M. 2011 “Cross-domain object matching with model selection” [JMLR Workshop and Conference Proceedings] Vol.15 P.807-815 google
  • 16. Suzuki T., Sugiyama M. 2011 “Least-squares independent component analysis” [Neural Computation] Vol.23 P.284-301 google doi
  • 17. Sugiyama M., Yamada M., Kimura M., Hachiya H. 2011 “On information-maximization clustering: tuning parameter selection and analytic solution” [in Proceedings of the 28th International Conference on Machine Learning] P.65-72 google
  • 18. Kimura M., Sugiyama M. 2011 “Dependence maximization clustering with least-squares mutual information” [Journal of Advanced Computational Intelligence and Intelligent InforInformatics] Vol.15 P.800-805 google
  • 19. Yamada M., Sugiyama M. 2010 “Dependence minimizing regression with model selection for nonlinear causal inference under non-Gaussian noise” [in Proceedings of the 24th AAAI Conference on Artificial Intelligence] P.643-648 google
  • 20. Vapnik V. N. 1998 Statistical Learning Theory google
  • 21. Sugiyama M., Suzuki T., Nakajima S., Kashima H., von Bunau P., Kawanabe M. 2008 “Direct importance estimation for covariate shift adaptation” [Annals of the Institute of Statistical Mathematics] Vol.60 P.699-746 google doi
  • 22. Nguyen X., Wainwright M. J., Jordan M. I. 2010 “Estimating divergence functionals and the likelihood ratio by convex risk minimization” [IEEE Transactions on Information Theory] Vol.56 P.5847-5861 google doi
  • 23. Kanamori T., Hido S., Sugiyama M. 2009 “A least-squares approach to direct importance estimation” [Journal of Machine Learning Research] Vol.10 P.1391-1445 google
  • 24. Yamada M., Suzuki T., Kanamori T., Hachiya H., Sugiyama M. 2013 “Relative density-ratio estimation for robust distribution comparison” [Neural Computation] Vol.25 P.1324-1370 google doi
  • 25. Sugiyama M., Suzuki T., Kanamori T., du Plessis M. C., Liu S., Takeuchi I. 2013 “Density difference estimation” google
  • 26. Kullback S., Leibler R. A. 1951 “On information and sufficiency” [The Annals of Mathematical Statistics] Vol.22 P.79-86 google
  • 27. Amari S., Nagaoka H. 2000 Methods of Information Geometry google
  • 28. Sugiyama M., Suzuki T., Kanamori T. 2012 Density Ratio Estimation in Machine Learning google
  • 29. Cortes C., Mansour Y., Mohri M. 2010 “Learning bounds for importance weighting,” in Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, R. Zemel, J. Shawe-Taylor, and A. Culotta, Eds., P.442-450 google
  • 30. Pearson K. 1900 “On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling” [Philosophical Magazine] Vol.50 P.157-175 google doi
  • 31. Ali S. M., Silvey S. D. 1966 “A general class of coefficients of divergence of one distribution from another” [Journal of the Royal Statistical Society] Vol.28 P.131-142 google
  • 32. Csiszar I. 1967 “Information-type measures of difference of probability distributions and indirect observation” [Studia Scientiarum Mathematicarum Hungarica] Vol.2 P.299-318 google
  • 33. Sugiyama M., Suzuki T., Kanamori T. 2012 “Density ratio matching under the Bregman divergence: a unified framework of density ratio estimation” [Annals of the Institute of Statistical Mathematics] Vol.64 P.1009-1044 google doi
  • 34. Tsuboi Y., Kashima H., Hido S., Bickel S., Sugiyama M. 2009 “Direct density ratio estimation for large-scale covariate shift adaptation” [Information and Media Technologies] Vol.4 P.529-546 google
  • 35. Yamada M., Sugiyama M. 2009 “Direct importance estimation with Gaussian mixture models” [IEICE Transactions on Information and Systems] Vol.92 P.2159-2162 google doi
  • 36. Yamada M., Sugiyama M., Wichern G., Simm J. 2010 “Direct importance estimation with a mixture of probabilistic principal component analyzers” [IEICE Transactions on Information and Systems] Vol.93 P.2846-2849 google doi
  • 37. Keziou A. 2003 “Dual representation of φ-divergences and applications” [Comptes Rendus Mathematique] Vol.336 P.857-862 google doi
  • 38. Rockafellar R. T. 1970 Convex Analysis. google
  • 39. Tibshirani R. 1996 “Regression shrinkage and subset selection with the lasso” [Journal of the Royal Statistical Society B] Vol.58 P.267-288 google
  • 40. Tomioka R., Suzuki T., Sugiyama M. 2011 “Super-linear convergence of dual augmented Lagrangian algorithm for sparsity regularized estimation” [Journal of Machine Learning Research] Vol.12 P.1537-1586 google
  • 41. Efron B., Hastie T., Johnstone I., Tibshirani R. 2004 “Least angle regression” [The Annals of Statistics] Vol.32 P.407-499 google doi
  • 42. Chapelle O., Scholkopf B., Zien A. 2006 Semi-Supervised Learning. google
  • 43. Rifkin R., Yeo G., Poggio T. 2003 “Regularized least-squares classification,” in Advances in Learning Theory: Methods, Models and Applications, J. A. K. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, Eds. P.131-154 google
  • 44. Sugiyama M., Krauledat M., Muller K. R. 2007 “Covariate shift adaptation by importance weighted cross validation” [Journal of Machine Learning Research] Vol.8 P.985-1005 google
  • 45. Liu T., Yuan Z., Sun J., Wang J., Zheng N., Tang X., Shum H. Y. 2011 “Learning to detect a salient object” [IEEE Transactions on Pattern Analysis and Machine Intelligence] Vol.33 P.353-367 google doi
  • 46. Shannon C. E. 1948 “A mathematical theory of communication” [Bell Systems Technical Journal] Vol.27 P.379-423 google
  • 47. Cover T. M., Thomas J. A. 2006 Elements of Information Theory google
  • 48. Torkkola K. 2003 “Feature extraction by non-parametric mutual information maximization” [Journal of Machine Learning Research] Vol.3 P.1415-1438 google
  • 49. Sainui J., Sugiyama M. 2013 “Direct approximation of quadratic mutual information and its application to dependencemaximization clustering” google
  • 50. Sugiyama M., Kawanabe M., Chui P. L. 2010 “Dimensionality reduction for density ratio estimation in high-dimensional spaces” [Neural Networks] Vol.23 P.44-59 google doi
  • 51. Sugiyama M., Yamada M., von Bunau P., Suzuki T., Kanamori T., Kawanabe M. 2011 “Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search” [Neural Networks] Vol.24 P.183-198 google doi
  • 52. Yamada M., Sugiyama M. 2011 “Direct density-ratio estimation with dimensionality reduction via hetero-distributional subspace analysis” [in Proceedings of the 25th AAAI Conference on Artificial Intelligence] P.549-554 google
  • [Fig. 1.] Schematic of class-prior estimation under class balance change.
    Schematic of class-prior estimation under class balance change.
  • [Fig. 2.] (Left) Squared error of class-prior estimation. (Right) Misclassification error by a weighted L2-regularized kernel least-squares classifier with weighted cross-validation. (a) Australian, (b) diabetes, (c) German, and (d) statlogheart datasets. LSDD: least-squares density difference, KDE: kernel density estimation.
    (Left) Squared error of class-prior estimation. (Right) Misclassification error by a weighted L2-regularized kernel least-squares classifier with weighted cross-validation. (a) Australian, (b) diabetes, (c) German, and (d) statlogheart datasets. LSDD: least-squares density difference, KDE: kernel density estimation.
  • [Fig. 3.] Schematic of change-point detection in timeseries.
    Schematic of change-point detection in timeseries.
  • [Fig. 4.] Results of change-point detection. Original time-series data is plotted in the top graphs, and change scores obtained by KLIEP (Kullback-Leibler importance estimation procedure; Section III-A) and LSDD (least-squares density difference; Section III-D) are plotted in the bottom graphs. (a) CENSREC (Corpora and Environments for Noisy Speech Recognition) dataset, (b) HASC (Human Activity Sensing Consortium) dataset.
    Results of change-point detection. Original time-series data is plotted in the top graphs, and change scores obtained by KLIEP (Kullback-Leibler importance estimation procedure; Section III-A) and LSDD (least-squares density difference; Section III-D) are plotted in the bottom graphs. (a) CENSREC (Corpora and Environments for Noisy Speech Recognition) dataset, (b) HASC (Human Activity Sensing Consortium) dataset.
  • [Fig. 5.] Schematic of salient object detection in an image.
    Schematic of salient object detection in an image.
  • [Fig. 6.] Results of salient object detection in an image: (upper) original images and (lower) obtained saliency maps-brighter color means more salient.
    Results of salient object detection in an image: (upper) original images and (lower) obtained saliency maps-brighter color means more salient.