Robust Fuzzy Varying Coefficient Regression Analysis with Crisp Inputs and Gaussian Fuzzy Output

  • cc icon
  • ABSTRACT

    This study presents a fuzzy varying coefficient regression model after deleting the outliers to improve the feasibility and effectiveness of the fuzzy regression model. The objective of our methodology is to allow the fuzzy regression coefficients to vary with a covariate, and simultaneously avoid the impact of data contaminated by outliers. In this paper, fuzzy regression coefficients are represented by Gaussian fuzzy numbers. We also formulate suitable goodness of fit to evaluate the performance of the proposed methodology. An example is given to demonstrate the effectiveness of our methodology.


  • KEYWORD

    Gaussian fuzzy number , Goodness of fit , Outlier , Fuzzy varying coefficient regression

  • Ⅰ. INTRODUCTION

    As an important statistical analysis tool, the regression analysis model is often utilized to describe the statistical functional relationship between a response variable and a set of explanatory variables, so that the response variable can be predicted accordingly. The traditional regression model is anchored on binary logic. The sampling data used in traditional regression analysis has some strict assumptions: every observation is independent of others, the sampling data has a certain probability distribution, and so on. In actual practice, however, the description of the observations is often vague, and the data are often influenced by subjective judgment, or described in linguistic terms.

    In recent years, fuzzy logic has become more widely used in statistical analysis. In their pioneering work, Tanaka et al. [1] employed fuzzy input data to establish a fuzzy regression analysis model. From then on the fuzzy regression model and its applications have attracted considerable attention from many fields, such as engineering, economics, management science, and environmental science. In the traditional regression model, the deviation between the experimental data and the model is interpreted as arising from the error of observation, but the fuzzy regression model views this kind of error as fuzziness of the structure in the system and the regression parameters.

    Compared with the traditional regression model, the fuzzy linear regression model is inferior to the traditional linear regression model in terms of predictive capability, whereas their comparative descriptive performance depends on various factors associated with the data set and proper specificity of the model, especially for large sample data [2]. However, fuzzy linear regression performance becomes relatively better, as the size of the data set diminishes, and the aptness of the regression model deteriorates. Fuzzy linear regression may be used as an alternative to traditional linear regression in estimating regression parameters when the data is insufficient. Existing fuzzy regression models are mainly based on triangular fuzzy number, or trapezoidal fuzzy number. However, due to the nonlinear and complex nature of the relationship among variables in the system, the estimated effect of these fuzzy regression models needs to be further improved.

    Following Tanaka et al. [1], many scholars have proposed all kinds of approaches to solve fuzzy regression model. There are two kinds of approaches to solving a fuzzy regression model. 1) The interval regression method: minimize the total spread of the fuzzy parameters, with the constraint that the membership of the estimate is not less than a predefined value [3-5]. This method essentially transforms a fuzzy regression problem into an optimization problem. 2) The fuzzy least square method: minimize the total square of the distances between the observation and estimated values of the response variables, and induce some equations similar to the normal equations of traditional regression analysis to obtain the fuzzy parameters [6-11].

    Besides the above-mentioned two kinds of approaches, there are some other types of approaches to solve a fuzzy regression model. For example, the Monte Carlo method can be applied to the fuzzy regression model to obtain the optimal solution within a predetermined error bound [12, 13]. As a new classification technique proposed by Vapnik [14], the support vector machine (SVM) has been successful in solving pattern recognition and function estimation problems. Hong and Hwang [15] studied the convex optimization problem of a multi-fuzzy linear regression model via SVM. Hao and Chiang [16] employed fuzzy set theory to SVM in which the parameters, such as the components within the weight vector and the bias term, were set to be fuzzy numbers. By using different kernel functions, their method can achieve automatic accuracy control in the fuzzy regression analysis task. Assigning fuzzy membership values to data samples, Khemchandani et al. [17] proposed an approach to fuzzy support vector regression for financial time series forecasting. Wu and Law [18] proposed a new fuzzy SVM with the ability to penalize Gaussian noises in triangular fuzzy number space. Lin and Pai [19] employed support vector regressions to calculate fuzzy upper and lower bounds, to formulate a fuzzy SVM model for forecasting the indices of business cycles.

    Previous fuzzy regression models find difficulty in dealing with input data varying with a covariate. Shen et al. [20] proposed a fuzzy varying coefficient model, where the fuzzy coefficients are allowed to vary with a covariate. The fuzzy varying coefficient regression model is an extension of the fuzzy linear regression model. Their method can improve the feasibility and adaptability of the fuzzy linear model. The procedure of their approach includes the following two steps: 1) After the fuzzy varying coefficient regression model is given, the proper kernel function is selected based on the distance of fuzzy numbers, and the cross-validation method is utilized to determine smooth parameters. 2) According to the definition of the distance between fuzzy numbers, the objective function is determined, and the estimate of response variables is obtained by the least squares method.

    From the viewpoint of robust analysis, the least square method above is not robust enough, that is, when data contains individual abnormal data called outliers, the least squares estimate will not be reliable, and in the worst case, the conclusion may be incorrect. Therefore, Watada and Yabuuchi [21] suggested that we should remove the irregular data or outliers, before constructing the fuzzy regression model. Combined with robust analysis, the fuzzy regression model will be free from the influence of outliers. In this article, we integrate the fuzzy varying coefficient regression model with robust analysis to improve the feasibility and effectiveness of the fuzzy regression model.

    The rest of the paper is organized as follows. Section Ⅱ illustrates the fuzzy varying coefficients regression model with its fuzzy regression coefficients estimation. Section Ⅲ introduces the notion of goodness of fit (GOF) for evaluating the model and the distance of fuzzy numbers, and employed them for robust analysis. In Section Ⅳ, a numerical example is used to demonstrate the effectiveness of the proposed methodology. Finally, the conclusions in this work are summarized in Section Ⅴ.

    Ⅱ. FUZZY VARYING COEFFICIENT REGRESSION MODEL

      >  A. Distance of Gaussian Fuzzy Numbers

    Definition 1. A fuzzy number à is called a Gaussian fuzzy number, denoted by à = (α, σ) if its membership function can be formulated as

    where, α, σ is the center and spread of à , respectively.

    Let à = (α, σ), = (b, τ) be Gaussian fuzzy numbers; if α = b and , σ = τ, à is equal to , denoted as à = . According to Zadeh's extension principle [22], the Gaussian fuzzy number has the following linear operations.

    Proposition 1. Let à = (α, σ), = (b, τ) be Gaussian fuzzy numbers, then

    1) Ã + = (α + b, σ + τ)

    2) k à = (ka, , ⱯkR, k > 0

    In order to characterize the nearness degree of Gaussian fuzzy numbers à and B, we use the following distance, proposed by Xu [23]:

    f(λ)is a monotonically increasing function on the interval [0,1], with f(0) = 0, and . The degree of nearness and overlap between the λ-level sets Aλ and , Bλ, denoted by d(Aλ Bλ), is weighted by f(λ), to emphasize the contribution of the higher values of λ to the distance between à and .

    In this article, we set f(λ) = 2λ3, instead of , f(λ) = λ for emphasizing the contribution of λ; thus, we can draw the conclusion as follows.

      >  B. Fuzzy Varying Coefficient Regression Model

    In order to describe the dynamic relationship between the response variable and a set of explanatory variables, Shen et al. [20] proposed a fuzzy varying coefficient regression model with its estimation. The procedure includes three steps as follows.

    Step 1: Construct the fuzzy varying coefficient regression model.Step 2: Determine the optimal value of smooth parameters by using the fuzzy cross-validation method.Step 3: According to the distance of fuzzy numbers, determine the objective function and obtain the estimate of response variables by the restricted least squares method.

    The fuzzy varying coefficient regression model proposed by Shen et al. [20] is formulated as follows:

    where, Xj, j = 1, 2, ..., m and t are explanatory variables (input variables) expressed by crisp numbers, Ãj(t) = (αj(t), σj(t)), j = 1, 2, ..., m are Gaussian fuzzy numbers varying with the variable t, and αj(t, are the center and spread of Ãj(t), respectively. According to the linear operation of Gaussian fuzzy numbers, the response variable is also a Gaussian fuzzy number, denoted by = (Y, S), where , . Generally, we take X1 ≡1, to make the model include a fuzzy varying intercept.

    Suppose that there are the experimental data set of the response variable and the set of explanatory variables Xj, j = 1, 2, ..., m :, where, is the i-th experimental data set of the explanatory variable presented by crisp numbers; the Gaussian fuzzy number (yi, si) is the i-th observation of the response variable, i = 1, 2, ..., n.

    Therefore, the sample form of the model Eq. (7) is

    Now we should estimate the fuzzy regression coefficient Ãj(t0), j = 1, 2, ..., m, for any t = to. On the basis of the distance proposed above and the principle of kernel smoothing in statistics, the restricted weighted leastsquares problem is formulated as follows:

    where, Kh (t) = K(t/h)/h, with being a Gaussian kernel function, and h being the smoothing parameter determined by the fuzzy cross-validation procedure.

    The restricted weighted least-squares problem above is equivalent to minimizing two formulas as follows:

    and

    In order to minimize Eq. (10), we set to zero the partial derivatives of g1 with respect to ak as follows:

    The equations above are equivalent to the following equations:

    Let X = (xij)n×m, Y = (y1, y2, ..., yn)T, S=(s1, s2, ..., sn)T, W(to) = diag(Kh(t1to), Kh(t2to), ..., Kh(tn−t0)), α(to) = (α1(to), α2(to), ..., αm(t0))T, σ(t0) = (σ1(to), (σ2to), ..., σm(to))T

    Assuming that the inverse of XT W(to)X always exists for any to, the estimate of α(to) will be s olved as follows:

    Similarly, the estimation of σ(to)will be obtained without positive restriction, as follows:

    Considering the positive restriction of the spread, Shen et al. [20] suggested that we utilize the method proposed by D'Urso et al. [24].

    Let Xi = (xi1, xi2, ..., xim)T, performing the estimate procedure above at t0 = t1, t2..., tn respectively, we can obtain the following fitted values of the center and spread of :

      >  C. Determination of the Smoothing Parameter

    The role of the smoothing parameter h is to adjust the degree of smoothness of the estimates of the center and spread of the fuzzy regression coefficients. The fuzzy cross-validation procedure can be used to select the optimal value of the smoothing parameter: suppose the number of data is n, for each i = 1, 2, ..., n, remove the i-th observation i = (yi, si), and compute the estimates of the center and spread of the fuzzy coefficients Ãj(t) (j = 1, 2, ..., n) at t = ti, according to the procedure described above. Let and be the estimates of the centers and spreads of the fuzzy coefficients under h, then the predicted values of the center and spread of the fuzzy response at ti can be obtained by the following formulations, respectively:

    The cross-validation (CV) score is formulated as follows:

    Then, we will select h0 as the optimal value of the smoothing parameter, such that

    There are two main advantages to using the fuzzy varying coefficient regression model: 1) Because the fuzzy regression coefficients may vary with another explanatory variable, the flexibility and adaptability of the fuzzy regression model are enhanced. 2) The model can deal with data that vary with a time variable, and establish a dynamic relationship between a response variable and a set of explanatory variables.

    Ⅲ. ROBUST ANALYSIS AND GOF OF GAUSSIAN FUZZY NUMBERS

      >  A. Robust Analysis

    From the robustness point of view, the least square method is not robust, that is, when data contains an individual abnormal value or outliers, the least square estimate will not be reliable and a wrong conclusion may even be drawn. Therefore, diagnostic checks should be done on data, before applying the least square method.

    Definition 2. An observation that has a bigger residual value than the others is called an outlier.

    The least square estimator is sensitive to outliers. Therefore, the estimation results are directly affected by each observation, and the data should be analyzed in detail. Sometimes, even a single observation may dramatically influence the value of the parameter estimates, and omitting this observation from the data may lead to totally different results. To handle the outlier problem, Hung and Yang [25] proposed an omission approach for Tanaka's linear programming method. This approach has the capability to examine the behavior of value changes in the objective function of fuzzy regression models, when observations are omitted. On the basis of the Least Median Squares-Weighted Least Squares estimation procedure, D'Urso et al. [24] proposed a robust fuzzy linear regression model to deal with data contaminated by outliers. To overcome the higher order or interactive terms, and the influence of outliers existing in the manufacturing process data, Chan et al. [26] integrated genetic programming with the fuzzy regression model.

    Next, we introduce the M estimation methods introduced by Huber [27], which are widely used for robust regression. M estimation can be regarded as a generalization of maximum-likelihood estimation. The general M estimator minimizes the following objective function

    where, the function ρ gives the contribution of each residual to the objective function.

    By setting to zero the partial derivative of υ with respect to âj, we have m equations as follows:

    where, ψ(z) = ρ'(z)is the derivative of ρ. The standardized residuals may be defined as z = ri/d, where ri = yi − . d is a robust estimate of scale. Under normality, the expected value of d is the standard error of estimate in the population.

    In this article, we will integrate robustness analysis with the fuzzy varying coefficient regression model. The main procedure is as follows:

    Step 1: Utilize the robustfit function of Matlab toolbox to perform robust analysis and remove the outliers, according to the result of the robustness analysis.Step 2: Construct a fuzzy varying coefficient regression model and use the fuzzy cross-validation method to determine the optimal value of the smoothing parameter.Step 3: Determine the objective function according to the definition of distance Eq. (6), and solve the least squares problem for the parameter estimation of response variables.Step 4: Evaluate our approach, by comparing it with Shen et al.'s approach [20] according to AGOF, which will be introduced in Subsection Ⅲ-B.

      >  B. GOF of Gaussian Fuzzy Numbers

    In order to evaluate the fit performance of our approach, we next introduce an index named GOF for describing the nearness and overlap of two Gaussian fuzzy numbers. For two Gaussian fuzzy numbers à = (a, σ), = (b, τ) the GOF for them denoted by GOF(Ã,)should be a strict monotonically decreasing function with respect to the value of , |ba| and be equal to 1 if and only if à = .

    We utilize the proportion of the shadow area to the maximum of the area shaped by Ã(x) and (x) to be, ). are the areas under the curve of membership functions Ã(x) and (x), respectively. Therefore, we define GOF(Ã,) as follows.

    Definition 3. Let à = (a, σ), = (b, τ), be Gaussian fuzzy numbers; the GOF for à and is defined as

    where, ϕ(·) is the standard normal distribution function.

    For simplicity of calculation, we define another GOF for à and used in Section Ⅳ as follows:

    Definition 4. Let à = (a, σ), = (b, τ),be Gaussian fuzzy numbers; the GOF for à and is defined as

    In Section Ⅳ, for the purpose of evaluating the performance of our approach, we will use AGOF (average of all GOF for the observations with their estimates) to compare our approach with the approach proposed by Shen et al. [20].

    Ⅳ. AN ILLUSTRATED EXAMPLE

    Gross domestic product (GDP) is the market value of all officially recognized final goods and services produced within a country in a given period. In economics, GDP is a sum of consumption, investment, government spending, and exports. In this article, we think that the dominant explanatory variables to determine GDP are the total amount of urban and rural savings deposit, the total financial expenditure, fixed assets investment, and sum of imports and exports, which are denoted by X1, X2, X3, X4, respectively. By using the fuzzy varying coefficient regression model combined with robustness analysis, we remove the outliers by utilizing the robustfit function in Matlab, and construct the fuzzy regression varying regression model for fitting the response (GDP).

    The dataset shown in Table 1 consists of 29 observations of the total amount of urban and rural savings deposit, the total financial expenditure, fixed assets investment, sum of imports and exports, and GDP of China from 1981 to 2009, in which the GDP is assumed to be a Gaussian fuzzy number.

    First, we utilize the robustfit function in Matlab to analyze the robustness of this dataset; the diagram of residual errors is given in Fig. 1.

    The residual of the 27th observation is 3.54 × 104, which is obviously much larger than that of the other observations, so we think it is an outlier and should be removed. Next we will construct the fuzzy varying coef-ficient regression model after removing the 27th observation, and estimate the parameters.

    Next, we set the variable t to be the time order of each year and consider the following fuzzy varying coefficient model:

    With the selected optimal value of h = 2.16 obtained by the fuzzy cross-validation procedure, we calibrate model Eq. (25) with the restricted weighted least-squares procedure.

    For comparison, the fit values of Shen et al.'s method and our method are shown in the third and fifth column in Table 2, respectively, with their GOF shown in the fourth and sixth column, respectively. As is noticed, the AGOF of Shen et al.'s method is 0.8943, while the AGOF of our method is 0.9934. The larger value of AGOF for our method demonstrates that the fit performance of our approach is better than the approach proposed by Shen et al. [20].

    Ⅴ. CONCLUSION

    The aim of our study is to propose a methodology for dealing with the dynamic fuzzy function relationships between the response variable and a set of explanatory variables, while simultaneously avoiding the impact of outliers in the dataset. Our methodology can improve the feasibility and effectiveness of the fuzzy regression model. Specifically, the contribution of this work can be summarized in two points. First, it has provided a dynamic model to deal with the data varying with a covariate, especially for the sampling data having approximately a Gaussian contribution. Second, with the help of robust analysis, the proposed model is free from the irregular data or outliers, after removing the outliers. From the robustness standpoint, a suitable index for characterizing the outliers needs to be studied to improve the robustness of the fuzzy regression model in future work.

  • 1. Tanaka H, Uejima S, Asai K 1982 “ Linear regression analysis with fuzzy model,” [IEEE Transactions on Systems, Man and Cybernetics] Vol.12 P.903-907 google doi
  • 2. Kim K. J, Moskowitz H, Koksalan M 1996 “Fuzzy versus statistical linear regression,” [European Journal of Operational Research] Vol.92 P.417-434 google doi
  • 3. Chen Y, Tang J, Fung R. Y. K, Ren Z 2004 “ Fuzzy regression-based mathematical programming model for quality function deployment,” [International Journal of Production Research] Vol.42 P.1009-1027 google doi
  • 4. Tanaka H 1987 “ Fuzzy data analysis by possibilistic linear models,” [Fuzzy Sets and Systems] Vol.24 P.363-375 google doi
  • 5. Tanaka H, Watada J 1988 “ Possibilistic linear systems and their application to the linear regression model,” [Fuzzy Sets and Systems] Vol.27 P.275-289 google doi
  • 6. Diamond P 1988 “ Fuzzy least squares,” [Information Sciences] Vol.46 P.141-157 google doi
  • 7. D'urso P, Santoro A 2006 “ Goodness of fit and variable selection in the fuzzy multiple linear regression,” [Fuzzy Sets and Systems] Vol.157 P.2627-2647 google doi
  • 8. Ferraro M. B, Coppi R, Gonzalez Rodriguez G, Colubi A 2010 “ A linear regression model for imprecise response,” [International Journal of Approximate Reasoning] Vol.51 P.759-770 google doi
  • 9. Taheri S. M, Kelkinnama M 2012 “ Fuzzy linear regression based on least absolutes deviations,” [Iranian Journal of Fuzzy Systems] Vol.9 P.121-140 google
  • 10. Waterman M. S 1974 “ A restricted least squares problem,” [Technometrics] Vol.16 P.135-136 google doi
  • 11. Wu H. C 2011 “ The construction of fuzzy least squares estimators in fuzzy linear regression models,” [Expert Systems with Applications] Vol.38 P.13632-13640 google
  • 12. Abdalla A, Buckley J. J 2007 “ Monte Carlo methods in fuzzy linear regression I,” [Soft Computing] Vol.11 P.991-996 google doi
  • 13. Abdalla A, Buckley J. J 2007 “ Monte Carlo methods in fuzzy linear regression ”,” [Soft Computing] Vol.12 P.463-468 google
  • 14. Vapnik V. N 1995 The Nature of Statistical Learning google
  • 15. Hong D. H, Hwang C 2003 “ Support vector fuzzy regression machines,” [Fuzzy Sets and Systems] Vol.138 P.271-281 google doi
  • 16. Hao P. Y, Chiang J. H 2008 “ Fuzzy regression analysis by support vector learning approach,” [IEEE Transactions on Fuzzy Systems] Vol.16 P.428-441 google doi
  • 17. Khemchandani R, Chandra S 2009 “ Regularized least squares fuzzy support vector regression for financial time series forecasting,” [Expert Systems with Applications] Vol.36 P.132-138 google doi
  • 18. Wu Q, Law R 2010 “ Fuzzy support vector regression machine with penalizing Gaussian noises on triangular fuzzy number space,” [Expert Systems with Applications] Vol.37 P.7788-7795 google doi
  • 19. Lin K. P, Pai P. F 2010 “A fuzzy support vector regression model for business cycle predictions,” [Expert Systems with Applications] Vol.37 P.5430-5435 google doi
  • 20. Shen S. L, Mei C. L, Cui J. L 2010 “ A fuzzy varying coefficient model and its estimation,” [Computers and Mathematics with Applications] Vol.60 P.1696-1705 google doi
  • 21. Watada J, Yabuuchi Y 1994 “ Fuzzy robust regression analysis,” [Proceeding of the 3rd IEEE International Conference on Fuzzy Systems] P.1370-1376 google
  • 22. Zadeh L. A 1965 “ Fuzzy sets,” [Information and Control] Vol.8 P.338-353 google doi
  • 23. Xu R. N 1991 “ A linear regression model in fuzzy environment,” [Advance in Modelling Simulation] Vol.27 P.31-40 google
  • 24. D'Urso P, Massari R, Santoro A 2011 “ Robust fuzzy regression analysis,” [Information Sciences] Vol.181 P.4154-4174 google doi
  • 25. Hung W. L, Yang M. S 2006 “ An omission approach for detecting outliers in fuzzy regressionsmodel,” [Fuzzy Sets and Systems] Vol.157 P.3109-3122 google doi
  • 26. Chan K. Y, Kwong C. K, Fogarty T. C 2010 “Modeling manufacturing processes using a genetic programming-based fuzzy regression with detection of outliers,” [Information Sciences] Vol.180 P.506-518 google doi
  • 27. Huber P. J 1964 “ Robust estimation of a location parameter,” [Annals of Mathematical Statistics] Vol.35 P.73-101 google doi
  • [Table 1.] Observations of gross domestic product (GDP) from 1981 to 2009 (unit: CNY 10,000)
    Observations of gross domestic product (GDP) from 1981 to 2009 (unit: CNY 10,000)
  • [Fig. 1.] Scatter diagram of robust analysis residual.
    Scatter diagram of robust analysis residual.
  • [Table 2.] Comparison between the observations and estimates of GDP (unit: CNY 10,000)
    Comparison between the observations and estimates of GDP (unit: CNY 10,000)