Robust Fuzzy Varying Coefficient Regression Analysis with Crisp Inputs and Gaussian Fuzzy Output

Yang Zhihui; Yin Yunqiang; Chen Yizeng

doi:10.5626/JCSE.2013.7.4.263

OA학술지
Journal of Computing Science and Engineering

Robust Fuzzy Varying Coefficient Regression Analysis with Crisp Inputs and Gaussian Fuzzy Output

DOI : 10.5626/JCSE.2013.7.4.263
Author: Yang Zhihui, Yin Yunqiang, Chen Yizeng
Organization: Yang Zhihui; Yin Yunqiang; Chen Yizeng
Publish: Journal of Computing Science and Engineering Volume 7, Issue4, p263~271, 30 Dec 2013

ABSTRACT

Robust Fuzzy Varying Coefficient Regression Analysis with Crisp Inputs and Gaussian Fuzzy Output

KEYWORD

Gaussian fuzzy number , Goodness of fit , Outlier , Fuzzy varying coefficient regression

본문

Collapse all

Ⅰ. INTRODUCTION

As an important statistical analysis tool, the regression analysis model is often utilized to describe the statistical functional relationship between a response variable and a set of explanatory variables, so that the response variable can be predicted accordingly. The traditional regression model is anchored on binary logic. The sampling data used in traditional regression analysis has some strict assumptions: every observation is independent of others, the sampling data has a certain probability distribution, and so on. In actual practice, however, the description of the observations is often vague, and the data are often influenced by subjective judgment, or described in linguistic terms.

In recent years, fuzzy logic has become more widely used in statistical analysis. In their pioneering work, Tanaka et al. [1] employed fuzzy input data to establish a fuzzy regression analysis model. From then on the fuzzy regression model and its applications have attracted considerable attention from many fields, such as engineering, economics, management science, and environmental science. In the traditional regression model, the deviation between the experimental data and the model is interpreted as arising from the error of observation, but the fuzzy regression model views this kind of error as fuzziness of the structure in the system and the regression parameters.

Compared with the traditional regression model, the fuzzy linear regression model is inferior to the traditional linear regression model in terms of predictive capability, whereas their comparative descriptive performance depends on various factors associated with the data set and proper specificity of the model, especially for large sample data [2]. However, fuzzy linear regression performance becomes relatively better, as the size of the data set diminishes, and the aptness of the regression model deteriorates. Fuzzy linear regression may be used as an alternative to traditional linear regression in estimating regression parameters when the data is insufficient. Existing fuzzy regression models are mainly based on triangular fuzzy number, or trapezoidal fuzzy number. However, due to the nonlinear and complex nature of the relationship among variables in the system, the estimated effect of these fuzzy regression models needs to be further improved.

Following Tanaka et al. [1], many scholars have proposed all kinds of approaches to solve fuzzy regression model. There are two kinds of approaches to solving a fuzzy regression model. 1) The interval regression method: minimize the total spread of the fuzzy parameters, with the constraint that the membership of the estimate is not less than a predefined value [3-5]. This method essentially transforms a fuzzy regression problem into an optimization problem. 2) The fuzzy least square method: minimize the total square of the distances between the observation and estimated values of the response variables, and induce some equations similar to the normal equations of traditional regression analysis to obtain the fuzzy parameters [6-11].

Besides the above-mentioned two kinds of approaches, there are some other types of approaches to solve a fuzzy regression model. For example, the Monte Carlo method can be applied to the fuzzy regression model to obtain the optimal solution within a predetermined error bound [12, 13]. As a new classification technique proposed by Vapnik [14], the support vector machine (SVM) has been successful in solving pattern recognition and function estimation problems. Hong and Hwang [15] studied the convex optimization problem of a multi-fuzzy linear regression model via SVM. Hao and Chiang [16] employed fuzzy set theory to SVM in which the parameters, such as the components within the weight vector and the bias term, were set to be fuzzy numbers. By using different kernel functions, their method can achieve automatic accuracy control in the fuzzy regression analysis task. Assigning fuzzy membership values to data samples, Khemchandani et al. [17] proposed an approach to fuzzy support vector regression for financial time series forecasting. Wu and Law [18] proposed a new fuzzy SVM with the ability to penalize Gaussian noises in triangular fuzzy number space. Lin and Pai [19] employed support vector regressions to calculate fuzzy upper and lower bounds, to formulate a fuzzy SVM model for forecasting the indices of business cycles.

Previous fuzzy regression models find difficulty in dealing with input data varying with a covariate. Shen et al. [20] proposed a fuzzy varying coefficient model, where the fuzzy coefficients are allowed to vary with a covariate. The fuzzy varying coefficient regression model is an extension of the fuzzy linear regression model. Their method can improve the feasibility and adaptability of the fuzzy linear model. The procedure of their approach includes the following two steps: 1) After the fuzzy varying coefficient regression model is given, the proper kernel function is selected based on the distance of fuzzy numbers, and the cross-validation method is utilized to determine smooth parameters. 2) According to the definition of the distance between fuzzy numbers, the objective function is determined, and the estimate of response variables is obtained by the least squares method.

From the viewpoint of robust analysis, the least square method above is not robust enough, that is, when data contains individual abnormal data called outliers, the least squares estimate will not be reliable, and in the worst case, the conclusion may be incorrect. Therefore, Watada and Yabuuchi [21] suggested that we should remove the irregular data or outliers, before constructing the fuzzy regression model. Combined with robust analysis, the fuzzy regression model will be free from the influence of outliers. In this article, we integrate the fuzzy varying coefficient regression model with robust analysis to improve the feasibility and effectiveness of the fuzzy regression model.

The rest of the paper is organized as follows. Section Ⅱ illustrates the fuzzy varying coefficients regression model with its fuzzy regression coefficients estimation. Section Ⅲ introduces the notion of goodness of fit (GOF) for evaluating the model and the distance of fuzzy numbers, and employed them for robust analysis. In Section Ⅳ, a numerical example is used to demonstrate the effectiveness of the proposed methodology. Finally, the conclusions in this work are summarized in Section Ⅴ.

Ⅱ. FUZZY VARYING COEFFICIENT REGRESSION MODEL

  > A. Distance of Gaussian Fuzzy Numbers

Definition 1. A fuzzy number Ã is called a Gaussian fuzzy number, denoted by Ã = (α, σ) if its membership function can be formulated as

[]

where, α, σ is the center and spread of Ã , respectively.

Let Ã = (α, σ), = (b, τ) be Gaussian fuzzy numbers; if α = b and , σ = τ, Ã is equal to , denoted as Ã = . According to Zadeh's extension principle [22], the Gaussian fuzzy number has the following linear operations.

Proposition 1. Let Ã = (α, σ), = (b, τ) be Gaussian fuzzy numbers, then

1) Ã + = (α + b, σ + τ)

2) k Ã = (ka, kσ, Ɐk ∈ R, k > 0

In order to characterize the nearness degree of Gaussian fuzzy numbers Ã and B, we use the following distance, proposed by Xu [23]:

[]

[]

[]

[]

f(λ)is a monotonically increasing function on the interval [0,1], with f(0) = 0, and . The degree of nearness and overlap between the λ-level sets A_λ and , B_λ, denoted by d(A_λ B_λ), is weighted by f(λ), to emphasize the contribution of the higher values of λ to the distance between Ã and .

In this article, we set f(λ) = 2λ³, instead of , f(λ) = λ for emphasizing the contribution of λ; thus, we can draw the conclusion as follows.

[]

  > B. Fuzzy Varying Coefficient Regression Model

In order to describe the dynamic relationship between the response variable and a set of explanatory variables, Shen et al. [20] proposed a fuzzy varying coefficient regression model with its estimation. The procedure includes three steps as follows.

Step 1: Construct the fuzzy varying coefficient regression model.Step 2: Determine the optimal value of smooth parameters by using the fuzzy cross-validation method.Step 3: According to the distance of fuzzy numbers, determine the objective function and obtain the estimate of response variables by the restricted least squares method.

The fuzzy varying coefficient regression model proposed by Shen et al. [20] is formulated as follows:

[]

where, X_j, j = 1, 2, ..., m and t are explanatory variables (input variables) expressed by crisp numbers, Ã_j(t) = (α_j(t), σ_j(t)), j = 1, 2, ..., m are Gaussian fuzzy numbers varying with the variable t, and α_j(t, are the center and spread of Ã_j(t), respectively. According to the linear operation of Gaussian fuzzy numbers, the response variable Ỹ is also a Gaussian fuzzy number, denoted by Ỹ = (Y, S), where , . Generally, we take X₁ ≡1, to make the model include a fuzzy varying intercept.

Suppose that there are the experimental data set of the response variable Ỹ and the set of explanatory variables X_j, j = 1, 2, ..., m :, where, is the i-th experimental data set of the explanatory variable presented by crisp numbers; the Gaussian fuzzy number (y_i, s_i) is the i-th observation of the response variable, i = 1, 2, ..., n.

Therefore, the sample form of the model Eq. (7) is

[]

Now we should estimate the fuzzy regression coefficient Ã_j(t₀), j = 1, 2, ..., m, for any t = t_o. On the basis of the distance proposed above and the principle of kernel smoothing in statistics, the restricted weighted leastsquares problem is formulated as follows:

[]

where, K_h (t) = K(t/h)/h, with being a Gaussian kernel function, and h being the smoothing parameter determined by the fuzzy cross-validation procedure.

The restricted weighted least-squares problem above is equivalent to minimizing two formulas as follows:

[]

and

[]

In order to minimize Eq. (10), we set to zero the partial derivatives of g₁ with respect to a_k as follows:

[]

The equations above are equivalent to the following equations:

[]

Let X = (x_ij)n×m, Y = (y₁, y₂, ..., y_n)^T, S=(s₁, s₂, ..., s_n)^T, W(t_o) = diag(K_h(t₁−t_o), K_h(t₂−t_o), ..., K_h(t_n−t0)), α(t_o) = (α₁(t_o), α₂(t_o), ..., α_m(t₀))^T, σ(t₀) = (σ₁(t_o), (σ₂t_o), ..., σ_m(t_o))^T

Assuming that the inverse of X^T W(t_o)X always exists for any t_o, the estimate of α(t_o) will be s olved as follows:

[]

Similarly, the estimation of σ(t_o)will be obtained without positive restriction, as follows:

[]

Considering the positive restriction of the spread, Shen et al. [20] suggested that we utilize the method proposed by D'Urso et al. [24].

Let X_i = (x_i1, x_i2, ..., x_im)^T, performing the estimate procedure above at t₀ = t₁, t₂..., t_n respectively, we can obtain the following fitted values of the center and spread of Ỹ:

[]

  > C. Determination of the Smoothing Parameter

The role of the smoothing parameter h is to adjust the degree of smoothness of the estimates of the center and spread of the fuzzy regression coefficients. The fuzzy cross-validation procedure can be used to select the optimal value of the smoothing parameter: suppose the number of data is n, for each i = 1, 2, ..., n, remove the i-th observation ỹ_i = (y_i, s_i), and compute the estimates of the center and spread of the fuzzy coefficients Ã_j(t) (j = 1, 2, ..., n) at t = t_i, according to the procedure described above. Let and be the estimates of the centers and spreads of the fuzzy coefficients under h, then the predicted values of the center and spread of the fuzzy response Ỹ at t_i can be obtained by the following formulations, respectively:

[]

The cross-validation (CV) score is formulated as follows:

[]

Then, we will select h₀ as the optimal value of the smoothing parameter, such that

[]

There are two main advantages to using the fuzzy varying coefficient regression model: 1) Because the fuzzy regression coefficients may vary with another explanatory variable, the flexibility and adaptability of the fuzzy regression model are enhanced. 2) The model can deal with data that vary with a time variable, and establish a dynamic relationship between a response variable and a set of explanatory variables.

Ⅲ. ROBUST ANALYSIS AND GOF OF GAUSSIAN FUZZY NUMBERS

  > A. Robust Analysis

From the robustness point of view, the least square method is not robust, that is, when data contains an individual abnormal value or outliers, the least square estimate will not be reliable and a wrong conclusion may even be drawn. Therefore, diagnostic checks should be done on data, before applying the least square method.

Definition 2. An observation that has a bigger residual value than the others is called an outlier.

The least square estimator is sensitive to outliers. Therefore, the estimation results are directly affected by each observation, and the data should be analyzed in detail. Sometimes, even a single observation may dramatically influence the value of the parameter estimates, and omitting this observation from the data may lead to totally different results. To handle the outlier problem, Hung and Yang [25] proposed an omission approach for Tanaka's linear programming method. This approach has the capability to examine the behavior of value changes in the objective function of fuzzy regression models, when observations are omitted. On the basis of the Least Median Squares-Weighted Least Squares estimation procedure, D'Urso et al. [24] proposed a robust fuzzy linear regression model to deal with data contaminated by outliers. To overcome the higher order or interactive terms, and the influence of outliers existing in the manufacturing process data, Chan et al. [26] integrated genetic programming with the fuzzy regression model.

Next, we introduce the M estimation methods introduced by Huber [27], which are widely used for robust regression. M estimation can be regarded as a generalization of maximum-likelihood estimation. The general M estimator minimizes the following objective function

[]

where, the function ρ gives the contribution of each residual to the objective function.

By setting to zero the partial derivative of υ with respect to â_j, we have m equations as follows:

[]

where, ψ(z) = ρ'(z)is the derivative of ρ. The standardized residuals may be defined as z = r_i/d, where r_i = y_i − . d is a robust estimate of scale. Under normality, the expected value of d is the standard error of estimate in the population.

In this article, we will integrate robustness analysis with the fuzzy varying coefficient regression model. The main procedure is as follows:

Step 1: Utilize the robustfit function of Matlab toolbox to perform robust analysis and remove the outliers, according to the result of the robustness analysis.Step 2: Construct a fuzzy varying coefficient regression model and use the fuzzy cross-validation method to determine the optimal value of the smoothing parameter.Step 3: Determine the objective function according to the definition of distance Eq. (6), and solve the least squares problem for the parameter estimation of response variables.Step 4: Evaluate our approach, by comparing it with Shen et al.'s approach [20] according to AGOF, which will be introduced in Subsection Ⅲ-B.

  > B. GOF of Gaussian Fuzzy Numbers

In order to evaluate the fit performance of our approach, we next introduce an index named GOF for describing the nearness and overlap of two Gaussian fuzzy numbers. For two Gaussian fuzzy numbers Ã = (a, σ), = (b, τ) the GOF for them denoted by GOF(Ã,)should be a strict monotonically decreasing function with respect to the value of , |b−a| and be equal to 1 if and only if Ã = .

We utilize the proportion of the shadow area to the maximum of the area shaped by Ã(x) and (x) to be, ). are the areas under the curve of membership functions Ã(x) and (x), respectively. Therefore, we define GOF(Ã,) as follows.

Definition 3. Let Ã = (a, σ), = (b, τ), be Gaussian fuzzy numbers; the GOF for Ã and is defined as

[]

where, ϕ(·) is the standard normal distribution function.

For simplicity of calculation, we define another GOF for Ã and used in Section Ⅳ as follows:

Definition 4. Let Ã = (a, σ), = (b, τ),be Gaussian fuzzy numbers; the GOF for Ã and is defined as

[]

In Section Ⅳ, for the purpose of evaluating the performance of our approach, we will use AGOF (average of all GOF for the observations with their estimates) to compare our approach with the approach proposed by Shen et al. [20].

Ⅳ. AN ILLUSTRATED EXAMPLE

Gross domestic product (GDP) is the market value of all officially recognized final goods and services produced within a country in a given period. In economics, GDP is a sum of consumption, investment, government spending, and exports. In this article, we think that the dominant explanatory variables to determine GDP are the total amount of urban and rural savings deposit, the total financial expenditure, fixed assets investment, and sum of imports and exports, which are denoted by X₁, X₂, X₃, X₄, respectively. By using the fuzzy varying coefficient regression model combined with robustness analysis, we remove the outliers by utilizing the robustfit function in Matlab, and construct the fuzzy regression varying regression model for fitting the response (GDP).

The dataset shown in Table 1 consists of 29 observations of the total amount of urban and rural savings deposit, the total financial expenditure, fixed assets investment, sum of imports and exports, and GDP of China from 1981 to 2009, in which the GDP is assumed to be a Gaussian fuzzy number.

[Table 1.] Observations of gross domestic product (GDP) from 1981 to 2009 (unit: CNY 10,000)

Observations of gross domestic product (GDP) from 1981 to 2009 (unit: CNY 10,000)

First, we utilize the robustfit function in Matlab to analyze the robustness of this dataset; the diagram of residual errors is given in Fig. 1.

[Fig. 1.] Scatter diagram of robust analysis residual.

The residual of the 27th observation is 3.54 × 10⁴, which is obviously much larger than that of the other observations, so we think it is an outlier and should be removed. Next we will construct the fuzzy varying coef-ficient regression model after removing the 27th observation, and estimate the parameters.

Next, we set the variable t to be the time order of each year and consider the following fuzzy varying coefficient model:

[]

With the selected optimal value of h = 2.16 obtained by the fuzzy cross-validation procedure, we calibrate model Eq. (25) with the restricted weighted least-squares procedure.

[Table 2.] Comparison between the observations and estimates of GDP (unit: CNY 10,000)

Comparison between the observations and estimates of GDP (unit: CNY 10,000)

For comparison, the fit values of Shen et al.'s method and our method are shown in the third and fifth column in Table 2, respectively, with their GOF shown in the fourth and sixth column, respectively. As is noticed, the AGOF of Shen et al.'s method is 0.8943, while the AGOF of our method is 0.9934. The larger value of AGOF for our method demonstrates that the fit performance of our approach is better than the approach proposed by Shen et al. [20].

Ⅴ. CONCLUSION

The aim of our study is to propose a methodology for dealing with the dynamic fuzzy function relationships between the response variable and a set of explanatory variables, while simultaneously avoiding the impact of outliers in the dataset. Our methodology can improve the feasibility and effectiveness of the fuzzy regression model. Specifically, the contribution of this work can be summarized in two points. First, it has provided a dynamic model to deal with the data varying with a covariate, especially for the sampling data having approximately a Gaussian contribution. Second, with the help of robust analysis, the proposed model is free from the irregular data or outliers, after removing the outliers. From the robustness standpoint, a suitable index for characterizing the outliers needs to be studied to improve the robustness of the fuzzy regression model in future work.

참고문헌

1. Tanaka H, Uejima S, Asai K 1982 “ Linear regression analysis with fuzzy model,” [IEEE Transactions on Systems, Man and Cybernetics] Vol.12 P.903-907
2. Kim K. J, Moskowitz H, Koksalan M 1996 “Fuzzy versus statistical linear regression,” [European Journal of Operational Research] Vol.92 P.417-434
3. Chen Y, Tang J, Fung R. Y. K, Ren Z 2004 “ Fuzzy regression-based mathematical programming model for quality function deployment,” [International Journal of Production Research] Vol.42 P.1009-1027
4. Tanaka H 1987 “ Fuzzy data analysis by possibilistic linear models,” [Fuzzy Sets and Systems] Vol.24 P.363-375
5. Tanaka H, Watada J 1988 “ Possibilistic linear systems and their application to the linear regression model,” [Fuzzy Sets and Systems] Vol.27 P.275-289
6. Diamond P 1988 “ Fuzzy least squares,” [Information Sciences] Vol.46 P.141-157
7. D'urso P, Santoro A 2006 “ Goodness of fit and variable selection in the fuzzy multiple linear regression,” [Fuzzy Sets and Systems] Vol.157 P.2627-2647
8. Ferraro M. B, Coppi R, Gonzalez Rodriguez G, Colubi A 2010 “ A linear regression model for imprecise response,” [International Journal of Approximate Reasoning] Vol.51 P.759-770
9. Taheri S. M, Kelkinnama M 2012 “ Fuzzy linear regression based on least absolutes deviations,” [Iranian Journal of Fuzzy Systems] Vol.9 P.121-140
10. Waterman M. S 1974 “ A restricted least squares problem,” [Technometrics] Vol.16 P.135-136
11. Wu H. C 2011 “ The construction of fuzzy least squares estimators in fuzzy linear regression models,” [Expert Systems with Applications] Vol.38 P.13632-13640
12. Abdalla A, Buckley J. J 2007 “ Monte Carlo methods in fuzzy linear regression I,” [Soft Computing] Vol.11 P.991-996
13. Abdalla A, Buckley J. J 2007 “ Monte Carlo methods in fuzzy linear regression ”,” [Soft Computing] Vol.12 P.463-468
14. Vapnik V. N 1995 The Nature of Statistical Learning
15. Hong D. H, Hwang C 2003 “ Support vector fuzzy regression machines,” [Fuzzy Sets and Systems] Vol.138 P.271-281
16. Hao P. Y, Chiang J. H 2008 “ Fuzzy regression analysis by support vector learning approach,” [IEEE Transactions on Fuzzy Systems] Vol.16 P.428-441
17. Khemchandani R, Chandra S 2009 “ Regularized least squares fuzzy support vector regression for financial time series forecasting,” [Expert Systems with Applications] Vol.36 P.132-138
18. Wu Q, Law R 2010 “ Fuzzy support vector regression machine with penalizing Gaussian noises on triangular fuzzy number space,” [Expert Systems with Applications] Vol.37 P.7788-7795
19. Lin K. P, Pai P. F 2010 “A fuzzy support vector regression model for business cycle predictions,” [Expert Systems with Applications] Vol.37 P.5430-5435
20. Shen S. L, Mei C. L, Cui J. L 2010 “ A fuzzy varying coefficient model and its estimation,” [Computers and Mathematics with Applications] Vol.60 P.1696-1705
21. Watada J, Yabuuchi Y 1994 “ Fuzzy robust regression analysis,” [Proceeding of the 3rd IEEE International Conference on Fuzzy Systems] P.1370-1376
22. Zadeh L. A 1965 “ Fuzzy sets,” [Information and Control] Vol.8 P.338-353
23. Xu R. N 1991 “ A linear regression model in fuzzy environment,” [Advance in Modelling Simulation] Vol.27 P.31-40
24. D'Urso P, Massari R, Santoro A 2011 “ Robust fuzzy regression analysis,” [Information Sciences] Vol.181 P.4154-4174
25. Hung W. L, Yang M. S 2006 “ An omission approach for detecting outliers in fuzzy regressionsmodel,” [Fuzzy Sets and Systems] Vol.157 P.3109-3122
26. Chan K. Y, Kwong C. K, Fogarty T. C 2010 “Modeling manufacturing processes using a genetic programming-based fuzzy regression with detection of outliers,” [Information Sciences] Vol.180 P.506-518
27. Huber P. J 1964 “ Robust estimation of a location parameter,” [Annals of Mathematical Statistics] Vol.35 P.73-101

OAK XML 통계

이미지 / 테이블

[ Table 1. ] Observations of gross domestic product (GDP) from 1981 to 2009 (unit: CNY 10,000)
[ Fig. 1. ] Scatter diagram of robust analysis residual.
[ Table 2. ] Comparison between the observations and estimates of GDP (unit: CNY 10,000)

OA 실천선언

기관별보기

유형별보기

OA 학술지

OAK 소개

소통/참여

Ⅰ. INTRODUCTION

Ⅱ. FUZZY VARYING COEFFICIENT REGRESSION MODEL

> A. Distance of Gaussian Fuzzy Numbers

> B. Fuzzy Varying Coefficient Regression Model

> C. Determination of the Smoothing Parameter

Ⅲ. ROBUST ANALYSIS AND GOF OF GAUSSIAN FUZZY NUMBERS

> A. Robust Analysis

> B. GOF of Gaussian Fuzzy Numbers

Ⅳ. AN ILLUSTRATED EXAMPLE

Ⅴ. CONCLUSION

OAK 소개

소통/참여

상세검색

자료유형선택

발행연도

기관선택

국내 기관

국외 기관