Robust Fuzzy Varying Coefficient Regression Analysis with Crisp Inputs and Gaussian Fuzzy Output
- Author: Yang Zhihui, Yin Yunqiang, Chen Yizeng
- Organization: Yang Zhihui; Yin Yunqiang; Chen Yizeng
- Publish: Journal of Computing Science and Engineering Volume 7, Issue4, p263~271, 30 Dec 2013
This study presents a fuzzy varying coefficient regression model after deleting the outliers to improve the feasibility and effectiveness of the fuzzy regression model. The objective of our methodology is to allow the fuzzy regression coefficients to vary with a covariate, and simultaneously avoid the impact of data contaminated by outliers. In this paper, fuzzy regression coefficients are represented by Gaussian fuzzy numbers. We also formulate suitable goodness of fit to evaluate the performance of the proposed methodology. An example is given to demonstrate the effectiveness of our methodology.
Gaussian fuzzy number , Goodness of fit , Outlier , Fuzzy varying coefficient regression
As an important statistical analysis tool, the regression analysis model is often utilized to describe the statistical functional relationship between a response variable and a set of explanatory variables, so that the response variable can be predicted accordingly. The traditional regression model is anchored on binary logic. The sampling data used in traditional regression analysis has some strict assumptions: every observation is independent of others, the sampling data has a certain probability distribution, and so on. In actual practice, however, the description of the observations is often vague, and the data are often influenced by subjective judgment, or described in linguistic terms.
In recent years, fuzzy logic has become more widely used in statistical analysis. In their pioneering work, Tanaka et al.  employed fuzzy input data to establish a fuzzy regression analysis model. From then on the fuzzy regression model and its applications have attracted considerable attention from many fields, such as engineering, economics, management science, and environmental science. In the traditional regression model, the deviation between the experimental data and the model is interpreted as arising from the error of observation, but the fuzzy regression model views this kind of error as fuzziness of the structure in the system and the regression parameters.
Compared with the traditional regression model, the fuzzy linear regression model is inferior to the traditional linear regression model in terms of predictive capability, whereas their comparative descriptive performance depends on various factors associated with the data set and proper specificity of the model, especially for large sample data . However, fuzzy linear regression performance becomes relatively better, as the size of the data set diminishes, and the aptness of the regression model deteriorates. Fuzzy linear regression may be used as an alternative to traditional linear regression in estimating regression parameters when the data is insufficient. Existing fuzzy regression models are mainly based on triangular fuzzy number, or trapezoidal fuzzy number. However, due to the nonlinear and complex nature of the relationship among variables in the system, the estimated effect of these fuzzy regression models needs to be further improved.
Following Tanaka et al. , many scholars have proposed all kinds of approaches to solve fuzzy regression model. There are two kinds of approaches to solving a fuzzy regression model. 1) The interval regression method: minimize the total spread of the fuzzy parameters, with the constraint that the membership of the estimate is not less than a predefined value [3-5]. This method essentially transforms a fuzzy regression problem into an optimization problem. 2) The fuzzy least square method: minimize the total square of the distances between the observation and estimated values of the response variables, and induce some equations similar to the normal equations of traditional regression analysis to obtain the fuzzy parameters [6-11].
Besides the above-mentioned two kinds of approaches, there are some other types of approaches to solve a fuzzy regression model. For example, the Monte Carlo method can be applied to the fuzzy regression model to obtain the optimal solution within a predetermined error bound [12, 13]. As a new classification technique proposed by Vapnik , the support vector machine (SVM) has been successful in solving pattern recognition and function estimation problems. Hong and Hwang  studied the convex optimization problem of a multi-fuzzy linear regression model via SVM. Hao and Chiang  employed fuzzy set theory to SVM in which the parameters, such as the components within the weight vector and the bias term, were set to be fuzzy numbers. By using different kernel functions, their method can achieve automatic accuracy control in the fuzzy regression analysis task. Assigning fuzzy membership values to data samples, Khemchandani et al.  proposed an approach to fuzzy support vector regression for financial time series forecasting. Wu and Law  proposed a new fuzzy SVM with the ability to penalize Gaussian noises in triangular fuzzy number space. Lin and Pai  employed support vector regressions to calculate fuzzy upper and lower bounds, to formulate a fuzzy SVM model for forecasting the indices of business cycles.
Previous fuzzy regression models find difficulty in dealing with input data varying with a covariate. Shen et al.  proposed a fuzzy varying coefficient model, where the fuzzy coefficients are allowed to vary with a covariate. The fuzzy varying coefficient regression model is an extension of the fuzzy linear regression model. Their method can improve the feasibility and adaptability of the fuzzy linear model. The procedure of their approach includes the following two steps: 1) After the fuzzy varying coefficient regression model is given, the proper kernel function is selected based on the distance of fuzzy numbers, and the cross-validation method is utilized to determine smooth parameters. 2) According to the definition of the distance between fuzzy numbers, the objective function is determined, and the estimate of response variables is obtained by the least squares method.
From the viewpoint of robust analysis, the least square method above is not robust enough, that is, when data contains individual abnormal data called outliers, the least squares estimate will not be reliable, and in the worst case, the conclusion may be incorrect. Therefore, Watada and Yabuuchi  suggested that we should remove the irregular data or outliers, before constructing the fuzzy regression model. Combined with robust analysis, the fuzzy regression model will be free from the influence of outliers. In this article, we integrate the fuzzy varying coefficient regression model with robust analysis to improve the feasibility and effectiveness of the fuzzy regression model.
The rest of the paper is organized as follows. Section Ⅱ illustrates the fuzzy varying coefficients regression model with its fuzzy regression coefficients estimation. Section Ⅲ introduces the notion of goodness of fit (GOF) for evaluating the model and the distance of fuzzy numbers, and employed them for robust analysis. In Section Ⅳ, a numerical example is used to demonstrate the effectiveness of the proposed methodology. Finally, the conclusions in this work are summarized in Section Ⅴ.
Definition 1. A fuzzy number Ã is called a Gaussian fuzzy number, denoted by Ã= ( α, σ) if its membership function can be formulated as
where, α, σ is the center and spread of
Ã= ( α, σ), = ( b, τ) be Gaussian fuzzy numbers; if α= band , σ = τ, Ãis equal to , denoted as Ã= . According to Zadeh's extension principle , the Gaussian fuzzy number has the following linear operations. Proposition 1. Let Ã= ( α, σ), = ( b, τ) be Gaussian fuzzy numbers, then
Ã+ = ( α+ b, σ + τ)
k Ã= ( ka, kσ, Ɐ k∈ R, k> 0
In order to characterize the nearness degree of Gaussian fuzzy numbers
Ãand B, we use the following distance, proposed by Xu : f(is a monotonically increasing function on the interval [0,1], with λ) f(0) = 0, and . The degree of nearness and overlap between the λ-level sets Aλand , Bλ, denoted by d( Aλ Bλ), is weighted by f( λ), to emphasize the contribution of the higher values of λto the distance between Ãand .
In this article, we set
f(= 2 λ) λ3, instead of , f(= λ) λfor emphasizing the contribution of λ; thus, we can draw the conclusion as follows.
In order to describe the dynamic relationship between the response variable and a set of explanatory variables, Shen et al.  proposed a fuzzy varying coefficient regression model with its estimation. The procedure includes three steps as follows.
Step 1: Construct the fuzzy varying coefficient regression model.Step 2: Determine the optimal value of smooth parameters by using the fuzzy cross-validation method.Step 3: According to the distance of fuzzy numbers, determine the objective function and obtain the estimate of response variables by the restricted least squares method.
The fuzzy varying coefficient regression model proposed by Shen et al.  is formulated as follows:
Xj, j= 1, 2, ..., mand tare explanatory variables (input variables) expressed by crisp numbers, Ãj( t) = ( αj( t), σj( t)), j= 1, 2, ..., mare Gaussian fuzzy numbers varying with the variable t, and αj( t, are the center and spread of Ãj( t), respectively. According to the linear operation of Gaussian fuzzy numbers, the response variable Ỹis also a Gaussian fuzzy number, denoted by Ỹ= ( Y, S), where , . Generally, we take X1 ≡1, to make the model include a fuzzy varying intercept.
Suppose that there are the experimental data set of the response variable
Ỹand the set of explanatory variables Xj, j= 1, 2, ..., m:, where, is the i-th experimental data set of the explanatory variable presented by crisp numbers; the Gaussian fuzzy number ( yi, si) is the i-th observation of the response variable, i= 1, 2, ..., n.
Therefore, the sample form of the model Eq. (7) is
Now we should estimate the fuzzy regression coefficient
Ãj( t0), j= 1, 2, ..., m, for any t= to. On the basis of the distance proposed above and the principle of kernel smoothing in statistics, the restricted weighted leastsquares problem is formulated as follows:
Kh( t) = K(t/h)/h, with being a Gaussian kernel function, and h being the smoothing parameter determined by the fuzzy cross-validation procedure.
The restricted weighted least-squares problem above is equivalent to minimizing two formulas as follows:
In order to minimize Eq. (10), we set to zero the partial derivatives of
g1 with respect to akas follows:
The equations above are equivalent to the following equations:
X= ( xij) , n× m Y= ( y1, y2, ..., yn) T, S=( s1, s2, ..., sn) T, W( to) = diag( Kh( t1− to), Kh( t2− to), ..., Kh( tn−t0)), α( to) = (α1( to), α2( to), ..., α m( t0)) T, σ( t0) = (σ1( to), (σ2 to), ..., σ m( to)) T
Assuming that the inverse of
XT W( to) Xalways exists for any to, the estimate of α( to) will be s olved as follows:
Similarly, the estimation of σ(
to)will be obtained without positive restriction, as follows:
Xi= ( xi1, xi2, ..., xim) T, performing the estimate procedure above at t0 = t1, t2..., tnrespectively, we can obtain the following fitted values of the center and spread of Ỹ:
The role of the smoothing parameter
his to adjust the degree of smoothness of the estimates of the center and spread of the fuzzy regression coefficients. The fuzzy cross-validation procedure can be used to select the optimal value of the smoothing parameter: suppose the number of data is n, for each i= 1, 2, ..., n, remove the i-th observation ỹi= ( yi, si), and compute the estimates of the center and spread of the fuzzy coefficients Ãj( t) ( j= 1, 2, ..., n) at t= ti, according to the procedure described above. Let and be the estimates of the centers and spreads of the fuzzy coefficients under h, then the predicted values of the center and spread of the fuzzy response Ỹat tican be obtained by the following formulations, respectively:
The cross-validation (CV) score is formulated as follows:
Then, we will select
h0 as the optimal value of the smoothing parameter, such that
There are two main advantages to using the fuzzy varying coefficient regression model: 1) Because the fuzzy regression coefficients may vary with another explanatory variable, the flexibility and adaptability of the fuzzy regression model are enhanced. 2) The model can deal with data that vary with a time variable, and establish a dynamic relationship between a response variable and a set of explanatory variables.
From the robustness point of view, the least square method is not robust, that is, when data contains an individual abnormal value or outliers, the least square estimate will not be reliable and a wrong conclusion may even be drawn. Therefore, diagnostic checks should be done on data, before applying the least square method.
An observation that has a bigger residual value than the others is called an outlier.
The least square estimator is sensitive to outliers. Therefore, the estimation results are directly affected by each observation, and the data should be analyzed in detail. Sometimes, even a single observation may dramatically influence the value of the parameter estimates, and omitting this observation from the data may lead to totally different results. To handle the outlier problem, Hung and Yang  proposed an omission approach for Tanaka's linear programming method. This approach has the capability to examine the behavior of value changes in the objective function of fuzzy regression models, when observations are omitted. On the basis of the Least Median Squares-Weighted Least Squares estimation procedure, D'Urso et al.  proposed a robust fuzzy linear regression model to deal with data contaminated by outliers. To overcome the higher order or interactive terms, and the influence of outliers existing in the manufacturing process data, Chan et al.  integrated genetic programming with the fuzzy regression model.
Next, we introduce the M estimation methods introduced by Huber , which are widely used for robust regression. M estimation can be regarded as a generalization of maximum-likelihood estimation. The general M estimator minimizes the following objective function
where, the function
ρgives the contribution of each residual to the objective function.
By setting to zero the partial derivative of
υwith respect to âj, we have m equations as follows:
where, ψ(z) =
ρ'(z)is the derivative of ρ. The standardized residuals may be defined as z = ri/d, where ri= yi− . dis a robust estimate of scale. Under normality, the expected value of dis the standard error of estimate in the population.
In this article, we will integrate robustness analysis with the fuzzy varying coefficient regression model. The main procedure is as follows:
Step 1: Utilize the robustfit function of Matlab toolbox to perform robust analysis and remove the outliers, according to the result of the robustness analysis.Step 2: Construct a fuzzy varying coefficient regression model and use the fuzzy cross-validation method to determine the optimal value of the smoothing parameter.Step 3: Determine the objective function according to the definition of distance Eq. (6), and solve the least squares problem for the parameter estimation of response variables.Step 4: Evaluate our approach, by comparing it with Shen et al.'s approach  according to AGOF, which will be introduced in Subsection Ⅲ-B.
In order to evaluate the fit performance of our approach, we next introduce an index named GOF for describing the nearness and overlap of two Gaussian fuzzy numbers. For two Gaussian fuzzy numbers
Ã= ( a, σ), = ( b, τ) the GOF for them denoted by GOF( Ã,)should be a strict monotonically decreasing function with respect to the value of , | b− a| and be equal to 1 if and only if Ã= .
We utilize the proportion of the shadow area to the maximum of the area shaped by
Ã( x) and ( x) to be, ). are the areas under the curve of membership functions Ã( x) and ( x), respectively. Therefore, we define GOF( Ã,) as follows. Definition 3. Let Ã= ( a, σ), = ( b, τ), be Gaussian fuzzy numbers; the GOF for Ã and is defined as
ϕ(·) is the standard normal distribution function.
For simplicity of calculation, we define another GOF for
Ãand used in Section Ⅳ as follows: Definition 4. Let Ã= ( a, σ), = ( b, τ), be Gaussian fuzzy numbers; the GOF for Ã and is defined as
In Section Ⅳ, for the purpose of evaluating the performance of our approach, we will use AGOF (average of all GOF for the observations with their estimates) to compare our approach with the approach proposed by Shen et al. .
Gross domestic product (GDP) is the market value of all officially recognized final goods and services produced within a country in a given period. In economics, GDP is a sum of consumption, investment, government spending, and exports. In this article, we think that the dominant explanatory variables to determine GDP are the total amount of urban and rural savings deposit, the total financial expenditure, fixed assets investment, and sum of imports and exports, which are denoted by
X1, X2, X3, X4, respectively. By using the fuzzy varying coefficient regression model combined with robustness analysis, we remove the outliers by utilizing the robustfit function in Matlab, and construct the fuzzy regression varying regression model for fitting the response (GDP).
The dataset shown in Table 1 consists of 29 observations of the total amount of urban and rural savings deposit, the total financial expenditure, fixed assets investment, sum of imports and exports, and GDP of China from 1981 to 2009, in which the GDP is assumed to be a Gaussian fuzzy number.
First, we utilize the robustfit function in Matlab to analyze the robustness of this dataset; the diagram of residual errors is given in Fig. 1.
The residual of the 27th observation is 3.54 × 104, which is obviously much larger than that of the other observations, so we think it is an outlier and should be removed. Next we will construct the fuzzy varying coef-ficient regression model after removing the 27th observation, and estimate the parameters.
Next, we set the variable
tto be the time order of each year and consider the following fuzzy varying coefficient model:
With the selected optimal value of
h= 2.16 obtained by the fuzzy cross-validation procedure, we calibrate model Eq. (25) with the restricted weighted least-squares procedure.
For comparison, the fit values of Shen et al.'s method and our method are shown in the third and fifth column in Table 2, respectively, with their GOF shown in the fourth and sixth column, respectively. As is noticed, the AGOF of Shen et al.'s method is 0.8943, while the AGOF of our method is 0.9934. The larger value of AGOF for our method demonstrates that the fit performance of our approach is better than the approach proposed by Shen et al. .
The aim of our study is to propose a methodology for dealing with the dynamic fuzzy function relationships between the response variable and a set of explanatory variables, while simultaneously avoiding the impact of outliers in the dataset. Our methodology can improve the feasibility and effectiveness of the fuzzy regression model. Specifically, the contribution of this work can be summarized in two points. First, it has provided a dynamic model to deal with the data varying with a covariate, especially for the sampling data having approximately a Gaussian contribution. Second, with the help of robust analysis, the proposed model is free from the irregular data or outliers, after removing the outliers. From the robustness standpoint, a suitable index for characterizing the outliers needs to be studied to improve the robustness of the fuzzy regression model in future work.
[Table 1.] Observations of gross domestic product (GDP) from 1981 to 2009 (unit: CNY 10,000)
[Fig. 1.] Scatter diagram of robust analysis residual.
[Table 2.] Comparison between the observations and estimates of GDP (unit: CNY 10,000)