A New Estimation Model for Wireless Sensor Networks Based on the SpatialTemporal Correlation Analysis
 Author: Ren Xiaojun, Sug HyonTai, Lee HoonJae
 Publish: Journal of information and communication convergence engineering Volume 13, Issue2, p105~112, 30 June 2015

ABSTRACT
The estimation of missing sensor values is an important problem in sensor network applications, but the existing approaches have some limitations, such as the limitations of application scope and estimation accuracy. Therefore, in this paper, we propose a new estimation model based on a spatialtemporal correlation analysis (STCAM). STCAM can make full use of spatial and temporal correlations and can recognize whether the sensor parameters have a spatial correlation or a temporal correlation, and whether the missing sensor data are continuous. According to the recognition results, STCAM can choose one of the most suitable algorithms from among linear interpolation algorithm of temporal correlation analysis (TCALI), multiple regression algorithm of temporal correlation analysis (TCAMR), spatial correlation analysis (SCA), spatialtemporal correlation analysis (STCA) to estimate the missing sensor data. STCAM was evaluated over Intel lab dataset and a traffic dataset, and the simulation experiment results show that STCAM has good estimation accuracy.

KEYWORD
Data mining , DESM , Missing sensor data , STCAM

I. INTRODUCTION
With the rapid development of wireless communication, microelectronics, and embedded computing technologies, sensor networks are widely used in certain fields, such as the military, environment, and medicine. Therefore, nowadays, these networks have become a popular topic of research. In a wireless sensor network, sensors always communicate with the server and other sensors (e.g., for sending data or accepting data). However, in the process of communication, we can expect the transmitted sensor data to get lost or corrupted for many reasons, such as bad weather conditions, the sensor node’s communication ability, wireless signal strength, power outage at the sensor node, or a relatively high bit error rate of the wireless radio transmissions as compared with wired communications. In general, we can requery data or discard data, but requerying data is a naive alternative as it may induce a long wait or quicken the power exhaustion of the node, and most importantly, it does not guarantee having the original reading available. Discarding data is also a bad choice as it may lead to the loss of some interesting data. Therefore, it is essential to develop a technique for estimating missing data.
Data mining can produce knowledge from the existing data, and this knowledge can be used for estimating the missing sensor data. However, the existing missing sensor data estimation approaches do not achieve good results (as discussed in the following section). Therefore, in this paper, we propose a new estimation model based on a spatialtemporal correlation analysis (STCA). This model can discover intrinsic relationships among sensors and then incorporate the intrinsic relationships and the spatialtemporal relationship into data estimation. Finally, STCAM is tested with data from a traffic monitoring sensor application.
II. RELATED WORKS
In fact, the topic of missing data estimation belongs to the field of statistics, and many researchers have conducted a considerable amount of research on this topic by using methods such as mean substitution, linear regression, Bayesian estimation, expectation maximization, knearest neighbor, and neural networks [1,2]. However, because of the characteristics of a wireless sensor network, these techniques cannot provide a good estimation of the missing sensor data. To solve the problem of missing sensor data, many techniques have been proposed.
To avoid the problem of missing sensor data, many researchers have redesigned the sensor network architecture. NASA/JPL [3] is one of the most famous architectures. In NASA/JPL, if one sensor fails, its neighboring sensors compensate for the lost data by increasing their sampling rates. This implies that there must be a tight collaboration among sensors for a sensor to know that its neighboring sensor has failed. This increases the power consumption of every sensor even during its normal operation. Further, this approach does not address how sampling rates should be adjusted in order to guarantee good QoS. It is also possible that when some neighboring sensors fail, no sampling adjustment can potentially compensate for the missing values.
Some of the researchers used association rule mining to estimate the missing sensor data. Halatchev and Gruenwald [4] proposed the WARM algorithm. In this algorithm, if sensor node
a fails, WARM will find its neighbor sensor nodeb and useb ’s data to estimatea ’s missing data. WARM makes use of the sliding window concept, where only the latestw rounds of data reports are stored and used for estimation. However, the algorithm has one limitation, which is its disregard of the temporal aspect since it views all data as equally important. Gruenwald et al. [5] proposed an improved algorithm called FARM, which uses association rule mining to discover intrinsic relationships among sensors and incorporates them into the data estimation while taking data freshness into consideration. However, WARM and FARM can only be used in the case of discrete data; because most of the sensor data are numeric, WARM and FARM cannot be used widely.III. ESTIMATION MODEL BASED ON SPATIALTEMPORAL CORRELATION ANALYSIS
In fact, there are two types of missing sensor data, namely single missing data elements and continuous missing data; therefore, STCAM must have the ability to provide different solutions for different types of missing data according to the sensor’s spatialtemporal correlation. Before the given STCA, we will discuss the problem description, temporal correlation algorithm (TCA), spatial correlation algorithm (SA), and spatialtemporal correlation algorithm (STCA).
> A. Problem Description
STCAM uses a temporal series form to represent the collected data of a sensor node
a_{k} . The temporal series form is as follows:where
T _{1}......T_{n} ∈R denote the sampling time andV_{k} _{1}......V_{kn} ∈R represent the sampling values of sensor nodea_{k} at timeT _{1}......T_{n} . Assuming thatV_{ki} denotes the missing sensor data and represents the estimated sensor data at timeT_{i} , we can reduce the problem of the estimation of the missing sensor data to the calculation of the smallest value of> B. Temporal Correlation Algorithm
In some applications, the data of the monitoring parameter have a tight temporal correlation, such as temperature, humidity, and light intensity. Therefore, we can use temporal correlations to build the TCA model. In the next section, we will introduce two algorithms, namely the linear interpolation algorithm (TCALI) and multiple regression algorithm (TCAMR).
1) TCALI Algorithm
The linear interpolation algorithm is a method of curve fitting using linear polynomials, which have a high efficiency. In this section, TCALI can be expressed by the following formula [6]:
where
T_{u} andT_{v} denote the two nearest time points fromT_{i} , andT_{u} <T_{i} <T_{v} ; denotes the estimated sensor data at timeT_{i} , andV_{iu} andV_{iv} represent the sampling data at timeT_{u} andT_{v} , respectively.For a single missing data element, the TCALI algorithm can give a better attestation value, but if the missed sensor data are continuous, the accuracy of the TCALI algorithm decreases, as shown in Fig. 1. Sensor V measures the temperature every 24 minutes. Assuming that
T _{1176} (V _{1176} = 32.90) is missed and thatT _{1152} (V _{1150} = 33.10) andT _{1200} (V _{1200} = 32.50) are the two nearest time points fromT _{1176}, we find that is close toV _{1176}. However, assuming the data betweenT _{1008} andT _{1272}, we find thatT _{984} (V _{984} = 29.80) andT _{1296} (V _{1296} = 30.30) are the two nearest time points fromT _{1176} ; therefore, and the value of is very large. Hence, the TCALI algorithm is only used for estimating single missing data elements.2) TCAMR Algorithm
From the above section, we see that TCALI has good accuracy for single missing sensor data elements in TCA, but for continuous missing sensor data, TCALI cannot provide good estimation data. Therefore, in this section, we will introduce the multiple regression algorithm (TCAMR) to estimate the continuous missing sensor data of the TCA model. Assuming that
V_{ki} denotes the missing data of sensor nodea_{k} at timeT_{i} , the problem of estimating can be solved by using the following multiple regression formula:where {
β _{0},β _{1},β _{2}.......β_{m} } denote regression coefficients, which represent the contribution level for .To estimate , we should use the training dataset to estimate the value of {
β _{0},β _{1},β _{2}.......β_{m} }. Assuming that the training dataset is {V_{ki} ,V_{k} _{(}_{i} _{+1)},V_{k} _{(}_{i} _{+2)}.......V_{kj} },j >i + 2m + 1. To estimate {β _{0},β _{1},β _{2}.......β_{m} }, we should build h linear equations (h >m + 1) that can be expressed as follows:Let
Therefore, Eq. (3) can be rewritten as follows:
The coefficient
β can be estimated by using the leastsquares approach [7], which can be expressed as follows:After we calculate the value of coefficient
β , we can use Equation (2) to estimate the continuous missing sensor data of TCA.> C. SCA Algorithm
For continuous missing data and loose temporal correlation parameters, the TCA algorithm cannot provide a good estimation value for the missing data. However, the SCA algorithm can discover the spatial relationship between the sensor nodes and use the discovered spatial knowledge to estimate the missing data.
Assuming that
V_{ki} denotes the missing data of sensor nodea_{k} at timeT_{i} and {α _{1},α _{2}......α_{m} } represent the neighbors ofa_{k} , we find that {V _{1}_{i} ,V _{2}_{i} ......V_{mi} } represent the data values of {α _{1},α _{2}......α_{m} } at timeT_{i} . The problem of estimatingV_{ki} can be solved by {V _{1}_{i} ,V _{2}_{i} ......V_{mi} } using the multiple regression as follows:where {
β _{0},β _{1},β _{2}.......β_{m} } denote regression coefficients, which represent the contribution level for .To calculate SCA needs a dataset to estimate the value of {
β _{0},β _{1},β _{2}.......β_{m} }. According to the solution rules of linear equations, the dataset contains at least (m + 1) groups of {V _{1},V _{2}.......V_{m} }. Note that whenh >m +1, the linear equations can be expressed as follows:Let
Therefore, Eq. (3) can be represented by using matrix algebra as follows:
Hence, we can calculate the value of coefficient
β as follows:> D. STCA Algorithm
The TCA algorithm is always used for estimating tight temporal correlations and single missing data elements. The SRA algorithm is used for estimating tight spatial correlations. However, when the temporal or spatial correlation is unknown, TCA or SCA may not give a good estimation value. To solve this problem, the STCA algorithm is proposed. This algorithm takes into account the weight of the temporal and spatial correlations; therefore, the STCA algorithm can be represented as follows:
where
W_{s} andW_{T} denote the weight of respectively; represents the result of the SCA algorithm, and denotes the result of the TCA algorithm.To obtain the optimum value of
W_{s} andW_{T} , STCA calculates the residual sum of squares (RSS) as follows:where denote the estimation error of respectively, and
h represents the number of selected datasets.Let
Therefore, this question of getting optimum value of
W_{s} andW_{T} becomes a quadratic programming problem:Therefore, we can use the leastsquares approach [7] to obtain an optimal solution as follows:
where
> E. Correlation Analysis Algorithm
We use Pearson’s productmoment correlation coefficient to measure the correlation of the output variable and the input variable. The value of ρ is between +1 and −1 (inclusive), where 1 denotes a total positive correlation, 0 represents no correlation, and −1 indicates a total negative correlation. The formula for ρ is as follows:
where
y denotes the output variable;x represents the input variable;μ_{x} andμ_{y} denote the mean ofx andy , respectively; andσ_{x} andσ_{y} indicate the standard deviation ofx andy , respectively.Further, 0.5 < 
ρ  ≤ 1 is regarded as a high correlation, 0.3 < ρ  ≤ 0.5 as a medium correlation, and 0.0 < ρ  ≤ 0.3 as a low correlation.1) Temporal Correlation Analysis
If we want to find whether the sampling data of a sensor have a temporal relationship, we should choose a training dataset for the analysis. Assuming that {
V_{ki} ,V_{k} _{(}_{i} _{+1)},V_{k} _{(}_{i} _{+2)},.......,V_{k} _{(}_{j} _{1)},V_{kj} }, is the training dataset of sensor nodea_{k} , we useV_{kj} ,V_{k} _{(}_{j} _{1)},.....,V _{(}_{jh} _{)} to denote the sampling value ofa_{k} at time T_{h}. Thus, we obtain the subdataset at T_{(h1)},T_{(h2)},T_{(h3)},..., T_{1}, as shown in Table 1.Therefore, we can use Eq. (12) to analyze the relationship of the dataset of T_{h} and the dataset of another time (T_{(h1)},T_{(h2)},...,T_{1}). Now, we can define the temporal correlation as follows:
Definition 1: In a training dataset, if the subdataset of T_{h} is highly relevant to one or more subdatasets of another time (0.5 < ρ ≤ 1), the dataset of the sensor node has a high temporal correlation. If the subdataset of T_{h} is only moderately relevant to one or more subdatasets of another time (0.3 < ρ ≤ 0.5), the dataset of the sensor node has a medium temporal correlation.2) Spatial Correlation Analysis
If we want to determine whether the sampling data of a sensor have a spatial relationship, we should also choose a training dataset for the analysis. Assuming that
a _{(}_{k} _{+1)},a _{(}_{k} _{+2)},...,a _{(}_{k+i} _{)} are the nearest nodes froma_{k} , we obtain the values listed in Table 2.Definition 2: In a training dataset, if the subdataset ofa_{k} is highly relevant to one or more other sensor node subdatasets (0.5 < ρ ≤ 1), the dataset of the sensor node has a high spatial correlation. If the subdataset of T_{h} is only moderately relevant to one or more other sensor node subdatasets (0.3 < ρ ≤ 0.5), the dataset of the sensor node has a medium spatial correlation.> F. Process of STCAM Decision
STCAM uses the following four algorithms: TCALI, TCAMR, SCA, and STCA. If a sensor node has a tight temporal correlation and does not miss continuous data, STCAM will use the TCALI or TCAMR algorithm to estimate the missing sensor data. Further, if a sensor node has a high spatial correlation, STCAM will use SCA to estimate the missing data. Otherwise, STCAM will choose the STCA algorithm. The process of STCAM decision making is shown in Fig. 2. From these figures, we can conclude the applicable conditions of the four algorithms.
TCALI: The training dataset has a high temporal correlation when the type of missing sensor data is single. In contrast, the training dataset has a medium temporal correlation when the training dataset has a low spatial correlation and the type of missing sensor data is single.TCAMR: The training dataset has a high temporal correlation, and the type of missing sensor data is continuous. The training dataset has a medium temporal correlation when the training dataset has a low spatial correlation and the type of missing sensor data is continuous.SCA: The training dataset has a medium temporal correlation when the training dataset has a high spatial correlation. The training dataset has a low temporal correlation when the training dataset has a low spatial correlation, and the training dataset has a low temporal correlation when the training dataset has a medium spatial correlation.STCA: The training dataset has a medium temporal correlation when the training dataset has a medium spatial correlation.If the training dataset has a low spatial and temporal correlation, there is no matching algorithm for the estimation.
IV. SIMULATION EXPERIMENTS
The estimation model proposed in this paper is simulated using Java and evaluated over the Intel lab dataset [8] and a traffic dataset of a city in China. The Intel lab dataset is a trace of readings from 54 sensor nodes deployed in the Intel Research Berkeley lab. These sensor nodes collected the light, humidity, temperature, and voltage readings once every 30 seconds. The traffic dataset is a trace of readings from 596 sensor nodes that are deployed on different roads.
To evaluate the accuracy and performance of STCAM, we choose DESM [9] for a comparison. The DESM algorithm is also an estimation approach based on the spatialtemporal correlation, and the result formula is as follows:
where
V_{k} _{(}_{i} _{1)} denotes the value of sensor nodea_{k} at (i 1) time,V_{zi} represents the value ofa_{z} at time i,β denotes the weight ofV_{zi} , and DESM choosesa_{z} as the nearest node froma_{k} (for a detailed description of DESM, refer to [9]).To evaluate the four abovementioned algorithms, we need to choose different datasets for testing.
1) Comparison between TCALI/TCAMR and DESM
By analyzing the temporal correlation, we know that the temperature dataset has a high temporal correlation; therefore, we use the temperature dataset of sensor 23 to test the accuracy and performance of TCALI/TCAMR. Firstly, we assume that the 121^{th}, 131^{th}, 141^{th},..., 311^{th} data elements are missed; therefore, under this condition, STCAM chooses the TCALI algorithm to estimate the missing sensor data. The experiment results are presented in Fig. 3 and Table 3.
We assume that data 121–140 are missing; therefore, under this condition, STCAM chooses the TCAMR algorithm to estimate the missing sensor data. The experiment results are presented in Fig. 4 and Table 4.
According to Fig. 4, the accuracy of TCAMR decreases with an increase in the amount of continuous missing sensor data. Therefore, there is a threshold. If the amount of missing sensor data is less than the threshold, TCAMR exhibits good accuracy, and if the amount of missing sensor data is more than the threshold, TCAMR is not suitable for estimating the missing sensor data with a high temporal correlation. According to Table 4, the performance of TCAMR is significantly higher than that of DESM.
2) Comparison between STCA and DESM
By analyzing the temporal and spatial correlation, we know that the humidity dataset has a medium temporal and spatial correlation. Under this condition, STCAM chooses the STCA algorithm to estimate the missing sensor data. We choose the humidity dataset of sensor 11 to test the accuracy and performance of STCA. Assuming that data 81–105 are missing, we obtain the experimental results shown in Fig. 5 and Table 5.
According to Fig. 5 and Table 5, STCA exhibits better accuracy, but the performance is lower than that of DESM.
3) Comparison between SCA and DESM
By analyzing the temporal and spatial correlation, we find that the traffic dataset has a low temporal correlation and a high spatial correlation. Therefore, under this condition, STCAM chooses the SCA algorithm to estimate the missing sensor data. We suppose that data 121–144 of sensor
a _{6} are missing. The experimental results are shown in Fig. 6 and Table 6.According to Fig. 6 and Table 6, SCA exhibits better accuracy, but the performance is lower than that of DESM.
V. CONCLUSION
In this paper, we propose a data estimation technique called STCAM, which can discover the correlation of the training dataset, and depending on this correlation and the type of missing sensor data, STCAM can choose one of the most suitable algorithms from SCALI, SCAMR, TCA, and STCA to estimate the missing sensor data. From the simulation result, we conclude that STCAM exhibits good accuracy for the missing sensor data, but in terms of performance, STCAM has a relatively low computational efficiency. Therefore, STCAM can only be deployed at the sink node or in the central server. Moreover, by the simulation, we found that the accuracy of TCAMR decreases with an increase in the amount of continuous missing sensor data, and this may influence the total accuracy of STCAM, but in the paper, we do not provide an effective solution for this issue. Therefore, in the future, we will conduct further research to fill the gap.

[]

[]

[Fig. 1.] Temperature data collected by sensor V for one day.

[]

[]

[]

[]

[]

[]

[]

[]

[]

[]

[]

[]

[]

[]

[]

[]

[]

[Table 1.] Dataset at time Th, T(h1),T(h2),..., T1

[Table 2.] Dataset of ak, a(k+1),a(k+2),...,a(k+i) at different times

[Fig. 2.] Process of STCAM decision. STCAM: a model based on spatialtemporal correlation analysis, TCA: temporal correlation analysis, TCAMR: multiple regression algorithm of TCA, TCALI: linear interpolation algorithm of TCA, SCA: spatial correlation analysis, STCA: spatialtemporal correlation analysis.

[]

[Fig. 3.] Comparison of experimental results of TCALI and DESM. TCALI: linear interpolation algorithm of temporal correlation analysis, DESM: data estimation using statistical model.

[Table 3.] Performance comparison of TCALI and DESM

[Fig. 4.] Comparison of experimental results of TCAMR and DESM. TCALI: linear interpolation algorithm of temporal correlation analysis, DESM: data estimation using statistical model.

[Table 4.] Performance comparison of TCAMR and DESM

[Fig. 5.] Comparison of experimental results of STCA and DESM. STCA: spatialtemporal correlation analysis, DESM: data estimation using statistical model.

[Table 5.] Performance comparison of STCA and DESM

[Fig. 6.] Comparison of experimental results of SCA and DESM. SCA: spatial correlation analysis, DESM: data estimation using statistical model.

[Table 6.] Performance comparison of SCA and DESM