A New Estimation Model for Wireless Sensor Networks Based on the Spatial-Temporal Correlation Analysis

  • cc icon
  • ABSTRACT

    The estimation of missing sensor values is an important problem in sensor network applications, but the existing approaches have some limitations, such as the limitations of application scope and estimation accuracy. Therefore, in this paper, we propose a new estimation model based on a spatial-temporal correlation analysis (STCAM). STCAM can make full use of spatial and temporal correlations and can recognize whether the sensor parameters have a spatial correlation or a temporal correlation, and whether the missing sensor data are continuous. According to the recognition results, STCAM can choose one of the most suitable algorithms from among linear interpolation algorithm of temporal correlation analysis (TCA-LI), multiple regression algorithm of temporal correlation analysis (TCA-MR), spatial correlation analysis (SCA), spatial-temporal correlation analysis (STCA) to estimate the missing sensor data. STCAM was evaluated over Intel lab dataset and a traffic dataset, and the simulation experiment results show that STCAM has good estimation accuracy.


  • KEYWORD

    Data mining , DESM , Missing sensor data , STCAM

  • I. INTRODUCTION

    With the rapid development of wireless communication, microelectronics, and embedded computing technologies, sensor networks are widely used in certain fields, such as the military, environment, and medicine. Therefore, nowadays, these networks have become a popular topic of research. In a wireless sensor network, sensors always communicate with the server and other sensors (e.g., for sending data or accepting data). However, in the process of communication, we can expect the transmitted sensor data to get lost or corrupted for many reasons, such as bad weather conditions, the sensor node’s communication ability, wireless signal strength, power outage at the sensor node, or a relatively high bit error rate of the wireless radio transmissions as compared with wired communications. In general, we can re-query data or discard data, but re-querying data is a naive alternative as it may induce a long wait or quicken the power exhaustion of the node, and most importantly, it does not guarantee having the original reading available. Discarding data is also a bad choice as it may lead to the loss of some interesting data. Therefore, it is essential to develop a technique for estimating missing data.

    Data mining can produce knowledge from the existing data, and this knowledge can be used for estimating the missing sensor data. However, the existing missing sensor data estimation approaches do not achieve good results (as discussed in the following section). Therefore, in this paper, we propose a new estimation model based on a spatial-temporal correlation analysis (STCA). This model can discover intrinsic relationships among sensors and then incorporate the intrinsic relationships and the spatial-temporal relationship into data estimation. Finally, STCAM is tested with data from a traffic monitoring sensor application.

    II. RELATED WORKS

    In fact, the topic of missing data estimation belongs to the field of statistics, and many researchers have conducted a considerable amount of research on this topic by using methods such as mean substitution, linear regression, Bayesian estimation, expectation maximization, k-nearest neighbor, and neural networks [1,2]. However, because of the characteristics of a wireless sensor network, these techniques cannot provide a good estimation of the missing sensor data. To solve the problem of missing sensor data, many techniques have been proposed.

    To avoid the problem of missing sensor data, many researchers have redesigned the sensor network architecture. NASA/JPL [3] is one of the most famous architectures. In NASA/JPL, if one sensor fails, its neighboring sensors compensate for the lost data by increasing their sampling rates. This implies that there must be a tight collaboration among sensors for a sensor to know that its neighboring sensor has failed. This increases the power consumption of every sensor even during its normal operation. Further, this approach does not address how sampling rates should be adjusted in order to guarantee good QoS. It is also possible that when some neighboring sensors fail, no sampling adjustment can potentially compensate for the missing values.

    Some of the researchers used association rule mining to estimate the missing sensor data. Halatchev and Gruenwald [4] proposed the WARM algorithm. In this algorithm, if sensor node a fails, WARM will find its neighbor sensor node b and use b’s data to estimate a’s missing data. WARM makes use of the sliding window concept, where only the latest w rounds of data reports are stored and used for estimation. However, the algorithm has one limitation, which is its disregard of the temporal aspect since it views all data as equally important. Gruenwald et al. [5] proposed an improved algorithm called FARM, which uses association rule mining to discover intrinsic relationships among sensors and incorporates them into the data estimation while taking data freshness into consideration. However, WARM and FARM can only be used in the case of discrete data; because most of the sensor data are numeric, WARM and FARM cannot be used widely.

    III. ESTIMATION MODEL BASED ON SPATIAL-TEMPORAL CORRELATION ANALYSIS

    In fact, there are two types of missing sensor data, namely single missing data elements and continuous missing data; therefore, STCAM must have the ability to provide different solutions for different types of missing data according to the sensor’s spatial-temporal correlation. Before the given STCA, we will discuss the problem description, temporal correlation algorithm (TCA), spatial correlation algorithm (SA), and spatial-temporal correlation algorithm (STCA).

      >  A. Problem Description

    STCAM uses a temporal series form to represent the collected data of a sensor node ak. The temporal series form is as follows:

    image

    where T1......TnR denote the sampling time and Vk1...... VknR represent the sampling values of sensor node ak at time T1......Tn. Assuming that Vki denotes the missing sensor data and represents the estimated sensor data at time Ti, we can reduce the problem of the estimation of the missing sensor data to the calculation of the smallest value of

      >  B. Temporal Correlation Algorithm

    In some applications, the data of the monitoring parameter have a tight temporal correlation, such as temperature, humidity, and light intensity. Therefore, we can use temporal correlations to build the TCA model. In the next section, we will introduce two algorithms, namely the linear interpolation algorithm (TCA-LI) and multiple regression algorithm (TCA-MR).

    1) TCA-LI Algorithm

    The linear interpolation algorithm is a method of curve fitting using linear polynomials, which have a high efficiency. In this section, TCA-LI can be expressed by the following formula [6]:

    image

    where Tu and Tv denote the two nearest time points from Ti, and Tu < Ti < Tv ; denotes the estimated sensor data at time Ti, and Viu and Viv represent the sampling data at time Tu and Tv, respectively.

    For a single missing data element, the TCA-LI algorithm can give a better attestation value, but if the missed sensor data are continuous, the accuracy of the TCA-LI algorithm decreases, as shown in Fig. 1. Sensor V measures the temperature every 24 minutes. Assuming that T1176 (V1176 = 32.90) is missed and that T1152 (V1150 = 33.10) and T1200 (V1200 = 32.50) are the two nearest time points from T1176, we find that is close to V1176. However, assuming the data between T1008 and T1272, we find that T984 (V984 = 29.80) and T1296 (V1296 = 30.30) are the two nearest time points from T1176 ; therefore, and the value of is very large. Hence, the TCA-LI algorithm is only used for estimating single missing data elements.

    2) TCA-MR Algorithm

    From the above section, we see that TCA-LI has good accuracy for single missing sensor data elements in TCA, but for continuous missing sensor data, TCA-LI cannot provide good estimation data. Therefore, in this section, we will introduce the multiple regression algorithm (TCA-MR) to estimate the continuous missing sensor data of the TCA model. Assuming that Vki denotes the missing data of sensor node ak at time Ti, the problem of estimating can be solved by using the following multiple regression formula:

    image

    where {β0,β1,β2.......βm} denote regression coefficients, which represent the contribution level for .

    To estimate , we should use the training dataset to estimate the value of {β0,β1,β2.......βm}. Assuming that the training dataset is {Vki,Vk(i+1),Vk(i+2).......Vkj}, j > i + 2m + 1. To estimate {β0,β1,β2.......βm}, we should build h linear equations (h > m + 1) that can be expressed as follows:

    image

    Let

    image

    Therefore, Eq. (3) can be rewritten as follows:

    image

    The coefficient β can be estimated by using the leastsquares approach [7], which can be expressed as follows:

    image

    After we calculate the value of coefficient β, we can use Equation (2) to estimate the continuous missing sensor data of TCA.

      >  C. SCA Algorithm

    For continuous missing data and loose temporal correlation parameters, the TCA algorithm cannot provide a good estimation value for the missing data. However, the SCA algorithm can discover the spatial relationship between the sensor nodes and use the discovered spatial knowledge to estimate the missing data.

    Assuming that Vki denotes the missing data of sensor node ak at time Ti and {α1,α2......αm} represent the neighbors of ak, we find that {V1i,V2i......Vmi} represent the data values of {α1,α2......αm} at time Ti. The problem of estimating Vki can be solved by {V1i,V2i......Vmi} using the multiple regression as follows:

    image

    where {β0,β1,β2.......βm} denote regression coefficients, which represent the contribution level for .

    To calculate SCA needs a dataset to estimate the value of {β0,β1,β2.......βm}. According to the solution rules of linear equations, the dataset contains at least (m + 1) groups of {V1,V2.......Vm}. Note that when h > m+1, the linear equations can be expressed as follows:

    image

    Let

    image

    Therefore, Eq. (3) can be represented by using matrix algebra as follows:

    image

    Hence, we can calculate the value of coefficient β as follows:

    image

      >  D. STCA Algorithm

    The TCA algorithm is always used for estimating tight temporal correlations and single missing data elements. The SRA algorithm is used for estimating tight spatial correlations. However, when the temporal or spatial correlation is unknown, TCA or SCA may not give a good estimation value. To solve this problem, the STCA algorithm is proposed. This algorithm takes into account the weight of the temporal and spatial correlations; therefore, the STCA algorithm can be represented as follows:

    image

    where Ws and WT denote the weight of respectively; represents the result of the SCA algorithm, and denotes the result of the TCA algorithm.

    To obtain the optimum value of Ws and WT, STCA calculates the residual sum of squares (RSS) as follows:

    image

    where denote the estimation error of respectively, and h represents the number of selected datasets.

    Let

    image

    Therefore, this question of getting optimum value of Ws and WT becomes a quadratic programming problem:

    image

    Therefore, we can use the least-squares approach [7] to obtain an optimal solution as follows:

    image

    where

    image

      >  E. Correlation Analysis Algorithm

    We use Pearson’s product-moment correlation coefficient to measure the correlation of the output variable and the input variable. The value of ρ is between +1 and −1 (inclusive), where 1 denotes a total positive correlation, 0 represents no correlation, and −1 indicates a total negative correlation. The formula for ρ is as follows:

    image

    where y denotes the output variable; x represents the input variable; μx and μy denote the mean of x and y, respectively; and σx and σy indicate the standard deviation of x and y, respectively.

    Further, 0.5 < |ρ| ≤ 1 is regarded as a high correlation, 0.3 < |ρ| ≤ 0.5 as a medium correlation, and 0.0 < |ρ| ≤ 0.3 as a low correlation.

    1) Temporal Correlation Analysis

    If we want to find whether the sampling data of a sensor have a temporal relationship, we should choose a training dataset for the analysis. Assuming that {Vki,Vk(i+1),Vk(i+2),.......,Vk(j-1),Vkj}, is the training dataset of sensor node ak, we use Vkj,Vk(j-1),.....,V(j-h) to denote the sampling value of ak at time Th. Thus, we obtain the subdataset at T(h-1),T(h-2),T(h-3),..., T1, as shown in Table 1.

    Therefore, we can use Eq. (12) to analyze the relationship of the dataset of Th and the dataset of another time (T(h-1),T(h-2),...,T1). Now, we can define the temporal correlation as follows:

    Definition 1: In a training dataset, if the sub-dataset of Th is highly relevant to one or more sub-datasets of another time (0.5 < |ρ| ≤ 1), the dataset of the sensor node has a high temporal correlation. If the sub-dataset of Th is only moderately relevant to one or more sub-datasets of another time (0.3 < |ρ| ≤ 0.5), the dataset of the sensor node has a medium temporal correlation.

    2) Spatial Correlation Analysis

    If we want to determine whether the sampling data of a sensor have a spatial relationship, we should also choose a training dataset for the analysis. Assuming that a(k+1),a(k+2),...,a(k+i) are the nearest nodes from ak, we obtain the values listed in Table 2.

    Definition 2: In a training dataset, if the sub-dataset of ak is highly relevant to one or more other sensor node subdatasets (0.5 < |ρ| ≤ 1), the dataset of the sensor node has a high spatial correlation. If the sub-dataset of Th is only moderately relevant to one or more other sensor node subdatasets (0.3 < |ρ| ≤ 0.5), the dataset of the sensor node has a medium spatial correlation.

      >  F. Process of STCAM Decision

    STCAM uses the following four algorithms: TCA-LI, TCA-MR, SCA, and STCA. If a sensor node has a tight temporal correlation and does not miss continuous data, STCAM will use the TCA-LI or TCA-MR algorithm to estimate the missing sensor data. Further, if a sensor node has a high spatial correlation, STCAM will use SCA to estimate the missing data. Otherwise, STCAM will choose the STCA algorithm. The process of STCAM decision making is shown in Fig. 2. From these figures, we can conclude the applicable conditions of the four algorithms.

    TCA-LI: The training dataset has a high temporal correlation when the type of missing sensor data is single. In contrast, the training dataset has a medium temporal correlation when the training dataset has a low spatial correlation and the type of missing sensor data is single.

    TCA-MR: The training dataset has a high temporal correlation, and the type of missing sensor data is continuous. The training dataset has a medium temporal correlation when the training dataset has a low spatial correlation and the type of missing sensor data is continuous.

    SCA: The training dataset has a medium temporal correlation when the training dataset has a high spatial correlation. The training dataset has a low temporal correlation when the training dataset has a low spatial correlation, and the training dataset has a low temporal correlation when the training dataset has a medium spatial correlation.

    STCA: The training dataset has a medium temporal correlation when the training dataset has a medium spatial correlation.

    If the training dataset has a low spatial and temporal correlation, there is no matching algorithm for the estimation.

    IV. SIMULATION EXPERIMENTS

    The estimation model proposed in this paper is simulated using Java and evaluated over the Intel lab dataset [8] and a traffic dataset of a city in China. The Intel lab dataset is a trace of readings from 54 sensor nodes deployed in the Intel Research Berkeley lab. These sensor nodes collected the light, humidity, temperature, and voltage readings once every 30 seconds. The traffic dataset is a trace of readings from 596 sensor nodes that are deployed on different roads.

    To evaluate the accuracy and performance of STCAM, we choose DESM [9] for a comparison. The DESM algorithm is also an estimation approach based on the spatial-temporal correlation, and the result formula is as follows:

    image

    where Vk(i-1) denotes the value of sensor node ak at (i -1) time, Vzi represents the value of az at time i, β denotes the weight of Vzi, and DESM chooses az as the nearest node from ak (for a detailed description of DESM, refer to [9]).

    To evaluate the four abovementioned algorithms, we need to choose different datasets for testing.

       1) Comparison between TCA-LI/TCA-MR and DESM

    By analyzing the temporal correlation, we know that the temperature dataset has a high temporal correlation; therefore, we use the temperature dataset of sensor 23 to test the accuracy and performance of TCA-LI/TCA-MR. Firstly, we assume that the 121th, 131th, 141th,..., 311th data elements are missed; therefore, under this condition, STCAM chooses the TCA-LI algorithm to estimate the missing sensor data. The experiment results are presented in Fig. 3 and Table 3.

    We assume that data 121–140 are missing; therefore, under this condition, STCAM chooses the TCA-MR algorithm to estimate the missing sensor data. The experiment results are presented in Fig. 4 and Table 4.

    According to Fig. 4, the accuracy of TCA-MR decreases with an increase in the amount of continuous missing sensor data. Therefore, there is a threshold. If the amount of missing sensor data is less than the threshold, TCA-MR exhibits good accuracy, and if the amount of missing sensor data is more than the threshold, TCA-MR is not suitable for estimating the missing sensor data with a high temporal correlation. According to Table 4, the performance of TCA-MR is significantly higher than that of DESM.

       2) Comparison between STCA and DESM

    By analyzing the temporal and spatial correlation, we know that the humidity dataset has a medium temporal and spatial correlation. Under this condition, STCAM chooses the STCA algorithm to estimate the missing sensor data. We choose the humidity dataset of sensor 11 to test the accuracy and performance of STCA. Assuming that data 81–105 are missing, we obtain the experimental results shown in Fig. 5 and Table 5.

    According to Fig. 5 and Table 5, STCA exhibits better accuracy, but the performance is lower than that of DESM.

       3) Comparison between SCA and DESM

    By analyzing the temporal and spatial correlation, we find that the traffic dataset has a low temporal correlation and a high spatial correlation. Therefore, under this condition, STCAM chooses the SCA algorithm to estimate the missing sensor data. We suppose that data 121–144 of sensor a6 are missing. The experimental results are shown in Fig. 6 and Table 6.

    According to Fig. 6 and Table 6, SCA exhibits better accuracy, but the performance is lower than that of DESM.

    V. CONCLUSION

    In this paper, we propose a data estimation technique called STCAM, which can discover the correlation of the training dataset, and depending on this correlation and the type of missing sensor data, STCAM can choose one of the most suitable algorithms from SCA-LI, SCA-MR, TCA, and STCA to estimate the missing sensor data. From the simulation result, we conclude that STCAM exhibits good accuracy for the missing sensor data, but in terms of performance, STCAM has a relatively low computational efficiency. Therefore, STCAM can only be deployed at the sink node or in the central server. Moreover, by the simulation, we found that the accuracy of TCA-MR decreases with an increase in the amount of continuous missing sensor data, and this may influence the total accuracy of STCAM, but in the paper, we do not provide an effective solution for this issue. Therefore, in the future, we will conduct further research to fill the gap.

  • 1. Pan L., Gao H., Gao H., Liu Y. 2014 “A spatial correlation based adaptive missing data estimation algorithm in wireless sensor networks,” [International Journal of Wireless Information Networks] Vol.21 P.280-289 google doi
  • 2. Niu K., Zhao F., Qiao X. 2013 “A missing data imputation algorithm in wireless sensor network based on minimized similarity distortion,” [in Proceedings of the 6th International Symposium on Computational Intelligence and Design (ISCID)] P.235-238 google
  • 3. Ramakrishnan S. 2003 “Sensing the world,” [Jasubhai Digital Media] Vol.10 P.26-28 google
  • 4. Halatchev M., Gruenwald L. 2005 “Estimating missing values in related sensor data streams,” [in Proceedings of the 11th International Conference on Management of Data (COMAD)] P.83-94 google
  • 5. Gruenwald L., Chok H., Aboukhamis M. 2007 “Using data mining to estimate missing sensor data,” [in Proceedings of 7th IEEE International Conference on Data Mining Workshops] P.207-212 google
  • 6. Yarman B. S., Kilinc A., Aksen A. 2004 “Immitance data modelling via linear interpolation techniques: a classical circuit theory approach,” [International Journal of Circuit Theory and Applications] Vol.32 P.537-563 google doi
  • 7. Kanamori T., Hido S., Sugiyama M. 2009 “A least-squares approach to direct importance estimation,” [Journal of Machine Learning Research] Vol.10 P.1391-1445 google
  • 8. Madden S. Intel lab data [Internet] google
  • 9. Li Y., Ai C., Deshmukh W. P., Wu Y. 2008 “Data estimation in sensor networks using physical and statistical methodologies,” [in Proceedings of 28th International Conference on Distributed Computing Systems (ICDCS'08)] P.538-545 google
  • [] 
  • [] 
  • [Fig. 1.] Temperature data collected by sensor V for one day.
    Temperature data collected by sensor V for one day.
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [] 
  • [Table 1.] Dataset at time Th, T(h-1),T(h-2),..., T1
    Dataset at time Th, T(h-1),T(h-2),..., T1
  • [Table 2.] Dataset of ak, a(k+1),a(k+2),...,a(k+i) at different times
    Dataset of ak, a(k+1),a(k+2),...,a(k+i) at different times
  • [Fig. 2.] Process of STCAM decision. STCAM: a model based on spatialtemporal correlation analysis, TCA: temporal correlation analysis, TCA-MR: multiple regression algorithm of TCA, TCA-LI: linear interpolation algorithm of TCA, SCA: spatial correlation analysis, STCA: spatial-temporal correlation analysis.
    Process of STCAM decision. STCAM: a model based on spatialtemporal correlation analysis, TCA: temporal correlation analysis, TCA-MR: multiple regression algorithm of TCA, TCA-LI: linear interpolation algorithm of TCA, SCA: spatial correlation analysis, STCA: spatial-temporal correlation analysis.
  • [] 
  • [Fig. 3.] Comparison of experimental results of TCA-LI and DESM. TCA-LI: linear interpolation algorithm of temporal correlation analysis, DESM: data estimation using statistical model.
    Comparison of experimental results of TCA-LI and DESM. TCA-LI: linear interpolation algorithm of temporal correlation analysis, DESM: data estimation using statistical model.
  • [Table 3.] Performance comparison of TCA-LI and DESM
    Performance comparison of TCA-LI and DESM
  • [Fig. 4.] Comparison of experimental results of TCA-MR and DESM. TCA-LI: linear interpolation algorithm of temporal correlation analysis, DESM: data estimation using statistical model.
    Comparison of experimental results of TCA-MR and DESM. TCA-LI: linear interpolation algorithm of temporal correlation analysis, DESM: data estimation using statistical model.
  • [Table 4.] Performance comparison of TCA-MR and DESM
    Performance comparison of TCA-MR and DESM
  • [Fig. 5.] Comparison of experimental results of STCA and DESM. STCA: spatial-temporal correlation analysis, DESM: data estimation using statistical model.
    Comparison of experimental results of STCA and DESM. STCA: spatial-temporal correlation analysis, DESM: data estimation using statistical model.
  • [Table 5.] Performance comparison of STCA and DESM
    Performance comparison of STCA and DESM
  • [Fig. 6.] Comparison of experimental results of SCA and DESM. SCA: spatial correlation analysis, DESM: data estimation using statistical model.
    Comparison of experimental results of SCA and DESM. SCA: spatial correlation analysis, DESM: data estimation using statistical model.
  • [Table 6.] Performance comparison of SCA and DESM
    Performance comparison of SCA and DESM