An Exponential Smoothing Adaptive Failure Detector in the Dual Model of Heartbeat and Interaction
 DOI : 10.5626/JCSE.2014.8.1.17
 Author: Yang Zhiyong, Li Chunlin, Liu Yanpei, Liu Yunchang, Xu Lijun
 Organization: Yang Zhiyong; Li Chunlin; Liu Yanpei; Liu Yunchang; Xu Lijun
 Publish: Journal of Computing Science and Engineering Volume 8, Issue1, p17~24, 30 March 2014

ABSTRACT
In this paper, we propose a new implementation of a failure detector. The implementation uses a dual model of heartbeat and interaction. First, the heartbeat model is adopted to shorten the detection time, if the detection process does not receive the heartbeat message in the expected time. The interaction model is then used to check the process further. The expected time is calculated using the exponential smoothing method. Exponential smoothing can be used to estimate the next arrival time not only in the random data, but also in the data of linear trends. It is proven that the new detector in the paper can eventually be a perfect detector.

KEYWORD
Failure detector , Exponential smoothing , Eventually perfect

I. INTRODUCTION
A failure detector is a key building block for faulttolerant distributed system, which provide a mechanism to collect information of process failure. Chandra and Toueg [1] introduced the concept of an unreliable failure detector and many faulttolerance algorithms have been proposed based on unreliable failure detectors.
> The fixed timeout Δ_{to} is set as a constant value in a conventional failure detector and the monitoring process begins to suspect the monitored process if it does not receive a heartbeat message after time Δ_{to}. The disadvantage is that it is difficult to determine an appropriate Δ_{to}. If Δ_{to} is short, it is easy to make a wrong inference; if Δ_{to} is long, it results in a long time expense of detection time.
Recently, some adaptive failure detectors have been presented [26]. Most of them are based on a heartbeat strategy and modify the timeout value dynamically according to the network conditions. Chen et al. [7] estimate (hereinafter, Chen’s estimation) the next heartbeat arrival time by computing the average transmission time and a fixed safety margin is added. Bertier et al. [8] combine Chen’s estimation and a dynamical safety margin based on Jacobson’s [9] estimation of the roundtrip time.
In addition, some studies [1012] introduce Omega failure detectors (type of failure detectors) which judge process failure not according to the check point, but according to an output value φ (0 ≤ φ ≤1) in order to present the reliability of the process.
The next part of this section will introduce the failure detector properties and the quality of service of the failure detector.
> A. Failure Detector Properties
Failure detectors are characterized by two properties: completeness and accuracy. Two kinds of completeness and four kinds of accuracy are defined [1, 13].
Completeness characterizes the failure detector's capability to suspect every incorrect process permanently. Two kinds of completeness are defined.
1) Strong completeness: Eventually, every process that crashes is permanently suspected by every correct process. 2) Weak completeness: Eventually, every process that crashes is permanently suspected by some correct process.
Accuracy is the characterization of the failure detector’s capability not to suspect correct processes. Four kinds of accuracy are defined.
1) Strong accuracy: No process is suspected before it crashes. 2) Eventual strong accuracy: There is a time after which correct processes are not suspected by any correct process. 3) Weak accuracy: Some correct process is never suspected. 4) Eventual weak accuracy: There is a time after which some correct process is never suspected by any correct process.
Eight classes of failure detectors are yielded by combining the two kinds of completeness and four kinds of accuracy, named as perfect,
P ; eventually perfect, ◇P ; strong,S ; eventually strong, ◇S ;Q ; ◇Q ; weak,W ; eventually weak, ◇W .Generally speaking, a good failure detector should be a
perfect oreventually perfect detector. In this paper we design an eventually perfect detector. The detector meets the requirement of strong completeness and eventual strong accuracy.> B. Quality of Service of Failure Detectors
Some metrics have been proposed to specify the quality of service of a failure detector [14]. The main metrics are as follows:
● Detection time: The time from when the process crashes to the time when it is permanently suspected. ● Mistake recurrence time: The time between two consecutive mistakes. ● Mistake duration: The duration time for the failure detector to correct a mistake.
The rest of this paper is organized as follows: Section II describes our failure detection model; Section III presents the new algorithm in our failure detector; Section IV proves that it is an eventually perfect failure detector; Section V presents the performance evaluation by a series of experiments; and Section VI consists of the conclusion of our research.
Ⅱ. THE MODEL OF FAILURE DETECTION
> A. The Heartbeat Model
The heartbeat model [15] is used in most distributed systems. Every process
p periodically sends an “I am alive” heartbeat message to the processq . The period is the heartbeat interval Δ_{i} .If
q does not receive a heartbeat message fromp after a timeout delay Δ_{to},p is added to the list of suspected processes. Ifq receives the heartbeat message fromp later, thenq removesp from its list of suspected processes, as shown in Fig. 1.The heartbeat interval Δ
_{i} :Δ_{i} is the time between two emissions of the “I am alive” heartbeat message. The timeout delay Δ_{to} :Δ_{to} is the time between the last reception of the “I am alive” message fromp and the time whereq starts suspectingp . The transmission delay Δ_{tr} :Δ_{tr} is the time between the emission of the heartbeat message and the reception of the heartbeat message.> B. The Interaction Model
The process
q monitors every processp by sending an “Are you alive?” message top periodically. Oncep receives the message fromq , it replies with an “I am alive” message toq . Ifq does not receive the message fromp after a timeout delay Δ_{to},p is added to the list of suspected processes. Ifq receives the heartbeat message fromp later, then q removesp from its list of suspected processes, as shown in Fig. 2.> C. The Dual Model of Heartbeat and Interaction
> The heartbeat failure detector sends half as many messages as the interaction detector for the same detection quality, so the heartbeat failure detector is used in most of the distributed systems. However, it often mistakes actual failure statuses due to loss of packets or long delays in a complicated network. The interaction failure detector needs more detection time and sends more messages than the heartbeat detector, but it can be implemented as a requirement, and the detection result is independent of the message.
In this paper, we will use a dual model of heartbeat and interaction. The dual model is organized in two steps. First, the heartbeat model is used to detect the process failure. If a process
p is suspected, then the interaction model is used to check the processp further. The detailed procedure is as shown in Fig. 3.Ⅲ. AN EXPONENTIAL SMOOTHING ADAPTIVE FAILURE DETECTOR
Chen’s estimation is a classical adaptive failure detector algorithm. It estimates the arrival time of the next heartbeat message according to the historical time sequences of the heartbeat messages. In addition, a constant safety margin is added to raise the detector’s accuracy.
The process
q receives n heartbeat messages denotedm_{1}, m_{2}, m_{3} , ...,m_{n} . The receipt time of the messages are presented asA_{1}, A_{2}, A_{3} , ...,A_{n} and Δ_{i} is the heartbeat interval. Then, the estimation arrival time of the next message can be expressed as follows:The timeout checking point is
where is ∂ the constant safety margin.
Chen’s failure detection algorithm can estimate the arrival time of the next heartbeat message dynamically. However, it uses a constant safety margin, and the suitable constant ∂ is difficult to determine in a condition with a complex network. In addition, the algorithm uses the average transmission delay of the most recent
k messages to estimate the next transmission time. And it is only suitable for the fluctuation sequence with small fluctuations. If the delay sequences increase or decrease in a linear manner, Chen’s estimation does not correspond to the real delay.An improved estimation algorithm based on exponential smoothing is proposed to overcome the shortcomings of the base algorithm. Exponential smoothing [16] is a technique that can be applied to a time data sequence. It makes forecasts according to a series of historical data.
The delay data sequence is represented by {
x_{i} },x_{i} =A_{i} −Δ_{t}*i and the estimation of the exponential smoothing algorithm is written as . The simplest form of exponential smoothing is given by the formulaswhere α is the smoothing factor, and 0 < α < 1.
The above simple exponential smoothing does not perform well when there is a trend in the data. In such situations, double exponential smoothing is used to estimate the following data in a sequence with a linear trend. Double exponential smoothing works as follows:
where
a_{i} is the estimated level at timei andb_{i} is the estimated trend at timei .Then,
Another improvement in the algorithm is that the dynamic safety margin is calculated using the mean square deviation. As shown in Fig. 1, if the two adjacent heartbeat messages have the same transmission delay, then,
A _{i+1} =A _{i}+Δ_{t} , so A_{i+1} – (A_{i}+Δ_{t}) can represent the fluctuation in the adjacent transmission delay, and it may be positive or negative.Finally, we have
Algorithm 1 is the exponential smoothing adaptive failure detector (ESA_FD) algorithm.
Ⅳ. PROOF
Our failure detection algorithm can meet the requirements requirements of class ◇
P . In other words, it has the properties of strong completeness and eventual strong accuracy.Theorem 1. Strong completeness.Every crashed process
p is permanently suspected by every correct process∃t0 :∀t ≥ t0, ∀q∈correct(t),∀q∈crashed p∈suspectq(t).
Before providing proof, we defined several parameters:
Δmax: The unknown bound time threshold on the message transmission between processes p and q.mk: The kth message that process p sends to q.tsk : The time when process p sends mk to q.trk : The time when process q receives mk.tc : The time of a crash of process p.
When the process p crashes at
t_{c} ,p stops sending “I am alive” messages. Thents_{k} ≤t_{c} , so there is an upper boundt_{c} + Δ_{max} after the time when processq can’t receive any more messages fromp .The next thing to be proven is that τ_{i+1} exists as an upper bound.
where
x _{max} is the maximum value in {x_{i} }.So there is an upper bound (
i + 1) * Δ_{i}+ *x _{max} + 3Δ_{max} in the worst condition after which processq starts suspecting processp if it does not receive any messages fromp .Theorem 2. Eventual strong accuracy.Theorem 2 states that there is an upper bound time after which the correct processes are not suspected by any correct process.
∃ _{bound}, ∀t t ≥t _{bound}, ∀q ,p ∈correct (t ),p ∉suspect_{q} (t )However, since the maximum transmission delay is Δ_{max}, the dual model of heartbeat and interaction is used in this paper. In the worst network conditions, the checkpoint isτ_{i}+2 Δ_{max}, so a correct process will not be suspected by any correct process.
Ⅴ. EXPERIMENT
Two computers are used to build the experimental platform in a local area network. One machine runs a program to send heartbeat messages (like process
p ), and the other machine runs the monitored program to receive heartbeat messages (like processq ). We simulate complex network conditions by uploading files at a random time, and neither machines fail during the experiments.In the paper, we compare ESA_FD to conventional FD and Chen’s FD. The conventional FD sets a fixed timeout Δ
_{to} . If the monitor process does not receive the heartbeat message after Δ_{to} , it begins to suspect the monitored process. A good Δ_{to} is very important to the FD. If Δto is short, it’s easy to make a wrong suspicion. Otherwise, it results in a long time expense in detection time, but this is hard to determine under complex network conditions.Chen’s FD estimates the arrival time of the next heartbeat message dynamically and it uses the average transmission delay of the most recent
k messages to compute the next transmission time. To improve the accuracy, an additional safety margin ∂ is added. However, ∂ is a constant value in Chen’s FD.Experiment 1. Transmission delay contrast.We count thirty transmission times of the heartbeat messages and the heartbeat interval is set to 2 seconds. We compute the transmission delay according to the estimation of the arrival time by using Chen’s FD and ESA_FD. The result is shown in Fig. 5.
Since the average delay is used to estimate the arrival delay in Chen’s FD, Chen’s FD is only suitable for a series of random data and there is a big difference between the real transmission delay and Chen’s estimation when the data has a linear trend. Whereas the estimation in ESA_FD confirms the real transmission delay very well, exponential smoothing uses a weight value to emphasize the recent data and ESA_FD is suitable for the data not only for random but also for linear trends.
Experiment 2. The average mistake rate contrast.Two groups of tests are involved in computing the average mistake rate. The heartbeat interval of the first group is set to 2 seconds and the other is set to 5 seconds. The fixed timeout in the conventional FD is set to 1.8 seconds and the safety margin in Chen’s FD is set to 0.5 seconds. The result is shown in Fig. 6.
Since the conventional FD uses a fixed timeout to detect a failure, and it is difficult to choose the fixed value to fit complex network conditions, so the fixed timeout of 2 seconds may not be the most suitable value since it results in a high mistake rate of about 10%. Chen’s FD is better than the conventional FD, but it is not suitable for data with a linear trend. Therefore, the safety margin is an important adjustment for the result, and in the same manner as the conventional FD, it is hard to determine the value of the safety margin. The ESA_FD in this paper has a very low mistake rate providing accurate estimation to the transmission delay and dual detection starts while a process timeouts first. This further raises the accuracy of detection.
Experiment 3. The average detection time contrast.The parameters are set as in Experiment 2. In the experiment, we count the detection time and the result is shown in Fig. 7.
EFA_FD uses dual detection while a process is suspected, so it takes more time than Chen’s FD. However, since the estimation to the transmission is more precise, the amount of time for dual detection is shorter, and the difference between the Chen’s FD and ESA_FD is very small. So EFA_FD raises the detection accuracy by increasing detection time by a small margin.
Ⅵ. CONCLUSION
ESA_FD is an improved adaptive failure detector, which is more suitable for complex network conditions than Chen’s FD. On the one hand, ESA_FD uses exponential smoothing to predict the next data and the prediction through exponential smoothing is more accurate than Chen’s prediction, which is calculated with the average value, because Chen’s FD is only suited for random data with little fluctuation. Hence, ESA_FD can more accurately predict the series of data not only for random, but also in linear trends in applications where the weight value is of exponential smoothing. On the other hand, ESA_FD calculates the safety margin with the mean square deviation and it varies dynamically according the historical transmission delay. ESA_FD meets the requirements of strong completeness and eventually strong accuracy and the experimental results show that ESA_FD can increase the accuracy substantially by increasing the detection time by a small amount.

[Fig. 1.] The heartbeat model.

[Fig. 2.] The interaction model.

[Fig. 3.] The dual model of heartbeat and interaction.

[Fig. 5.] The transmission delay contrast. ESA_FD: exponential smoothing adaptive failure detector.

[Fig. 6.] The average mistake rate contrast. ESA_FD: exponential smoothing adaptive failure detector.

[Fig. 7.] The average detection time contrast. ESA_FD: exponential smoothing adaptive failure detector.