An Exponential Smoothing Adaptive Failure Detector in the Dual Model of Heartbeat and Interaction

Yang Zhiyong; Li Chunlin; Liu Yanpei; Liu Yunchang; Xu Lijun

doi:10.5626/JCSE.2014.8.1.17

OA학술지
Journal of Computing Science and Engineering

An Exponential Smoothing Adaptive Failure Detector in the Dual Model of Heartbeat and Interaction

DOI : 10.5626/JCSE.2014.8.1.17
Author: Yang Zhiyong, Li Chunlin, Liu Yanpei, Liu Yunchang, Xu Lijun
Organization: Yang Zhiyong; Li Chunlin; Liu Yanpei; Liu Yunchang; Xu Lijun
Publish: Journal of Computing Science and Engineering Volume 8, Issue1, p17~24, 30 March 2014

ABSTRACT

An Exponential Smoothing Adaptive Failure Detector in the Dual Model of Heartbeat and Interaction

KEYWORD

Failure detector , Exponential smoothing , Eventually perfect

본문

Collapse all

I. INTRODUCTION

A failure detector is a key building block for fault-tolerant distributed system, which provide a mechanism to collect information of process failure. Chandra and Toueg [1] introduced the concept of an unreliable failure detector and many fault-tolerance algorithms have been proposed based on unreliable failure detectors.

  > The fixed timeout Δ_to is set as a constant value in a conventional failure detector and the monitoring process begins to suspect the monitored process if it does not receive a heartbeat message after time Δ_to. The disadvantage is that it is difficult to determine an appropriate Δ_to. If Δ_to is short, it is easy to make a wrong inference; if Δ_to is long, it results in a long time expense of detection time.

Recently, some adaptive failure detectors have been presented [2-6]. Most of them are based on a heartbeat strategy and modify the timeout value dynamically according to the network conditions. Chen et al. [7] estimate (hereinafter, Chen’s estimation) the next heartbeat arrival time by computing the average transmission time and a fixed safety margin is added. Bertier et al. [8] combine Chen’s estimation and a dynamical safety margin based on Jacobson’s [9] estimation of the round-trip time.

In addition, some studies [10-12] introduce Omega failure detectors (type of failure detectors) which judge process failure not according to the check point, but according to an output value φ (0 ≤ φ ≤1) in order to present the reliability of the process.

The next part of this section will introduce the failure detector properties and the quality of service of the failure detector.

  > A. Failure Detector Properties

Failure detectors are characterized by two properties: completeness and accuracy. Two kinds of completeness and four kinds of accuracy are defined [1, 13].

Completeness characterizes the failure detector's capability to suspect every incorrect process permanently. Two kinds of completeness are defined.

1) Strong completeness: Eventually, every process that crashes is permanently suspected by every correct process. 2) Weak completeness: Eventually, every process that crashes is permanently suspected by some correct process.

Accuracy is the characterization of the failure detector’s capability not to suspect correct processes. Four kinds of accuracy are defined.

1) Strong accuracy: No process is suspected before it crashes. 2) Eventual strong accuracy: There is a time after which correct processes are not suspected by any correct process. 3) Weak accuracy: Some correct process is never suspected. 4) Eventual weak accuracy: There is a time after which some correct process is never suspected by any correct process.

Eight classes of failure detectors are yielded by combining the two kinds of completeness and four kinds of accuracy, named as perfect, P; eventually perfect, ◇P; strong, S; eventually strong, ◇S; Q; ◇Q; weak, W; eventually weak, ◇W.

Generally speaking, a good failure detector should be a perfect or eventually perfect detector. In this paper we design an eventually perfect detector. The detector meets the requirement of strong completeness and eventual strong accuracy.

  > B. Quality of Service of Failure Detectors

Some metrics have been proposed to specify the quality of service of a failure detector [14]. The main metrics are as follows:

● Detection time: The time from when the process crashes to the time when it is permanently suspected. ● Mistake recurrence time: The time between two consecutive mistakes. ● Mistake duration: The duration time for the failure detector to correct a mistake.

The rest of this paper is organized as follows: Section II describes our failure detection model; Section III presents the new algorithm in our failure detector; Section IV proves that it is an eventually perfect failure detector; Section V presents the performance evaluation by a series of experiments; and Section VI consists of the conclusion of our research.

Ⅱ. THE MODEL OF FAILURE DETECTION

  > A. The Heartbeat Model

The heartbeat model [15] is used in most distributed systems. Every process p periodically sends an “I am alive” heartbeat message to the process q. The period is the heartbeat interval Δ_i.

If q does not receive a heartbeat message from p after a timeout delay Δ_to, p is added to the list of suspected processes. If q receives the heartbeat message from p later, then q removes p from its list of suspected processes, as shown in Fig. 1.

[Fig. 1.] The heartbeat model.

The heartbeat interval Δ_i:Δ_i is the time between two emissions of the “I am alive” heartbeat message. The timeout delay Δ_to:Δ_to is the time between the last reception of the “I am alive” message from p and the time where q starts suspecting p. The transmission delay Δ_tr:Δ_tr is the time between the emission of the heartbeat message and the reception of the heartbeat message.

  > B. The Interaction Model

The process q monitors every process p by sending an “Are you alive?” message to p periodically. Once p receives the message from q, it replies with an “I am alive” message to q. If q does not receive the message from p after a timeout delay Δ_to, p is added to the list of suspected processes. If q receives the heartbeat message from p later, then q removes p from its list of suspected processes, as shown in Fig. 2.

[Fig. 2.] The interaction model.

  > C. The Dual Model of Heartbeat and Interaction

  > The heartbeat failure detector sends half as many messages as the interaction detector for the same detection quality, so the heartbeat failure detector is used in most of the distributed systems. However, it often mistakes actual failure statuses due to loss of packets or long delays in a complicated network. The interaction failure detector needs more detection time and sends more messages than the heartbeat detector, but it can be implemented as a requirement, and the detection result is independent of the message.

In this paper, we will use a dual model of heartbeat and interaction. The dual model is organized in two steps. First, the heartbeat model is used to detect the process failure. If a process p is suspected, then the interaction model is used to check the process p further. The detailed procedure is as shown in Fig. 3.

[Fig. 3.] The dual model of heartbeat and interaction.

Ⅲ. AN EXPONENTIAL SMOOTHING ADAPTIVE FAILURE DETECTOR

Chen’s estimation is a classical adaptive failure detector algorithm. It estimates the arrival time of the next heartbeat message according to the historical time sequences of the heartbeat messages. In addition, a constant safety margin is added to raise the detector’s accuracy.

The process q receives n heartbeat messages denoted m₁, m₂, m₃, ..., m_n. The receipt time of the messages are presented as A₁, A₂, A₃, ..., A_n and Δ_i is the heartbeat interval. Then, the estimation arrival time of the next message can be expressed as follows:

[]

The timeout checking point is

where is ∂ the constant safety margin.

[]

Chen’s failure detection algorithm can estimate the arrival time of the next heartbeat message dynamically. However, it uses a constant safety margin, and the suitable constant ∂ is difficult to determine in a condition with a complex network. In addition, the algorithm uses the average transmission delay of the most recent k messages to estimate the next transmission time. And it is only suitable for the fluctuation sequence with small fluctuations. If the delay sequences increase or decrease in a linear manner, Chen’s estimation does not correspond to the real delay.

An improved estimation algorithm based on exponential smoothing is proposed to overcome the shortcomings of the base algorithm. Exponential smoothing [16] is a technique that can be applied to a time data sequence. It makes forecasts according to a series of historical data.

The delay data sequence is represented by {x_i}, x_i = A_i − Δ_t*i and the estimation of the exponential smoothing algorithm is written as . The simplest form of exponential smoothing is given by the formulas

[]

where α is the smoothing factor, and 0 < α < 1.

The above simple exponential smoothing does not perform well when there is a trend in the data. In such situations, double exponential smoothing is used to estimate the following data in a sequence with a linear trend. Double exponential smoothing works as follows:

[]

where a_i is the estimated level at time i and b_i is the estimated trend at time i.

Then,

[]

[]

Another improvement in the algorithm is that the dynamic safety margin is calculated using the mean square deviation. As shown in Fig. 1, if the two adjacent heartbeat messages have the same transmission delay, then, A_i+1 = A_i+Δ_t , so A_i+1 – (A_i+Δ_t) can represent the fluctuation in the adjacent transmission delay, and it may be positive or negative.

Finally, we have

[]

[]

Algorithm 1 is the exponential smoothing adaptive failure detector (ESA_FD) algorithm.

Ⅳ. PROOF

Our failure detection algorithm can meet the requirements requirements of class ◇ P. In other words, it has the properties of strong completeness and eventual strong accuracy.

Theorem 1. Strong completeness.

Every crashed process p is permanently suspected by every correct process

∃t0 :∀t ≥ t0, ∀q∈correct(t),∀q∈crashed p∈suspectq(t).

Before providing proof, we defined several parameters:

Δmax: The unknown bound time threshold on the message transmission between processes p and q.mk: The kth message that process p sends to q.tsk : The time when process p sends mk to q.trk : The time when process q receives mk.tc : The time of a crash of process p.

When the process p crashes at t_c, p stops sending “I am alive” messages. Then ts_k ≤ t_c, so there is an upper bound t_c + Δ_max after the time when process q can’t receive any more messages from p.

The next thing to be proven is that τ_i+1 exists as an upper bound.

where x_max is the maximum value in {x_i}.

So there is an upper bound (i + 1) * Δ_i+ * x_max + 3Δ_max in the worst condition after which process q starts suspecting process p if it does not receive any messages from p.

Theorem 2. Eventual strong accuracy.

Theorem 2 states that there is an upper bound time after which the correct processes are not suspected by any correct process.

∃t_bound, ∀t≥t_bound, ∀q,

p∈correct(t), p∉suspect_q(t)

However, since the maximum transmission delay is Δ_max, the dual model of heartbeat and interaction is used in this paper. In the worst network conditions, the checkpoint isτ_i+2 Δ_max, so a correct process will not be suspected by any correct process.

Ⅴ. EXPERIMENT

Two computers are used to build the experimental platform in a local area network. One machine runs a program to send heartbeat messages (like process p), and the other machine runs the monitored program to receive heartbeat messages (like process q). We simulate complex network conditions by uploading files at a random time, and neither machines fail during the experiments.

In the paper, we compare ESA_FD to conventional FD and Chen’s FD. The conventional FD sets a fixed timeout Δ_to. If the monitor process does not receive the heartbeat message after Δ_to, it begins to suspect the monitored process. A good Δ_to is very important to the FD. If Δto is short, it’s easy to make a wrong suspicion. Otherwise, it results in a long time expense in detection time, but this is hard to determine under complex network conditions.

Chen’s FD estimates the arrival time of the next heartbeat message dynamically and it uses the average transmission delay of the most recent k messages to compute the next transmission time. To improve the accuracy, an additional safety margin ∂ is added. However, ∂ is a constant value in Chen’s FD.

Experiment 1. Transmission delay contrast.

We count thirty transmission times of the heartbeat messages and the heartbeat interval is set to 2 seconds. We compute the transmission delay according to the estimation of the arrival time by using Chen’s FD and ESA_FD. The result is shown in Fig. 5.

[Fig. 5.] The transmission delay contrast. ESA_FD: exponential smoothing adaptive failure detector.

Since the average delay is used to estimate the arrival delay in Chen’s FD, Chen’s FD is only suitable for a series of random data and there is a big difference between the real transmission delay and Chen’s estimation when the data has a linear trend. Whereas the estimation in ESA_FD confirms the real transmission delay very well, exponential smoothing uses a weight value to emphasize the recent data and ESA_FD is suitable for the data not only for random but also for linear trends.

Experiment 2. The average mistake rate contrast.

Two groups of tests are involved in computing the average mistake rate. The heartbeat interval of the first group is set to 2 seconds and the other is set to 5 seconds. The fixed timeout in the conventional FD is set to 1.8 seconds and the safety margin in Chen’s FD is set to 0.5 seconds. The result is shown in Fig. 6.

[Fig. 6.] The average mistake rate contrast. ESA_FD: exponential smoothing adaptive failure detector.

Since the conventional FD uses a fixed timeout to detect a failure, and it is difficult to choose the fixed value to fit complex network conditions, so the fixed timeout of 2 seconds may not be the most suitable value since it results in a high mistake rate of about 10%. Chen’s FD is better than the conventional FD, but it is not suitable for data with a linear trend. Therefore, the safety margin is an important adjustment for the result, and in the same manner as the conventional FD, it is hard to determine the value of the safety margin. The ESA_FD in this paper has a very low mistake rate providing accurate estimation to the transmission delay and dual detection starts while a process timeouts first. This further raises the accuracy of detection.

Experiment 3. The average detection time contrast.

The parameters are set as in Experiment 2. In the experiment, we count the detection time and the result is shown in Fig. 7.

[Fig. 7.] The average detection time contrast. ESA_FD: exponential smoothing adaptive failure detector.

EFA_FD uses dual detection while a process is suspected, so it takes more time than Chen’s FD. However, since the estimation to the transmission is more precise, the amount of time for dual detection is shorter, and the difference between the Chen’s FD and ESA_FD is very small. So EFA_FD raises the detection accuracy by increasing detection time by a small margin.

Ⅵ. CONCLUSION

ESA_FD is an improved adaptive failure detector, which is more suitable for complex network conditions than Chen’s FD. On the one hand, ESA_FD uses exponential smoothing to predict the next data and the prediction through exponential smoothing is more accurate than Chen’s prediction, which is calculated with the average value, because Chen’s FD is only suited for random data with little fluctuation. Hence, ESA_FD can more accurately predict the series of data not only for random, but also in linear trends in applications where the weight value is of exponential smoothing. On the other hand, ESA_FD calculates the safety margin with the mean square deviation and it varies dynamically according the historical transmission delay. ESA_FD meets the requirements of strong completeness and eventually strong accuracy and the experimental results show that ESA_FD can increase the accuracy substantially by increasing the detection time by a small amount.

참고문헌

1. Chandra T. D., Toueg S. 1996 Unreliable failure detectors for reliable distributed systems [Journal of the ACM] Vol.43 P.225-267
2. Bansal S., Sharma S., Trivedi I. 2012 Adaptive staircase multiple failure detector for parallel and distributed image processing [in Proceedings of the 1st International Conference on Recent Advances in Information Technology] P.91-94
3. Tian D., Mao T., Xie J. 2009 “Adaptive failure detection algorithm for grid systems,” in Fuzzy Information and Engineering
4. Ma T., Hillston J., Anderson S. 2010 On the quality of service of crash-recovery failure detectors [IEEE Transactions on Dependable and Secure Computing] Vol.7 P.271-283
5. Wu C., Wu K., Feng L., Tian D. 2010 NN-SA based dynamic failure detector for services composition in distributed environment [in Advanced Data Mining and Applications] P.443-450
6. Luan L., Liu S. F., Zhang X. J. 2008 Research and improvement of failure detector algorithm based on fresh point [Journal of Jilin University (Science Edition)] Vol.46 P.681-686
7. Chen W., Toueg S., Aguilera M. K. 2002 On the quality of service of failure detectors [IEEE Transactions on Computers] Vol.51 P.561-580
8. Bertier M., Marin O., Sens P. 2002 Implementation and performance evaluation of an adaptable failure detector [in Proceedings of the International Conference on Dependable Systems and Network] P.354-363
9. Jacobson V. 1988 Congestion avoidance and control [ACM SIGCOMM Computer Communication Review] Vol.18 P.314-329
10. Martin C., Larrea M., Jimenez E. 2007 On the implementation of the Omega failure detector in the crash-recovery failure model [in Proceedings of the 2nd International Conference on Availability, Reliability and Security] P.975-982
11. Shi L., Hou Y. S. 2010 Adaptive failure detection model based on message delay prediction [Journal of Computer Applications] Vol.30 P.1312-1315
12. Hayashibara N., Defago X., Yared R., Katayama T. 2004 The ？ accrual failure detector [in Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems] P.66-78
13. Liu B., Yang S., Shi L., Ding X., Zhang Q. 2011 Modeling of failure detector based on message delay prediction mechanism [Journal of Software] Vol.6 P.1821-1828
14. Bertier M., Marin O., Sens P. 2003 Performance analysis of a hierarchical failure detector [in Proceedings of the International Conference on Dependable Systems and Networks] P.635-644
15. Hayashibara N., Takizawa M. 2007 Performance analysis of the interrogation-based failure detector [in Proceedings of the 27th International Conference on Distributed Computing Systems Workshops] P.34
16. Exponential smoothing

이미지 / 테이블

[ Fig. 1. ] The heartbeat model.
[ Fig. 2. ] The interaction model.
[ Fig. 3. ] The dual model of heartbeat and interaction.
[ Fig. 5. ] The transmission delay contrast. ESA_FD: exponential smoothing adaptive failure detector.
[ Fig. 6. ] The average mistake rate contrast. ESA_FD: exponential smoothing adaptive failure detector.
[ Fig. 7. ] The average detection time contrast. ESA_FD: exponential smoothing adaptive failure detector.