Load Shedding for Temporal Queries over Data Streams

Al-Kateb Mohammed; Lee Byung Suk

doi:10.5626/JCSE.2011.5.4.294

OA학술지
Journal of Computing Science and Engineering

Load Shedding for Temporal Queries over Data Streams

DOI : 10.5626/JCSE.2011.5.4.294
Author: Al-Kateb Mohammed, Lee Byung Suk
Organization: Al-Kateb Mohammed; Lee Byung Suk
Publish: Journal of Computing Science and Engineering Volume 5, Issue4, p294~304, 30 Dec 2011

ABSTRACT

Load Shedding for Temporal Queries over Data Streams

KEYWORD

Algorithms , Load shedding , Data streams , Temporal query processing

본문

Collapse all

I. INTRODUCTION

Continuous queries [1] are standing queries, which typically run indefinitely. These queries are central to applications deal-ing with continuous and unbounded data streams [2]. Many of these applications handle data whose values may change over time. For this class of application, enhancing continuous queries with temporal functions and predicates [3] enriches the seman-tics of queries and, consequently, enables users to define tempo-ral expressions in their queries. In this paper we refer to this class of queries as continuous temporal queries.

Unlike traditional continuous queries that typically return only the values of specified attributes, a continuous temporal query returns the valid time interval of each value as well the attribute values themselves [4]. The selection predicates in this type of query includes a temporal predicate for the time inter-vals associated with the attribute values satisfying the value predicate.

To guarantee the correctness of the result of a temporal query, adjacent or overlapping timestamps of value-equivalent tuples should be merged prior to the evaluation of the temporal function or predicate specified in the query. This merge process is referred to as temporal coalescing [5]. This coalescing is essential for temporal query processing because queries evalu-ated on uncoalesced data may generate incorrect answers [5, 6]. The same is true for continuous temporal queries over data streams [7].

For stream applications, it is not uncommon to have insuffi-cient system resources to process or keep all tuples arriving from the input data stream in memory [2]. In these situations, discarding a fraction of tuples, called load shedding [8], often becomes necessary to resolve the problem. With load shedding in place, however, the system cannot produce an exact result, thus ensuring the accuracy of an approximate query result is an important issue [9].

In this paper, we study the problem of load shedding from a data stream in the presence of temporal coalescing, with a par-ticular focus on the case of insufficient memory to keep all tuples in the window specified in a continuous temporal selec-tion query. The presence of coalescing as an integral step in temporal query processing poses two major challenges for the load shedding mechanism.

The first challenge is that the accuracy metrics used in the existing load shedding methods (e.g., the number of tuples in the approximate result as in [10-16] or the relative deviation between estimated and actual aggregate values as in [17, 18]) are not suitable for continuous temporal queries. Since a tempo-ral query returns both attribute values and their valid time inter-vals, the accuracy metric should take both into account. The second challenge is a direct consequence of the first challenge, that is, the existing load shedding algorithms (whether random load shedding as in [17] or semantic load shedding as in [11]) are therefore inadequate for temporal queries.

To address the first challenge, we introduce a new accuracy metric that combines a value similarity metric and an interval similarity metric. The proposed metric is not only intuitive and semantically correct but also has theoretical backing. Specifi-cally, the value similarity metric is based on the Jaccard coeffi-cient [19], and the interval similarity metric is based on PQI interval orders [20]. The Jaccard coefficient is a common statis-tic used for comparing the similarity between two sets. PQI interval orders provide a similarity metric for comparing inter-vals (mapped from elements).

In response to the second challenge, we present a new “coa-lescence-aware” algorithm that keeps the effects of coalescing in mind when deciding which tuples to drop. Making an optimal load-shedding decision requires holding all tuples in memory but, because the available memory is limited, we propose a greedy algorithm that takes a quadratic running time and con-sumes linear memory space. It employs a greedy strategy that combines two greedy objectives, one targeting the value simi-larity and the other targeting the interval similarity. More spe-cifically, it tries to minimize the number of extraneous tuples and the length of coalesced intervals lost in the approximate result as a consequence of dropping tuples.

A performance study compares the proposed coalescence-aware algorithm to a conventional random load shedding algo-rithm. (Note that other existing load-shedding algorithms are not comparable, since they do not work with coalescing.) The main purpose of the performance study is to compare the two mechanisms in terms of the achieved accuracy. The experiment results show that the accuracy improvement of our proposed mechanism over the random mechanism approaches an order of magnitude as the coalescing probability increases and the avail-able memory size decreases. We have also examined the effects of the two greedy objectives on the performance of the pro-posed coalescence-aware algorithm and confirmed that the objectives are well balanced in their influence on the accuracy achieved by the greedy algorithm.

The main contributions of this paper can be summarized as follows.

■ It introduces the new problem of load shedding for continuous temporal queries over a windowed data stream.

■ It identifies coalescing as an important barrier to overcome, and proposes a new accuracy metric and a new algorithm for load shedding while taking coalescing effects into consider-ation.

■ It presents a thorough performance study on the proposed coa-lescence-aware algorithm and two partially coalescence-aware algorithms (each reflecting one of the two greedy objectives).

The rest of this paper is organized as follows. Section II pro-vides a relevant background and introduces the model of contin-uous temporal queries assumed in this paper. Section III presents the proposed accuracy metric and load shedding algo-rithm. Section IV presents the experiments and discusses the results. Section V reviews related work. Section VI concludes this paper and suggests future work.

II. CONTINUOUS TEMPORAL QUERIES OVER DATA STREAMS

We characterize continuous temporal queries by their support for temporal functions and predicates. These queries can be seen as a hybrid of traditional continuous queries over data streams [1] and temporal queries over temporal databases [4]. In this section, we first summarize the models of these two base query types and then introduce and motivate the model of the hybrid query type.

  > A. Continuous Queries Over Data Streams

A data stream is an unbounded sequence of tuples, and typi-cal queries on data streams run continuously for an indefinitely long time period [1, 2]. We assume that each tuple arriving from a data stream is timestamped. Additionally, we assume that all tuples arrive in an increasing order of their timestamp and that tuples in a window are maintained in an increasing order of their timestamps.

In many cases, only tuples bounded by a window on a data stream are of interest at any given time [2]. A window may be tuple-based or time -- based depending on whether its size is the number of tuples or the window time span. The solutions pre-sented in this paper --the proposed accuracy metric and the load shedding algorithm- work well with either window model.

An example continuous query is shown below, considering a wireless sensor network in which sensors are mounted on weather boards to collect timestamped topology information along with humidity, temperature, etc. [21]

Example 1. Continuous Query

Assume a query that monitors humidity values in the past 15 minutes and continuously outputs, at every minute, humidity values and their associated timestamp values if it finds that the humidity exceeds 75%. This query can be expressed as follows using the Continuous Query Language (CQL) syntax [22].

SELECT S.humidity, S.timestamp

FROM Stream S RANGE 15 MINUTE SLIDES 1 MINUTE

WHERE S.humidity > 75;

In this query syntax, RANGE specifies the window size and SLIDES specifies the sliding window interval.

  > B. Temporal Queries Over Temporal Databases

Time is a ubiquitous aspect in almost all real-world phenom-ena that many real-world applications deal with data whose val-ues may change over time. All these applications thus have intrinsic needs to model time, and it has motivated the database community to undertake an extensive study of temporal data management for relational databases [4], semi-structured data-bases (e.g., XML [23]), and data streams [24].

Temporal databases make it possible for the user to store the history of data and to retrieve this history through temporal que-ries. The importance of such capabilities has become evident with significant commercial interests ensuring that major DBMS products [25, 26] are currently supporting temporal data management as core features.

Two time dimensions are of interest in the handling of tem-poral data -- valid time interval and transaction time interval [4]. The former is the time interval during which the stored fact is true in reality. The latter is the time interval during which the fact is present in the database. A temporal database can support the former only (called a valid-time database), the latter only (called a transaction-time database), or both (called a bitempo-ral database). In this paper we consider a valid-time database.

In addition, temporal databases can support either attribute or tuple timestamping depending on whether the time interval is associated with individual attributes or with the entire tuple of all attributes [4]. In this paper we consider tuple timestamping.

Temporal databases can be queried using Temporal SQL (TSQL) [27], which is essentially SQL that has been extended with temporal predicates and functions. The correct evaluation of temporal predicates and functions demands temporal tuples to be coalesced first. As already mentioned, coalescing is a fun-damental operation in any temporal data model, and is essential to temporal query processing since queries evaluated on uncoa-lesced data may generate incorrect answers [5, 6] (see the exam-ple below).

Example 2. Importance of Coalescing

Consider the temporal table Employee in Table 1, where for simplicity only Andy's employment history records are shown. The second tuple reflects a change of his salary and the third tuple reflects a change of his department.

Suppose the manager wants to know which employees worked for the R&D Department for more than six consecutive years. The query can be expressed as follows in TSQL [27].

SELECT E.NAME, VALID (E)

FROM Employee E (Name, Department) as E

WHERE CAST (VALID (E) AS INTERVAL YEAR) > 6;

In this syntax, E(Name, Department) specifies coalescing over Name and Department, VALID(E) returns the valid time interval of tuples in the result, and the CAST clause converts the valid time interval to the specified granularity, i.e., year. The query returns a pairs of values: Name and the interval during which the values of the coalescing attributes stay the same. With coalescing, the first two tuples in Table 1 (both of Andy in R&D)

[Table 1.] Andy’s employment records

Andy’s employment records

are merged because their valid time intervals are adjacent to each other. The merged time interval, [2001, 2008], is greater than six years, and so Andy is in the query result. Without coa-lescing, in contrast, Andy cannot be in the query result because the time interval of neither the first nor the second tuple alone is greater than six years.

  > C. Continuous Temporal Queries Over Data Streams

As already mentioned, we see the continuous temporal stream query model as a hybrid of the continuous stream query model and the temporal database query model. What this entails in the temporal representation of tuples is that, when a new tuple, s_i, is added to a window, we model the valid time interval of its preceding tuple, s_i-1, as (t_i-1, t_i), where t_i-1 and t_i are the timestamps of s_i-1 and s_i, respectively. Moreover, inherited from temporal database queries, continuous temporal stream queries output not only the values of the coalesced attributes but also their valid time intervals. As mentioned in Section II-B, the cor-rect evaluation of temporal predicates and functions in a contin-uous temporal query requires that value-equivalent tuples with adjacent time intervals should be merged before a query can be evaluated for them.

In this paper, we consider continuous temporal selection que-ries (see Example 3 below) as the query type. A temporal selec-tion query is adequate enough for our purpose of studying the effect of load shedding. Besides, we believe queries of this type are useful in a wide range of stream applications.

Example 3. Temporal Continuous Selection Query

In the application assumed in Example 1, consider another query for monitoring the stream of sensor readings across arriving temporal tuples. Every minute, the query processor outputs the humidity value and the associated time interval if it finds that the value has been the same for 3 consecutive minutes or longer. This query can be expressed as follows using the syn-tax borrowed from CQL [22] and TSQL [27].

SELECT S.humidity, VALID (S)

FROM Stream (humidity) as S RANGE 15 MINUTE SLIDES 1 MINUTE

WHERE CAST (VALID (S) AS INTERVAL MINUTE) >= 3;

III. COALESCENCE-AWARE LOAD SHEDDING

As mentioned in Section I, making load shedding coales-cence-aware requires a new accuracy metric and a new load-shedding algorithm because the existing ones do not work with coalescing. In this section, we propose a new accuracy metric in Section III-B and a new algorithm in Section III-C.

  > A. Preliminaries

Definition 3.1 Coalesced tuple. A coalesced tuple CT_i is defined as a pair of an attribute value and a time interval of the coalesced tuple (Fig. 1).

Definition 3.2 Exact query result. An exact result of a contin-uous temporal query, denoted as ε, is defined as a sequence of exact coalesced tuples ？CT₁, CT₂, ...CT_n？ (Fig. 2).

Definition 3.3 Approximate query result. An approximate result of a continuous temporal query, denoted as A, is defined as a sequence ？A₁, A₂, .. .A_n？ where A_i is the set of approximate coalesced tuples CT_ij resulting from the exact coalesced tuple

[Fig. 1.] An example of coalesced tuples.

[Fig. 2.] An example of an exact query result.

[Fig. 3.] An example of an approximate query result.

CT_i due to load shedding (Fig. 3).

Property l. Bounding Property on Approximate Coalesced Tuples

As a result of load shedding, an exact coalesced tuple CT_i reduces to zero or more approximate coalesced tuples in A_i ∈ A such that the attribute values of all the approximate coalesced tuples are the same as that of CT_i and the time interval of every approximate coalesced tuple is contained in that of CT_i.

Proof: This property can be proved through a case analysis. As a result of load shedding, one and only one of the following four cases can happen to CT_i.

■ Case 1: CTi is retained without any alteration. This case hap-pens when none of the tuples coalesced in CTi is dropped.

■ Case 2: An exact coalesced tuple disappears in its entirety. This case occurs when a tuple not coalesced with its adjacent tuples in the exact result is dropped.

■ Case 3: An exact coalesced tuple is decreased in its interval. This case occurs when a tuple which is either the first or the last of the tuples coalesced in the exact result is dropped.

■ Case 4: An exact coalesced tuple is split into two approximate coalesced tuples of shorter intervals. This case occurs when a tuple in the middle of the tuples coalesced in the exact result is dropped.

It is obvious in all four cases that the bounding property holds.

  > B. Accuracy Metric

Since the result of a continuous temporal query consists of both the values of coalesced attributes and their associated valid time intervals (recall Definitions 3.2 and 3.3), an accuracy met-ric for this class of queries should take both into account. In this regard, we propose an accuracy metric based on the concept of the Jaccard coefficient [19] and the concept of PQI interval orders [20].

The Jaccard coefficient, J (D_x, D_y), measures the similarity between two data sets, D_x and D_y, and is defined as the ratio between the cardinality of their intersection and the cardinality of their union.

We use the Jaccard coefficients to compare the coalesced attribute values between the exact and the approximate query result, that is, between ε ≡ ？CT₁, ..., CT_n？ and A ≡ ？A₁, ..., A_n？. Note that there may be more than one coalesced tuple in A_i for each CT_i in ε (recall Definitions 3.2, 3.3, and Property 1). We thus need to apply Equation 1 to multisets. That is,

where

and ？ are the multiset intersection and union opera-tors, respectively (Let f(D₁, x) and f(D₂, x) be the multiplicity of an element x in the multisets D₁ and D₂, respectively. Then, f(D₁

D₂, x) and f(D₁ ？ D₂, x) are defined as min{f(D₁, x), f(D₂, x)}, and max{f(D₁, x), f(D₂, x)}, respectively). Note |E_i| equals 1 since it is a singleton set. The numerator term min(1, |A_i|) equals 0 only if the exact coalesced tuple disappears as a result of load shedding (case 2 in the proof of Property 1) but other-wise will always equals 1.

Equation 2 considers only the coalesced attribute values, and so it should be modified to consider the interval similarity factor as well. For this purpose, we adopt PQI interval orders used to compare intervals [20, 28]. In this scheme, given two intervals x and y, the similarity between them is defined as the length of their interval intersection (？) divided by the length of their interval union (？).

(See Appendix for more information about PQI interval orders.)

We use the PQI interval orders to compare an exact coa-lesced time interval with the set of approximate coalesced time intervals derived from it. Let I_i and I_ij be the coalesced intervals of CT_i and CT_ij, respectively. We know from Property 1 that I_ij is a subinterval of I_i for every j. Hence, the interval similarity for a given exact coalesced tuple is computed as follows.

Equation 4 essentially computes the interval reduction ratio of the exact coalesced tuple CT_i due to load shedding. Its value equals 1 only if CT_i is retained intact as a result of load shedding (the case 1 in the proof of Property 1), but otherwise is always less than 1. In the proposed accuracy metric, we use this ratio to scale the intersection term of CT_i in Equation 2.

Note that min(1, |A_i|) equals either 0 or 1 and that if min(1, |A_i|) = 0 (i.e., the case 2) then PQI (CT_i, A_i) = 0. Hence, Equa-tion 5 can be further simplified as follows.

This accuracy metric has the nice property of limiting the computed accuracy to a maximum of 1.0 (i.e., 100%) and allowing for intuitive interpretations as illustrated in the follow-ing example.

Example 4. Proposed Accuracy Metric.

As shown in Fig. 4, consider an exact result (ε) ？ resulting from coalescing five actual tuples ？ and five possible approxi-mate results (A₁, A₂, A₃, A₄, A₅) ？ each resulting from drop-ping one of the five actual tuple. Using Equation 6, the accuracy of each approximate result is calculated as follows.

[Fig. 4.] An example of load shedding.

We see that the resulting accuracy is the lowest in A₃, where the third tuple, <83, [7, 10)> is dropped. In this case, it intro-duces a temporal gap between the second and the fourth tuples (which are coalesced with the third tuple in the exact result). This causes an extra attribute value (hence we get a 2 in the denominator) and a missing part of the exact temporal interval (hence we get 6/9 in the numerator) in the approximate result. On the other hand, the highest accuracy is achieved in either A₂, where the second tuple is dropped, or in A₄ where the fourth tuple is dropped. The reason is that dropping either tuple causes only the missing part of the exact temporal interval (as in A₃) and neither has an extra attribute value (as in A₃) nor is miss-ing the entire interval (as in A₁ and A₅).

  > C. Coalescence-Aware Load Shedding Algorithm

The problem of coalescence-aware load shedding can be stated formally as follows.

Problem definition: Given the memory M (of size |M| tuples) and a continuous temporal query which specifies coalescing tuples in a window WS (of size |WS| tuples) over a data stream S, we discard |WS| ？ |M| tuples from the |WS| tuples with the objective of maximizing the accuracy of the query output subject to the constraint that |M| < |WS|.

Finding an optimal solution would require holding all |W_S| tuples in memory to decide on the optimal load shedding deci-sion, but the available memory can hold only |M| tuples (as |M| < |W_S|). Even if it were possible (which it is not), it would be computationally intractable to find from

possible sub-sets an optimal subset of tuples to discard. This problem is harder than the NP-hard nonlinear binary integer programming problem, as there is no well-formed functional form for com-puting the accuracy for a given set of binary integer assign-ments (e.g., 1 for retaining a tuple and 0 for discarding a tuple). So, we propose a greedy algorithm, which takes a quadratic run-ning time in the worst case and consumes a linear amount of storage space.

The proposed algorithm decides which tuple to drop from memory upon the arrival of each new tuple from the input data stream. The algorithm uses a greedy strategy based on two objectives for achieving value similarity (Equation 2) and inter-val similarity (Equation 4).

Objective 1. Dropping a tuple may introduce an extraneous tuple in an approximate result if it causes a temporal gap between tuples that are coalesced in the exact result. Let δ⁺_sj denote the number of such extraneous tuples resulting from dropping a tuple s_j. Note that δ⁺_sj is 1 in the case 4 in the proof of Property 1 and is 0 in other cases. Evidently, we want to mini-mize the number of extraneous tuples, and, thus, we set the first objective to drop a tuple with the smallest δ⁺_sj (i.e., = 0) first. A tie is broken by random choice.

Objective 2. Dropping a tuple may also cause either all or part of the time interval of an exact coalesced tuple to be miss-ing in the approximate result. Let ρ^？_sj be the ratio of the length of the time interval missing due to the dropping of s_j over the length of the time interval of the exact coalesced tuple. Note that ρ^？_sj is 0 in the case 1 in the proof of Property 1, is 1 in the case 2, and is bounded between 0 and 1 in other cases. Then, the second objective is to drop a tuple with the smallest ρ^？_sj first in order to reduce these missing intervals. Like Objective 1, a tie is broken by random choice.

There are different ways to combine δ⁺_sj and ρ^？_sj in the objec-tive function of the greedy algorithm. The sum of them is used here (It can be a weighted sum if we want to give a bias in favor of either objective.). We believe summation is better than a product, as the resulting value is always greater than 0. The product gives 0 when δ⁺_sj or ρ^？_sj is 0.

For example, Table 2 shows the values of δ⁺_sj and ρ^？_sj for each

[Table 2.] δ+sj and ρ?sj for tuples in Example 4.

δ+sj and ρ?sj for tuples in Example 4.

of the five actual tuples in Example 4. It shows that the tuples that have a minimum value of δ⁺_sj + ρ^？_sj are s₂ and s₄. The algo-rithm thus decides to drop either s₂ or s₄. Recall the conclusion in Example 4 that the highest accuracy is achieved in either A₂, where s₂ is dropped, or in A₄ where s₄ is dropped. This demon-strates how the accuracy metric introduced in Section III-B is reflected in our proposed load shedding algorithm.

Algorithm 1 outlines the steps of load shedding using the combined objective function. Upon the arrival of a new tuple s_i (Line 1), if the memory M is not full yet, s_i is added to W_S (Line 4), but otherwise a tuple s_m which gives a minimum value of δ⁺_sj + ρ^？_sj among all s_j ∈ M∪{s_i} is found (Lines 7-14) and dropped from the window (Line 16).

[Algorithm 1.] Coalescing-aware load shedding (CALS)

Coalescing-aware load shedding (CALS)

It is easy to see that the running-time complexity is O(|M|²), since the algorithm scans the |M|+1 tuples in M∪{s_i} [15] lin-early and, for each tuple, the computation of δ⁺_sj and ρ^？_sj takes O(|M|) time in the worst case scenario. The complexity is lower in practice, as computing δ⁺_sj and ρ^？_sj involves only those tuples coalesced to form the same exact tuple. The storage space com-plexity is O(|M|), as it suffices to have enough memory to hold the |M|+1 tuples and two numbers current_min and m.

In the experiments presented in Section IV, we use two addi-tional algorithms, called coalescence-aware load shedding (CALS)-V and CALS-I. CALS-V is the CALS algorithm reduced to consider only the value similarity (hence using δ⁺_sj only), and CALS-I is that considering only the interval similarity (hence using ρ^？_sj only).

IV. PERFORMANCE STUDY

We conducted experiments to study the performance of the proposed load shedding algorithm. The main objective of the experiments was to observe the effectiveness of the proposed CALS algorithm. Additionally, we compare CALS with two partially coalescence-aware algorithms, CALS-V and CALS-I, to observe how each of the two complementary greedy objec-tives affects the performance individually. We use the random load shedding (RAND) algorithm as the baseline approach (Under this random load shedding mechanism, tuples are dis-carded randomly such that each tuple has the same probability of being discarded from (or, equivalently, retained in) mem-ory.). As anticipated, the experiment results show that CALS achieved the highest accuracy, RAND achieved the lowest, and CALS-V and CALS-I are in between.

The experiments were conducted using Matlab (Release R2010b) on a 64-bit machine with 4.00 GB internal main mem-ory. We use the Matlab array/matrix data type to implement the window data structure.

  > A. Experiment Setup

1) Control Parameters: We identified two key parameters influencing the accuracy ofs load shedding algorithms signifi-cantly: memory ratio and coalescing probability. The coalesc-ing probability is the expected probability that a tuple in the input stream is coalesced with its preceding tuple upon arrival. The memory ratio is the ratio of available memory size over the specified window size, that is, the number of tuples that can be actually stored in the memory divided by the number of tuples that are supposed to be stored in the memory. In our experi-ments we set the window size to 500 tuples and vary the ratio from 50% to 90%. A natural expectation with these two param-eters is that the benefit of coalescence-awareness will be more visible in the resulting accuracy when the coalescing probability is higher or the memory ratio is lower. In addition, we consid-ered the threshold for time interval selection, which in our experiments is the lower bound on the selection interval speci-fied in the query (see the selection query in Example 3). The experiment results, however, show that the influence of this parameter value on the accuracy is insignificant.

2) Data Sets: We use five synthetic data sets simulating streams with the coalescing probabilities 10%, 30%, 50%, 70%, and 90%, respectively. The timestamp value, t_i, of each tuple in the data sets is selected randomly from within the next 10 incre-ments of time (i.e., within [t_i, t_i + 10)). (The particular length of the time unit and the size of the increment, whether constant or varying, are irrelevant for these experiments.) The real data sets contain weather measurement data collected from sensors deployed throughout the Intel Berkeley Research Lab to gather timestamped topology information along with humidity, tem-perature, light intensity, and voltage values. In order to make cases with different coalescing probabilities, we performed coa-lescing on different coalescing attributes such as humidity, volt-age, and light intensity. The resulting coalescing probabilities are approximately 29% (for humidity), 64% (for voltage), and 86% (for light intensity).

  > B. Experiment Results

For each algorithm, we measured the achieved accuracy as an average of the accuracies measured using each data set. The results are presented in two stages in this section. The first stage focuses on comparing CALS with RAND to see the effect of coalescence-awareness on the accuracy. The second stage focuses on examining the individual effects of the two partially coalescence-aware greedy objectives.

1) The Effect of Coalescence-Awareness: Figs. 5 and 6 show the accuracies achieved by CALS and RAND using the syn-thetic and real data sets, respectively. Both graphs show that CALS outperforms RAND in the entire range of coalescing probabilities and memory ratios. Furthermore, it clearly shows that the performance advantage of coalescence-awareness increases as the coalescing probability increases and as the available memory size decreases, approaching an order of mag-nitude as they approach the largest coalescing probability (i.e., 90%) and the lowest memory ratio (i.e., 50%) used in the exper-iments. In addition, as the figures show, there is little difference in the accuracy for varying the interval selection threshold

[Fig. 5.] Accuracies achieved by coalescence-aware load shedding (CALS) and random load shedding (RAND) on synthetic data sets.

[Fig. 6.] Accuracies achieved by coalescence-aware load shedding (CALS) and random load shedding (RAND) on real data sets.

value. This indicates that, while the load shedding produces more tuples of shorter coalesced intervals, the relative differ-ence between the sets of tuples selected in the case of exact and approximate query results does not change much for different threshold values.

2) The Effects of the Two Greedy Bbjectives: Figs. 7 and 8 show the accuracies achieved by all four algorithms (i.e., CALS, CALS-V, CALS-I, RAND) for varying coalescing prob-abilities and memory ratios, respectively. As expected, the accu-racies of CALS-V and CALS-I are bounded between the accuracy of CALS (upper bound) and the accuracy of RAND (lower bound).

In addition, we see that their performance is very close to each other regardless of the memory ratio. This is evident from the fact that the memory size determines the number of tuples to be dropped, and it has equitable effects on the proposed accu-

[Fig. 7.] Accuracies achieved by coalescence-aware load shedding (CALS) CALS-V CALS-I and random load shedding (RAND) on synthetic data sets.

[Fig. 8.] Accuracies achieved by coalescence-aware load shedding (CALS) CALS-V CALS-I and random load shedding (RAND) on real data set.

racy under both greedy objectives.

We also see that the performance of CALS-V changes little as the coalescing probability increases while CALS-I starts degrading as it increases beyond around 50%. Our reasoning is that CALS-V aims at deciding which tuples to retain that would otherwise cause time gaps (hence extraneous tuples) and so has little dependency on the coalescing probability itself, while in CALS-I it is more likely that tuples influencing the accuracy more (even if they have the shortest time interval) are dropped as the coalescing probability increases.

V. RELATED WORK

There are two separate aspects to the related works we look at: load shedding from a data stream and temporal coalescing over data streams.

  > A. Load Shedding from a Data Stream

The most common accuracy metric assumed in the research literature for load shedding over data streams is the size of approximate query result [10-16]. A few other existing studies, specific to aggregation queries, assume the accuracy metric of the relative error between approximate and actual aggregate values [17, 18]. For both of these accuracy metrics, the accom-panying algorithms are generally categorized as random load shedding (i.e., drop tuples randomly, as in [17, 29]) or semantic load shedding (i.e., drop tuples based on their relative impor-tance, as in [11]).

For maximizing the size of an approximate result (known as the max-subset), Das et al. [11] proposed two heuristics for dropping tuples from window-join input streams. The first heu-ristic sets the priority of a tuple being retained in memory based on the probability that a counterpart joining tuple arrives from the other data stream. The second heuristic favors a tuple based on its own remaining lifetime. Kang et al. [14] proposed a cost-based load shedding mechanism for evaluating window stream joins by adjusting the size of the sliding windows to maximize the number of produced tuples. Ayad and Naughton [10] pro-posed techniques for optimizing the execution of conjunctive continuous queries under limited computational resources by placing random drop boxes throughout the query plan to maxi-mize the plan throughput. Xie et al. [16] presented a stochastic load shedding mechanism for maximizing the number of result tuples in stream joins by exploiting the statistical properties of the input streams. Gedik et al. [12] proposed a load shedding approach for stream joins that adapts to stream rate and time correlation changes in order to increase the number of output tuples produced by a join query. Then, Gedik et al. [13] further addressed the problem of multiple-way joins over windowed data streams and proposed a window segmentation approach for maximizing the query output rate. Tatbul and Zdonik [15] pro-posed a load shedding operator that models a data stream as a sequence of logical windows and either drops or retains the entire window depending on the definition and characteristics of the window operator. For minimizing the relative error for aggregation queries, Babcock et al. [17] proposed a random sampling techniques for optimum placement of load shedding operators in aggregation query plans, measuring the resulting accuracy as the relative deviation between estimated and actual aggregate values. Law and Zaniolo [18] proposed a Bayesian load shedding approach for aggregation stream queries with the objective of obtaining an accurate estimate of the aggregation answer with minimal time and space overheads.

In contrast to all this existing work, we address a rather novel query model for data streams under which a query result is com-posed of both attribute values and the valid time interval of each value. Therefore, all the existing load shedding accuracy met-rics and algorithms are not applicable to our research problem.

  > B. Temporal Coalescing Over Data Streams

Barga et al. [30] proposed to employ temporal coalescing in view of complex event processing over data streams, so that two events are represented as a single event if their valid-time intervals overlap. Zaniolo [31] examined how temporal coalesc-ing can be expressed for a sequence of events using OLAP functions and Kleene-closure constructs. Kramer and Seeger [32] introduced coalescing as a physical operator for compact-ing the representation of a data stream by merging tuples with identical values and consecutive timestamps into a single tuple. Recently, we studied coalescing for a windowed temporal query processing over data streams and addressed the problems of updating a window’s extent and optimally selecting between eager and lazy coalescing for concurrent temporal queries [7].

While all this existing work is pertinent to temporal coalesc-ing over data streams, none of them addresses it from the view-point of memory being insufficient to retain all tuples. To the best of our knowledge, our work is unique in this regard.

VI. CONCLUSION

This paper addressed the problem of load shedding from a window for continuous temporal queries over data streams when memory is limited. The key challenges come from the fact that tuples need to be coalesced for temporal query process-ing and that this fact makes existing load-shedding algorithms and the accuracy metric used by them inapplicable. Thus, we proposed a new accuracy metric and a new algorithm that can handle coalescing challenges well. The accuracy metric com-bines a value similarity metric based on Jaccard coefficients and an interval similarity metric based on PQI interval orders. The algorithm takes a greedy approach in which our greedy strategy combines two objectives, each objective aims to maximize one of the two types of similarity. The algorithm takes polynomial running time and linear memory space, and the accuracy it achieves is far higher than that achieved by the baseline random load-shedding algorithm. Also, the two objectives combined into the greedy strategy make about equal contributions to the achieved accuracy.

For future work, there are several extensions possible. First, this paper focuses on the limited-memory case, and so the lim-ited-CPU time case is another problem we could look into. Sec-ond, this paper considers the temporal selection query, and other query types like aggregation or join queries are possibilities for further work. For this, the accuracy metric and the greedy strat-egy may need to be adapted to the query type. Third, this paper considers queries retrieving both the attribute values and their valid time intervals. Considering special cases, such as a snap-shot query (retrieving only attribute values) and the valid-time query (retrieving only valid time intervals) [27], may offer sim-pler and more efficient solutions. This too could be an interest-ing study.

  > Appendix. PQI Interval Orders

The PQI interval orders are used to compare intervals [28]. In this scheme, each element, x, of a given data set, D, is repre-sented by an interval through a lower bound function, L(x), and an upper bound function, U(x), such that ∀x ∈ D : L(x) < U(x). They provide three types of binary relations for interval com-parison: strict preference, weak preference (or hesitation), and indifference.

The strict preference relation (denoted as P) models the case in which the interval of one element precedes the interval of the other element (i.e., interval intersection of the two elements is empty) (Fig. 9a). Formally, given two elements x and y, x is said to be strictly preferred over y if L(x) > U(y). The weak prefer-ence relation (denoted as Q) is models the case in which the interval of one element overlaps the interval of the other ele-ment (i.e., non-empty intersection) (Fig. 9b). Formally, given two elements x and y, x is said to be weakly preferred over y if U(x) > U(y) > L(x) > L(y). The indifference relation (denoted as I) models the case in which the interval of one element contains the entire interval of the other element (Fig. 9c and 9d). For-mally, given two elements x and y, x is said to be indifferent from (i.e., similar to) y if U(x) > U(y) > L(y) > L(x) or U(y) > U(x) > L(x) > L(y).

The degree of overlap between two intervals can be used as a measure of the similarity between them [20]. Formally, given two elements x and y, the similarity between x and y can be defined as follows.

[Fig. 9.] Example of PQI relations between two intervals x and y.

This equation is equivalent to Equation 3 in Section III-B.

참고문헌

1. Babu S, Widom J 2001 “Continuous queries over data streams” [SIGMOD Record] Vol.30 P.109-120
2. Babcock B, Babu S, Datar M, Motwani R, Widom J June 3-5 “Models and issues in data stream systems” [Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems] P.1-16
3. Allen J 1983 “Maintaining knowledge about temporal intervals” [Communications of the ACM] Vol.26 P.832-843
4. Tansel A. U 1993 Temporal Databases: Theory Design and Imple-mentation
5. Bohlen M. H, Snodgrass R. T, Soo M. D September 3-6 “Coalescing in temporal databases” [Proceedings of the 22th International Con-ference on Very Large Data Bases] P.180-191
6. Dyreson C. E June 9-12 “Temporal coalescing with now granularity and incomplete information” [ACM SIGMOD International Confer-ence on Management of Data] P.169-180
7. Al-Kateb M, Kunta S. S, Lee B. S 2011 “Temporal coalescing on window extents over data streams” [IEICE Transactions on Information and Systems] Vol.E94-D P.489-503
8. Tatbul N, Liu L 2009 “Load shedding” Encyclopedia of Database Systems P.1632-1636
9. Motwani R, Widom J, Arasu A, Babcock B, Babu S, Datar M, Manku G, Olston C, Rosenstein J, Varma R January 5-8 “Query processing approximation and resource management in a data stream management system” [First Biennial Conference on Inno-vative Data Systems Research]
10. Ayad A. M, Naughton J. F June 13-18 “Static optimization of conjunc-tive queries with sliding windows over infinite streams” [Pro-ceedings of the ACM SIGMOD International Conference on Management of Data] P.419-430
11. Das A, Gehrke J, Riedewald M June 9-12 “Approximate join pro-cessing over data streams” [ACM SIGMOD International Confer-ence on Management of Data] P.40-51
12. Gedik B, Wu K. L, Yu P. S, Liu L October 31-November 5 “Adaptive load shed-ding for windowed stream joins” [Proceedings of the 14th ACM International Conference on Information and Knowledge Man-agement] P.171-178
13. Gedik B, Wu K. L, Yu P. S, Liu L April 15-20 “A load shedding framework and optimizations for M-way windowed stream joins” [Proceedings of the 23rd International Conference on Data Engineering] P.536-545
14. Kang J, Naughton J. F, Viglas S. D March 5-8 “Evaluating window joins over unbounded streams” [Proceedings of the 19th Interna-tional Conference on Data Engineering] P.341-352
15. Tatbul N, Zdonik S 2006 “Window-aware load shedding for aggregation queries over data streams” [Proceedings of the 32nd International Conference on Very Large Data Bases] P.799-810
16. Xie J, Yang J, Chen Y June 14-16 “On joining and caching stochastic streams” [ACM SIGMOD International Conference on Manage-ment of Data] P.359-370
17. Babcock B, Datar M, Motwani R March 30-April 2 “Load shedding for aggregation queries over data streams” [Proceedings of the 20th International Conference on Data Engineering] P.350-361
18. Law Y. N, Zaniolo C 2008 “Improving the accuracy of continu-ous aggregates and mining queries on data streams under load shedding” [International Journal of Business Intelligence and Data Mining] Vol.3 P.99-117
19. Jaccard P 1901 “Etude comparative de la distribution orale dans une portion des alpes et des jura” [Bulletin del la Socit Vaudoise des Sciences Naturelles] Vol.37 P.241-272
20. Ozturk M, Tsoukias A, Prade H, Subrahmanian V 2007 “Valued hesitation in intervals com-parison” Lecture Notes in Computer Science Vol. 4772: Scal-able Uncertainty Management (First International Conference SUM 2007 WashingtonDC USA October 10-12 2007. Pro-ceedings) P.157-170
21. “Intel Lab Data”
22. Arasu A, Babu S, Widom J 2006 “The CQL continuous query language: semantic foundations and query execution” [VLDB Journal] Vol.15 P.121-142
23. Wang F, Zaniolo C, Zhou X June 23-25 “Temporal XML? SQL strikes back!” [Proceedings of the 12th International Symposium on Temporal Representation and Reasoning] P.47-55
24. Srivastava U, Widom J June 14-16 “Flexible time management in data stream systems” [Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Sys-tems] P.263-274
25. “An Oracle White Paper Sep 2009: Oracle Database 11g Work-space Manager Overview”
26. “Teradata”
27. Snodgrass R. T, Ahn I, Ariav G, Batory D, Clifford J, Dyreson C. E, Elmasrik R, Grandik F, Jensen C. S, Kafer W, Kline N, Kulkarni K, Leung T. Y. C, Lorentzos N, Roddick J. F, Segev A, Soo M. D, Sripada S. M 1994 “TSQL2 language specification” [ACM SIGMOD Record] Vol.23 P.65-86
28. Tsoukias A, Vincke P 2003 “A characterization of PQI interval orders” [Discrete Applied Mathematics] Vol.127 P.387-397
29. Mozafari B, Zaniolo C March 1-6 “Optimal load shedding with aggre-gates and mining queries” [Proceedings of the 26th IEEE Inter-national Conference on Data Engineering] P.76-88
30. Barga R. S, Goldstein J, Ali M, Hong M January 7-10 “Consistent streaming through time: a vision for event stream processing” [Third Biennial Conference on Innovative Data Systems Research] P.363-374
31. Zaniolo C July 23-25 “Event-oriented data models and temporal queries in transaction-time databases” [Proceedings of the 16th Interna-tional Symposium on Temporal Representation and Reasoning] P.47-53
32. Kramer J, Seeger B 2005 “A temporal foundation for continuous queries over data streams” [Proceedings of the 11th Interna-tional Conference on Management of Data] P.70-82

OAK XML 통계

이미지 / 테이블

[ Table 1. ] Andy’s employment records
[ Fig. 1. ] An example of coalesced tuples.
[ Fig. 2. ] An example of an exact query result.
[ Fig. 3. ] An example of an approximate query result.
[ Fig. 4. ] An example of load shedding.
[ Table 2. ] δ+sj and ρ?sj for tuples in Example 4.
[ Algorithm 1. ] Coalescing-aware load shedding (CALS)
[ Fig. 5. ] Accuracies achieved by coalescence-aware load shedding (CALS) and random load shedding (RAND) on synthetic data sets.
[ Fig. 6. ] Accuracies achieved by coalescence-aware load shedding (CALS) and random load shedding (RAND) on real data sets.
[ Fig. 7. ] Accuracies achieved by coalescence-aware load shedding (CALS) CALS-V CALS-I and random load shedding (RAND) on synthetic data sets.
[ Fig. 8. ] Accuracies achieved by coalescence-aware load shedding (CALS) CALS-V CALS-I and random load shedding (RAND) on real data set.
[ Fig. 9. ] Example of PQI relations between two intervals x and y.