Load Shedding for Temporal Queries over Data Streams
 Author: AlKateb Mohammed, Lee Byung Suk
 Organization: AlKateb Mohammed; Lee Byung Suk
 Publish: Journal of Computing Science and Engineering Volume 5, Issue4, p294~304, 30 Dec 2011

I. INTRODUCTION
Continuous queries [1] are standing queries, which typically run indefinitely. These queries are central to applications dealing with continuous and unbounded data streams [2]. Many of these applications handle data whose values may
change over time . For this class of application, enhancing continuous queries with temporal functions and predicates [3] enriches the semantics of queries and, consequently, enables users to define temporal expressions in their queries. In this paper we refer to this class of queries ascontinuous temporal queries .Unlike traditional continuous queries that typically return only the values of specified attributes, a continuous
temporal query returns thevalid time interval of each value as well the attribute values themselves [4]. The selection predicates in this type of query includes a temporal predicate for the time intervals associated with the attribute values satisfying the value predicate.To guarantee the correctness of the result of a temporal query, adjacent or overlapping timestamps of valueequivalent tuples should be merged prior to the evaluation of the temporal function or predicate specified in the query. This merge process is referred to as
temporal coalescing [5]. This coalescing is essential for temporal query processing because queries evaluated on uncoalesced data may generate incorrect answers [5, 6]. The same is true forcontinuous temporal queries over data streams [7].For stream applications, it is not uncommon to have insufficient system resources to process or keep all tuples arriving from the input data stream in memory [2]. In these situations, discarding a fraction of tuples, called load shedding [8], often becomes necessary to resolve the problem. With load shedding in place, however, the system cannot produce an exact result, thus ensuring the accuracy of an approximate query result is an important issue [9].
In this paper, we study the problem of load shedding from a data stream in the presence of temporal coalescing, with a particular focus on the case of insufficient memory to keep all tuples in the window specified in a continuous temporal selection query. The presence of coalescing as an integral step in temporal query processing poses two major challenges for the load shedding mechanism.
The first challenge is that the accuracy metrics used in the existing load shedding methods (e.g., the number of tuples in the approximate result as in [1016] or the relative deviation between estimated and actual aggregate values as in [17, 18]) are not suitable for continuous temporal queries. Since a temporal query returns both attribute values and their valid time intervals, the accuracy metric should take both into account. The second challenge is a direct consequence of the first challenge, that is, the existing load shedding algorithms (whether random load shedding as in [17] or semantic load shedding as in [11]) are therefore inadequate for temporal queries.
To address the first challenge, we introduce a new accuracy metric that combines a value similarity metric and an interval similarity metric. The proposed metric is not only intuitive and semantically correct but also has theoretical backing. Specifically, the value similarity metric is based on the Jaccard coefficient [19], and the interval similarity metric is based on
PQI interval orders [20]. The Jaccard coefficient is a common statistic used for comparing the similarity between two sets.PQI interval orders provide a similarity metric for comparing intervals (mapped from elements).In response to the second challenge, we present a new “
coalescenceaware ” algorithm that keeps the effects of coalescing in mind when deciding which tuples to drop. Making an optimal loadshedding decision requires holding all tuples in memory but, because the available memory is limited, we propose a greedy algorithm that takes a quadratic running time and consumes linear memory space. It employs a greedy strategy that combines two greedy objectives, one targeting the value similarity and the other targeting the interval similarity. More specifically, it tries to minimize the number of extraneous tuples and the length of coalesced intervals lost in the approximate result as a consequence of dropping tuples.A performance study compares the proposed coalescenceaware algorithm to a conventional random load shedding algorithm. (Note that other existing loadshedding algorithms are not comparable, since they do not work with coalescing.) The main purpose of the performance study is to compare the two mechanisms in terms of the achieved accuracy. The experiment results show that the accuracy improvement of our proposed mechanism over the random mechanism approaches an order of magnitude as the coalescing probability increases and the available memory size decreases. We have also examined the effects of the two greedy objectives on the performance of the proposed coalescenceaware algorithm and confirmed that the objectives are well balanced in their influence on the accuracy achieved by the greedy algorithm.
The main contributions of this paper can be summarized as follows.
■ It introduces the new problem of load shedding for continuous temporal queries over a windowed data stream.
■ It identifies coalescing as an important barrier to overcome, and proposes a new accuracy metric and a new algorithm for load shedding while taking coalescing effects into consideration.
■ It presents a thorough performance study on the proposed coalescenceaware algorithm and two partially coalescenceaware algorithms (each reflecting one of the two greedy objectives).
The rest of this paper is organized as follows. Section II provides a relevant background and introduces the model of continuous temporal queries assumed in this paper. Section III presents the proposed accuracy metric and load shedding algorithm. Section IV presents the experiments and discusses the results. Section V reviews related work. Section VI concludes this paper and suggests future work.
II. CONTINUOUS TEMPORAL QUERIES OVER DATA STREAMS
We characterize continuous temporal queries by their support for temporal functions and predicates. These queries can be seen as a hybrid of traditional continuous queries over data streams [1] and temporal queries over temporal databases [4]. In this section, we first summarize the models of these two base query types and then introduce and motivate the model of the hybrid query type.
> A. Continuous Queries Over Data Streams
A data stream is an unbounded sequence of tuples, and typical queries on data streams run continuously for an indefinitely long time period [1, 2]. We assume that each tuple arriving from a data stream is timestamped. Additionally, we assume that all tuples arrive in an increasing order of their timestamp and that tuples in a window are maintained in an increasing order of their timestamps.
In many cases, only tuples bounded by a window on a data stream are of interest at any given time [2]. A window may be tuplebased or time  based depending on whether its size is the number of tuples or the window time span. The solutions presented in this paper the proposed accuracy metric and the load shedding algorithm work well with either window model.
An example continuous query is shown below, considering a wireless sensor network in which sensors are mounted on weather boards to collect timestamped topology information along with humidity, temperature, etc. [21]
Example 1. Continuous Query
Assume a query that monitors humidity values in the past 15 minutes and continuously outputs, at every minute, humidity values and their associated timestamp values if it finds that the humidity exceeds 75%. This query can be expressed as follows using the Continuous Query Language (CQL) syntax [22].SELECT S.humidity, S.timestamp
FROM Stream S RANGE 15 MINUTE SLIDES 1 MINUTE
WHERE S.humidity > 75;
In this query syntax, RANGE specifies the window size and SLIDES specifies the sliding window interval .> B. Temporal Queries Over Temporal Databases
Time is a ubiquitous aspect in almost all realworld phenomena that many realworld applications deal with data whose values may change over time. All these applications thus have intrinsic needs to model time, and it has motivated the database community to undertake an extensive study of temporal data management for relational databases [4], semistructured databases (e.g., XML [23]), and data streams [24].
Temporal databases make it possible for the user to store the history of data and to retrieve this history through temporal queries. The importance of such capabilities has become evident with significant commercial interests ensuring that major DBMS products [25, 26] are currently supporting temporal data management as core features.
Two time dimensions are of interest in the handling of temporal data  valid time interval and transaction time interval [4]. The former is the time interval during which the stored fact is true in reality. The latter is the time interval during which the fact is present in the database. A temporal database can support the former only (called a validtime database), the latter only (called a transactiontime database), or both (called a bitemporal database). In this paper we consider a validtime database.
In addition, temporal databases can support either attribute or tuple timestamping depending on whether the time interval is associated with individual attributes or with the entire tuple of all attributes [4]. In this paper we consider tuple timestamping.
Temporal databases can be queried using Temporal SQL (TSQL) [27], which is essentially SQL that has been extended with temporal predicates and functions. The correct evaluation of temporal predicates and functions demands temporal tuples to be coalesced first. As already mentioned, coalescing is a fundamental operation in any temporal data model, and is essential to temporal query processing since queries evaluated on uncoalesced data may generate incorrect answers [5, 6] (see the example below).
Example 2. Importance of Coalescing
Consider the temporal table Employee in Table 1,where for simplicity only Andy's employment history records are shown. The second tuple reflects a change of his salary and the third tuple reflects a change of his department .Suppose the manager wants to know which employees worked for the R&D Department for more than six consecutive years. The query can be expressed as follows in TSQL [27].SELECT E.NAME, VALID (E)
FROM Employee E (Name, Department) as E
WHERE CAST (VALID (E) AS INTERVAL YEAR) > 6;
In this syntax, Table 1E(Name, Department) specifies coalescing overName andDepartment ,VALID(E) returns the valid time interval of tuples in the result, and theCAST clause converts the valid time interval to the specified granularity, i.e., year. The query returns a pairs of values:Name and the interval during which the values of the coalescing attributes stay the same. With coalescing, the first two tuples in(both of Andy in R&D) are merged because their valid time intervals are adjacent to each other. The merged time interval, [2001, 2008], is greater than six years, and so Andy is in the query result. Without coalescing, in contrast, Andy cannot be in the query result because the time interval of neither the first nor the second tuple alone is greater than six years .> C. Continuous Temporal Queries Over Data Streams
As already mentioned, we see the continuous temporal stream query model as a hybrid of the continuous stream query model and the temporal database query model. What this entails in the temporal representation of tuples is that, when a new tuple,
s_{i} , is added to a window, we model the valid time interval of its preceding tuple,s _{i1}, as (t _{i1},t_{i} ), wheret _{i1} andt_{i} are the timestamps ofs _{i1} ands_{i} , respectively. Moreover, inherited from temporal database queries, continuous temporal stream queries output not only the values of the coalesced attributes but also their valid time intervals. As mentioned in Section IIB, the correct evaluation of temporal predicates and functions in a continuous temporal query requires that valueequivalent tuples with adjacent time intervals should be merged before a query can be evaluated for them.In this paper, we consider continuous temporal
selection queries (see Example 3 below) as the query type. A temporal selection query is adequate enough for our purpose of studying the effect of load shedding. Besides, we believe queries of this type are useful in a wide range of stream applications.Example 3. Temporal Continuous Selection Query
In the application assumed in Example 1, consider another query for monitoring the stream of sensor readings across arriving temporal tuples. Every minute, the query processor outputs the humidity value and the associated time interval if it finds that the value has been the same for 3 consecutive minutes or longer. This query can be expressed as follows using the syntax borrowed from CQL [22]and TSQL [27].SELECT S.humidity, VALID (S)
FROM Stream (humidity) as S RANGE 15 MINUTE SLIDES 1 MINUTE
WHERE CAST (VALID (S) AS INTERVAL MINUTE) >= 3;
III. COALESCENCEAWARE LOAD SHEDDING
As mentioned in Section I, making load shedding coalescenceaware requires a new accuracy metric and a new loadshedding algorithm because the existing ones do not work with coalescing. In this section, we propose a new accuracy metric in Section IIIB and a new algorithm in Section IIIC.
> A. Preliminaries
Definition 3.1 Coalesced tuple. A coalesced tuple CT_{i} is defined as a pair of an attribute value and a time interval of the coalesced tuple (Fig. 1).Definition 3.2 Exact query result. An exact result of a continuous temporal query, denoted as ε, is defined as a sequence of exact coalesced tuples ？CT_{1}, CT_{2}, ...CT_{n} ？ (Fig. 2).Definition 3.3 Approximate query result. An approximate result of a continuous temporal query, denoted as A, is defined as a sequence ？A_{1}, A_{2}, .. .A_{n}？ where A_{i} is the set of approximate coalesced tuples CT_{ij} resulting from the exact coalesced tuple CT_{i} due to load shedding (Fig. 3).Property l. Bounding Property on Approximate Coalesced Tuples
As a result of load shedding, an exact coalesced tuple
CT_{i} reduces to zero or more approximate coalesced tuples inA_{i} ∈A such that the attribute values of all the approximate coalesced tuples are the same as that ofCT_{i} and the time interval of every approximate coalesced tuple is contained in that ofCT_{i} .Proof : This property can be proved through a case analysis. As a result of load shedding, one and only one of the following four cases can happen toCT_{i} .■ Case 1: CTi is retained without any alteration. This case happens when none of the tuples coalesced in CTi is dropped.
■ Case 2: An exact coalesced tuple disappears in its entirety. This case occurs when a tuple not coalesced with its adjacent tuples in the exact result is dropped.
■ Case 3: An exact coalesced tuple is decreased in its interval. This case occurs when a tuple which is either the first or the last of the tuples coalesced in the exact result is dropped.
■ Case 4: An exact coalesced tuple is split into two approximate coalesced tuples of shorter intervals. This case occurs when a tuple in the middle of the tuples coalesced in the exact result is dropped.
It is obvious in all four cases that the bounding property holds.
> B. Accuracy Metric
Since the result of a continuous temporal query consists of both the values of coalesced attributes and their associated valid time intervals (recall Definitions 3.2 and 3.3), an accuracy metric for this class of queries should take both into account. In this regard, we propose an accuracy metric based on the concept of the
Jaccard coefficient [19] and the concept ofPQI interval orders [20].The
Jaccard coefficient ,J (D_{x} ,D_{y} ), measures the similarity between two data sets,D_{x} andD_{y} , and is defined as the ratio between the cardinality of their intersection and the cardinality of their union.We use the Jaccard coefficients to compare the coalesced attribute
values between the exact and the approximate query result, that is, betweenε ≡ ？CT _{1}, ...,CT _{n}？ andA ≡ ？A _{1}, ...,A_{n} ？. Note that there may be more than one coalesced tuple inA_{i} for eachCT_{i} inε (recall Definitions 3.2, 3.3, and Property 1). We thus need to apply Equation 1 to multisets. That is,where
and ？ are the multiset intersection and union operators, respectively (Let
f (D _{1},x ) andf (D _{2},x ) be the multiplicity of an elementx in the multisetsD _{1} andD _{2}, respectively. Then,f (D _{1}D _{2},x ) andf (D _{1} ？D _{2},x ) are defined as min{f (D _{1},x ),f (D _{2},x )}, and max{f (D _{1},x ),f (D _{2},x )}, respectively). Note E_{i}  equals 1 since it is a singleton set. The numerator term min(1, A_{i} ) equals 0 only if the exact coalesced tuple disappears as a result of load shedding (case 2 in the proof of Property 1) but otherwise will always equals 1.Equation 2 considers only the coalesced attribute values, and so it should be modified to consider the interval similarity factor as well. For this purpose, we adopt
PQI interval orders used to compare intervals [20, 28]. In this scheme, given two intervalsx andy , the similarity between them is defined as the length of their interval intersection (？) divided by the length of their interval union (？).(See Appendix for more information about
PQI interval orders.)We use the
PQI interval orders to compare an exact coalesced time interval with theset of approximate coalesced time intervals derived from it. LetI_{i} andI_{ij} be the coalesced intervals ofCT_{i} andCT_{ij} , respectively. We know from Property 1 thatI_{ij} is a subinterval ofI_{i} for everyj . Hence, the interval similarity for a given exact coalesced tuple is computed as follows.Equation 4 essentially computes the interval reduction ratio of the exact coalesced tuple
CT_{i} due to load shedding. Its value equals 1 only ifCT_{i} is retained intact as a result of load shedding (the case 1 in the proof of Property 1), but otherwise is always less than 1. In the proposed accuracy metric, we use this ratio to scale the intersection term ofCT_{i} in Equation 2.Note that min(1, 
A_{i} ) equals either 0 or 1 and that if min(1, A_{i} ) = 0 (i.e., the case 2) thenPQI (CT_{i} ,A_{i} ) = 0. Hence, Equation 5 can be further simplified as follows.This accuracy metric has the nice property of limiting the computed accuracy to a maximum of 1.0 (i.e., 100%) and allowing for intuitive interpretations as illustrated in the following example.
Example 4. Proposed Accuracy Metric.
As shown in Fig. 4,consider an exact result (ε) ？ resulting from coalescing five actual tuples ？ and five possible approximate results (A _{1},A _{2},A _{3},A _{4},A _{5}) ？each resulting from dropping one of the five actual tuple. Using Equation 6, the accuracy of each approximate result is calculated as follows .We see that the resulting accuracy is the lowest in A _{3},where the third tuple , <83, [7, 10)>is dropped. In this case, it introduces a temporal gap between the second and the fourth tuples (which are coalesced with the third tuple in the exact result). This causes an extra attribute value (hence we get a 2in the denominator) and a missing part of the exact temporal interval (hence we get 6/9in the numerator) in the approximate result. On the other hand, the highest accuracy is achieved in either A _{2},where the second tuple is dropped, or in A _{4}where the fourth tuple is dropped. The reason is that dropping either tuple causes only the missing part of the exact temporal interval (as in A _{3}) and neither has an extra attribute value (as in A _{3}) nor is missing the entire interval (as in A _{1}and A _{5}).> C. CoalescenceAware Load Shedding Algorithm
The problem of coalescenceaware load shedding can be stated formally as follows.
Problem definition: Given the memory M (of size M tuples) and a continuous temporal query which specifies coalescing tuples in a window WS (of size WS tuples) over a data stream S, we discard WS ？ M tuples from the WS tuples with the objective of maximizing the accuracy of the query output subject to the constraint that M < WS.
Finding an optimal solution would require holding all 
W_{S}  tuples in memory to decide on the optimal load shedding decision, but the available memory can hold only M  tuples (as M  < W_{S} ). Even if it were possible (which it is not), it would be computationally intractable to find frompossible subsets an optimal subset of tuples to discard. This problem is harder than the NPhard nonlinear binary integer programming problem, as there is no wellformed functional form for computing the accuracy for a given set of binary integer assignments (e.g., 1 for retaining a tuple and 0 for discarding a tuple). So, we propose a greedy algorithm, which takes a quadratic running time in the worst case and consumes a linear amount of storage space.
The proposed algorithm decides which tuple to drop from memory upon the arrival of each new tuple from the input data stream. The algorithm uses a greedy strategy based on two objectives for achieving value similarity (Equation 2) and interval similarity (Equation 4).
Objective 1 . Dropping a tuple may introduce an extraneous tuple in an approximate result if it causes a temporal gap between tuples that are coalesced in the exact result. Letδ ^{+}_{sj} denote the number of such extraneous tuples resulting from dropping a tuples_{j} . Note thatδ ^{+}_{sj} is 1 in the case 4 in the proof of Property 1 and is 0 in other cases. Evidently, we want to minimize the number of extraneous tuples, and, thus, we set the first objective todrop a tuple with the smallest δ ^{+}_{sj} (i .e ., =0 ) first. A tie is broken by random choice.Objective 2 . Dropping a tuple may also cause either all or part of the time interval of an exact coalesced tuple to be missing in the approximate result. Letρ ^{？}_{sj} be the ratio of the length of the time interval missing due to the dropping ofs_{j} over the length of the time interval of the exact coalesced tuple. Note thatρ ^{？}_{sj} is 0 in the case 1 in the proof of Property 1, is 1 in the case 2, and is bounded between 0 and 1 in other cases. Then, the second objective is todrop a tuple with the smallest ρ ^{？}_{sj}first in order to reduce these missing intervals. Like Objective 1, a tie is broken by random choice.There are different ways to combine
δ ^{+}_{sj} andρ ^{？}_{sj} in the objective function of the greedy algorithm. The sum of them is used here (It can be aweighted sum if we want to give a bias in favor of either objective.). We believe summation is better than a product, as the resulting value is always greater than 0. The product gives 0 whenδ ^{+}_{sj} orρ ^{？}_{sj} is 0.For example, Table 2 shows the values of
δ ^{+}_{sj} andρ ^{？}_{sj} for eachof the five actual tuples in Example 4. It shows that the tuples that have a minimum value of
δ ^{+}_{sj} +ρ ^{？}_{sj} ares _{2} ands _{4}. The algorithm thus decides to drop eithers _{2} ors _{4}. Recall the conclusion in Example 4 that the highest accuracy is achieved in eitherA _{2}, wheres _{2} is dropped, or inA _{4} wheres _{4} is dropped. This demonstrates how the accuracy metric introduced in Section IIIB is reflected in our proposed load shedding algorithm.Algorithm 1 outlines the steps of load shedding using the combined objective function. Upon the arrival of a new tuple
s_{i} (Line 1), if the memoryM is not full yet,s_{i} is added toW_{S} (Line 4), but otherwise a tuples_{m} which gives a minimum value ofδ ^{+}_{sj} +ρ ^{？}_{sj} among alls_{j} ∈M ∪{s_{i} } is found (Lines 714) and dropped from the window (Line 16).It is easy to see that the runningtime complexity is
O (M ^{2}), since the algorithm scans the M +1 tuples inM ∪{s_{i} } [15] linearly and, for each tuple, the computation ofδ ^{+}_{sj} andρ ^{？}_{sj} takesO (M ) time in the worst case scenario. The complexity is lower in practice, as computingδ ^{+}_{sj} andρ ^{？}_{sj} involves only those tuples coalesced to form the same exact tuple. The storage space complexity isO (M ), as it suffices to have enough memory to hold the M +1 tuples and two numberscurrent _min andm .In the experiments presented in Section IV, we use two additional algorithms, called coalescenceaware load shedding (CALS)V and CALSI. CALSV is the CALS algorithm reduced to consider only the value similarity (hence using
δ ^{+}_{sj} only), and CALSI is that considering only the interval similarity (hence usingρ ^{？}_{sj} only).IV. PERFORMANCE STUDY
We conducted experiments to study the performance of the proposed load shedding algorithm. The main objective of the experiments was to observe the effectiveness of the proposed CALS algorithm. Additionally, we compare CALS with two partially coalescenceaware algorithms, CALSV and CALSI, to observe how each of the two complementary greedy objectives affects the performance individually. We use the random load shedding (RAND) algorithm as the baseline approach (Under this random load shedding mechanism, tuples are discarded randomly such that each tuple has the same probability of being discarded from (or, equivalently, retained in) memory.). As anticipated, the experiment results show that CALS achieved the highest accuracy, RAND achieved the lowest, and CALSV and CALSI are in between.
The experiments were conducted using Matlab (Release R2010b) on a 64bit machine with 4.00 GB internal main memory. We use the Matlab array/matrix data type to implement the window data structure.
> A. Experiment Setup
: We identified two key parameters influencing the accuracy ofs load shedding algorithms significantly:1) Control Parameters memory ratio andcoalescing probability . Thecoalescing probability is the expected probability that a tuple in the input stream is coalesced with its preceding tuple upon arrival. Thememory ratio is the ratio of available memory size over the specified window size, that is, the number of tuples that can be actually stored in the memory divided by the number of tuples that are supposed to be stored in the memory. In our experiments we set the window size to 500 tuples and vary the ratio from 50% to 90%. A natural expectation with these two parameters is that the benefit of coalescenceawareness will be more visible in the resulting accuracy when the coalescing probability is higher or the memory ratio is lower. In addition, we considered thethreshold for time interval selection, which in our experiments is the lower bound on the selection interval specified in the query (see the selection query in Example 3). The experiment results, however, show that the influence of this parameter value on the accuracy is insignificant. : We use five2) Data Sets synthetic data sets simulating streams with the coalescing probabilities 10%, 30%, 50%, 70%, and 90%, respectively. The timestamp value,t_{i} , of each tuple in the data sets is selected randomly from within the next 10 increments of time (i.e., within [t_{i} ,t_{i} + 10)). (The particular length of the time unit and the size of the increment, whether constant or varying, are irrelevant for these experiments.) Thereal data sets contain weather measurement data collected from sensors deployed throughout the Intel Berkeley Research Lab to gather timestamped topology information along with humidity, temperature, light intensity, and voltage values. In order to make cases with different coalescing probabilities, we performed coalescing on different coalescing attributes such as humidity, voltage, and light intensity. The resulting coalescing probabilities are approximately 29% (for humidity), 64% (for voltage), and 86% (for light intensity).> B. Experiment Results
For each algorithm, we measured the achieved accuracy as an average of the accuracies measured using each data set. The results are presented in two stages in this section. The first stage focuses on comparing CALS with RAND to see the effect of coalescenceawareness on the accuracy. The second stage focuses on examining the individual effects of the two partially coalescenceaware greedy objectives.
: Figs. 5 and 6 show the accuracies achieved by CALS and RAND using the synthetic and real data sets, respectively. Both graphs show that CALS outperforms RAND in the entire range of coalescing probabilities and memory ratios. Furthermore, it clearly shows that the performance advantage of coalescenceawareness increases as the coalescing probability increases and as the available memory size decreases, approaching an order of magnitude as they approach the largest coalescing probability (i.e., 90%) and the lowest memory ratio (i.e., 50%) used in the experiments. In addition, as the figures show, there is little difference in the accuracy for varying the interval selection threshold1) The Effect of CoalescenceAwareness value. This indicates that, while the load shedding produces more tuples of shorter coalesced intervals, the relative difference between the sets of tuples selected in the case of exact and approximate query results does not change much for different threshold values.
: Figs. 7 and 8 show the accuracies achieved by all four algorithms (i.e., CALS, CALSV, CALSI, RAND) for varying coalescing probabilities and memory ratios, respectively. As expected, the accuracies of CALSV and CALSI are bounded between the accuracy of CALS (upper bound) and the accuracy of RAND (lower bound).2) The Effects of the Two Greedy Bbjectives In addition, we see that their performance is very close to each other regardless of the memory ratio. This is evident from the fact that the memory size determines the number of tuples to be dropped, and it has equitable effects on the proposed accu
racy under both greedy objectives.
We also see that the performance of CALSV changes little as the coalescing probability increases while CALSI starts degrading as it increases beyond around 50%. Our reasoning is that CALSV aims at deciding which tuples to retain that would otherwise cause time gaps (hence extraneous tuples) and so has little dependency on the coalescing probability itself, while in CALSI it is more likely that tuples influencing the accuracy more (even if they have the shortest time interval) are dropped as the coalescing probability increases.
V. RELATED WORK
There are two separate aspects to the related works we look at: load shedding from a data stream and temporal coalescing over data streams.
> A. Load Shedding from a Data Stream
The most common accuracy metric assumed in the research literature for load shedding over data streams is the size of approximate query result [1016]. A few other existing studies, specific to aggregation queries, assume the accuracy metric of the relative error between approximate and actual aggregate values [17, 18]. For both of these accuracy metrics, the accompanying algorithms are generally categorized as random load shedding (i.e., drop tuples randomly, as in [17, 29]) or semantic load shedding (i.e., drop tuples based on their relative importance, as in [11]).
For maximizing the size of an approximate result (known as the maxsubset), Das et al. [11] proposed two heuristics for dropping tuples from windowjoin input streams. The first heuristic sets the priority of a tuple being retained in memory based on the probability that a counterpart joining tuple arrives from the other data stream. The second heuristic favors a tuple based on its own remaining lifetime. Kang et al. [14] proposed a costbased load shedding mechanism for evaluating window stream joins by adjusting the size of the sliding windows to maximize the number of produced tuples. Ayad and Naughton [10] proposed techniques for optimizing the execution of conjunctive continuous queries under limited computational resources by placing random drop boxes throughout the query plan to maximize the plan throughput. Xie et al. [16] presented a stochastic load shedding mechanism for maximizing the number of result tuples in stream joins by exploiting the statistical properties of the input streams. Gedik et al. [12] proposed a load shedding approach for stream joins that adapts to stream rate and time correlation changes in order to increase the number of output tuples produced by a join query. Then, Gedik et al. [13] further addressed the problem of multipleway joins over windowed data streams and proposed a window segmentation approach for maximizing the query output rate. Tatbul and Zdonik [15] proposed a load shedding operator that models a data stream as a sequence of logical windows and either drops or retains the entire window depending on the definition and characteristics of the window operator. For minimizing the relative error for aggregation queries, Babcock et al. [17] proposed a random sampling techniques for optimum placement of load shedding operators in aggregation query plans, measuring the resulting accuracy as the relative deviation between estimated and actual aggregate values. Law and Zaniolo [18] proposed a Bayesian load shedding approach for aggregation stream queries with the objective of obtaining an accurate estimate of the aggregation answer with minimal time and space overheads.
In contrast to all this existing work, we address a rather novel query model for data streams under which a query result is composed of both attribute values and the valid time interval of each value. Therefore, all the existing load shedding accuracy metrics and algorithms are not applicable to our research problem.
> B. Temporal Coalescing Over Data Streams
Barga et al. [30] proposed to employ temporal coalescing in view of complex event processing over data streams, so that two events are represented as a single event if their validtime intervals overlap. Zaniolo [31] examined how temporal coalescing can be expressed for a sequence of events using OLAP functions and Kleeneclosure constructs. Kramer and Seeger [32] introduced coalescing as a physical operator for compacting the representation of a data stream by merging tuples with identical values and consecutive timestamps into a single tuple. Recently, we studied coalescing for a
windowed temporal query processing over data streams and addressed the problems of updating a window’s extent and optimally selecting between eager and lazy coalescing for concurrent temporal queries [7].While all this existing work is pertinent to temporal coalescing over data streams, none of them addresses it from the viewpoint of memory being insufficient to retain all tuples. To the best of our knowledge, our work is unique in this regard.
VI. CONCLUSION
This paper addressed the problem of load shedding from a window for continuous temporal queries over data streams when memory is limited. The key challenges come from the fact that tuples need to be coalesced for temporal query processing and that this fact makes existing loadshedding algorithms and the accuracy metric used by them inapplicable. Thus, we proposed a new accuracy metric and a new algorithm that can handle coalescing challenges well. The accuracy metric combines a value similarity metric based on Jaccard coefficients and an interval similarity metric based on
PQI interval orders. The algorithm takes a greedy approach in which our greedy strategy combines two objectives, each objective aims to maximize one of the two types of similarity. The algorithm takes polynomial running time and linear memory space, and the accuracy it achieves is far higher than that achieved by the baseline random loadshedding algorithm. Also, the two objectives combined into the greedy strategy make about equal contributions to the achieved accuracy.For future work, there are several extensions possible. First, this paper focuses on the limitedmemory case, and so the limitedCPU time case is another problem we could look into. Second, this paper considers the temporal selection query, and other query types like aggregation or join queries are possibilities for further work. For this, the accuracy metric and the greedy strategy may need to be adapted to the query type. Third, this paper considers queries retrieving both the attribute values and their valid time intervals. Considering special cases, such as a snapshot query (retrieving only attribute values) and the validtime query (retrieving only valid time intervals) [27], may offer simpler and more efficient solutions. This too could be an interesting study.
> Appendix. PQI Interval Orders
The
PQI interval orders are used to compare intervals [28]. In this scheme, each element,x , of a given data set,D , is represented by an interval through a lower bound function,L (x ), and an upper bound function,U (x ), such that ∀x ∈D :L (x ) <U (x ). They provide three types of binary relations for interval comparison:strict preference ,weak preference (orhesitation ), andindifference .The strict preference relation (denoted as
P ) models the case in which the interval of one element precedes the interval of the other element (i.e., interval intersection of the two elements is empty) (Fig. 9a). Formally, given two elementsx andy ,x is said to be strictly preferred overy ifL (x ) >U (y ). The weak preference relation (denoted asQ ) is models the case in which the interval of one element overlaps the interval of the other element (i.e., nonempty intersection) (Fig. 9b). Formally, given two elementsx andy ,x is said to be weakly preferred overy ifU (x ) >U (y ) >L (x ) >L (y ). Theindifference relation (denoted asI ) models the case in which the interval of one element contains the entire interval of the other element (Fig. 9c and 9d). Formally, given two elementsx andy ,x is said to be indifferent from (i.e., similar to)y ifU (x ) >U (y ) >L (y ) >L (x ) orU (y ) >U (x ) >L (x ) >L (y ).The degree of overlap between two intervals can be used as a measure of the
similarity between them [20]. Formally, given two elementsx andy , the similarity betweenx andy can be defined as follows.This equation is equivalent to Equation 3 in Section IIIB.

[Table 1.] Andy’s employment records

[Fig. 1.] An example of coalesced tuples.

[Fig. 2.] An example of an exact query result.

[Fig. 3.] An example of an approximate query result.

[Fig. 4.] An example of load shedding.

[Table 2.] δ+sj and ρ?sj for tuples in Example 4.

[Algorithm 1.] Coalescingaware load shedding (CALS)

[Fig. 5.] Accuracies achieved by coalescenceaware load shedding (CALS) and random load shedding (RAND) on synthetic data sets.

[Fig. 6.] Accuracies achieved by coalescenceaware load shedding (CALS) and random load shedding (RAND) on real data sets.

[Fig. 7.] Accuracies achieved by coalescenceaware load shedding (CALS) CALSV CALSI and random load shedding (RAND) on synthetic data sets.

[Fig. 8.] Accuracies achieved by coalescenceaware load shedding (CALS) CALSV CALSI and random load shedding (RAND) on real data set.

[Fig. 9.] Example of PQI relations between two intervals x and y.