Fast Result Enumeration for Keyword Queries on XML Data
 Author: Zhou Junfeng, Chen Ziyang, Tang Xian, Bao Zhifeng, Ling TokWang
 Organization: Zhou Junfeng; Chen Ziyang; Tang Xian; Bao Zhifeng; Ling TokWang
 Publish: Journal of Computing Science and Engineering Volume 6, Issue2, p127~140, 30 June 2012

ABSTRACT
In this paper, we focus on efficient construction of tightest matched subtree (TMSubtree) results, for keyword queries on extensible markup language (XML) data, based on smallest lowest common ancestor (SLCA) semantics. Here, “matched” means that all nodes in a returned subtree satisfy the constraint that the set of distinct keywords of the subtree rooted at each node is not subsumed by that of any of its sibling nodes, while “tightest” means that no two subtrees rooted at two sibling nodes can contain the same set of keywords. Assume that
d is the depth of a given TMSubtree,m is the number of keywords of a given queryQ . We proved that ifd ≤m , a matched subtree result has at most 2m ! nodes; otherwise, the size of a matched subtree result is bounded by (d ？m + 2)m !. Based on this theoretical result, we propose a pipelined algorithm to construct TMSubtree resultswithout rescanning all node labels. Experiments verify the benefits of our algorithm in aiding keyword search over XML data.

KEYWORD
XML , Keyword search , Result enumeration

I. INTRODUCTION
Over the past few years, keyword search on extensible markup language (XML) data has been a hot research issue, along with the steady increase of XMLbased applications [112]. As with the importance of effectiveness in keyword search, efficiency is also a key factor in the success of keyword search.
Typically, an XML document is modeled as a nodelabeled tree
T . For a given keyword queryQ , each resultt is a subtree ofT containing each keyword ofQ at least once, where the root node oft should satisfy a certain semantics, such as smallest lowest common ancestor (SLCA) [4], exclusive lowest common ancestor (ELCA) [2, 3, 9], valuable lowest common ancestor (VLCA) [11] or meaningful lowest common ancestor (MLCA) [5]. Based on the set of qualified root nodes, there are three kinds of subtree results: 1) complete subtree (CSubtree), which is a subtreerooted at a node
v that is excerpted from the original XML tree without pruning any information [2, 4]; 2) path subtree (PSubtree), which is a subtreethat consists of paths from
v to all of its descendants, each of which contains at least one input keyword [13]; and 3) matched subtree (MSubtree), which is a subtreerooted at
v satisfying the constraints ofmonotonicity andconsistency [7, 8]. LetS _{v'} = {k _{1},k _{2},...k _{m'}} be the set of distinct keywords contained in the subtree rooted at a nodev' ,S_{v'} ⊆Q . Intuitively, a subtreet_{v} is an MSubtree if and only if for any descendant nodev' ofv , there does not exist a sibling nodeu' ofv' , such thatS_{v'} ⊂S_{u'} , which we call the constraint of “keywords subsumption ”. Obviously, for a given CSubtreecan be got by removing from
all nodes that do not contain any keyword of the given query in their subtrees, and, according to [8],
can be got by removing from
all of the nodes that do not satisfy the constraint of keywords subsumption.
EXAMPLE 1 . To find for “CS” laboratory all papers that are written by “Tom” and published in “DASFAA” about “XML” from the XML documentD in Fig. 1, we may submit a queryQ = {CS, Tom, DASFAA, XML} to complete this task. Obviously, the qualified SLCA node is the root node with Dewey [10] label “1”. Therefore, the CSubtree forQ isD itself, the PSubtree isR _{1}, while the MSubtree isR 2.From Example 1 we know that a CSubtree, e.g.,
D , may be incomprehensible for users, since it could be as large as the document itself, while a PSubtree could make users feel frustrated, since it may contain too much irrelevant information, e.g., although each leaf node ofR 1 directly contains at least one keyword ofQ , the three papers with Dewey labels “1.2.3, 1.3.2, 1.3.3” have nothing to do with “XML”. In fact, from Fig. 1 we can easily know that for “CS” laboratory, the paper written by “Tom” and published in “DASFAA” about “XML” is the node with Dewey label “1.2.2”. According to Fig. 1, we know that the keyword sets for nodes 1.1, 1.2, and 1.3 areS _{1.1} = {CS},S _{1.2} = {Tom, DASFAA, XML}, andS _{1.3} = {DASFAA}, respectively. According to the constraint of keywords subsumption, all nodes in the subtree rooted at node 1.3 should be pruned, sinceS _{1.3}⊂S _{1.2}. Similarly, the keyword sets for node 1.2.1, 1.2.2, and 1.2.3 areS _{1.2.1} = {Tom},S _{1.2.2} = {Tom, DASFAA, XML}, andS _{1.2.3} = {Tom, DASFAA}, respectively. According to the constraint of keywords subsumption, sinceS _{1.2.1}⊂S _{1.2.2} andS _{1.2.3}⊂S _{1.2.2}, all nodes in the subtrees rooted at node 1.2.1 and 1.2.3 should be removed. After that, we get the MSubtree, i.e.,R _{2}, which contains all necessary information after removing nodes that do not satisfy the constraint of keywords subsumption, and is more selfexplanatory and compact.However, most existing methods [3, 4, 6, 9, 14] that address efficiency focus on computing qualified root nodes, such as SLCA or ELCA nodes, as efficient as possible. In fact, constructing subtree results is not a trivial task. Existing methods [7, 8] need to firstly
scan all nodelabels to compute qualified SLCA/ELCA results, and then
rescan all node labels to construct the initial subtrees. After that, they need to buffer these subtrees in memory, and apply the constraint of keywords subsumption on each node of these subtrees, to prune nodes with keyword sets subsumed by that of their sibling nodes, which is inefficient in time and space.EXAMPLE 2 . Take the MaxMatch algorithm [8] for example. It firstly gets the set of SLCA nodes based on either indexed lookup (IL) [4], incremental multiway SLCA (IMS) [6] or hash search (HS) [12] algorithm, then computes all matched subtree results. We ran the Max Match algorithm for queryQ 1 = {school, gender, education, takano, province} andQ 2 = {incategory, text, bidder, data} on the XMark (http://monetdb.cwi.nl/xml) dataset of 582 MB. We found that no matter which one of IL, IMS, and HS is adopted, the time of SLCA computation is less than 7% and 1% of the overall running time for Q1 and Q2, respectively, as shown in Table 1.From Example 2 we know that, given a set of SLCA nodes, the operation of computing matched subtree results will dominate the overall performance of the MaxMatch algorithm, thus should deserve being paid more attention. As illustrated by [7], an MSubtree could still contain redundant information; e.g., the four conference nodes, i.e., 1.2.2.3, 1.2.3.3, 1.3.2.3, and 1.3.3.3, of
D in Fig. 1 are the same as each other according to their content, and for keyword query {CS, conference}, returning only one of them is enough. However, the MSubtree result contains all of these conference nodes, because all of them satisfy the constraint of keywords subsumption.In this paper, we focus on constructing tightest matched subtree (TMSubtree) results according to SLCA semantics. Intuitively, a TMSubtree is an MSubtree after removing redundant information, and it can be generated from the corresponding PSubtree by removing all nodes that do not satisfy the constraint of keywords subsumption, and just keeping one node for a set of sibling nodes that
have the same keyword set. Assume that
d is the depth of a given TMSubtree,m is the number of keywords of the given queryQ . We proved that ifd ≤m , then a TMSubtree has at most 2m ! nodes; otherwise, the number of nodes of a TMSubtree is bounded by (d ?m 2)m !. Based on this theoretical result, we propose a pipelined algorithm to compute TMSubtreeswithout rescanning all node labels. Our algorithm sequentially processes all node labels in document order, and immediately outputs each TMSubtree once it is found. Compared with the MaxMatch algorithm [8], our method reduces the space complexity fromto
O (d ·max {2m !, (d m + 2)·m !}), whereL_{i} is the inverted Dewey label list of keywordk_{i} .The rest of the paper is organized as follows. In Section II, we introduce background knowledge, and discuss the related work. In Section III, we give an indepth analysis of the MaxMatch algorithm [8], then define the tightest matched subtree (TMSubtree) and discuss its properties, and finally, present our algorithm on computing all TMSubtree results. In Section IV, we present the experimental results, and in Section V, we conclude our paper.
† An earlier version of this paper has been published at DASFAA 2012.
II. BACKGROUND AND RELATED WORK
We model an XML document as a node labeled ordered tree, where nodes represent elements or attributes, while edges represent direct nesting relationships between nodes in the tree. Fig. 1 is a sample XML document. We say a node v directly contains a keyword
k , ifk appears in the node name or attribute name, ork appears in the text value ofv .A Dewey label of node
v is a concatenation of its parent's label and its local order. The last component is the local order ofv among its siblings, whereas the sequence of components before the last one is called the parent label. In Fig. 1, the Dewey label of each node is marked as the sequence of components separated by “.”. For a Dewey labelA :a _{1}.a _{2}...a_{n} , we denote the number of components ofA as A , and thei^{th} component asA [i ]. As each Dewey label [10] consists of a sequence of components representing the path from the document root to the node it represents, the Dewey labeling scheme is a natural choice of stateoftheart algorithms [2, 4, 6, 15, 16] for keyword query processing on XML data. The positional relationships between two nodes include document order (？_{d} ), equivalence (=), ancestordescendant (AD, ?_{a} ), parent child (PC,?_{p} ), ancestororself (?_{a} ) and Sibling relationship.u ？_{d} v means thatu is located beforev in document order,u ?_{a} v means thatu is an ancestor node ofv , andu ?_{p} v denotes thatu is the parent node ofv . Ifu andv represent the same node, we haveu =v , and bothu ?_{d} v andu ?_{a} v hold. In the following discussion, we do not differentiate between a node and its label, if without ambiguity.For a given query
Q = {k _{1},k _{2}, …,k_{m} } and an XML documentD , we useL_{i} to denote the inverted Dewey label list ofk_{i} , of which all labels are sorted in document order. LetLCA (v _{1},v _{2},....,v_{m} ) be the lowest common ancestor (LCA) of nodesv _{1},v _{2},...,v_{m} , the LCAs ofQ onD are defined asLCA (Q ) = {v v =LCA (v _{1},v _{2},…,v_{m} ),v_{i} ∈L_{i} (1≤i ≤m )}; e.g., the LCAs ofQ ={XML ,Tom } onD in Fig. 1 are nodes 1.2 and 1.2.2.In the past few years, researchers have proposed many LCAbased semantics [1, 2, 4, 5, 11], among which SLCA [4, 6] is one of the most widely adopted semantics. Compared with LCA, SLCA defines a subset of
LCA (Q ), of which no LCA in the subset is the ancestor of any other LCA, which can be formally defined asSLCASet =SLCA (Q ) = {v v ∈LCA (Q ) and ?v' ∈LCA (Q ), such thatv ?_{a} v' }. In Fig. 1, although 1.2 and 1.2.2 are LCAs ofQ = {XML, Tom }, only 1.2.2 is an SLCA node forQ , because 1.2 is an ancestor of 1.2.2.Based on the set of matched SLCA nodes, there are three kinds of subtree results: 1) CSubtree [2, 4]; 2) PSubtree [13]; and 3) MSubtree, which is a subtree rooted at
v satisfying the constraints ofmonotonicity andconsistency [7, 8], which can be further interpreted by the changing of data and query, respectively.Data monotonicity means that if we add a new node to the data, the number of query results should be (nonstrictly) monotonically increasing.Query monotonicity means that if we add a keyword to the query, then the number of query results should be (nonstrictly) monotonically decreasing.Data consistency means that after a data insertion, each additional subtree that becomes (part of) a query result should contain the newly inserted node.Query consistency means that if we add a new keyword to the query, then each additional subtree that becomes (part of) a query result should contain at least one match to this keyword. [8] has proved that if all nodes of a subtreet_{v} satisfy the constraint of “keywords subsumption ”, thent_{v} must satisfy the constraints ofmonotonicity andconsistency , that is,t_{v} is an MSubtree. According to Example 1, we know that compared with CSubtrees and PSubtrees, MSubtrees contain all necessary information after removing nodes that do not satisfy the constraint of keywords subsumption, and are more selfexplanatory and compact.To construct MSubtrees, the existing method [8] needs to firstly
scan all node labels to compute qualified SLCA nodes, thenrescan all node labels to construct the initial subtrees. After that, they need to buffer these subtrees in memory and apply the constraint of keywords subsumption on each node of these subtrees to prune nodes with keyword sets subsumed by that of their sibling nodes, which is inefficient in time and space.Considering that an MSubtree may still contain redundant information (discussed in Section I), in this paper, we focus on efficiently constructing TMSubtree results based on SLCA semantics. Intuitively, a TMSubtree is an MSubtree, after removing redundant information. Constructing TMSubtree results based on ELCA semantics [7] is similar, and therefore omitted for reasons of limited space.
III. RESULT ENUMERATION
> A. Insight into the MaxMatch Algorithm
The MaxMatch algorithm [8] returns MSubtree results that are rooted at SLCA nodes, and satisfy the constraint of “keywords subsumption”. For a given query
Q = {k _{1},…,k_{m} } and an XML documentD , supposing thatL _{1}(L_{m} ) is the Dewey label list of occurrence of the least (most) frequent keyword ofQ ,d is the depth ofD . As shown in Algorithm 1, the MaxMatch algorithm works in three steps to produce all MSubtree results.Step 1 (line 1): MaxMatch finds from them inverted Dewey label lists the set of SLCA nodes, i.e.,SLCASet , by calling the IL algorithm [4]. The cost of this step isO (md L _{1} log L_{m} ). In this step, all Dewey labels are processed once. Note that any algorithm for SLCA computation can be used in this step.Step 2 (line 2): MaxMatch calls functiongroupMatches to construct the set of groups, i.e.,groupSet . As shown ingroupMatches , it needs to firstly merge them lists into a single list with costthen sequentially rescan all labels and insert each one to a certain group (if possible), with cost
in this step, all Dewey labels are processed twice to construct the set of groups.
Step 3 (lines 35): For each groupg , MaxMatch firstly constructs the PSubtree, then traverses it to prune redundant information. The overall cost of Step 3 isO (min {D ,}·2
^{m} ), where 2^{m} is the cost of checking whether the set of distinct keywords of nodev is subsumed by that of its sibling nodes; if not, thenv is a node that satisfies the constraint of keywords subsumption.Therefore, the time complexity of Algorithm 1 is
O (max {min {D ,}·2
^{m} ,md L _{1}log L_{m} }). Moreover, as the MaxMatch algorithm needs to buffer all groups in memory before step 3, its space complexity is> B. The Tightest Matched Subtree
DEFINITION 1 . (TMSubtree)For an XML tree D and a keyword query Q, let S_{v} ⊆Q be the set of distinct keywords that appear in the subtree rooted at v, S_{v} ≠Φ. A subtree t is a TMSubtree iff t’s root node is an SLCA node, and each node v of t satisfies that for each sibling node v' of v, S_{v} ?S_{v'}, and for each set of sibling nodes {v _{1},v _{2}, …,v _{n}}satisfying S _{v1} =S _{v2} =…=S_{vn} ,only one of them is kept for presentation .Intuitively, a TMSubtree is an MSubtree with redundant information being removed. It can be generated from the corresponding PSubtree by removing all nodes that do not satisfy the constraint of keywords subsumption, and just keeping one node for a set of sibling nodes that have the same keyword set.
DEFINITION 2 . (Maximum TMSubtree)Let t be a TMSubtree of Q, v a node of t, S_{v} ⊆Q the set of distinct keywords that appear in the subtree rooted at v, v.level the level value of v in t. We say t is a maximum TMSubtree if it satisfies the following conditions: 1)
v has S_{v} child nodes v _{1},v _{2},… ,v _{Sv},2)
if S_{v} ≥2？v.level <d, then for any two child nodes v_{i}, v_{j} (1≤i ≠j ≤S_{v} ), S_{vi}  = S_{vj}  =S_{v}  ？ 1？S_{vi} ？S_{vj}  = S_{v}  ? 2,3)
if S_{v}  = 1？v.level <d, then v has one child node v _{1}and S_{v} =S _{v1}.LEMMA 1 . Given a maximum TMSubtreet ,t is not a TMSubtree anymore, after inserting any keyword node intot without increasingt 's depth. . Suppose thatProof t is not a maximum matched subtree result, then there must exist a nonleaf nodev oft , such that we can insert a nodev_{c} intot as a child node ofv ;v_{c} satisfies thatS_{vc} ⊆S_{v} . Obviously, there are four kinds of relationships betweenS_{vc} andS_{v} :1) 
S_{vc}  = S_{v}  = 1. In this case,v has one child node that contains the same keyword asv . Obviously,v_{c} cannot be inserted intot , according to Definition 1.2)
S_{vc} =S_{v} ？S_{v}  ≥ 2. Sincev has S_{v}  child nodes and each one contains S_{v} ?1 keywords, for each child nodev_{i} (1≤i ≤S_{v} ) ofv , we haveS_{vi} ⊂S_{v} =S_{vc} . According to Definition 1, all existing child nodes ofv should be removed ifv_{c} is inserted intot , thusv_{c} cannot be inserted intot as a child node ofv , in such a case.3) 
S_{vc}  = S_{v}  ?1？S_{v}  ≥ 2. According to condition 2, all existing child nodes ofv contain all possible combinations of keywords inS_{v} , thus there must exist a child nodev_{ci} ofv , such thatS_{vc} =S_{vci} , which contradicts Definition 1, thusv_{c} cannot be inserted intot in this case.4) if 
S_{vc}  <S_{v}  ？ 1？S_{v}  ≥ 2. According to condition 2, all existing child nodes ofv contain all possible combinations of keywords inS_{v} , thus there must be a child nodev_{ci} ofv , such thatS_{vc} ⊂S_{vci} , which also contradicts Definition 1, thusv_{c} cannot be inserted intot in such a case.In summary, if
t is a maximum TMSubtree result, no other keyword node can be inserted intot , such thatt is still a TMSubtree, without increasingt ’s depth.THEOREM 1 . Given a keyword queryQ = {k _{1},k _{2},...,k_{m} } and one of its TMSubtreet of depthd , ifd ≤m , thent has at most 2m ! nodes; otherwise, the number of nodes oft is bounded by (d ?m 2)m !. . Assume thatProof t is a maximum TMSubtree, obviously, the number of nodes at 1^{st} level oft is 1 =For the 2
^{nd} level, sinceS_{t.root} =Q ,t.root has S_{t.root}  =child nodes, of which each one contains 
S_{t.root}  ? 1 distinct keywords.Thus we have Formula 1 to compute the number of nodes at the
i^{th} level.If
d ≤m , the total number of nodes int isIf
d >m , each node at them^{th} level oft contains only one keyword, and all levels greater thanm contain the same number of nodes as that of them^{th} level. According to Formula 1, we know thatN (m ) =m !, thus the total number of nodes int isTherefore if
d ≤m , a TMSubtreet has at most 2m ! nodes, otherwise, the number of nodes oft is bounded by (d ?m 2)·m !.EXAMPLE 3 . Given a keyword queryQ = {k _{1},k _{2},k _{3}}, Fig. 2 shows three subtree results; according to Definition 1, they are all TMSubtrees. Obviously, by fixing their depths, no other node with any kind of combination ofk _{1} tok _{3} can be inserted into these TMSubtree results, according to Lemma 1, that is, they are all maximum TMSubtrees. The TMSubtree in Fig. 2a has 4 nodes. Since its depth is 2 and is less than the number of keywords, i.e., 3, it satisfies Theorem 1, since 4 < 2 × 3! = 12. The TMSubtree in Fig. 2b is another maximum TMSubtree with 10 nodes, and also satisfies Theorem 1, since 10 < 2 × 3! = 12. Fig. 2c is also a maximum TMSubtree with 16 nodes. Since the depth of the TMSubtree is 4 and is greater than the number of keywords ofQ , it still satisfies Theorem 1, since 16 < (4 ? 3 2) × 3! = 18.> C. The Algorithm
Compared with the MaxMatch algorithm that produces all subtree results in three steps, the basic idea of our method is directly constructing all subtree results in the procedure of processing all Dewey labels. The benefits of our method lie in two aspects: 1) the buffered data in memory is largely reduced; 2) each Dewey label is visited only once. The first benefit comes from Theorem 1, which guarantees that our method does not need to buffer huge volumes of data in memory as [8] does; the second benefit is based on our algorithm.
In our algorithm, for a given keyword query
Q = {k _{1},k _{2},...,k_{m} }, each keywordk_{i} corresponds to a listL_{i} of Dewey labels sorted in document order, andL_{i} is associated with a cursorC_{i} pointing to some Dewey label ofL_{i} .C_{i} can move to the Dewey label (if any) next to it by usingadvance (C_{i} ). Initially, each cursorC_{i} points to the first Dewey label ofL_{i} .As shown in Algorithm 2, our method sequentially scans all Dewey labels in document order. The main procedure is very simple: for all nodes that have not been visited yet, in each iteration, it firstly chooses the currently minimum Dewey label, by calling the selectMin Label function (line 3), then processes it, by calling the pushStack procedure (line 4), and finally, it moves
C_{k} forwardly to the next element inL_{k} (line 5). After all Dewey labels are processed, our algorithm pops all instack elements (line 79), then outputs the last TMSubtree result, to terminate the processing (line 10).During the processing, our algorithm uses a stack
S to temporarily maintain all components of a Dewey label, where each stack elemente denotes a component of a Dewey label, which corresponds to a node in the XML tree.e is associated with two variables: the first one is a binary bitstring indicating which keyword is contained in the subtree rooted ate ; the second is a set of pointers pointing to its child nodes, which is used to maintain intermediate subtrees.The innovation of our method lies in that our method immediately outputs each TMSubtree result
t_{v} when findingv is a qualified SLCA node, which makes it more efficient in time and space. Specifically, in each iteration (line 2 to 6 of Algorithm 2), our method selects the currently minimum Dewey labelC_{k} in line 3, then pushes all components ofC_{k} intoS in line 4. The pushStack procedure firstly pops fromS all stack elements that are not the common prefix ofC_{k} and the label represented by the current stack elements, then pushes allC_{k} 's components that are not in the stack intoS . To pop out an element fromS , the pushStack procedure will call popStack procedure to complete this task. The popStack procedure is a little more tricky. It firstly checks whether the popped elementv is an LCA node. Ifv is not an LCA node, popStack firstly transfers the value ofv 's bitstring to its parent node inS (line 14), then inserts subtreet_{v} intot_{top(S)} in line 15. In lines 16 to 22, popStack will delete all possible redundant subtrees, by checking the subsumption relationship between the keyword set ofv and that of its sibling nodes. Ifv is an LCA node (line 3) and located after the previous LCA nodeu in document order (line 4), it means thatu is an SLCA node if it is not an ancestor ofv (line 5), then popStack directly outputs the TMSubtree result rooted atu in line 6. The subtree rooted atu is then deleted (line 8), andu points tov in line 9. Ifv is an LCA node but locatedbefore
u , it means thatv is not an SLCA node, thus we directly delete the subtree rooted atv (line 11).Example 4 . Consider the XML documentD in Fig. 1 and queryQ = {Mike, DASFAA, DB}. The inverted Dewey label lists for keywords ofQ are shown in Fig. 3b. The status of processing these labels is shown in Fig. 3a1a14. In this example, we use “001” (“010” or “100”) to indicate that “Mike” (“DASFAA” or “DB”) is contained in a subtree rooted at some node. After 1.2.2.1 is pushed into stack, the status is shown in Fig. 3a1, where the bitstring of the top element ofS is “001”, indicating that node 1.2.2.1 contains “Mike”. The second pushed label is 1.2.2.3, and the status is shown in Fig. 3a2. Note that after an element is popped out from the stack, its bitstring is transferred to its parent in the stack. The next two labels are processed similarly. Before 1.3.1 is pushed into the stack, we can see from Fig. 3a5 that the bitstring of the top element inS is “111”, which means that node 1.2 contains all keywords. After 1.3.1 is pushed into stack, the subtree rooted at 1.2 is temporarily buffered in memory. As shown in Fig. 3a10, before 1.3.3.1 is pushed into stack, the last component of 1.3.2 will be popped out from the stack. Since the bitstring of the current top element in S is “111” (Fig. 3a10), we know that 1.3.2 is an LCA node. According to line 5 of popStack procedure, we know that the previous LCA, i.e., 1.2, is an SLCA node, thus we output the matched subtree result rooted at 1.2. After that, the subtree rooted at 1.3.2 will be temporarily buffered in memory. When the last component of 1.3 is popped out fromS (Fig. 3a14), according to line 3 of popStack procedure, we know that 1.3 is an LCA node. According to line 4 of popStack procedure, we know that the previous LCA node, i.e., 1.3.2, is located after 1.3 in document order, thus we know that 1.3 is not an SLCA node immediately, and delete the subtree rootedat 1.3 in line 11 of popStack procedure. Finally, we output the TMSubtree rooted at 1.3.2 in line 10 of Algorithm 2. Therefore for
Q , the two TMSubtree results are rooted at 1.2 and 1.3.2, respectively.As shown in Algorithm 2, for a given keyword query
Q = {k _{1},k _{2},...,k_{m} } and an XML documentD of depthd , our method just needs to sequentially scan all labels in them inverted label listsonce , therefore the overall I/O cost of Algorithm 2 isNow we analyze the time complexity of our algorithm. Since our algorithm needs to process all components of each involved Dewey label of the given keyword query
Q = {k _{1},k _{2}, …,k_{m} }, the total number of components processed in our method is bounded by, and the cost of processing these components is
dm During processing, each one of the
components will be inserted into a subtree and deleted from the same subtree just once, and the cost of both inserting and deleting a component is
O (1). When inserting a subtree into another subtree, the operation of checking the subsumption relationship between the two keyword sets of two sibling nodes will be executed at mostm times, according to Definition 2. Therefore, the overall time complexity isSince our method is executed in a pipelined way, at any time it just needs to maintain at most
d subtrees, where each one is not greater than a maximum TMSubtree. According to Theorem 1, each matched subtree result contains at mostmax {2m !,(d ?m 2)m !} nodes. Therefore, the space complexity of our method isO (d ·max {2m !, (d ?m 2)m !}). Sinced andm are very small in practice, the size of these subtrees buffered in memory is very small.Note that to output the name of nodes in a TMSubtree result, existing methods may either store all path information in advance, by suffering from huge storage space [1], or use the extended Dewey labels [17], by affording additional cost on computing the name of each node, according to predefined rules. In contrast, our method
maintains another hash mapping between each path ID and the path information; the total number of index entries is the number of nodes in the dataguide index [18] of the XML tree, which is very small in practice. To derive for each node its name on a path, we maintain in each Dewey label a path ID after the last component, thus we can get the name of each node on a path in constant time.
IV. EXPERIMENTAL EVALUATION
> A. Experimental Setup
Our experiments were implemented on a PC with Pentium(R) DualCore E7500 2.93 GHz CPU, 2 GB memory, 500 GB IDE hard disk, and Windows XP professional as the operating system.
The algorithms used for comparison include the Max Match algorithm (the MaxMatch algorithm is used to output TMSubtrees in our experiment) [8] and our merge Matching algorithm. MaxMatch was implemented based on IL [4], IMS [6] and HS [12] to test the impacts of different SLCA computation algorithms on the overall performance, and is denoted as MaxMatchIL, MaxMatch IMS and MaxMatchHS, respectively. All these algorithms and our mergeMatching algorithm were implemented using Microsoft VC+ . All results are the average time, by executing each algorithm 100 times on hot cache.
> B. Datasets and Queries
We use XMark dataset for our experiments, because it possesses a complex schema, which can test the features of different algorithms in a more comprehensive way. The size of the dataset is 582 MB; it contains 8.35 million nodes, and the maximum depth and average depth of the XML tree are 12 and 5.5, respectively.
We have selected 30 keywords, which are classified into three categories, according to the length of their inverted Dewey label lists (the 
L_{i}  line in Table 2): 1) low frequency (1001,000); 2) median frequency (10,000 40,000); and 3) high frequency (300,000600,000).Based on these keywords, we generated four groups of queries, as shown in Table 3: 1) four queries (Q1 to Q4) with 2, 3, 4, 5 keywords of low frequency; 2) four queries (Q5 to Q8) of median frequency; 3) four queries (Q9 to Q12) of high frequency; and 4) 20 queries (Q13 to Q32) with keywords of random frequency.
> C. Evaluation Metrics
The metrics used for evaluating these algorithms include: 1) the number of buffered nodes, which is used to compare the space cost of these algorithms, 2) running time on SLCA and subtree result computation, and 3) scalability.
For a given query, we define the
result selectivity as the size of the results over the size of the shortest inverted list. The second to last column of Table 3 is the result selectivity of each query.> D. Performance Comparison and Analysis
1) Comparison of the Number of Buffered Nodes
To make comparison of space cost for different algorithms, we have collected the statistics of the number of buffered nodes for the MaxMatch and mergeMatching algorithms. Table 4 shows a comparison of the number of buffered nodes for each query, from which we know that the number of buffered nodes for mergeMatching is
much less than that of MaxMatch. The reason lies in that MaxMatch needs to firstly construct
all path subtrees and buffer them in memory, while mergeMatching just needs to buffer at mostd TMSubtrees in memory, whered is the depth of the given XML tree.2) Comparison of SLCA Computation:
To get all matched subtree results, the first critical step is computing all qualified SLCA nodes. Fig. 4 shows the comparison of different algorithms on SLCA computation.
From Fig. 4 we have the following observations: 1) mergeMatching is beaten by HS, IL and IMS for almost all queries. This is because mergeMatching sequentially processes all involved Dewey labels. Its performance is dominated by the number of involved Dewey labels, while IL and IMS are more flexible in utilizing the positional relationship to prune useless keyword nodes; HS only sequentially processes each Dewey label of the shortest inverted Dewey label list
L _{1}, and its performance is dominated by the length ofL _{1}. 2) HS is more efficient than IL and IMS when there exists huge difference between the set of involved Dewey label lists, such as Q13, Q15, Q17 to Q32, while it is beaten by IMS when the result selectivity is low, such as Q11 and Q12. This is because HS needs to process all Dewey labels of the shortest list, while IMS can wisely skip many useless nodes. 3) IL can beat IMS for some queries, while it can also be beaten by IMS for other queries, especially when the result selectivity is low, such as Q6, Q11 and Q12. This is because the IL algorithm computes the SLCA results by processing two lists each time from the shortest to the longest, while the IMS algorithm computes each potential SLCA by taking one node from each Dewey label list in each iteration. IMS could be the best choice when all nodes of the set of lists are not uniformly distributed; IL could perform best when all nodes are uniformly distributed, and there exists huge difference in lengths of the set of lists.3) Impacts of the SLCA Computation on the Overall Performance
To further investigate the impacts of SLCA computation on the overall performance, we implemented the MaxMatch algorithm based on IL, IMS, and HS, which are denoted as MaxMatchIL, MaxMatchIMS, and Max MatchHS, respectively. As shown in Table 5, the three algorithms always consume similar running time, and in most cases, MaxMatchHS is a little better than the other two algorithms. The reason lies in that they all share the same procedure in constructing matched subtree results, which usually needs much more time than computing SLCA nodes. Therefore, even though HS performs better than IL and IMS in most cases, their overall performance is similar to each other.
Further, from Fig. 4 and Table 5 we know that no matter how efficient HS, IL, and IMS are, and no matter how much HS performs better than IL and IMS, or vice versa, their performance benefits on SLCA computation do not
exist anymore, if compared with the time on constructing matched subtree results based on the MaxMatch algorithm.
4) Comparison of Constructing Subtree Results
Fig. 5 shows the comparison between MaxMatch and mergeMatching on constructing all matched subtree results, from which we know that although mergeMatching is not as efficient as HS, IL, and IMS on SLCA computation, it is much more efficient than MaxMatch on constructing matched subtree results, such as Q1 to Q12, Q17, Q18, Q20 to Q24, Q26, and Q28 to Q32, and the time saved in the second phase is much more than that wasted in the first phase. The reason lies in that merge Matching processes each Dewey label only once, while MaxMatch needs to firstly scan all Dewey labels once to compute SLCA results, then scan all Dewey labels once more to construct a set of initial groups according to different SLCA nodes. After that, it constructs all path subtree results by processing these Dewey labels again.
At last, it needs to traverse all nodes to prune useless nodes.
5) Comparison of the Overall Performance
Fig. 6 shows the comparison of the overall running time between MaxMatchHS and mergeMatching, from which we know that in most cases, mergeMatching performs much better than MaxMatchHS. The reason lies in that compared with MaxMatch, mergeMatching does not need to rescan all involved Dewey labels, and does not need to delay prune useless nodes after constructing all path subtree results.
On the contrary, it prunes useless nodes when constructing each matched subtree result, and thus can quickly reduce the number of testing operations of the keywords consumption relationship between sibling nodes.
Besides, we show in Fig. 7 the scalability when executing Q7 on XMark datasets with different sizes (from 116 MB to 1745 MB [15x]). The query time of the MaxMatch HS and our mergeMatching algorithms grow sublinearly with the increase of the data size. Also, mergeMatching consistently saves about 78% time when compared with MaxMatchHS. For other queries, we have similar results, which are omitted due to limitations of space.
V. CONCLUSIONS
Considering that TMSubtree is more selfexplanatory and compact than CSubtree and PSubtree, but existing methods on subtree result computation need to rescan all Dewey labels, we focus in this paper on efficient construction of TMSubtree results for keyword queries on XML data based on SLCA semantics. We firstly proved the upper bound for the size of a given TMSubtree, that is, it has at most 2
m ! nodes ifd ≤m ; otherwise, its size is bounded by (d m 2)·m !, whered is the depth of a given TMSubtree, andm is the number of keywords of the given queryQ . Then we proposed a pipelined algorithm to accelerate the computation of TMSubtree results, which only needs to sequentially scan all Dewey labelsonce without buffering huge volumes of intermediate results. Because the space complexity of our method isO (d ·max {2m !, (d m 2)·m !}), and in practice,d andm are very small, the size of the buffered subtrees is very small. The experimental results in Section IV verify the benefits of our algorithm in aiding keyword search over XML data.

[Table 1.] Comparison of the running time for smallest lowest common ancestor (SLCA) computation and subtree construction

[Fig. 1.] A sample extensible markup language document D.

[Table 6.] Algorithm 1 MaxMatch

[Fig. 2.] Illustration of three possible TMSubtrees for keyword query Q ={k1, k2, k3}.

[Table 7.] Algorithm 2 mergeMatching(Q)

[Fig. 3.] Running status for Q = {Mike, DASFAA, DB}.

[Table 2.] Statistics of keywords used in our experiment

[Table 3.] Queries on XMark dataset

[Table 4.] Comparison of the number of buffered nodes for each query

[Fig. 4.] Comparison of running time on smallest lowest common ancestor (SLCA) computation. HS: hash search, IL: indexed lookup, IMS: incremental multiway SLCA.

[Table 5.] Comparison of the running time for the MaxMatch algorithm based of different methods on smallest lowest common ancesto computation (ms)

[Fig. 5.] Comparison of running time on constructing all subtree results.

[Fig. 6.] Comparison of the overall running time for different methods. HS: hash search.

[Fig. 7.] Comparison of scalability on extensible markup language documents of different size for Q7. HS: hash search.