Online Clustering Algorithms for SemanticRich Network Trajectories
 Author: Roh GookPil, Hwang Seungwon
 Organization: Roh GookPil; Hwang Seungwon
 Publish: Journal of Computing Science and Engineering Volume 5, Issue4, p346~353, 30 Dec 2011

ABSTRACT
With the advent of ubiquitous computing, a massive amount of trajectory data has been published and shared in many websites. This type of computing also provides motivation for online mining of trajectory data, to fit userspecific preferences or context (e.g., time of the day). While many trajectory clustering algorithms have been proposed, they have typically focused on offline mining and do not consider the restrictions of the underlying road network and selection conditions representing user contexts. In clear contrast, we study an efficient clustering algorithm for Boolean + Clustering queries using a prematerialized and summarized data structure. Our experimental results demonstrate the efficiency and effectiveness of our proposed method using reallife trajectory data.

KEYWORD
Online clustering , Semanticrich trajectory

I. INTRODUCTION
With the advent of ubiquitous computing devices equipped with global positioning system (GPS), radio frequency identification (RFID) or sensors, we can easily record object movements or trajectories. Recently, much research work [14] has aimed to evaluate spatiotemporal queries, such as range and nearest neighbor queries, on a trajectory database from the given query point.
While most existing work focuses on supporting trajectories that record only the locations over time, the actual trajectories collected from recent locationbased services often have greater semantic richness. For example, Flickr (flickr.com) allows users to geotag their photos and annotate them with comments and tags. This enables users to describe their trips as a trajectory of photos and information, for example, Fig. 1 shows a geotagged photo of the opera house on the map with annotated information such as title, username or timestamp. Such additional information enables us to bridge the semantic gap between the location and the actual activity at the location, by distinguishing between different activities that occurred at the same location, or the same activities at different locations, as Fig. 2 illustrates. Recent work [5], distinguishing activities from locations using user comments, has the same goal as ours: supporting semanticrich locationbased service.
The scale of such semanticrich trajectories, e.g., millions of geotagged images added monthly to Flickr, naturally motivates us to support advanced data analysis (e.g., clustering), e.g., recommending interesting travel itineraries [6].
Our goal is to enable online clustering of semanticrich trajectories, motivated by the following example scenario.
Example 1 . While existing forms of travel recommendation, e.g., travel guides or portals, focus on offline computation rendering the same itinerary for all users, travel recommendation on ubiquitous devices naturally motivate forms of online recommendation that can adapt to changes in userspecific context. For instance, the recommendation should vary according to the temporal context, such as time of the day or season of the year.We thus propose to study online trajectory clustering algorithms, combining the following semantic context criteria:
Soft criteria: Users may embed various semantic information, e.g., semantic geographic information, in the similarity measure.
Hard criteria: Users may consider only trajectories satisfying some Boolean semantic criteria, such as time = evening or user = Tom.
In this paper, we focus on combining the two semantic context criteria with clustering, which can be easily provided by the user at the query time. However, existing clustering algorithms cannot address this problem due to the following challenges. First, most existing work focuses on offline unconstrained location trajectories, such as hurricanes or animal trajectories. In contrast, in many locationbased services, trajectory mining should reflect the proximity with respect to the underlying constraint, e.g., road networks or subway routes, because two nearby locations can be very hard to reach if no roads connect the two locations. We study online clustering, which supports soft semantic criteria using a similarity measure based on the underlying geographic information such as road networks.
Secondly, to support hard semantic criteria, we need to support Boolean + Clustering queries efficiently. While existing clustering algorithms can be adopted after Boolean conditions are evaluated, such naive adoption incurs a significantly high computational cost, which provides motivation for an integrated algorithm capable of evaluating both.
To this end, we design a summarized data structure where precomputed initial clusters of trajectories are stored. This can expedite online computation in order to merge these initial clusters. We store statistical information on selection attributes to reflect Boolean selections in the merging process.
This paper addresses the above challenges, which is an extension of our preliminary work [7] that only tackled the first challenge. We discuss our extension of this preliminary work in Section II. The main contributions of this paper are as follows:
We are motivated to study the problem of supporting Boolean + Clustering queries for semanticrich trajectories.
We design a summarized structure for prematerializing initial trajectory clusters.
We propose an efficient clustering algorithm built on the proposed summarized structure.
We extensively validate the effectiveness and the efficiency of our proposed framework using reallife trajectory data.
The rest of the paper is organized as follows. Section II presents the prior research work. In Section III, we present the formal definition of our problem. Section IV studies distance measures. Section V describes our clustering algorithms. Section VI reports our experimental results. Section VII concludes this paper.
II. RELATED WORK
Lately, clustering trajectories have been studied actively, but most work assumes trajectories without road network constraints, as in hurricane tracking or animal movement data. For trajectories without constraints, trajectories are typically represented by either a set or sequence of locations (sampling points), for which similarity metrics have been proposed in [813] and [1417], respectively.
In a clear contrast, this work focuses on trajectories that are constrained by road networks, for which a similarity metric has only recently been proposed in [7]. In many reallife applications, moving objects are often constrained by obstacles such as buildings, trees, or roads. Roh and Hwang [7] studied a similarity metric that reflects such constraints in trajectory analysis.
Once similarity metrics are defined, existing clustering algorithms can be applied to group trajectories into clusters. Specifically, recent work [18] proposes to adopt the expectation maximization (EM) algorithm and densitybased clustering algorithm [12] for the task of clustering trajectories. However, these algorithms are not suitable for our proposed problem of online clustering with Boolean selection conditions. Gaffney and Smyth [18] requires slow convergence, which prohibits online responses, while [12] requires to prematerialize an index structure for online computation, which will be ineffective when there exist Boolean selection conditions.
Our preliminary work [7] distinguishes itself from the existing work by designing a distance function for networkconstrained location trajectories and developing an online algorithm. Based on this preliminary work, we extend our algorithms to support semanticrich trajectories, archiving not only location changes, but also semantic information such as images, text, user ID and tags.
In particular, our study aims to support Boolean selection of semantic features in this paper. For efficient support, we propose to minimize online computational requirements, as similarly adopted for numerical data in [19]. However, adopting [19] for our proposed problem is not desirable, as this work assumes two objects with similar selection attribute values are also deemed to be similar according to the similarity metric. This assumption does not hold in our problem, where similarity and selection attributes are allowed to be independent, e.g., similarity of image features and selection of timestamps.
The most relevant work [5] to this paper distinguishes activities from locations using comments added by users on the trajectory. This work has the same goals as ours; we support Boolean conditions or texts or tags, such as
tag =dining . A key distinction of our work from the existing work is to provide a systematic framework for online clustering with arbitrary Boolean semantic conditions.III. PROBLEM DEFINITION: BOOLEAN + CLUSTERING
In this section, we formally define our problem. We begin by modeling a networkconstrained trajectory. We model a road network as a graph, where an edge is a road segment and a vertex is an intersection of adjacent segments, formally stated as follows:
Definition 3.1 Road network. A road network (RN) is defined as a graphG _{RN} = (V ,E ), whereV is a set of intersections of the road network, and E is a set of road segmentsR_{i} ∈E such thatRi = (ri,s, ri,e),
where
r_{i,s} ,r_{i,e} ∈V and there exists a road betweenr_{i,s} andr_{i,e} .The trajectories on the road network can thus be represented by a set of connected road segments, or edges in the abovementioned graph. Methods of matching GPS logs to road segments have been studied in existing map matching literatures (e.g., [20]).
Definition 3.2 Networkconstrained trajectory. Given a road networkG_{RN} = (V ,E ), a trajectory of length l is defined asTRi = [ti1 , ..., til],
where
t^{i}_{j} ∈E , 1 ≤j ≤l andt^{i}_{j} andt^{i}_{j+1} are connected.We formally define our problem of finding a set of disjoint clusters C for trajectories that satisfy the given Boolean condition set B. The Boolean condition set can be a complex Boolean condition such as conjunctions and disjunctions of subconditions.
Definition 3.3 Problem statement. Given input trajectories T that satisfy Boolean selection condition set B, our goal is to find C = {c_{q} , ...,c _{C}}, maximizing the sumof intercluster distances, such thatwhere the intercluster distance
D_{inter} is defined by the distance between representative trajectories of two clusters. Note that our target problem does not require users to specify a fixed number of clusters.Definition 3.4 Intercluster distance. Given two clusters, c i and c j, the distance between them is defined bywhere rt(c_{i}) is the representative trajectory of a cluster c_{i} , i.e.,
and D(·,·) is the trajectory distance, which we will define in the next section.
IV. DISTANCE MEASURE
In this section, we discuss a distance measure between networkconstrained trajectories. To define a desirable distance measure, we have to consider the following two requirements:
Accuracy: The distance measure should accurately reflect the spatial proximity from the viewpoint of the road network.
Efficiency: The distance measure should be a metric for efficient computation of clustering, as we will elaborate in the details in Section V.
While distance measures satisfying the above requirements for trajectories without road network constraints have been actively studied [1215], the distance measure for networkconstrained trajectories has only recently been proposed in our preliminary work.
> A. Accuracy Requirement
To address the first requirement that spatial proximity must be accurately represented, our distance measure adopts the intuitive notion of the Hausdorff distance to quantify the distance of two trajectories
TR_{i} andTR_{j} as the longest road network path from each road segment to its closest road segment in the other trajectory. More formally,Definition 4.1 Trajectory distance [7]. Given two trajectoriesTR_{i} = [t^{i} _{1} , ...,t^{i}_{n} ] andTR_{j} = [t^{j} _{1} , ...,t^{j}_{m} ], the distance between them is defined as follows:where
(
TR_{i} ,TR_{j} ) is the oneway trajectory distance fromTR_{i} toTR_{j} , defined as:Meanwhile, the road segment distance d (·,·) in the previous definition is defined as follows:
Definition 4.2 Road segment distance [7].where
(
R_{i} ,R_{j} ) is a oneway road segment distance fromR_{i} toR_{j} defined as:where
 a,b ？ is the length of the shortest path between a and b.> B. Efficiency Requirement
To address the second requirement of efficient computation, we show that our distance measure is a metric.
Theorem 4.3 . Trajectory distance in Definition 4.1 is a metric.Proof. Given three trajectories
TR_{i} ,TR_{j} , andTR_{k} , Equation 3 and 4 satisfy the following properties.The symmetric property follows from 3, the reflexive property follows from 1 and 2, and the triangle in equality follows from 4.
As the symmetric and reflexive properties are obvious from the definition, we only present the proof of triangle in equality. First we consider εneighborhood road segments of a trajectory, denoted by
N_{ε} (TR ), which are road segments within εdistance from any road segment in TR.With this notion, the proof of triangle inequality is as follows. For convenience, we shortly denote
(
TR_{i} ,TR_{j} ) bySimilarly,
From Equation 7 and 8, the following inequality holds.
V. SEMCLUSTER ALGORITHM
In this section, we present a clustering algorithm called SemCluster, which supports Boolean + Clustering queries for semanticrich network trajectories. As NNCluster [7] is an online algorithm, it can be naively adopted for Boolean + Clustering queries, by evaluating Boolean selection first and performing NNCluster only on the qualifying results. However, such naive adoption incurs a high cost, especially when the selectivity of Boolean conditions is high.
To address this challenge, the basic concept of SemCluster involves minimizing the distance computations at runtime by prematerializing initial clusters into a summarized data structure. Based on this concept, we devise SemCluster with two key components？BuildDS and MergeDS. The BuildDS module materializes the summarized data structure from trajectories in an offline process. The MergeDS module performs Boolean + Clustering queries with the prematerialized summarized data structure.
As we exploit the same property, sharing common nearest neighbors, of NNCluster when we identify initial clusters, we first give a brief review on identifying initial clusters in Section VA. We then present the BuildDS module in Section VB and the MergeDS module in Section VC.
> A. Review: Identifying Initial Clusters
We briefly review our method of identifying initial clusters of trajectories. We group the trajectories into initial clusters that are likely to belong to the same cluster. To decide which trajectories to put into the same initial cluster, we need to identify a set of trajectories sharing the common nearest neighbor. This problem can be abstracted as a graph problem, by denoting each trajectory as a node connected to another node that is the nearest neighbor. Given this graph, the problem of building the initial clusters can be reduced to listing all maximal cliques, which is known as an NPcomplete problem [21].
Our goal is to approximate this process by relaxing the notion of initial clusters to those sharing many knearest neighbors. To effectively construct initial clusters, we compare the knearest neighbors of each trajectory with those of another trajectory. The two trajectories are then merged into a cluster if the ratio between the common nearest neighbors is greater than a userdefined threshold value. Interested readers may find the details on identifying initial clusters and setting the threshold value in [7].
> B. BuildDS
This section describes our method of prematerializing a summarized data structure from the trajectories in the database. The summarized data structure consists of two components: 1) initial cluster information and 2) the distance relationship between initial clusters.
Once initial clusters are identified, we store the information of the initial clusters, such as the representative trajectory of the initial cluster and the number of members. We compute the representative trajectory of each initial cluster using Equation 2. We choose a hash table to store the information of the initial clusters for fast retrieval. More specifically, for each member of an initial cluster, the keyvalue pair of (tid, (refid, s)) is inserted into a hash table, where tid is an identifier of a trajectory, refid is an identifier of a representative trajectory, and s is the number of trajectories that belong to the initial cluster represented by refid. For example, for the initial clusterc_{1} = {t_{3} , t_{5} , t_{6} }, we store keyvalue pairs such as (3, (5, 3)), (5, (5, 3)), and (6, (5, 3)) in the hash table, when t_{5} is the representative trajectory of c_{1}.
In addition, we also compute and store the distance relationship between all pairs of representative trajectories using Equation 1, which will reduce the number of distance computations at runtime. If there are n initial clusters, we build an n × n distance matrix DM, where DM_{ij} contains intercluster distances between c_{i} and c_{j}.
> C. MergeDS
SemCluster invokes MergeDS when users issue Boolean + Clustering queries. This section presents the method by which MergeDS exploits the above prematerialized summarized data structure to merge initial clusters into meaningful groups. MergeDS consists of two components: 1) finding initial clusters satisfying given Boolean conditions and 2) merging them and maximizing the sum of intercluster distances.
Given a Boolean + Clustering query, we first search the database for trajectories satisfying the Boolean conditions. Then, we look up the hash table of the summarized data structure to identify the initial clusters of trajectories satisfying the Boolean conditions. Each identifier of a trajectory is used as a key. Using a hash table, we can effectively identify the initial clusters of trajectories that satisfy the Boolean conditions. Moreover, we can accelerate this step by linking leaf nodes of an index structure, which is built for a selection attribute, with the bins of the hash table.
For the initial clusters identified in the previous step, we find the disjoint clusters maximizing the intercluster distance D_{inter}, which is known to be NPhard [22]. We develop a greedy approach, which continues merging a pair of clusters in ascending order of intercluster distance until a stop condition is satisfied. Note that the intercluster distance has been previously computed and stored in the distance matrix of the summarized data structure. Therefore, SemCluster does not require distance computations for all pairs of representatives but instead refers to the distance matrix for determining the order of merging clusters.
In each merging step, e.g., c_{i} and c_{j}, SemCluster elects a trajectory as a new representative trajectory of merged cluster c_{i}∪ c_{j}. Between two representatives of c_{i} and c_{j}, the representative from the larger cluster is promoted to be a new representative of c_{i}∪c_{j}. To this end, we store the number of member trajectories in each merging step.
As a stop condition, we use the DB index [23], which is a quality measure validating the clustering results. Given a set of clusters C = {
c_{1} , ...,c_{n} }, the DB index is defined as:where
d_{ij} is is the distance between cluster c_{i} and cluster c_{j}, and s_{i} is the variance of cluster c_{i}. The DB index approaches 0 as the clusters get more compact and wellseparated. We note that the computation of the DB index does not require any additional distance computation between trajectories, which incurs high costs. We store the value of the DB index at each merging step. Then, we report the clustering result that minimizes the DB index.
VI. EXPERIMENTAL EVALUATION
We report the experimental results conducted to verify the effectiveness and efficiency of SemCluster for semanticrich trajectories.
> A. Experimental Settings
For semanticrich trajectories, we collected geotagged images from the Flickr website using Flickr API. There were 1, 505, 768 geotagged images, which were taken between 1st January 2001 and 31st December 2009 in Australia. The collected images were first grouped according to their user ID, and then the images of each user were sorted in order of time taken. We then split the image sequence of each user into disjoint subsequences by discretizing the 24hour period into 12 range groups. As a result, 314, 000 semanticrich trajectories are stored in the database.
All experiments were conducted on a machine with a Pentium4 (3.2 GHz) CPU and 1 GBytes of main memory, running a version of the Linux operating system. We implemented our proposed algorithms using the C programming language. All trajectories were stored in an Oracle Berkeley DB (Version 4.6). The Btree index was built on the database for efficient retrieval of Boolean selections.
> B. Efficiency
In this section, we validate the efficiency of SemCluster in comparison with NNCluster [7] using the response time as a metric, by varying the selectivity of the Boolean conditions. Note that in contrast to NNCluster, SemCluster does not require
any distance computations at runtime. Therefore, the performance gap between NNCluster and SemCluster will increase, if an expensive semanticaware distance measure is used (e.g., computing the image similarity).
As shown in Fig. 3, SemCluster outperforms NNCluster in all experimental settings. The speedup of SemCluster increases as the selectivity grows. We achieve this speedup by precomputing the trajectory distance computation when building the summarized data structure, and only handling references to intercluster distance in the distance matrix at runtime.
> C. Quality of the Clustering Result
In this section, we validate the quality of the clustering results produced by SemCluster. More specifically, we visualize the clustering results for Boolean selections of timestamps or user ID.
A) Selection of Temporal Attributes : Scenarios for combining clustering with selection conditions of timestamps include finding clusters for different times of day or different seasons of the year. In this section, we demonstrate the clustering results for trajectories in the morning (S1), during lunchtime (S2) and in the evening (S3). Fig. 4a and 4b contrast results for S1 and S3？ Note the cluster identified for S3 (evening) shows night view spots near the coastline, while the results for S1 (morning) includes more locations in downtown.B) Selection of User ID : Selection conditions can be added to user ID to identify userspecific trends. We observe userspecific trends in the clustering result of semanticrich trajectories. We use Flickr user ID as a selection attribute.Fig. 5 illustrates trajectories satisfying the following Boolean condition:
S4. : user ID = 2905XXXX@N02
S5. : user ID = 2906XXXX@N04.
We also present the clustering result in Fig. 5 for each Boolean condition. For the purposes of presentation, we plot only the representative trajectory of clusters and we do not visualize small clusters with sizes less than 15% of the largest one. As
shown in Fig. 5, the clustering results for S4 and S5 show the userspecific trends in semanticrich trajectories.
VII. CONCLUSION
In this paper, we studied a method to integrate a Boolean selection with clustering for mining semanticrich trajectories. Specifically, we designed a new framework prematerializing a summarized data structure for the whole dataset, and proposed a greedy clustering algorithm building on this structure. We validated the effectiveness and the efficiency of our results using reallife location trajectories and semanticrich trajectories collected from the Flickr website. Our proposed framework enables us to capture temporal characteristics of clusters and outperforms the baseline algorithm proposed in our preliminary work. For future work, we are investigating the efficient maintenance of the summarized data structure, in case of frequent updates in the trajectory database.

[Fig. 1.] Geotagged image example on Flickr.

[Fig. 2.] Semantic gaps between location and activity.

[Fig. 3.] Response time vs. selectivity.

[Fig. 4.] Daily trend of semanticrich trajectories.

[Fig. 5.] Clustering results of semanticrich trajectory for Boolean conditions S4 and S5.