Tracking of 2D or 3D Irregular Movement by a Family of Unscented Kalman Filters

  • cc icon
  • ABSTRACT

    This paper reports on the design of an object tracker that utilizes a family of unscented Kalman filters, one for each tracked object. This is a more efficient design than having one unscented Kalman filter for the family of all moving objects. The performance of the designed and implemented filter is demonstrated by using simulated movements, and also for object movements in 2D and 3D space.


  • KEYWORD

    Unscented Kalman filter , Detection-by-tracking , 2D/3D tracking

  • I. INTRODUCTION

    Object tracking is an important task for many applications, such as for robot navigation, surveillance, automotive safety, and video content indexing. Based on trajectories obtained through tracking, some advanced behaviour analysis can be applied. For instance, a pedestrian’s trajectory can be analysed to warn a driver if the trajectories of the vehicle and of the pedestrian are potentially intersecting.

    For multiple object tracking, tracking-by-detection methods are the most popular algorithms. A detector is used in each image frame to obtain candidate objects. Then, with a data-association procedure, all the candidates are matched to the existing trajectories as known up to the previous frame. Any unmatched candidate starts a new trajectory. Since there is no perfect detector that detects all objects without any false positives and false negatives, sometimes objects are missed (i.e., they appear in the image but are not detected), or background windows are incorrectly detected as being objects. Such false-positive or false-negative detections increase the difficulty of tracking.

    Occlusion by other objects or the background is one of the main reasons for detection to fail and it also increases the difficulty of tracking (e.g., identity switch). Some algorithms [1,2] propose tracking objects in the 2D image plane. The occlusion problem is handled either using part detectors and tracking detected body parts, or adopting instance-specific classifiers to improve performance of data assignment. However, tracking in the 2D image plane increases the ambiguity of data association. A tall person nearby, and a small person far away, for example, may appear very close to each other in the image, and, possibly, in some frames the tall person occludes the small person. But they are actually several meters away from each other. Thus, often, and also in this paper, stereo information is adopted to improve the tracking performance [3-5], and multiple pedestrians are tracked in 3D coordinates.

    Tracking objects with irregular movements in 3D space is a challenging task due to the totally unknown speed and direction. In this paper, the application of an unscented Kalman filter (UKF), which can also handle nonlinear?in fact, fully irregular trajectories in 3D space, is demonstrated. For the original paper on UKF see [6]. Similar work is proposed in [5]. However, instead of modelling the motion of the vehicle and the pedestrians separately, we straightforwardly model the relative motion between them, and no ground plane is assumed, so that objects moving with 6 degrees of freedom can be tracked properly. Different types of models are tested and compared in both simulation and real sequences.

    II. RELATED WORK

    Multiple object tracking has attracts a great deal of attention recently in computer vision research. Today, an update of the review [7] from 2006 should also include work such as in [2-4,8-13].

    Kalman filters (KF) have been extensively adopted to deal with tracking tasks. A KF is a recursive Bayesian filter, firstly, using motion information to predict the possible position, followed by fusing the observation (detection) and predicted position. A linear KF is used for tracking (e.g., [7]) when movement is such that linear models may be used for approximation. Obviously, a linear model is not suitable for most cases. The extended Kalman filter (EKF) was designed [14] for handling a nonlinear model by linearizing functions using the Taylor expansion extensively. For example, an EKF has been used for simultaneous localization and mapping [15], and for pedestrian tracking [16]. A particle filter was used to handle the task in [17]. Performance similar to an EKF is reported in [3].

    The UKF can handle a nonlinear model by using the unscented transform to estimate the first and second order moments of sigma points, which represent the distribution of a predicted state and predicted observations, and it appears that the UKF does this better than the EKF [18]. Thus, in this paper, a UKF is used for tracking multiple, irregularly moving objects in 3D space, which is a highly nonlinear problem.

    III. UNSCENTED KALMAN FILTER

    The unscented transform (UT) is the core component that enables the UKF able to be able to handle nonlinear models. Let L be the dimensionality of the system state xt-1|t-1 at time t-1. If the system noise (process noise Q and measurement noises R) is not additive noise, the state is augmented before UT. In our case, random acceleration is introduced as process noise; thus, the state augmented with a process noise vector, is denoted by

    image

    and called vector for short. The dimension of the augmented vector depends on the process model, which is illustrated in Section IV. Let xt|t-1 denote the predicted state at time t when passing xt-1|t-1 through process function f. Let yt |t -1 be the predicted observation at time t when passing xt |t -1 through observation function h.

    The UT works by sampling 2L+1 sigma vectors Xi(a) in the augmented state space (following [6]), forming a matrix χ (a) . The covariance matrix in augmented state space is denoted by P(a). Let

    image

    be the state covariance matrix (i.e., describing dependencies between components of a state χ ). Formally,

    image

    where λ is a positive real, used as a scaling parameter. These sigma vectors can be passed through a nonlinear function (e.g., f or h) one by one, thus defining trans-formed (i.e., new) sigma vectors such as

    image

    Means xt|t-1 or yt|t-1 and covariance matrices

    image

    are obtained as follows; take h for example:

    image

    with constant weights Wi(.). Details are given in [6].

    The UKF is illustrated as follows. At first we initialize the state x=x0 and state covariances P(xx) = P0(xx). For the augmented vectors, let

    image

    where Q denotes the process-noise covariance matrix. Details about Q are given in Section IV. For t ? (1, … , ∞), we calculate sigma vectors as follows:

    image
    image

    The process update is as follows:

    image

    We update the sigma vectors using

    image

    and update the measurement covariance matrix as follows:

    image

    where R is the assumed measurement noise covariance, depending on the observation model selected. Details are given in Section IV.

    Altogether, the UKF is defined by

    image

    IV. MULTIPLE OBJECT TRACKING

    Following tracking-by-detection methods, which are popular for solving multiple-object tracking tasks, a detector is applied in each frame to generate object candidates which are outputs of the detector. One UKF is adopted for tracking one object separately; thus a group of detected pedestrians defines a family of UKFs to be processed simultaneously. Each UKF tracks one detected object. The predicted state of a UKF is used for data association; when an observation (of the tracked object) is available in the current frame then we update the predicted state by using the corresponding UKF.

      >  A. Detection

    Detection-by-tracking methods rely on evaluating rectangular regions of interest, and we call them object boxes if positively identified as containing an object of interest. For pedestrian tracking, we adopt the popular histogram of oriented gradients (HOG) feature method and a support vector machine (SVM) classifier, originally introduced in [19]. HOG features describe the human profile by an oriented gradient histogram. An SVM classifier is able to handle high-dimensional and nonlinear features (such as HOG features). It projects sample features into a high

    dimensional space, and then finds a hyperplane to separate two classes. Instead of using a sliding window, regions of interest (i.e., inputs to the classifier) are selected by analysing calculated stereo information (depth and disparity maps), as proposed in [20].

    Fig. 1 shows several detection results in pedestrian sequence, dots (cyan) denote the boxes’ centre that are recognized as pedestrians, and the red rectangles denote the final detection results. As can be seen in the results, the object boxes may contain background, shift from the object, or miss the pedestrians.

    For the detection of Drosophila larvae (an example of 2D movement), thresholds and connected components are adopted to obtain one object box for each larva. Several larva detection results are shown in Fig. 2. As the scene is certain, the detection results are more reliable when compared to the pedestrian sequence. However, no depth information is available here.

      >  B. UKF-based Object Tracking

    As there is an unknown number of objects in a scene, the state-dimensionality would expand significantly if we would have decided to track all pedestrians in one UKF; in this case, the speed of tracking reduces dramatically when the scene is crowded with many detected objects. Thus, we decided on one UKF for each detected object for tracking.

    Choosing a proper model is important. In this subsection we offer three models for possible selection: 3DVT means that 3D position (world coordinates) with velocity is observed, 3DT means that 3D position without velocity is observed, and 2DT means that 2D position (image coordinates) without velocity is observed. These models are compared in Section V.

    1) The Two 3D Models

    In the 3DVT model, the object is tracked in 3D world coordinates. Its 3D position (x, y, z) is the first part of the state. We also include the velocity (vx, vy, vz). Thus, a state x=(x, y, z, vx, vy, vz)T is 6-dimensional.

    a) Process model: We assume constant velocity between adjacent frames, with Gaussian distributed noise acceleration na ? N (0, Σna). The diagonal elements in Σna are set to be equal and denoted by

    image

    Thus,

    image

    whereΔt is thetime interval between subsequent frames.

    b) Observation model: An observation consists of the position (i0, j0) (say, the centroid of the detected object box in the left camera), disparity d of the detected object, and velocity (vox, voy, voz) in 3D coordinates. The usual pinhole camera projection model is used 4to map 3D points onto the image plane,

    image

    where f denotes focal length, and b denotes the length of the baseline between two rectified stereo cameras. In this case,

    image

    For the disparity d we select the mode in the disparity map in a fixed (e.g., 20 × 20) neighbourhood around the centroid (i0, j0 ) of the detected object box. 3D scene flow (vox, voy, voz) can be obtained by combining optic flow and stereo information [21].

    As it is difficult to obtain high-quality scene flow as required for 3DVT, 3DT simplifies the 3DVT model by excluding the scene flow from the observation, and has the same process model as 3DVT. In this case,

    image

    2) The 2D Model

    If only monocular recording is available, the object is tracked in the 2D image plane only. The state x=(i,j,vi,vj)T consists of position (i, j) and velocity (vi, vj).

    a) Process model: The process model is the same as for the 3D models. We assume a constant velocity between subsequent frames with a Gaussian noise distribution for acceleration na:

    image

    b) Observation model: An observation consists of the central position (i0, j0) of an object box only, i0 = i and j0 = j, resulting in R = diag2nmp2nmp) for this case.

    C. Data Association

    As each object is tracked independently, data association by matching candidates to existing trajectories becomes important. If no match is found then we decide to initialize a new tracker.

    Since object movements are continuous, the estimated velocity in the UKF can be used as a cue to localize the search area in order to find the match object. For each trajectory, the possible location (i.e., (xp, yp, zp) for 3D, and (ip, jp) for 2D) of the object in the current frame is predicted by a process model used in the EKF. This location is used as a reference for searching potentially matching candidates in the current frame. Currently we simply match candidates based on the shortest Euclidean distance and a given threshold T.

    One candidate might be matched with several trajectories if the Euclidean distance is below T. Trajectories compete for the candidates, and in the end, the closest one wins. If a candidate is not matched to any trajectory, a new tracker is initialized. If a trajectory does not win any of the candidates, the tracker is propagated with the given prediction, and the new state is the predicted state, without being updated by an observation (because it is not available).

    No object appearance description is used here for assigning an object to a trajectory. In general, the inclusion of appearance representation (e.g., a colour histogram, or an instance-specific shape model) improves the performance. However, this is out of the scope of this paper, where we are focusing on the combination of different data association methods.

    V. EXPERIMENTS

    In this section, first, our three models (3DVT, 3DT, and 2DT) are tested in a simulated environment with different parameter sets. Second, our multiple-object tracking method is tested on real video sequences where (3D example) pedestrians are walking in inner-city scenes, or (2D example) larvae are moving on a flat culture dish.

      >  A. Simulated Tracking

    The three models defined in Section IV are tested in a simulation environment in OpenGL (SGI, Fremont, CA, USa). A cub is moving on a circular path around a 3D point with constant speed, as shown in Figs. 3 and 4. Acceleration noise na with different covariance (e.g., σ2na = 0.0001, 0.01, 1), and measurement noise nm with different covariance (e.g., σ2nmp = 10, 50, 100, σ2nmv = 50, 100, 150), are used to test and compare the three models’ performance. The simulation environment is different for 2D and 3D models, where for 2D, positions are integral pixel coordinates in the image plane, but for 3D, position coordinates are reals. The radius of the circle in the 3D models is 10, while in the 2D model it is 50. In both environments, measurements are degraded by noise before being sent to the UKF.

    Fig. 3 demonstrates the effect of σ2na, having fixed σ2nmp and σ2nmv. Experiments show that larger σ2na values result in more unstable trajectories. A large σ2na means that the process model produces a predicted state that is fluctuating with large magnitudes. Results show that σ2na = 0.0001 is a reasonable choice for 3D models. For the 2D model, a smaller σ2na yields smooth estimation, but the shifts are significant.

    A larger σ2na value produces estimations that are closer to the true positions, but fluctuations are significant, for an experiment with σ2na =1 for the 2D case. In general, 3DT and 3DVT converge better than 2DT, while 3DT and 3DVT show a similar performance. 3D models use stereo information rather than just a single image as for the 2D model, which also proves that stereo information can help to improve the tracking performance. As the measured 3D position is noisy, the measure of velocity is even noisier; this appears to be the main reason for the observation that the inclusion of velocity cannot improve the performance.

    Fig. 4 shows results for our models for different covariance values σ2nmp and σ2nmv of measurement noise. Significantly increasing measurement noise (i.e., higher uncertainty of observations) reduces the performance only slightly. This demonstrates that, to some degree, the UKF is a robust tracker, which is not vulnerable to detection uncertainties. As before, 3DT and 3DVT converge better than 2DT, while 3DT and 3DVT show a similar performance.

      >  B. Multiple Object Tracking in Real Data

    In this section we report on the performance of UKFsupported tracking for multiple larvae using the 2DT model, and for multiple pedestrians in traffic scenes using the 3DT model. The larvae and pedestrian sequences are recorded at 30 and 15 frames per second, respectively.

    Results for larvae tracking are shown in Fig. 5. As the velocity in the model is initialized by (0,0), the UKFestimation is “ slower” than the real speed of the larvae in the first 30 frames. The speed of convergence can be improved by increasing σ2na, but it should be noted that the larger the σ2na value is, the larger the magnitude of fluctuation. The estimated trajectories follow the moving larvae effectively, mainly because all of the larvae are properly detected in all of the frames. However, such a complete detection cannot be expected for pedestrian sequences. Next, we test the UKF for such “ noisy” detection results as pedestrian sequences.

    The results for pedestrian tracking are shown in Fig. 6. Objects are missing or shifting from time to time due to the clustered background (e.g., the car in the traffic scene detected as a pedestrian), illumination variations leaving some pedestrians undetected, or internal variations between objects (i.e., unstable detections). Our experiments verified that UKF predictions are able to follow irregularly moving pedestrians when detection fails for a few frames, and can even correct unstable detections.

    The second frame in Fig. 6 shows that the undetected pedestrian is predicted correctly in the white object box and is successfully matched to a detected position in the third frame. The last frame in Fig. 6 demonstrates that displaced detections are corrected by the UKF. Using only the defined distance rule for data assignment, this appears to be insufficient, especially for the given detection results. A small threshold may lead to a mismatch (i.e., the detection fails to satisfy the rule), and a large threshold may lead to an identity switch (i.e., a pedestrian is matched to another pedestrian).

    VI. CONCLUSIONS

    Assigning one UKF to each detected (moving) object simplifies the design and implementation of UKF prediction of 2D or 3D motion. Experiments demonstrate the robustness of the chosen approach. This tracker only generates shortterm tracks when detection is not reliable; long-term tracking should be possible by also introducing dynamic programming.

    For evaluating the performance in real-world (either 2D or 3D) applications, more extensive tests need to be undertaken, especially for the design and evaluation of quantitative performance measures. For example, the measures discussed in [22] for evaluating visual odometry techniques might also be of relevance for the tracking case.

  • 1. Wu B, Nevatia R 2007 “ Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors” [International Journal of Computer Vision] Vol.75 P.247-266 google
  • 2. Breitenstein M. D, Reichlin F, Leibe B, Koller-Meier E, van Gool L 2011 “ Online multiperson tracking-by-detection from a single, uncalibrated camera” [IEEE Transactions on Pattern Analysis and Machine Intelligence] Vol.33 P.1820-1833 google
  • 3. Ess A, Schindler K, Leibe B, van Gool L 2010 “ Object detection and tracking for autonomous navigation in dynamic environments” [International Journal of Robotics Research] Vol.29 P.1707-1725 google
  • 4. Leibe B, Schindler K, Cornelis N, van Gool L 2008 “ Coupled object detection and tracking from static cameras and moving vehicles” [IEEE Transactions on Pattern Analysis and Machine Intelligence] Vol.30 P.1683-1698 google
  • 5. Meuter M, Iurgel U, Park S. B, Kummert A 2008 “ The unscented Kalman filter for pedestrian tracking from a moving host” [Proceedings of IEEE Intelligent Vehicles Symposium] P.37-42 google
  • 6. Wan E. A, van der Merwe R 2000 “ The unscented Kalman filter for nonlinear estimation” [Proceedings of IEEE Adaptive Systems for Signal Processing, Communications, and Control Symposium] P.153-158 google
  • 7. Yilmaz A, Javed O, Shah M 2006 “ Object tracking: a survey” [ACM Computing Surveys] Vol.38 google
  • 8. Ess A, Leibe B, Schindler K, van Gool L 2009 “ Robust multiperson tracking from a mobile platform” [IEEE Transactions on Pattern Analysis and Machine Intelligence] Vol.31 P.1831-1846 google
  • 9. Leven W. F, Lanterman A. D 2009 “ Unscented Kalman filters for multiple target tracking with symmetric measurement equations” [IEEE Transactions on Automatic Control] Vol.54 P.370-375 google
  • 10. Mitzel D, Horbert E, Ess A, Leibe B 2010 “ Multi-person tracking with sparse detection and continuous segmentation” Computer Vision ECCV 2010 [Lecture Notes in Computer Science] Vol.6311 P.397-410 google
  • 11. Mitzel D, Sudowe P, Leibe B 2011 “ Real-time multi-person tracking with time-constrained detection” [Proceedings of the British Machine Vision Conference] P.104.1-104.11 google
  • 12. Mitzel D, Leibe B 2011 “ Real-time multi-person tracking with detector assisted structure propagation” [Proceedings of Computational Methods for the Innovative Design of Electrical Devices] P.974-981 google
  • 13. Shaikh M. M, Wook B, Lee C, Kim T, Lee T, Kim K, Cho D 2011 “ Mobile robot vision tracking system using unscented Kalman filter” [Proceedings of IEEE/SICE International Symposium on System Integration] P.1214-1219 google
  • 14. Welch G, Bishop G 1995 “ An introduction to the Kalman filter” google
  • 15. Huang S, Dissanayake G 2006 “ Convergence analysis for extended Kalman filter based SLAM” [Proceedings of IEEE International Conference on Robotics and Automation] P.412-417 google
  • 16. Yazdi H. S, Hosseini S. E 2008 “ Pedestrian tracking using single camera with new extended Kalman filter” [International Journal of Intelligent Computing and Cybernetics] Vol.1 P.379-397 google
  • 17. Cai Y, de Freitas N, Little J. J 2006 “ Robust visual tracking for multiple targets” [Proceedings of the 9th European Conference on Computer Vision] P.107-118 google
  • 18. Hartmann G 2012 “ Unscented Kalman filter sensor fusion for monocular camera localization” google
  • 19. Dalal N, Triggs B 2005 “ Histograms of oriented gradients for human detection” [Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition] P.886-893 google
  • 20. Gavrila D. M, Munder S 2007 “ Multi-cue pedestrian detection and tracking from a moving vehicle” [International Journal of Computer Vision] Vol.73 P.41-59 google
  • 21. Wedel A, Brox T, Vaudrey T, Rabe C, Franke U, Cremers D 2011 “ Stereoscopic scene flow computation for 3D motion understanding” [International Journal of Computer Vision] Vol.95 P.29-51 google
  • 22. Jiang R, Klette R, Wang S 2011 “ Statistical modeling of longrange drift in visual odometry” [Computer Vision ACCV 2010 Workshops, Lecture Notes in Computer Science] P.214-224 google
  • [Fig. 1.] The depth map on top uses a colour code for calculated distances; depth values are only shown at pixels where the mode filter accepts the given value. The lower images show detected (coloured) object boxes.
    The depth map on top uses a colour code for calculated distances; depth values are only shown at pixels where the mode filter accepts the given value. The lower images show detected (coloured) object boxes.
  • [Fig. 2.] Larvae detection results shown by (cyan) object boxes.
    Larvae detection results shown by (cyan) object boxes.
  • [Fig. 3.] Simulation results for variations in the variance of acceleration noise. From left to right, σ2na = 0.0001, 0.01, or 1, with fixed values σ2nm = σ2np = 50, and σ2nv = 100. From top to down, the tracking model is 3DVT, 3DT and 2DT, respectively.
    Simulation results for variations in the variance of acceleration noise. From left to right, σ2na = 0.0001, 0.01, or 1, with fixed values σ2nm = σ2np = 50, and σ2nv = 100. From top to down, the tracking model is 3DVT, 3DT and 2DT, respectively.
  • [Fig. 4.] Simulation results for variable variance of measurement noise. From left to right, σ2nmp = 10, 50, or 100, σ2nmv = 50, 100, 150, respectively, with fixed σ2na = 0.0001 for 3D models, σ2na = 1 for 2D models. From top to down, the tracking models are 3DVT, 3DT and 2DT, respectively.
    Simulation results for variable variance of measurement noise. From left to right, σ2nmp = 10, 50, or 100, σ2nmv = 50, 100, 150, respectively, with fixed σ2na = 0.0001 for 3D models, σ2na = 1 for 2D models. From top to down, the tracking models are 3DVT, 3DT and 2DT, respectively.
  • [Fig. 5.] 2D Tracking results of larva sequences. From top to bottom: tracking results in Frames 26, 46, and 166 of one sequence. The red lines show the detected track, and the white lines show the unscented Kalman filter-predicted track. The blue lines represent estimated trajectories. The left column is the original intensity image overlaid with the estimated trajectories.
    2D Tracking results of larva sequences. From top to bottom: tracking results in Frames 26, 46, and 166 of one sequence. The red lines show the detected track, and the white lines show the unscented Kalman filter-predicted track. The blue lines represent estimated trajectories. The left column is the original intensity image overlaid with the estimated trajectories.
  • [Fig. 6.] (a) Object boxes (red), unscented Kalman filter (UKF) predictions (white), and UKF estimations (blue) are overlaid on the original intensity image. The black box is an invalid detection excluded by a disparity check. (b) coloured blocks denote the object boxes (red), UKF predictions (white), and UKF estimations (blue) in the XZ-plane in real-world coordinates.
    (a) Object boxes (red), unscented Kalman filter (UKF) predictions (white), and UKF estimations (blue) are overlaid on the original intensity image. The black box is an invalid detection excluded by a disparity check. (b) coloured blocks denote the object boxes (red), UKF predictions (white), and UKF estimations (blue) in the XZ-plane in real-world coordinates.