Novel Parallel Approach for SIFT Algorithm Implementation
 Author: Su Le Tran, Lee JongSoo
 Organization: Su Le Tran; Lee JongSoo
 Publish: Journal of information and communication convergence engineering Volume 11, Issue4, p298~306, 31 Dec 2013

ABSTRACT
The scale invariant feature transform (SIFT) is an effective algorithm used in object recognition, panorama stitching, and image matching. However, due to its complexity, realtime processing is difficult to achieve with current software approaches. The increasing availability of parallel computers makes parallelizing these tasks an attractive approach. This paper proposes a novel parallel approach for SIFT algorithm implementation using a block filtering technique in a Gaussian convolution process on the SIMD Pixel Processor. This implementation fully exposes the available parallelism of the SIFT algorithm process and exploits the processing and input/output capabilities of the processor, which results in a system that can perform realtime image and video compression. We apply this implementation to images and measure the effectiveness of such an approach. Experimental simulation results indicate that the proposed method is capable of realtime applications, and the result of our parallel approach is outstanding in terms of the processing performance.

KEYWORD
Data parallel architecture , Parallel processing , SIFT , SIMD

I. INTRODUCTION
Nowadays, image matching is used for solving many problems in the field of computer vision, such as object or scene recognition, threedimensional (3D) structure from multiple images, stereo correspondence, and motion tracking. In recent years, an approach has been proposed to generate a set of salient image features. This approach is called the scale invariant feature transform (SIFT); it transforms image data into scaleinvariant coordinates with respect to local features. SIFT is a complex algorithm, and when used in multimedia applications, it is necessary to find a scheme to implement the algorithm in realtime.
In the previous approaches, the SIFT algorithm was implemented in both software and hardware. Due to the complexity of the SIFT algorithm, the software implementation cannot be applied in realtime applications. Therefore, recent research has mainly focused on the implementation and acceleration of the SIFT algorithm in hardware. Some approaches use the graphic processing unit (GPU) for the SIFT algorithm implementation. Sinha et al. [1] applied SIFT on a GPU; however, because of the hardware and OpenGL limitations, some data transfers were required between the GPU and the central processing unit (CPU), resulting in an increase in the data transmission time. As a result, their implementation can process a 640 × 480 video at the speed of 10 frames per second (FPS). Another approach is the SIFT implementation on multicore systems proposed by Zhang et al. [2]. This approach takes advantage of the computing power of multicore processors like dualsocket and quadcore systems. In their experiments, Zhang et al. [2] observed that the processing speed achieves an average of 45 FPS for a 640 × 480 video. However, more finegrainlevel parallelization must be conducted to reduce the load imbalance and achieve maximum performance with this approach in future largescale multicore systems.
The single instruction multiple data (SIMD) pixel processor array, which has not previously been proposed for SIFT algorithm implementation purposes, is used in our implementation to address the major drawbacks of previous research. To adapt the SIFT algorithm implementation to the data parallel architecture, we propose a parallelization SIFT algorithm approach called the PSIFT algorithm. In the PSIFT implementation, block filtering [3] is used for calculating the Gaussian filter to improve the performance. The SIMD pixel processor can exploit the benefits of integrating optoelectronic devices into a highperformance digital processing system and fully exposes the available parallelism calculation of the PSIFT algorithm. We have evaluated the impact of our parallel approach based on the processing performance. In the experiment detail section, we report that the processing speed is 326 FPS in the case of a 640 × 480 video. Compared with the existing hardwarebased SIFT implementations [1,2,4], our approach achieves a speed that is around 7 to 30 times faster. Further, because of the finegrain characteristic of the SIMD pixel processor, we show that the load imbalance of our implementation is reduced.
The rest of this paper is organized as follows: In Section II, we briefly describe the related research. In Section III, a novel PSIFT that uses block filtering is proposed. Section IV discusses the SIMD pixel processor system and the lowmemory SIMD architecture. Section V describes in detail how the proposed algorithm has been adapted to fully exploit the unique capability of the SIMD processor system. Strong proposals for the performance of the system and the experimental results are presented in Section VI. Section VII concludes this paper with a presentation of the simulation results.
II. RELATED WORK
SIFT is perhaps the most popular invariant local feature detector at present and has been applied successfully on many occasions, such as object recognition, image match and registration, structure from motion and 3D reconstruction, visual simultaneous localization and mapping and image panorama.
Nevertheless, the complexity of the SIFT algorithm results in a very high time consumption. Because of the high popularity of SIFT, it is no surprise that several variants and extensions of SIFT have been proposed. For example, Ke and Sukthankar [5] proposed the PCASIFT that applies principal component analysis (PCA) to the normalized gradient patch. The Gradient location and orientation histogram (GLOH) [6] changes the SIFT’s location grid and uses PCA to reduce the size of SIFT. The primary focus of these extensions is to gain improved performance.
An approach called HarrisSIFT [7] removes many indistinctive candidates before generating descriptors, thus saving excessive computations. HarrisSIFT reduces both the feature number and the database size, or in other words, cuts down the feature matching time. However, as provided, the results of the HarrisSIFT method are not sufficiently fast to be implemented in real time.
In the fast approximated SIFT, Grabner et al. [8] proposed a modified SIFT method for recognition purposes. The authors sped up the SIFT computation by using approximations (mainly employing integral images) in both the difference of Gaussian (DoG) detector and the SIFT descriptors. Their method can reduce the SIFT computation time by a factor of eight as compared to the binaries SIFT proposed by Lowe [9]. However, the loss in matching performance is a major drawback of this approach.
In the above approaches, the authors attempted to improve the performance of the SIFT algorithm by the modification of SIFT. However, for realtime implementation purposes, some hardwarebased approaches have been proposed in recent years. Sinha et al. [1] presented GPUSIFT, a GPUbased implementation for the SIFT feature extraction algorithm. The GPUSIFT has been implemented using the OpenGL graphics library and the Cg shading language and tested on modern graphics hardware platforms. Their implementation can process a 640 × 480 video at 10 FPS. The weakness of this approach can be attributed to the hardware and OpenGL limitations. Some data transfers are required between the GPU and the CPU and thus take time for data transition. Another approach that also used a GPU was developed by Heymann et al. [4]. In this research, the authors showed how a feature extraction algorithm can be adapted to make use of modern graphics hardware. Their experiment used NVIDIA Quadro FX 3400 GPU with 256 MB video RAM. As a result, the SIFT algorithm can be applied to image sequences with 640 × 480 pixels at 20 FPS. Further, recently, with the development of embedded technology, a new implementation of SIFT on a multicore processor has been proposed by Zhang et al. [2]. In their approach, the SIFT algorithm is implemented on multicore processors like dual socket and quadcore systems. The obtained result is an average of 45 FPS for a 640 × 480 video.
III. PSIFT ALGORITHM USING BLOCK FILTERING
The SIFT algorithm [10] combines a scale invariant region detector and a descriptor based on the gradient distribution in the detected regions. The descriptor is presented by a 3D histogram of gradient locations and orientations. These descriptors (local features) are very distinctive and invariant for image scaling or rotation. The flowchart of the SIFT algorithm is shown in Fig. 1. In this flowchart, there are four major steps [9]: DoG space building, keypoint detection and localization, orientation assignment, and the keypoint descriptor. DoG space building: scalespace extrema detection stage of the filtering attempts to equate different viewpoints that are the projections of a specific 3D object point. They are then examined in further detail. The identification of candidate locations can be efficiently achieved using the continuous “scale space” function, which is based on the Gaussian function. The scale space is defined by the following function:
SIFT is one such technique that locates the scalespace extrema from the Gaussian image differences
D(x, y, σ) given byIn Eqs. (1) and (2), * denotes the convolution operator,
G(x, y, σ) represents a variablescale Gaussian kernel,I(x, y) refers to the input image, andk is used for increasing or decreasing the scale.The construction of the DoG space is illustrated in Fig. 2. The initial image is repeatedly convolved with Gaussians to generate the set of scalespace images. The adjacent Gaussian images are subtracted to produce the DoG images as shown on the right of the figure.
The Gaussian image is downsampled by a factor of 2 after each octave, and the process repeated. The convolved images are grouped by octave, and the number of DoG per octave is fixed. Keypoint detection and localization: To detect the local maxima or minima of
D(x, y, σ) , each point is compared with its 8 neighbors on the same scale, and its 9 neighbors on the vertical and horizontal scales. If this value is larger than all 26 neighbors, it is the maximum; if less, it is the minimum.The keypoint localization stage attempts to eliminate these points from the list of keypoints that have low contrast (and are therefore sensitive to noise) or have poorly localized edges. This is achieved by calculating the Laplacian value for each keypoint found in the previous stage. The location of the extrema, z, can be expressed as follows:
If the function value at z is below a threshold value, then this point is discarded. This removes the extrema that have a low contrast.
Edge extrema that have large principle curvatures but small curvatures in the perpendicular direction are eliminated. Using the 2 × 2 Hessian matrix
H computed at the location and scale of the keypoint, we can compute principle curvatures that are proportional to the eigenvalue ofH .The elimination criteria can be constructed as follows:
where
r =α/β , andα denotes the eigenvalue with a relatively large magnitude andβ represents the eigenvalue with a smaller magnitude.If this inequality is true, the keypoint is rejected.
The orientation assignment stage in Fig. 1 aims to assign a consistent orientation to the keypoints based on the local image properties. The keypoint descriptor is represented relative to this orientation because it is invariant to the rotational movements of the keypoints. The approach taken to find an orientation uses the keypoint scale to select the Gaussian smoothed image
L . The gradient magnitudem(x, y) and the orientationθ(x, y) can be computed as follows:Then, we form an orientation histogram from the gradient orientations of the sample points, and locate the highest peak in the histogram. We use this peak and any other local peak within 80% of the height of this peak to create a keypoint with the orientation
θ(x,y) . Some points will be assigned multiple orientations. Then, we fit a parabola to the three histogram values closest to each peak to interpolate the peak’s position.The local gradient data, used above, are also used for creating keypoint descriptors. The gradient information is rotated to line up with the orientation of the keypoint and then weighted by a Gaussian kernel with a variance of the keypoint scale multiplied by 1.5. These data are then used for creating a set of histograms over a window centered on the keypoint.
Keypoint descriptors typically use a set of 16 histograms, which are aligned in a 4 × 4 grid, each with 8 orientation bins, one for each of the main compass directions and one for each of the midpoints of these directions. These result in a feature vector containing 128 elements.
These resulting vectors are known as SIFT keys and are used in the nearestneighbors approach to identify possible objects in an image. Collections of keys that agree on a possible model are identified. When three or more keys agree on the model parameters, the model is evident in the image.
We processed images using the C++ code developed by Lowe [9] and then measured the performance. The obtained results (Table 1) show that most of the calculations are implemented in the first stage: DoG space building. In this step, many calculations are repeatedly processed. Based on this characteristic, we propose a parallel approach for the SIFT algorithm called PSIFT. The proposed algorithm can take advantage of parallelism. The PSIFT will be described in detail in subsection IIIB. In subsection IIIA, we present a new method to calculate the Gaussian filter called block filtering.
> A. Block Filtering Technique
In the SIFT algorithm, the Gaussian filter is used many times during the SIFT feature determination. For calculation time reduction purposes, we propose a new method to implement the Gaussian convolution. In the proposed method, we use a 4 × 4 block to implement the Gaussian filter.
For example, we divide the image into subblocks. The size of these blocks is 4 × 4 pixels. Then, we apply a half kernel to the block; in this case, it is an array with four elements. In Fig. 3, we explain the “half kernel” definition. Because of the symmetric characteristic of the Gaussian kernel, we divide the Gaussian kernel into four “half kernels.”
Fig. 4 demonstrates how these blocks are applied to the image. After loading each 4 × 4 block of the input image. These “half kernels” are applied in four directions (left, right, top, and bottom) to the 4 × 4 block.
The proposed Gaussian filter implementation using the block filtering technique includes the following four steps:
1. Load 4 × 4 pixels of image.
2. Load kernel values.
3. Compute convolution (left, right, top, and bottom). Left, Right: reverse multiplication Top: Matrix transpose Bottom: Matrix transpose and then reverse multiplication
4. By repeating all image data, we will obtain the output image.
Our previous research [10] indicated that by using the block filtering for a Gaussian filter, we can reduce the calculation time by half.
> B. PSIFT Algorithm
As evident from the flowchart of the original SIFT algorithm (Fig. 5), each keypoint will be assigned an orientation and generate a descriptor just after detecting the keypoints for one scale. In this case, if we only detect a few keypoints in the later iterations of the octaves and the scales, there may be very few keypoints for the next computational step. This implies that the load imbalance will occur in the “Assign Orientation” and “Extract Descriptor” steps for each keypoint.
Moreover, the number of keypoints detected from each scale is decreased gradually when the image is downsampled. Because of this, load balance can be a serious issue in the later stages.
Further, the original SIFT algorithm is implemented in a straightforward manner. In this case, the speed of SIFT is quite poor. In the straight implementation, SIFT feature detection is repeated for each image in the scale space. Therefore, we have to repeat the same operation many times. To take advantage of the parallelization characteristic, we propose a parallel approach to implement the SIFT algorithm.
In our approach, we propose a parallel SIFT algorithm (PSIFT) to adapt to the SIMD processor architecture. The flowchart of PSIFT is illustrated in Fig. 5. The first step is to resize the input image into many images in the scale space. In our implementation, the number of images in the scale space is changed from 2 to 4. Then, we consider all images in the scale space and apply the SIFT feature detection process simultaneously.
IV. SIMD PROCESSOR ARRAY ARCHITECTURE
The SIMD pixel processor system [11] exploits the benefits of integrating optoelectronic devices into a highperformance digital processing system. In this system, an array of thinfilm detectors is integrated on top of and electrically interfaced with digital SIMD processing elements. The general architecture of a SIMD system is depicted in Fig. 6. The program is stored in the array control unit, and each instruction is broadcast to every node of the system in a lockstep fashion (i.e., via a single instruction stream). Each node, in turn, executes the received instructions on its local data (multiple data stream), while exchanging data with other nodes through the interconnection network.
Each SIMD processor node is interconnected to its four neighbors through a mesh network closed as a torus. Thus, the opposite rows (or columns) of the mesh are connected to each other, enabling more powerful communication schemes than those available with a standard NorthEast WestSouth (NEWS) network. The microarchitecture of an SIMD processing element (PE) is shown in Fig. 7, along with the interconnection network. The 16bit data path includes an addersubtractor, barrel shifter, and multiplyaccumulator unit. Each PE also includes 64 words of local memory.
Further, each SIMD processor node interfaces with a small array of thin film detectors, which is a subset of the focal plane array. The instruction set architecture allows a single node to address up to 16 × 16 arrays of detectors. Each processor incorporates eightbit sigmadelta analog to digital converters to convert light intensities, incident on the detectors, into digital values. The SAMPLE instruction simultaneously samples all detectors values and makes them available for further processing. The SIMD execution model allows the entire image projected on many nodes to be sampled in a single cycle.
This monolithic integration is the key feature of the SIMD pixel processor system, providing extremely compact, highframerate focal plane processing.
V. SIMD PROCESSOR ARRAY ARCHITECTURE
To execute the PSIFT algorithm, we consider all pixels of the images. By using the specified SIMD array, we distribute all pixels into all PEs in which every PE owns 16 pixels. Assume n is the total number of pixels. As a result, the number of PEs involved in the computation is n/16. By dividing the pixels among n/16 processors, every PE carries out the computation only on the local memory containing the 16 owned pixels along with their membership values as well as center values. Then, the PSIFT algorithm is implemented on n/16 processors in which some new equations are required for every PE. This enhances the performance of the PSIFT algorithm implementation.
Fig. 8 shows how the PSIFT algorithm is implemented on the SIMD parallel architecture. We divided the algorithm into the following six steps:
1. Detect the input image: Distribute the pixels to all processors.
2. Resize image: Resize the input image to images in the scale space. Each pixel of the images in the scale space is also distributed to the processors.
3. Construct DoG space: Compute the Gaussian convolution, and construct the DoG space as described in Section III. Each pixel of the images in the DoG space is stored in the processors.
4. Detect the keypoints from the images in the DoG space.
5. For each keypoint, we assign the orientation according to Eqs. (6) and (7).
Finally, we define the descriptor of a keypoint. Each keypoint descriptor is represented by a vector with 128 elements.
VI. PERFORMANCE EVALUATION
To evaluate the performance of the proposed algorithm, we use a cycleaccurate SIMD simulator. We developed the PSIFT algorithm in the respective assembly languages for the SIMD processor array. In this study, the image size of 256 × 256 pixels is used. For a fixed 256 × 256 pixel system, because each PE contains 4 × 4 pixels, 4,096 PEs are used. We summarize the parameters of the system configuration in Table 2.
The metrics of execution time and sustained throughput of each case form the basis of the study comparison, defined in (8) and (9):
Execution time
Sustained throughput
where
C denotes the cycle count,f_{k} represents the clock frequency,O_{exec} refers to the number of executed operations,U denotes the system utilization, andN_{PE} represents the number of processing elements.Fig. 9 shows the detected SIFT features in the case of a Lena image. As the number of scales increases, the detected SIFT features become more precise.
In this experiment, we use three different images (1, 2, and 3) as presented in Fig. 10. The number of octaves is also changed to 2, 3, and 4 in order to evaluate the complexity of the SIFT algorithm.
Table 3 summarizes the execution parameters for each image in the 4,096PE system. Scalar instructions control the processor array. Vector instructions, performed on the processor array, execute the algorithm in parallel. System Utilization is calculated as the average number of active processing elements. The algorithm operates with a System Utilization of 54% on average, resulting in a high sustained throughput. Overall, our parallel implementation supports sufficient realtime performance (1.03 ms) and provides efficient processing for the SIFT algorithm.
Table 4 shows the distribution of vector instructions for the parallel algorithm. Each bar divides the instructions into the arithmetic logic unit (ALU), memory (MEM), communication (COMM), PE activity control unit (MASK), and image loading (PIXEL). The ALU and MEM instructions are computation cycles, while the COMM and MASK instructions are necessary for data distribution and synchronization of the SIMD processor array. The results indicate that the proposed algorithm is dominated by ALU, MEM, and MASK operations.
In comparison with previous approaches, we implemented the experiment for a 640 × 480 pixel image. The number of octaves is 4. The calculated execution time after the implementation is 3.07 ms. Therefore, our proposed implementation can process a sequence of 640 × 480 images at
1000/3.07 ？ 326 (FPS). Fig. 11 shows the comparison between our parallel implementation with the GPUSIFT, optimization GPU, and the SIFT implementation based on a multicore processor. The results indicate that our proposed method can reduce the calculation time by around 32 times as compared to the GPUSIFT method, 16 times as compared to the optimization GPU method and 7 times as compared to the multicorebased implementation. These results demonstrate that the proposed parallel approach supports fast processing and provides reliable and efficient processing for SIFT implementation.VII. CONCLUSION
Recent advances in a wide range of applications of the SIFT algorithm in the field of computer vision require an increase in computational throughput and efficiency. These increased demands have become an important challenge in implementing the SIFT algorithm in realtime applications. In this paper, a parallel implementation of the PSIFT algorithm using block filtering based on a SIMD pixel processor was presented. By using the SIMD pixel processor system, we fully exploited the available parallelism of the PSIFT algorithm, primarily in the DoG space construction step. The obtained average processing speed was 326 FPS for images with 640 × 480 pixels. In comparison with the previous implementation [8,11], the proposed method was 30 times faster than the GPU and 7 times faster than the multicore implementation. In this research, we simulated the SIFT algorithm in the SIMD pixel processor. From the obtained performance result, we concluded that an actual implementation on hardware promises a good solution to implement the SIFT algorithm in realtime applications. The monolithic design and SIMD operation node allowed the FPS rate to be sustained at variable image sizes. The bandwidth bottleneck between the detector array and parallel processors did not exist even when the image size was increased. The experiment results also indicated that the proposed parallel approach provided a highthroughput, lowmemory implementation and reduced the load imbalance.
where
R _{0} (ρ ) represents the target rate (b/s/Hz) for a given SNRand
P_{e} (ρ ) denotes the error probability assuming the maximum likelihood decoding (To simplify notation,R _0 (ρ ) will be written asR _0 if droppingρ does not cause any confusion). Here,W represents the bandwidth. For the sake of simplicity, the notation ？ is used for representing the relation in (4b): particularly,Pe(ρ) ？ ρd
is identical to (4b), and ？ is referred to as the exponential equality.
The optimal DMT curve represents the maximum diversity gain for a given multiplexing gain
r and is given byd ^{*}(r ). It was shown in [10] that the outage probabilitysatisfies
Pout(R0,ρ) ？ ρd*(r),
where
I denotes the maximum average mutual information between the input and the output, and the error probabilityP_{e} (ρ ) of an optimal DMTachieving scheme also satisfiesP_{e} (ρ ) ？ρ ^{d*(r)} if the block length is sufficiently large.B. Achievability
In this subsection, we show that the simple instantaneous DF relay scheme achieves an optimal DMT curve. An upper bound on the DMT based on an instantaneous AF relay is also derived for the sake of comparison. We start from the following lemma:
Lemma 1. Let F(x;k) denote the cumulative distribution function of a chisquared random variable x with k degrees of freedom. Then, it follows thatand
The proof of this lemma is presented in
[19] . From Lemma 1, it can be easily concluded thatF (x ;2)=O (x ) andF (x ;4)=O (x ^{2}) for smallx ―f (x )=O (g (x )) means that positive constantsM andm exist such thatf (x )≤Mg (x ) for allx >m . Now, we are ready to derive the achievable DMT curve for underwater acoustic systems by using the DF relay protocol.

[Fig. 1.] The scale invariant feature transform (SIFT) algorithm flowchart. DoG: difference of Gaussian.

[Fig. 2.] Difference of Gaussian space construction.

[Table 1.] The distribution of computation time (measured in seconds) for each stage in the SIFT algorithm

[Fig. 3.] An illustration of the half kernel. The Gaussian kernel on the left side is divided into four “half kernels”, which are shown on the right side.

[Fig. 4.] 4 × 4 block filtering.

[Fig. 5.] Flowchart of the parallelization scale invariant feature transform (PSIFT).

[Fig. 6.] Organization of single instruction multiple data (SIMD) parallel architecture. ACU: array control unit.

[Fig. 7.] A block diagram of an single instruction multiple data (SIMD) processor array.

[Fig. 8.] Distribution of image data points to each PE node (N) in which PEs hold 4 × 4 pixels and all PEs work in parallel with a torus interconnection network. PE: processing element,

[Table 2.] System parameters

[Fig. 9.] Features detected by scale invariant feature transform algorithm with a changing number of scales. (a) 2 octaves, (b) 3 octaves, and (c) 4 octaves.

[Fig. 10.] Test images.

[Fig. 11.] Performance comparison between our proposed implementation and the other implementations. FPS: frames per second, GPU: graphic processing unit, SIFT: scale invariant feature transform, SIMD: single instruction multiple data.

[Table 3.] Algorithm performance on a 4,096PE system running at 150 MHz

[Table 4.] The distribution (measured in %) of vector instructions for the algorithm