Optimized and Portable FPGABased Systolic Cell Architecture for Smith？WatermanBased DNA Sequence Alignment
 Author: Shah Hurmat Ali, Hasan Laiq, Koo Insoo
 Publish: Journal of information and communication convergence engineering Volume 14, Issue1, p26~34, 31 March 2016

ABSTRACT
The alignment of DNA sequences is one of the important processes in the field of bioinformatics. The Smith–Waterman algorithm (SWA) performs optimally for aligning sequences but is computationally expensive. Field programmable gate array (FPGA) performs the best on parameters such as cost, speedup, and ease of reconfigurability to implement SWA. The performance of FPGAbased SWA is dependent on efficient cellbasic implementationunit design. In this paper, we present an optimized systolic cell design while avoiding oversimplification, very largescale integration (VLSI)level design, and direct mapping of iterative equations such as previous cell designs. The proposed design makes efficient use of hardware resources and provides portability as the proposed design is not based on gatelevel details. Our cell design implementing a linear gap penalty resulted in a performance improvement of 32× over a GPP platform and surpassed the hardware utilization of another implementation by a factor of 4.23.

KEYWORD
Bioinformatics , DNA sequence alignment , FPGA architecture , Smith？Waterman algorithm , Systolic cell

I. INTRODUCTION
Bioinformatics is the convergence of diverse and often divergent fields such as biology, mathematics, statistics, computer science, and computer engineering. Bioinformatics seeks to find answers to the allrelevant and allimportant question of the origin of life and the processes of life. Bioinformatics takes a different perspective on life depending on the scale or scope on which life is considered. From the perspectives varying from molecular to organism to evolution, one process lies at the heart of it all: sequence alignment.
Sequence alignment is concerned with finding the relatedness of one sequence to another sequence. Relatedness indicates whether the sequences are homologous and whether they share a common domain and motifs or not. Finding relationships among biological sequences helps significantly in knowing concepts such as how proteins are related to each other in an organism and to other organisms with which this organism shares an evolutionary history, thus providing significant insight into the meaning and function of life. Moreover, sequence alignment has diverse applications such as drug engineering, disease diagnostics, detection of autoimmune disease, genetic engineering of plants and animals, determination of the structure of proteins, evolution tracing, and determination of protein functions [1]. Sequence alignment compares two sequences by calculating the distance between the sequences. This distance represents the minimal cost at which one sequence can be converted into another. A score is assigned to each distance computed, and from this distance matrix, the degree of similarity or correlation between the sequences is inferred. For computing the abovementioned distance, two elementary operations are used, namely substitution and insertion/deletion (indel). Substitution signifies that one base in the sequence is replaced by another, while indel represents the insertion/deletion of a base into/from the sequence.
DNA is composed of four bases, namely adenine, cytosine, thymine, and guanine. When aligning two sequences, these bases are represented by the letters A, C, T, and G, respectively. Strings of these letters form a DNA sequence. Further, the basic alignment tool is the dot matrix. In this type of alignment, one sequence is arranged rowwise, while the other is arranged columnwise. A dot is placed in the cell that corresponds to a match of the letters of the row and the column.
As can be seen, DNA sequence alignment can be a very tedious and resourceintensive operation because DNA sequences can be millions of characters long and aligning them by conventional methods is not possible. For the alignment process to be meaningful, it needs to be accelerated. Such acceleration can be carried by using either software or hardware. Software acceleration employs heuristic methods and hence, is not optimal. For optimal acceleration, exact methods that compare each character against every character of the other sequence are required. Exact methods are very time consuming and resource intensive. To achieve optimality, they need to be carried out on dedicated hardware.
The Smith–Waterman algorithm (SWA) is an exact sequence matching algorithm that performs optimally as it can return local alignments and thus, provide considerably accurate information about the similarity between sequences that are aligned against each other [2]. SWA is a dynamic programmingbased algorithm that decomposes the alignment problem into subproblems and iteratively solves the given problem [3]. SWA has the space and time complexity of
O (MN ), whereM andN denote the lengths of the sequences to be aligned. The space and time complexity is the best among other alignment methods and provides accurate and local alignment [3].SWA is a resourceintensive operation and thus, needs dedicated hardware for efficient and fast operation. Different hardware implementation candidates can be found, such as FPGA, GPU, and cell BE platforms. In [4], a comparison between the FPGA, GPU, and cell BE reconfigurable platforms for aligning biological sequences on the basis of speed, energy consumption, and developmental costs has been presented. Further, it has reported that FPGA outperforms the other two platforms in terms of performance per watt and performance per dollar spent. The advantage of FPGA in highperformance computing is that, if it is used as an accelerator, the computing density can be increased significantly as FPGA benefits from the increased clock speed as opposed to general microprocessors that have reached their maximum clock speed [5].
Because of the low cost and speedrelated advantages, all recent hardwarebased SWA implementations are carried out on FPGAs [6]. The performance of an implementation on FPGA depends upon the cell design that does the calculation for SWA. This work focuses on designing an efficient systolic cell for implementing SWA on FPGA. The main objective of this work is to design a systolic cell that uses relatively few hardware resources but at the same time, is not designed on a lowlevel device/platform and does not depend on the specific details of the lowlevel device/platform. Therefore, the design can be implemented on any other FPGA device or platform. Our implementation has a tradeoff between complexity and performance. Moreover, it avoids both oversimplification and wastage of hardware resources by avoiding a direct mapping of the iterative equations to the hardware. This work should be considered in a broader framework developed by [7] and focuses particularly on designing a cell that optimally fits the system design of [7].
The performance issues associated with the transfer of the required data to FPGA and back has been studied by [8]. GPU platforms are explored in the same capacity as FPGA. The authors in [9] studied the wellknown seedandextend algorithm to accelerate sequence alignment. Meaningful alignment often requires the alignment of short DNA sequences. The mechanisms and specifics of the alignment of short DNA sequences are sometimes divergent with the goals of the alignment of large sequences. The authors in [10] developed a platform that through GPU could accelerate the alignment of short sequences. Other architectures are also explored for accelerating the alignment of DNA sequences. In [11], a single instruction multiple data (SIMD)based architecture is implemented on an applicationspecific instructionset processor (ASIP) for aligning DNA sequences.
The rest of this paper is organized as follows: Section II discusses the SWA in detail. Section III presents the principles on which the proposed cell design is based. Section IV describes the proposed cell design. Section V presents the results of this study, and Section VI concludes the paper.
II. SMITH？WATERMAN ALGORITHM
Needleman–Wunch [12] presented an efficient algorithm(NW algorithm) for aligning two sequences. The method was exact and thus, produced optimal results. However, it performed global alignments. Global alignments are good when the sequences to be aligned are closely related as endtoend alignment highlights the similarity across their lengths but they fail to perform optimally when the sequences to be aligned do not have large homology. Smith and Waterman presented a variant of the NW algorithm in 1984 and called it the Smith–Waterman algorithm (SWA) [3]. SWA finds the local alignment and works well for sequences that are distantly related as such sequences do not manifest the overall similarity but have regions that have concentrated local similarities.
Local alignments are carried out by making small changes to the equations that perform global alignments. SWA like the NW algorithm consists of three stages for performing alignment, namely initialization, matrix fill, and trace back. The initialization and matrix fill stages resemble those of the NW algorithm, while the trace back stage differs from that of the NW algorithm. These three stages can be described as follows:
Initialization : Assume that we have two sequencesx andy of lengthi andj , respectively, are to be aligned. The initialization step is different for the NW algorithm and SWA in that the top row and the leftmost column are initialized to 0 in the SWA, whereas in the NW algorithm, the upperrightmost element (0,0) is set to zero and the top row and the leftmost column are initialized with the cost of gap penalties of lengthi andj .Matrix fill : In the matrix fill stage, two characters are compared to each other and a score is assigned to the comparison on the basis of (1), whereF (i, j ) denotes the element that is computed presently,s (xi, yj ) represents the similarity score that is selected on the basis of the similarity between the two characters, whiled indicates the gap penalty.(1) uses a linear gap penalty scheme as every gap is penalized in the same way. A more biologically relevant method for handling gaps is the affine gap model that penalizes opening a gap more than extending an already opened gap. Such a scheme is given in (2).
In (3) and (4),
o denotes the gap opening penalty, whilee represents the gap extension penalty. The matricesF (i, j ) andE (i, j ) contain the trace of opening and extending a gap, respectively.F (i, j ) stores the cost for opening a gap and extending a gap on sequencex , whileE (i, j ) stores the cost for opening and extending a gap on sequence y.Trace back : In SWA, trace back starts from the element that has the maximum score. Moving in the direction of the cell that has the highest value, 0 is encountered and the alignment is stopped. Indels are created; i.e., “―” is inserted in the alignment for the vertical or horizontal selection of the cell instead of the diagonal movement. In the affine gap model, three separate matrices need to be stored, whereas in trace back, all the matrices are searched to find the highest value.III. SYSTEM MODEL
Before presenting the system model, we will evaluate the performance of the SWA used in the proposed design.
> A. Performance Evaluation of SWA
SWA performs optimally on many performance indicators such as accuracy, parallelism, and space and time complexity [13]. SWA is considered the most optimal because it performs the alignment of every individual character in one sequence against every character in the other sequence. This results in the massive space complexity of
O (MN ), whereM andN denote the sizes of the two sequences, respectively. However, SWA also computes the cell’s value independent of the reset of the matrix and dependent only upon the three previous cells and the similarity score.Both finegrained and coarsegrained parallelism can be exploited in SWA. Finegrained parallelism works on making the required data available for a cell and then computing the values of these cells in parallel; the values of the three previous cells are available for this computation. Coarsegrained parallelism is achieved by dividing the alignment problem into subproblems and solving the subproblems in parallel, thus reducing the overall time complexity.
In calculations that are not parallel, computing the entire matrix takes
M ×N cycles, which increases tremendously with an increase in the sequence length. From Fig. 1, we can see that diagonal elements in the matrix can be computed in parallel; therefore, the time complexity can be reduced to the number of diagonals in a matrix. There existM +N −1 diagonals in an array; therefore, two sequences having lengths of M and N can be aligned inM +N −1 cycles.> B. Systolic Cell Design for SWA
This section presents an efficient PE design for aligning two sequences by using a linear systolic array (LSA) on FPGA. The systolic array was introduced by Kung and Leiserson [15]. Systolic arrays have proven to be significantly efficient in parallelizing computing designs. These arrays have been used for performing the complex function of multiplying two long sequences: the sequences are arranged in rows and columns, and two elements from the sequences are multiplied by the PE in a systolic array. The idea of systolic arrays was used by Lipton and Lopresti [16] in 1985 for implementing the edit distance algorithm, a standard global alignment algorithm for DNA sequences. In 1987, on the basis of the initial results, which showed an improvement in speed in the order of hundreds to thousands of times, Lopresti developed the first fullfledged system for aligning nucleic sequences. This system was called PNAC. Further, BISP, BioScan, BSys, Splash/Splash 2, and Kestrel employ systolic arrays for aligning two sequences.
When SWA is mapped to a systolic array, each PE computes one element of the alignment matrix at a time, and an entire column is computed during the time span of the entire operation. Further a PE array is mapped to the antidiagonal lines of the alignment matrix. A PE that is composed of elements that perform simple operations such as addition, subtraction, and comparison makes the design very fast. The focus is always on making the PE simpler, faster, and thus, more efficient. The overall system design is adopted from our previous work as presented in [7]; the corresponding block diagram is illustrated in Fig. 1.
To achieve the goal of designing an efficient PE, it is imperative to reduce the area that a PE uses. Besides being slow, oversized PEs waste significant hardware resources. This reduces the available resources that can be used for realizing a large number of PEs and hence, reducing the size of the LSA. The size of the LSA controls the degree of parallelism and thus, the speedup achieved. Oversized PE also adversely affects the clock frequency through which the LSA operates. As the number of logic units increases in a PE, the clock will need more time to traverse to the end, subsequently affecting the rate at which the computation of the PE can be completed.
Previous design methods of PE can be broadly categorized into three categories: The first one oversimplifies the design of PE. The second approach employs a direct implementation of an iterative equation to the LSA. The third approach attempts to carry out optimizations at the VLSI level by integrating two PEs. All these approaches have problems of their own. The first method calculates only the edit distance between two sequences and thus, results in nonused hardware resources. The second results in an overuse of hardware resources, while the third results in compatibility issues as such a design cannot be migrated to other platforms.
Therefore, we need to come up with a design that avoids all of the abovementioned pitfalls. A design which is not overlysimplified and does not overuse resources, and which is also compatible with other platforms is needed. The next section looks at such a design for aligning two sequences by using SWA with linear gap penalties.
IV. CELL DESIGN
This section presents the proposed design for SWA. First, the cell design for SWA with a linear gap penalty is presented, and then, the cell design for SWA with an affine gap penalty is discussed.
> A. PE Design for SWA with Linear Gap Penalty
A portable design that judiciously uses hardware is possible if the iterative equations of SWA can be simplified and captured in standard hardware description languages (HDLs) such as Verilog. The major bottleneck with the recursive equation of SWA is that the data precision of the computed elements is very high. For the alignment of DNA sequences, a data precision of 16 bits is used. The logic units that a PE uses are very basic, such as adders and comparators, but the size of these units increases the hardware resources consumed by an individual PE. Therefore, the size of the logic units needs to be reduced to achieve an overall improvement in the hardware utilization. If Lopresti’s observation is realized in the field of PE design, significant savings in hardware usage can be achieved.
In 1985, Lipton and Lopresti [16] observed that the values of the top element
b and the left elementc are restricted to a certain distance from the corner element, i.e.,a + 1 ora – 1. Therefore, the iterative equations of SWA can be modified to accommodate this fact as in (5):In (5),
S andT denotes the two sequences to be matched, andi represents the index of both the sequences. In (5), the intermediate values that are used for calculating the finalF (i, j ) can be derived from the corner element. These new values differ from the corner values by +1 or –1 or by 0. Therefore, two bits can be used for representing them. The calculations of the difference vectors are expressed in (6) and (7);The values as computed in (6) and (7) are used as intermediate values, and thus, the adders and comparators involved compare two inputs of two bits each instead of two inputs of 16 bits each. The output
D (i, j ) is added toF (i  1,j  1)_{15:4} in order to obtainF (i, j ). Fig. 2 shows the basic PE design for the SWA linear gap penalties. The sequence comparison unit compares two characters, one from each sequence, and generates a similarity score on the basis of this comparison. The sequence Ns is shifted by one, and the current Ns is buffered by one cycle as it is used in the next PE after one clock cycle. The difference bits from the left element and the diagonal elements are separated in the first clock cycle. A subtraction unit subtracts the diagonal element from the left element to obtain the difference vector. This difference vector is used in the rest of the calculations. The remaining 12 bits from the diagonal element are added to the resultant difference vector in order to obtain the value of the present cell. The final difference vector is buffered for one clock cycle, and this buffered value is used as the top element in the next clock cycle, as in LSA. Then, in the next clock cycle, the lower value in the column is computed by the PE. The value of the cell is compared with Max_in, i.e., the global maximum, to find the global maximum. This global maximum is then compared with the current maximum value of the cell, which is stored in a buffer, and the greater of the two is stored. This value is also forwarded to the next PE as the global maximum. This optimized cell uses twobit units for computing the iterative equations of SWA. The last adder that adds the bits from the diagonal element to the difference vector is a 16bit adder, and the comparators that compute the global maximum and the maximum value of the cell are 16 bit wide.> B. PE Design for SWA with Affine Gap Penalty
Affine gap models are used for aptly representing the biological fact of the costliness of opening a gap than that of extending it. An affine gap model penalizes opening a gap more than extending an already opened one. (2)–(4) present the Smith–Waterman iterative equations for affine gap models. Fig. 3 presents the block diagram for an affine gap penalty cell.
Two new units are added to calculate
E (i, j ) andF (i, j ).E (i, j ) andF (i, j ) amortize the costs of opening new gaps on the first sequence and the second sequence, respectively. Difference bits are taken from the left element and are subtracted from the value of the gap opening penalty, i.e.,o , while fromE (i  1,j ), the previous PE gap extension penalty e is subtracted; then, both these values are compared to find the difference vector forE (i, j ). The value ofE (i, j ) is forwarded to the next PE. The value ofF (i, j ) need not be taken as an input from the previous PE; instead, it is generated in the current PE by itself. The current value ofF (i, j ) is buffered as the value of the present difference vector is delayed by one clock cycle in order to be used in the next cycle for generatingF (i, j ). From these values, o and e are subtracted to generate the difference vectors ofF (i, j ). The maximum values ofE (i, j ) andF (i, j ) are compared with the maximum of 0 and the difference vector of the left element to find the current difference vector. To this difference vector, the remaining bits of the diagonal element are added to get the present value of the cell. The calculation of the maximum value is the same as that of the linear gap penalty cell.V. RESULTS AND DISCUSSION
In order to reduce the cost incurred by acceleration through FPGA, we implemented the design on lowcost Spartan3 FPGA, a product of Xilinx. Spartan3 is a lowpriced device of Xilinx but is sufficient for implementing a design of low–midlevel complexity. It does not contain many logic slices to realize a highperformance design. It does not allow the implementation of PEs in the order of hundreds. The implementation of only a limited number of PEs is allowed. This affects the overall speedup achieved. The speedup achieved is at a cost incurred such as energy consumption and hardware resources used, thus, a balance has to be maintained between speedup achieved and the cost incurred.
The most often used performance metric for computational biology applications is the number of cells updated per second (CUPS), and it can be calculated as expressed in(8):
The device that we used is Spartan 3 XC3S250E. We could instantiate an array of PEs of size 35 on the said device. The operating frequency shown by the postplace and route simulation model was 50 MHz. The proposed cell design implemented in the framework results in 1.75 GCUPS for the affine gap penalty cell and 2.75 for the linear gap penalty cell. This is a 20× and 32× performance improvement over the GPP implementation described in [4]. Table 1 presents the performance improvement.
The authors of [17] built highly customizable PEs for the alignment of biological sequences on FPGA. They could implement an array of PEs of size 13 on a Spartan 3 1500 chip. Table 2 presents the efficiency of the proposed design in terms of the hardware utilization. The proposed design more efficiently utilizes hardware as it can realize more PEs by using limited hardware resources in contrast to the implementation described in [5] that realized fewer PEs by using a lot of the hardware resources available.
The number of PEs realized can be increased if our memory utilization logic is not implemented and an UART that directly sends the output is used. Performance can be improved by using a device that has a higher clock speed and has more hardware resources available.
Implementations such as the implementation described in [18] achieve very high performance but at a very high cost; further, the design cannot be implemented on any other platform because it incorporates the VLSIlevel details as in contrast, the proposed system achieves comparable performance at a very low cost.
VI. CONCLUSION
In this paper, we presented an efficient and lowhardwareresourceconsuming systolic cell design for implementing SWA on FPGA. Cell design is the most important part of designing an efficient and fast implementation of SWA on FPGA because the actual matching takes place here. Previous designs either oversimplified the cell design for implementing the iterative equations or avoided optimization altogether and directly mapped the iterative equations to the hardware. This resulted in either inflexibility or the wastage of hardware resources.
The proposed design aimed at a tradeoff between flexibility and performance. Equivalent equations of the iterative equations that require fewer iterations for the implementation were used instead of the original iterative equations. The results showed performance improvement over a GPP implementation in the order of 32× for the linear gap penalty scheme and of 20× for the affine gap penalty scheme. Moreover, the results were compared with those of a design that was implemented on a device that had more hardware resources than our Spartan XC3S250. Further, we showed that the proposed design surpassed the hardware resource utilization of the other design by a factor of 4.29 in the case of the linear gap penalty scheme.

[]

[]

[]

[]

[Fig. 1.] Block diagram of the overall system design.

[]

[]

[]

[Fig. 2.] Cell for the linear gap penalty.

[Fig. 3.] Cell for affine gap penalty.

[]

[Table 1.] Performance improvement in terms of CUPS

[Table 2.] Hardware utilization improvement