Design Methodologies for Reliable Clock Networks
 Author: Joo Deokjin, Kang Minseok, Kim Taewhan
 Organization: Joo Deokjin; Kang Minseok; Kim Taewhan
 Publish: Journal of Computing Science and Engineering Volume 6, Issue4, p257~266, 30 Dec 2012

ABSTRACT
This paper overviews clock design problems related to the circuit reliability in deep submicron design technology. The topics include the clock polarity assignment problem for reducing peak power/ground noise, clock mesh network design problem for tolerating clock delay variation, electromagnetic interference aware clock optimization problem, adjustable delay buffer allocation and assignment problem to support multiple voltage mode designs, and the state encoding problem for reducing peak current in sequential elements. The last topic belongs to finite state machine (FSM) design and is not directly related to the clock design, but it can be viewed that reducing noise at the sequential elements driven by clock signal is contained in the spectrum of reliable circuit design from the clock source down to sequential elements.

KEYWORD
Clock design , System reliability , Systemonchip , Embedded systems

I. INTRODUCTION
In a synchronous digital system, the clock signal is used to define a time reference for the movement of data in the system. The clock distribution network distributes the clock signal(s) from a clock source to all sequential elements which require it. Thus, the clock function is vital to the operation of the synchronous system. Nowadays, much more attention has been paid to clock related design than ever before. This is because as the clock frequency increases over a 1 GHz with low supply voltage, a small noise on the clock signal causes a transient function error or even a drastic system failure (the noise comes from many factors such as current charge/discharge variation and temperature variation). One of the most important clock design issues is analyzing the clock signals’ behavior, and mitigating the adverse impact of clock noise on circuit reliability.
This work overviews a number of important clock network optimization problems and the proposed techniques with regard to the circuit reliability in deep submicron design technology. The following subsections cover 1) the clock polarity assignment problem for reducing peak current noise on the clock tree, 2) the clock mesh network design problem for tolerating the clock skew variation, 3) the electromagnetic interference (EMI) aware clock optimization problem, 4) the adjustable delay buffer (ADB) allocation and assignment problem that is useful in the multiple voltage modes design environment and finally 5) the finite state machine (FSM) state encoding problem for reducing peak current in the sequential elements in FSM. Note that the last topic is not directly related to the clock design issue, but it can be regarded that reducing noise at the activation of sequential elements driven by a clock signal is contained within the spectrum of achieving reliable circuits from the clock source down to sequential elements.
II. CLOCK TREE SYNTHESIS OVERVIEW
A typical clock tree synthesis (CTS) process consists of three phases: abstract tree topology generation, clock tree routing, and buffer insertion.
Fig. 1 shows the flow of the methods of means and medians (MMM) algorithm [1]. It takes the clock sink locations as input and generates an abstract clock tree topology by recursively partitioning the sinks into two subregions until there are at most two sinks on each subregion. The partitions are performed either in the x direction or y direction, dividing the number of sinks equally using the concept of median. The partitions are then merged back in reverse order, forming an abstract tree topology where the merging points are determined using the concept of mean (i.e., center of mass) of the sink locations on the subregions to be merged. The more accurate determination of the location of merging points (i.e., the internal nodes of tree) is performed in the wire routing step.
With the abstract topology obtained, the actual wire routing from the clock source toward the sinks is accomplished by other algorithms. One of the most notable algorithms is the deferred merge embedding (DME) algorithm [2,3], which ensures zero clock skew while mini
mizing the total clock wire length (the clock skew refers to the delay difference between the earliest arrival time and latest arrival time to sinks). The DME algorithm runs in two stages; one is a bottomup process from sinks and the next is a top down process from the root of clock tree.
Fig. 2a and b show the bottomup stage in which the algorithm builds
merging segments or merging regions , if a bounded clock skew constraint is employed. Each point in the merging segment or region satisfies the clock skew constraint for the clock subtree rooted at the point, thus the points in the merging segment or region are the candidates for the wire branching location of the corresponding internal node in the abstract tree topology. Fig. 2c and d illustrate the topdown stage. The topdown stage finalizes the exact positions of branching locations. While all the points in merging segments or regions satisfy the clock skew constraint, depending on the point selection, the wire length may change and it is the topdown stage that ensures the minimal length of total wires.The final step of CTS is to insert buffers into the clock tree. The clock slewrate, which is the sharpness of the clock signal voltage rise and fall, must be controlled in modern chip design. With large clock trees, the wire capacitance is too large to be driven by one large buffer placed at the root of the clock tree. Hence, buffers need to be inserted into the clock tree to meet both the slew and skew constraints. Many works have combined the first
two phases with the buffer insertion [46].
Recently, some works attempted to integrate circuit simulation into the CTS process while trading off between the accuracy of the delay model and the quality of CTS [7]. On the other hand, instead of sacrificing simulation time with CTS quality, some approaches have proposed the adaptation of the independent delay model. This is accomplished by carefully constructing clock trees to have a symmetrical structure that guarantees zero clock skew throughout the clock tree. Shih and Chang [8] met near the zero skew (<2.1 ps) goal while achieving orders of magnitude runtime speedup.
III. RELIABILITY AWARE CLOCK DESIGN TECHNIQUES
> A. Clock Polarity Assignment
In synchronous circuits, clock trees and its clocked loads are the major sources of onchip noise (power/ ground voltage fluctuations) since they switch simultaneously near the rising and/or falling edge of the clock signal and draw a significant amount of current from the power/ground rail. To accommodate this simultaneous switching, Vittal et al. [9] and Vuillod et al. [10] took advantage of clock skew to scatter the clock signal arrival times along the time domain.
Thereafter, Nieh et al. [11] introduced additional flexibility to this scattering, by firstly proposing the polarity assignment technique. They demonstrated that mixing buffers and inverters for clock buffering elements can disperse noise over rising and falling edges of the clock as shown in Fig. 3. The approach assigned half of the buffering elements to a negative polarity and the other half positive by replacing one of the two buffers that are connected to the clock source with an inverter. However,
although the total peak current is significantly reduced, since power/ground noise is a local effect, the problem remained largely unsolved. Samanta et al. [12] mixed buffers and inverters throughout the clock tree structure so that for each local region about half of the buffering elements have positive polarity and the other half negative polarity. This approach, however, while greatly reducing noise, is likely to introduce large clock skew. According to more recent data [13], mixing buffers and inverters for nonleaf elements resulted in an average clock skew of 592 ps. Chen et al. [14] observed that noise from leaf buffering elements dominates that from nonleaf buffering elements. This phenomenon is more prominent when the clock tree is not a binary tree where one nonleaf buffering element has more than two leaf buffering elements attached. Hence, they proposed to assign polarity to leaf buffering elements only and assigned half of the leaf nodes to negative polarity, with the objective of minimizing clock skew. However, this approach only heuristically solved the noise reduction problem in that it does not consider the buffer load which affects the peak value of the noise current. Jang et al. [15] proposed and solved a combined problem of buffer sizing and polarity assignment with the proof of the nondeterministic polynomial time (NP)completeness of the polarity assignment problem. It retrieves all assignment combinations under the clock skew constraint and selects the one with the lowest noise. Very recently, Joo and Kim [16] proposed a finegrained polarity assignment to overcome the two limitations of the prior works, which are the unawareness of the signal delay (i.e., arrival time) differences to the leaf buffering elements and the ignorance of the effect of the current fluctuation of nonleaf buffering elements on the total peak current waveform.
Fig. 4 illustrates the difference between nonleaf blind and nonleaf aware polarity assignments. In Fig. 4a, the black dotted line which is the current waveform from leaf nodes is well balanced, indicating that the noise is minimized for the leaf nodes. However, the solid line with blue color which is the total noise waveform of both the leaf nodes and the nonleaf nodes is not balanced. The combined noise causes a higher peak current value than the original expectation that comes from the consideration of leaf nodes only. Fig. 4b shows an improved result in which the nonleaf noise is considered. At leaf nodes, more current noise is migrated to the falling edge of the clock signal, yielding a greater margin for the nonleaf node noise. The results of Fig. 4 suggest that polarity assignment algorithms need to consider the effect of the noise caused by nonleaf nodes on the leaf nodes, even when only assigning polarity to the leaf nodes only.
Fig. 5 compares the polarity assignment results by Jang and Kim’s algorithm [15] and Joo and Kim’s finegrained approach [16]. Joo and Kim’s modeling captured more details of the noise waveform, including smaller bumps at the opposite clock edge of the larger peak noise,
arrival time differences, and nonleaf element noise. Thus, more of the buffering elements were assigned as inverters, showing better balanced noise levels at the rising and falling edge of the clock.
In addition, there are a number of works which have considered other design factors to improve the quality of polarity assignment. Kang and Kim [17] considered a statistical delay variation to increase yield during polarity assignment. Ryu and Kim [18] took a different approach by assigning polarity to leaf nodes first and then routing the clock tree. Lu and Taskin [19] exploited exclusiveor
(XOR) gates, instead of inverters and buffers, so that polarity may be controlled at runtime.
> B. Clock Mesh Network Design
As complementary metaloxidesemiconductor (CMOS) process scaling continues under deep submicron technology while clock frequency increases, the variation effect on clock skew becomes significant. It is known that the delay variation on the interconnect can cause up to 25% variation in clock skew [20], implying the necessity of controlling the variation effect during the clock network design. The meshbased clock distribution network, as depicted in Fig. 6, is one of the solutions that can effectively mitigate the clock skew variation since multiple paths from the clock source to a sink are able to compensate the different clock arrival times.
However, the multiple paths are likely to generate short circuit currents between mesh buffers which have different clock arrival times. Moreover, the meshbased clock network requires much more wires than the treebased clock network in order to support multiple clock signal paths. Consequently, most of related works have focused on reducing the amount of power consumption and wire resource while maintaining clock skew variation tolerance. Venkataraman et al. [21] decided the location and size of buffers first, and then reduced mesh wires; they solved the mesh buffer placement and sizing problem by formulating it into a weighted setcover problem. Precisely, for each node (i.e., buffer) in the clock mesh they defined a so called covering region, which refers to the set of nodes that the node can drive. Subsequently, they locate a minimal set of covering regions that can cover all nodes in the clock mesh. They employed a greedy algorithm which iteratively picks a covering region that covers the largest number of uncovered nodes. The mesh reduction problem was then solved by formulating it into the Steiner network problem [22]; they iteratively removed wire segments while maintaining a certain level of redundancy, i.e., every clock sink should have at least
k closest clock buffers within a distance ofL_{max} . After mesh reduction is performed, readjustment of the buffer size is attempted as a postprocessing. On the other hand, Rajaram and Pan [23] suggested an initial mesh planning; they expressed the total wirelength and worstcase clock skew as functions of mesh size, by which they created an initial mesh size under the worstcase clock skew constraint. In addition, they improved the cost function used in the greedy algorithm in [21] by considering a potential effect of buffer insertion on mesh and lowpass filter characteristics of an RC mesh. They also proposed a network sensitivitybased mesh reduction algorithm in which the delay sensitivity of each sink is calculated in terms of the width of the wire segment. By using this delay sensitivity, they removed the wire segments that cause little effect on clock skew.To demonstrate the variation tolerance of the clock mesh, we compare the clock mesh with the clock tree in terms of the clock skew and resource usage. We implemented the clock mesh with a given mesh size and inserted buffers by using the algorithm of buffer placement and sizing in [21]. Then, the clock trees are generated by applying the algorithm in [3,6]. We tested clock meshes and clock trees on ISCAS89 benchmark circuits. We used 12 buffer sizes with maximumcapacitance ranging from 60 to 500 fF under a slew constraint of 100 ps. The technology parameters and transistor models were based on 45 nm predictive technology [24]. We consider 4 variation parameters: power supply voltage, buffer channel length, wire width, and sink capacitance. All variation parameters are varied with 5% standard deviation around its nominal value. We also consider the spatial correlation by the principal component analysis [25]. The columns in Table 1 are benchmark circuit name, given mesh size (‘Tree’ for the result of clock tree), mean and standard deviation of the clock skew (measured by HSPICE MonteCarlo simulations), total wirelength (WL), buffering area (BA), and power consumption (PWR). The last two rows in Table 1 are the average ratio of clock trees and clock meshes. Table 1 indicates that the clock mesh can reduce 50% of worstcase clock skew (
μ + 3σ) on average with the increase of 42% of total WL, 29% of BA, and 37% of PWR.> C. Electromagnetic Inference Aware Clock Design
Increasing the requirement for high performance circuits and the enforcement of strict governmental regulation leads to a new design solution for EMI. Electromagnetic radiation from an electronic circuit can be categorized into two forms [26]:
differentialmode radiation andcommon mode radiation . Differential mode radiation occurs in the current loop, which is related to onchip wiring, the package and enclosure, and board wiring among other things. In contrast, commonmode radiation is the result of voltage level difference. Thus, commonmode radiation is closely related to the signals on data and clock wires.EMIaware clock optimization is a clock network synthesis method with the objective of reducing EM emis
emission by controlling clock parameters. The key clock parameters are slew rate and clock skew. Slew rate is the rise/fall time in the clock signal which typically has a trapezoidal waveform. With Spectral analysis, the clock signal can be represented in the sum of series regarding sine and cosine functions. The amplitude of high frequency functions critically depends on the speed of slew rate [27]. In other words, a clock signal with a fast slew rate has larger amplitude in high frequencies compared to those with a slower slew rate. Previous works demonstrate that in order to reduce total EMI it is very effective to decrease the slew rate of the clock signal [2729]. Pandini et al. [27] proposed a method to decrease the slew rate by manually removing buffers and inverters with higher driving strength from the target library. Hu and Guthaus [29] proposed a method of weakening buffer driving strength by setting the constraint of minimum slew rate during clock tree synthesis. They proposed an incremental dynamic programming algorithm to determine buffer sizing and positioning for a clock tree, considering metrics such as skew and power used in [27]. EM emission is also primarily generated by simultaneous switching noise (SSN). In the clock network, sink buffers and connected flipflops toggle almost at the same time, creating a large current pulse on the power network, and fast current variation increases di/dt, causing a high SSN. Note that peak current can be controlled by spreading the activity of buffers and flipflops within a time period, but it causes a clock skew variation. Vuillod et al. [10] proposed a solution to reduce peak current by relaxing the clock skew constraint. Pandini et al. [27] performed a theoretical analysis on power/ground noise by describing noise using a periodic triangular pulse, and demonstrated that increasing skew bound is effective in reducing EM emission.
> D. Adjustable Delay Buffer Allocation
Traditional clock optimization techniques are mainly based on the environment in which all elements in a circuit use one fixed operating voltage. However, a recent technological trend requires multiple supply voltages to allow the voltage level applied to a circuit to be dynamically changed. In this multivoltage mode design, some part of the circuit could operate at high voltage when the associated module is required to complete its processing quickly and at low voltage when timing requirement can be relaxed and reducing power is more important. When the supplied voltage switches from high to low (or low to high), the delay of all logic elements including buffers in the clock network on the part of the chip also varies. One serious problem in this multivoltage mode design is the clock skew variation on the clock network.
To tackle the clock skew variation problem, Su et al. [30] proposed a methodology of dealing with clock skew optimization in which they proposed using ADB, which is a specially designed buffer so that its delay can be controlled dynamically. The general structure of ADB is shown in Fig. 7. It is composed of a normal buffer, internal capacitor bank, and a capacitor bank controller. The delay of ADB could be adjusted by activating the internal capacitor bank. That is, the adjusted delay of ADB is determined by the amount of activated capacitors in the capacitor bank that is controlled by the capacitor bank controller and its control input.
Fig. 8 demonstrates how ADBs can be used to fix clock skew violations. In Fig. 8a, an initial clock tree with clock signal arrival times for each clock sink in two power modes is given. With the clock skew bound of 10, there is one clock skew violation between FF_{1} and FF_{6} in power mode 1. By replacing buffer B_{1} with an ADB and inserting an additional delay of 3, this violation can be fixed. For power mode 2, there are two violations, one between FF_{3} and FF_{6} and the other between FF_{2} and FF_{6}. By allocating ADBs at B_{1} and B_{3} and assigning their delay values with 4 and 1, respectively, the clock skew violation is removed. The resulting ADB allocation is shown in Fig. 8b.
The problem to be solved in using ADBs in a multivoltage mode design is to minimize the cost of ADBs to be used since an ADB has more transistors than a normal buffer due to the internal capacitor bank and controller. The approach in [30] focused on the power mode with the worst violation of the clock skew constraint and resolved the worst clock skew by adding ADBs in a greedy man
ner. The approach was then repeated until there was no clock skew violation for every power mode. While the approach can efficiently find an ADB allocation with no clock skew violation, an optimal ADB allocation is not guaranteed due to the inherent limitation induced by the iterative heuristic. Later on, Lim and Kim [32] improved the work in [30] by proposing a linear time optimal algorithm for the ADB allocation and delay value assignment. Initially, they replaced all buffers in the clock tree with ADBs and then removed unnecessary ADBs through a comprehensive analysis of the timing relation between ADBs. They performed the analysis in a bottomup fashion in the clock tree and showed that the ADB allocation is a polynomialtime optimal algorithm, bounded by
O(N·K) whereN andK are the number of buffers in the clock tree and the number of power modes, respectively.Recently, Lim et al. [32] addressed several practical design issues related to the nonideal characteristics of ADBs. First, nonideal ADBs have nonuniform discrete delay steps which harm the optimality of ADB allocation in [32]. Second, ADBs have a nonzero base delay that is larger than that of typical buffers even when all the capacitors in the bank are deactivated. Third, the output slew increases as more of the capacitors in the bank become active. Last, further ADB area reduction is possible without affecting the number of ADBs currently allocated. They addressed these issues by proposing heuristic iterative refinement algorithms and proposing a new modified design of the ADB circuit.
> E. Noise Aware FSM State Encoding
Since FSM is a synchronous component controlled by the clock signal of a circuit, it also draws the peak current on the power line (
V_{DD} ) and ground line (V_{ss} ) of the circuit when state transition occurs. Thus, it is also important to reduce the number of FSM flipflops in the state register that switch simultaneously. Peak current in state encoding can be explained using the example given in Fig. 9, which shows state transition graphs (STGs) of a FSM. The maximum switching on the left FSM (STG_{1}) occurs when state changes fromS _{0} toS _{4} (three 0to1 transitions). Meanwhile, the maximum switching on the right FSM (STG_{2}) occurs when state changes fromS _{3} toS _{0} (two 1to0 transitions). This example demonstrates that proper state encoding may reduce the circuit noise.Huang et al. [33] start from the encoding result produced by the scheme in [34] which targets reducing the total power (i.e., total switching activity) of an input STG. The idea of reducing peak switching among state transi
tions is to decide whether the value of each bit position in the previously encoded registers is to be complemented or not. For example, if the solution selects to complement bit positions 0 and 3, then all the values in bit positions 0 and 3 in state encoding are complemented. They formulated the problem of determining bit complementing/ uncomplementing into an integer linear programming (ILP), targeted towards minimizing the larger of the maximum numbers of bit positions that switch from 0 to 1 among state transitions and the maximum number of bit positions that switch from 1 to 0.
For
n state FSM which has anm bit code length, the approach explores design space as large as 2^{m}. Lee and Kim [35] improved Huang’s work by considering the tradeoff between power and noise, in which they minimized peak switching, followed by minimizing total switching (i.e., minimizing total dynamic power consumption). They formulated the peak switching minimization problem into a satisfiability (SAT) problem with pseudoboolean (PB) expressions, and solved it by using a PBSolver [36]. Then, they iteratively updated the PB expression to extract a state transition which results in the highest switching, and running PBSolver with the additional conjunctive normal form (CNF) constraints. On the other hand, Gu et al. [37], similar to the work in [33], started from an encoded FSM with minimal total switching, but they identified a set of transitions (called working setS ) that cause peak current and applied both statereplication and stateencoding toS . Statereplication replicates a state to assign another code to each replicated state with the proper generation of state transitions while stateencoding reencodes the given code of a state to another unused code. With an input FSM with minimum total switching, by applying statereplication and stateencoding in a combined manner iteratively, they reduced the peak switching.IV. CONCLUSION
As the process technology scales down, the variation or sensitivity of noise and delay in the clock network becomes worse and worse, which in fact causes a drastic impact on the reliability of the circuit. This paper reviewed several important clock related design problems and the existing techniques that are essential to the highly reliable circuit design. In addition to the introduced design optimization and synthesis issues, there exist other issues that are also important to tolerate or mitigate circuit noise. Those examples are 3D clock design considering through silicon via (TSV) design variation and circuit reliability issues under the powergated or clockgated design environment. We hope that this survey helps to find or assess practically useful solutions to the important clock optimization problems that arise in the designs supporting diverse platforms, applications or environments (e.g., [38,39]).

28. Hardin K. B., Fessler J. T., Bush D. R. 1994 [Proceedings of the IEEE International Symposium on Electromagnetic Compatibility] P.227231

[Fig. 1.] The flow of the methods of means and medians algorithm [1]. (a) The location of all input clock sinks is fed as input. (b, c) Sinks are equally, using the concept of median of sink locations, partitioned into subregions (arbitrarily in the x or ydirection) until at most two sinks are left on each subregion. (d) The partitions are merged back in reverse order, using the concept of mean (or center of mass) of the sink locations on the two subregions to be merged to produce an abstract tree topology.

[Fig. 2.] The process of the deferred merge embedding algorithm. (a, b) Merging segments, which are the candidates of points of clock tree branching locations, are constructed in a bottomup manner. (c, d) Exact branching locations are identified and wire length minimization is performed in a topdown manner.

[Fig. 3.] Current profiles for a buffer and an inverter. (a) Buffers draw larger current from the power rail at the rising edge of the clock and discharges the current to the ground rail at the falling edge. (b) For the inverter, the opposite phenomenon occurs.

[Fig. 4.] (a) Current waveforms by nonleaf nodes' noise unaware optimal polarity assignment to leaf nodes. (b) Current waveforms resulting from nonleaf nodes' noise aware optimal polarity assignment to leaf nodes. Dark dotted line is the current waveform from leaf nodes only while blue solid line shows the total current from all clock nodes.

[Fig. 5.] Current waveforms for design s35932. (a) Result of polarity assignment without considering the noise effect by nonleaf nodes in Jang and Kim [15]. (b) Result of polarity assignment with the consideration of the noise effect by nonleaf nodes in Joo and Kim [16].

[Fig. 6.] Meshbased clock distribution network.

[Table 1.] Comparison of clock tree and clock mesh in terms of clock skew, wire length (WL), buffer area (BA), power consumption (PWL)

[Fig. 7.] Structure of adjustable delay buffer [31].

[Fig. 8.] An example clock tree with two power modes and the information of clock signal arrival times at clock sinks. (a) Under clock skew bound of 10, there exist clock skew violations. (b) Two adjustable delay buffers are allocated at B1 and B3 to resolve the violations.

[Fig. 9.] Two state transition graphs (STGs) of finite state machine.