Recent research indicates that transient errors will increasingly become a critical concern in microprocessor design. As embedded processors are widely used in reliability-critical or noisy environments, it is necessary to develop cost-effective fault-tolerant techniques to protect processors against transient errors. The register file is one of the critical components that can significantly affect microprocessor system reliability, since registers are typically accessed very frequently, and transient errors in registers can be easily propagated to functional units or the memory system, leading to silent data error (SDC) or system crash. This paper focuses on investigating the impact of register file soft errors on system reliability and developing cost-effective techniques to improve the register file immunity to soft errors. This paper proposes the register vulnerability factor (RVF) concept to characterize the probability that register transient errors can escape the register file and thus potentially affect system reliability. We propose an approach to compute the RVF based on register access patterns. In this paper, we also propose two compiler-directed techniques and a hybrid approach to improve register file reliability cost-effectively by lowering the RVF value. Our experiments indicate that on average, RVF can be reduced to 9.1% and 9.5% by the hyperblock-based instruction re-scheduling and the reliability-oriented register assignment respectively, which can potentially lower the reliability cost significantly, without sacrificing the register value integrity.
Recent research efforts indicate that microprocessors will become increasingly susceptible to transient errors (also called soft errors) due to shrinking feature size, lower supply voltage, higher frequency and higher density. Unlike hard errors that can be detected in the testing phase, transient errors occur at operation time, which can lead to
Soft error rate (SER) is typically described in failure in time (FIT). One FIT represents one error in a billion hours. Soft errors can be divided into two categories: undetected or detected. Undetected errors are also called SDC. The detected errors can be either recoverable or unrecoverable. The latter is referred to as
While theoretically one would expect to kill all the soft errors,so the system is entirely error free, in practice, industry typically sets soft-error-rate budgets for their product based on target market requirements. For instance, IBM targets 114 SDC FIT, 4,566 system-kill DUE FIT and 11,415 processor-kill DUE FIT for Power4 processors [2]. Therefore, designers should develop or choose the most cost-effective mechanisms to meet the pre-defined reliability goal in terms of SDC FIT and DUE FIT to minimize reliability cost.
With the widespread use of load/store architecture, modern microprocessors often employ register files with a large number of registers and multiple ports that unfortunately are susceptible to soft errors. Moreover, since registers are accessed very frequently, soft errors occurring in the register file can easily propagate to the functional units or the memory hierarchy, leading to severe system reliability problems. Previous work has already shown that soft errors in register files can lead to a large number of system failures [3]. Some processors use error detection and correction schemes in the register files to enhance register file immunity to soft errors. For instance, IBM G5 utilizes an ECCbased scheme [4] to protect the registers. While the ECC scheme can detect double-bit errors and correct single-bit errors, it cannot correct double-bit errors. In addition, the ECC scheme is costly in terms of performance and energy consumption. Tremblay and Tamir [5] show that a simple ECC operation can incur three times the delay of a simple arithmetic logic unit (ALU) operation. Although ECC computation and verification can be performed in the background, the energy consumption cannot be hidden. Recent work indicates that the energy consumption of ECC is approximately an order of magnitude larger than that of a register access [6]. Therefore, ECC protection will be a very expensive mechanism for registers, especially for embedded processors with cost constraints. Compared to ECC, a less expensive technique to enhance register file immunity is parity check. However, reliability improvement by parity is limited, because the parity-based schemes cannot correct any errors or detect even-bit errors. Therefore, it is important to develop costeffective techniques to enhance register file reliability without significantly affecting cost, performance and energy consumption, especially for embedded processors.
The first step is to understand the impact of register soft errors on system reliability to protect the register file against transient errors cost-effectively. Estimation based on raw register SER is too conservative, since not all register soft errors can affect system reliability. Overestimating the register reliability problem can lead to over-protection that will unnecessarily increase reliability cost. Similarly, underestimating the register reliability problem may result in under-protection, which will make the processors unreliable. In this paper, we study the register file susceptibility to soft errors by defining a new metric ? register vulnerability factor (RVF). RVF characterizes the probability that register transient errors can escape the register file and thus potentially affect system reliability.
Based on the register access patterns and the assumption that soft errors distribute uniformly, we develop an approach to compute the RVF quantitatively, which can be used to estimate the reliability requirement of register files accurately to avoid over-protection or under-protection. We propose two compilerguided techniques to increase register reliability by performing instruction re-scheduling and reliability-oriented register assignment with a partially ECC- protected register file built upon the RVF concept. Our experiments indicate that on average, hyperblock-based instruction re-scheduling can reduce the RVF to 9.1% and the reliability-oriented register assignment with partial ECC protection can reduce the RVF less than 10%. Moreover, we propose a hybrid approach by integrating these two techniques to reduce the RVF further. Our experimental results show the hybrid approach can reduce the average RVF to 6.1% with only four out of 64 registers covered by ECC, leading to substantial improvement of register reliability against soft errors without significant impact on cost or performance. The remainder of this paper is organized as follows. Section II introduces the concept of the register vulnerability factor. Section III presents two compiler-guided techniques to improve register file reliability against transient errors by reducing the register vulnerability factor. Section IV explains the evaluation methodology. The experimental results are given in section V.
Section VI discusses related work. Section VII concludes the paper.
II. REGISTER VULNERABILITY FACTOR
Register files are more resilient to transient errors than are conventional memory cells. However, as technology scales, the charge retaining capabilities of CMOS devices decrease, and more clock edges can occur during a given period. Thus, the window of vulnerability for a flip-flop being around its clock edges makes it more susceptible to soft errors at increased frequencies [7]. While it is important to protect the register file against soft errors early in the design cycle, one should be cautious not to overestimate this problem, which can lead to expensive and excessive protection. Design based on the raw SER of latches will over-estimate the register reliability problem, since not all soft errors occurring in the register file can lead to visible system faults. For instance, soft errors between two register write operations to the same register will be automatically corrected by the latter write operation. Therefore, designers must accurately measure the probability that register soft errors can affect other system components and thus lead to erroneous final output. Mukherjee et al. [1] proposed the concept of architectural vulnerability factor (AVF). AVR is defined as the probability that a fault in a processor structure will lead to a visible error in the final program output. In general, the AVF provides designers an accurate estimate of the soft error rate for various hardware components to make cost/reliability trade-offs. While the concept of AVF can also be applied to the register file, it fails to exploit the fact that soft errors in the register file can be automatically overlapped by the new values written to the register file. If a value with soft errors is written before it is read, it will have no impact on the system output. We define the RVF to be the probability that a soft error in registers can be propagated to other system components (i.e., functional units,memory) toward the goal to measure register file susceptibility to soft errors accurately and quantitatively. RVF concentrates on the probability of soft error propagation to other hardware
elements, in contrast to the AVF concept [1] that focuses on the effect of soft error propagation. Even if a soft error occurred in the register file is consumed by an instruction, it may still not affect the final output, since this instruction may be missspeculated. Indeed, such effects can be easily captured by the AVF [1]. Thus, this paper focuses on examining the RVF. Obviously, the RVF and the AVF can be combined to select the most cost-effective techniques to increase the register file reliability against soft errors.
Multiple values can be stored in the same register, as long as their lifetimes do not overlap, since processors only employ a limited number of architecture registers, while programs typically use a large number of values. In general, a value is first written to a register, then it is read once or more and finally another value is written to the same register, which finishes the lifetime of the old value and begins the lifetime of the new value. As depicted in Fig. 1, we can divide the accesses to register files into four different patterns (or intervals), namely, the write-read (WR), read-read (R-R), read-write (R-W) and write-write (W-W) patterns (note that the read/write mentioned in this paper refers to the corresponding operations on register values, including but not limited to the load/store instructions, which operate on the data from the memory hierarchy). Among these four patterns,the register file is only susceptible to soft errors during the W-R and R-R intervals. In contrast, the soft errors occurring during the R-W and W-W intervals can be overlapped by the latter write operations, and hence will not affect other system components. It is widely accepted that fault-inducing particle strikes are randomly and uniformly distributed [1]. Therefore, the probability that a soft error in registers can be propagated to other system components can be computed, as the average ratio to which the register values are exposed to the susceptible intervals (i.e., W-R and R-R), as described in Equation (1). In this Equation,
The RVF indicates the probability that register soft errors can spread to other hardware elements and thus affect the system output. The higher the RVF, the lower the register file reliability, and hence more expensive techniques are needed to fight soft errors. Measuring the RVF is not only useful to understand the reliability requirement of register files more accurately to avoid both over-protection or under-protection,
We define a new metric called register file reliability factor (RFRF), based on the concept of RVF, which is the product of 1) the RVF, 2) the raw SER per latch, and 3) the number of latches per register file. As shown in Equation 2, N denotes the number of latches per register file and SERlatch represents the raw soft error rate per latch that varied with different technology. Therefore, we can estimate the reliability of register file against transient errors more accurately by incorporating the RVF.
III. TECHNIQUES TO REDUCE REGISTER VULNERABILITY FACTOR
There are a number of research efforts on improving reliability of various system components of processors in the literature. These include techniques to address soft errors for main memory [8, 9], cache [10, 11], and the datapath [4, 12, 13]. However, very little work has been done to explore the impact of soft errors on register files. Memik et al. [14] proposed a scheme to replicate register values into the physical registers to increase the register file reliability. However, such a technique cannot be applied to processors without physical registers, such as very long instruction word (VLIW) architectures, which are increasingly used in embedded systems. This paper, in comparison, proposes two compiler-guided techniques to improve the register file immunity to soft errors that can be applied to a wide variety of embedded processors. Based on the RVF concept,the first technique aims to enhance register file reliability by rescheduling the register read/write operations to reduce the RVF value without impacting performance. The second technique assumes that a fraction of the register file employs the ECC code and thus we modify the register allocator to protect the registers that are most susceptible to soft errors based on the RVF profiling results. Built upon these two techniques, we propose a hybrid scheme that can reduce the RVF further to improve the register file immunity to transient errors.
>
A. Re-schedule Instructions to Reduce RVF
RVF can be reduced by delaying the write operations as late as possible and scheduling the read operations as early as possible, since registers are only susceptible to transient errors during the W-R and R-R intervals. Thus, the W-R and R-R intervals are shortened, while the R-W interval is lengthened, both of which can lead to a smaller RVF value and hence higher register file
reliability. The movement of the register read or write operations, however, is subject to the data dependence between different operations. We propose to re-schedule the read/write operations by exploiting the scheduling slacks to not impact performance. Fig. 2 sketches the algorithm of the instruction re-scheduling.
This algorithm takes a region of code to schedule (
The potential gain is the difference in RVF between the original schedule and the schedule after exploiting the slack, which is performed in the function
The complexity of this algorithm is O(n2), where n is the number of instructions in the region. This complexity has the same order of magnitude, as some other widely used optimization phases of compilation, such as the instruction scheduling.
Therefore, the latency of the instruction re-scheduling can be tolerated by the compiler to generate better code by considering both register reliability and performance.
Fig. 3 shows an example of instruction re-scheduling, where instruction I3 is dependent on I1 and I2, and I5 is dependent on I3 and I4. Given sufficient resources, I1, I2 and I4 can be scheduled at the first cycle, I3 can be scheduled at the second cycle and I5 is scheduled at the third cycle, as shown in Fig. 3b.
As can be seen, I4 has one cycle slack, since it can be scheduled in the second cycle without increasing the critical path delay. Since I4 writes to register R6 and I5 reads register R6, we can re-schedule I4 to be executed in the second clock cycle, as shown in Fig. 3c. While R6 is susceptible to soft errors during two clock cycles in schedule (b), its susceptible interval is reduced to one clock cycle in schedule (c). Consequently, the RVF of R6 is reduced. By exploiting the scheduling slacks to move register write operations as late as possible and register read operations as early as possible, the RVF can be potentially lowered without compromising performance. The advantage of this approach is that it is purely a software-based approach, which can increase the register file reliability with no additional hardware cost. However, the effectiveness of the approach depends on the flexibility to move the read/write operations in the scheduled code regions, which is constrained by data dependences and the critical path latency. We also make use of the superblock scheduling [15] and hyperblock scheduling [16] algorithms to form larger blocks, in which the compiler will have more flexibility to re-arrange and optimize the register access patterns to minimize the RVF value to enhance the compiler’s capability to reorder instructions.
>
B. Reliability-oriented Register Assignment with Partial ECC Protection
In contrast to the first technique, which is purely softwarebased,the second scheme assumes that a certain number of registers have employed the ECC code, which can detect double-bit errors and correct single-bit errors. The ECC code is sufficient to protect the register file against soft errors in most cases, since most soft errors are one-bit errors. Therefore, we assume a single-bit soft error model in this paper. We assume that only a small fraction of register file is covered by ECC, because ECC is costly, especially for embedded processors. We propose to modify the conventional register allocation algorithm by distinguishing the registers with ECC and the normal registers without ECC to minimize the RVF of a partially ECC-protected register file. We develop a profiling-based approach to direct the register allocation. Specifically, based on the RVF profiling for each register, the compiler selects the registers with the highest RVF values. If these registers are not protected by ECC, the compiler then re-assigns the registers, so that the registers with ECC always have the highest RVF values. Since the most susceptible register values are now covered by ECC (i.e., the registers with ECC will not be susceptible to soft errors during any access intervals), the overall reliability of the register file can be improved substantially.
We propose a hybrid scheme that combines both the re-scheduling and the reliability-oriented register assignment, based on these two techniques. In the hybrid scheme, the compiler firstly performs the instruction re-scheduling to minimize the RVF based on hyperblocks and then re-allocate registers based on the profiling information and the number of registers covered by ECC. Compared with the pure software-based approach, such a hybrid scheme can improve the reliability further, by exploiting the small number of registers that are protected by ECC. Likewise, the cost of the partially ECC-protected register file can be reduced by first applying the software-based instruction rescheduling to lower the RVF value, as much as possible.
We evaluate the register file reliability in a VLIW processor,since VLIW architecture is increasingly used in embedded com-
[Table 1.] Default parameters used in our simulations
Default parameters used in our simulations
puting. We implement the proposed RVF-based techniques in the trimaran framework [17] that consists of both an advanced compiler and a VLIW simulator. A program flows through the frontend compiler IMPACT, the backend compiler Elcor, and the cycle-level VLIW processor simulator. IMPACT applies optimization level 4 (O4), which includes machine-independent classical optimizations and transformations to the source program; whereas Elcor is responsible for machine-dependent optimizations, including instruction scheduling and register allocation. The VLIW configuration used in our experiments has four IALUs (integer ALUs), two FPALUs (floating-point ALUs), one LD/ST (load/store) unit and one branch unit. The register file consists of 64 general-purpose registers. Table 1 shows the default cache parameters. We assume each instruction word contains eight operations in the simulated VLIW processor. The basic block-scheduling algorithm is used as the default algorithm. We select ten benchmarks from Mediabench [18] for the evaluation.
>
A. Register Vulnerability Factor Results
Fig. 4 shows the RVF for different benchmarks. As can be seen, except for mpeg2enc, the RVF values of all other benchmarks are less than 20% and some RVF values are even less than 5%.
Such low RVF values indicate that the majority of soft errors occurring in the register file can be automatically overlapped by the write operations, and hence have no impact on other system components or the system output. Thus, the reliability cost can be potentially reduced by choosing less expensive (and often less powerful) techniques to protect the register file, while meeting the pre-defined reliability goal. These results also show that the register vulnerability factor is dependent on the application behaviour. Different applications access the register file in different patterns, leading to varied RVF values. Therefore, for embedded processors, which typically run a set of fixed applications,one can evaluate the register access patterns in the early design cycle to derive the RVF value, based on which the most
cost-effective technique can be selected to protect the register file against soft errors.
>
B. Effect of Instruction Re-scheduling
Table 2 lists the RVF values by re-scheduling the register write operations, as late as possible, and the register read operations, as early as possible, based on the scheduling slacks, since a small RVF value implies high reliability. The second column in Table 2 gives the RVF values of the original schedule that uses the list-scheduling algorithm [19]. The RVF values after instruction re-scheduling decrease for all benchmarks compared to the base scheme. These results clearly indicate that the compiler can optimize the register access patterns to improve the register file immunity to soft errors. Nevertheless, we also find that the amount of RVF reduction is insignificant, since the instruction reordering is limited within small basic blocks.
Fig. 5 shows the RVF values of instruction re-scheduling based on superblocks [15] and hyperblocks [16]. The compiler has more flexibility to move instructions without increasing the critical path delay, since the superblocks and hyperblocks are much larger than the basic blocks. Thus, we observe that
Register vulnerability factor values of instruction re-scheduling compared to the base scheme.
the RVF values of some benchmarks are reduced substantially. For instance, the RVF of djpeg decreases from 20% to 3.5% and 3.3%, respectively, for the superblock-based and hyperblock-based instruction re-scheduling approaches. On average, the superblock-based and hyperblock-based instruction re-scheduling can achieve an averaged RVF value as low as 10.9% and 9.1%, respectively, which can be translated to the register file reliability improvement and the reliability cost reduction. We also find that for some benchmarks, the RVF values become larger, because superblock scheduling and hyperblock scheduling also change the total execution cycles, compared to basic block scheduling.
>
C. Effect of Reliability-oriented Register Assignment
Commercial microprocessors, such as IBM G5, [4] have employed ECC to protect the register file against soft error. Although it is too costly to add ECC to each register for embedded processors, it is attractive to employ ECC to protect a limited number of registers that store the most critical data, since reliability is also critical to many embedded applications and not all registers are accessed uniformly. Table 3 lists the RVF values of the reliability-oriented register assignment by varying the number of registers protected by ECC. The profilingbased register assignment is effective in reducing RVF values. On average, RVF is reduced to 9.5%, with only four out of 64 registers protected by ECC. The RVF value can be further lowered with more registers covered by ECC. For instance, with eight and sixteen registers protected by ECC, the average RVF value is reduced to 6.5% and 3.2%, respectively. Obviously, cost will also increase, with more registers covered by ECC. Consequently, the designers need to trade-off cost and reliability to meet design goals.
We also experiment reducing the total number of generalpurpose registers, so that each register is likely to be accessed more frequently, to evaluate the effectiveness of the proposed reliability-oriented register assignment scheme. Tables 4 and 5 give the RVF values with 0, 2, 4, 8 and 16 registers covered by ECC for register files with 32 and 16 registers. The base RVF value is increased, since each register will be accessed more frequently with fewer registers. The RVF values can still be reduced
Register vulnerability factor values of register assignment with 0 2 4 8 and 16 registers protected by ECC. There are 64 registers.
The register vulnerability factor values of register assignment with 0 2 4 8 and 16 registers protected by ECC. There are 32 registers.
The register vulnerability factor values of register assignment with 0 2 4 8 and 16 registers protected by ECC. There are 16 registers.
effectively, by allocating the reliable registers with ECC to cover the most susceptible intervals. For instance, with four out of 32 registers protected by ECC, the average RVF value is as low as 10%. However, protecting four registers with ECC for a register file with 16 registers can only reduce the RVF to 13.7% on average. Nevertheless, the average RVF for a small register file is reduced more significantly by protecting more registers with ECC. For instance, with eight our of 64 registers covered by ECC, the average RVF is 6.5%, while with eight out of 32 or 16 registers protected by ECC, the average RVF is 5.8% and 5.9% respectively. That is, a larger portion of register will be covered by ECC, because for a smaller register file, leading to higher reliability. Obviously, if all the 16 registers are covered by ECC, the RVF value becomes zero under the single-bit error model, indicating high register reliability. Therefore, the reliability-oriented register assignment is quite effective in improving register file immunity to soft errors for varying numbers of registers.
>
D. Effect of the Hybrid Scheme
Tables 6-8 list the RVF values of the hybrid scheme for a
The register vulnerability factor values of the hybrid scheme with 0 2 4 8 and 16 registers protected by ECC. There are 64 registers.
The register vulnerability factor values of the hybrid scheme with 0 2 4 8 and 16 registers protected by ECC. There are 32registers.
The register vulnerability factor values of the hybrid scheme with 0 2 4 8 and 16 registers protected by ECC. There are 16 registers.
register file with 64, 32, or 16 registers, respectively. The hybrid scheme is more effective at reducing the RVF for different benchmarks than either the re-scheduling and register re-assignment approaches alone. The RVF value is as low as 6.1%, on average, with only four out of 64 registers protected by ECC. Protecting four registers with ECC will reduce the RVF to 5.0% and 4.7% for a register file with 32 or 16 registers, respectively, indicating great improvement in register file immunity against soft errors.
Transient errors caused by external particle strikes have traditionally been a concern for systems that operate in highly noisy environments. They have increasingly become a challenge for microprocessors ranging from high-end servers to embedded processors used for reliability-critical applications with the scaling of technology. Kim and Somani [20] conducted fault injection experiments on picoJava-II in its RTL model to understand the micro- processor vulnerability to soft errors. They found large variations for different hardware blocks. Wang et al. [21] studied the soft error sensitivity of a modern microprocessor, similar to the Alpha 21264, through fault injection on a RTL model. They reported that less than 15% of single bit errors in the processor state result in software visible errors. Mukherjee et al. [1] proposed an approach to measure AVF based on a performance model. They reported the AVFs of the instruction queue and execution units are 28% and 9%, respectively, for an Itanium2-like IA64 processor.
Biswas et al. [22] extended the lifetime analysis technique to examine the architectural vulnerability factors for address-based structures. All these prior research efforts have motivated us to study the sensitivity of register files to transient errors more accurately. In contrast to previous work, this paper focuses on studying the reliability of register files, which are not addressbased but can significantly affect overall system reliability if unprotected. We develop a method to compute the probability of register soft error propagation accurately by exploiting the fact that register soft errors can be overlapped by the register write operations, which are not captured by previous models.In this paper, we also proposed several novel techniques to improve register file reliability without significant hardware cost or performance degradation.
A number of techniques in the literature improve hardware reliability against transient errors. However, most of the research effort focuses on protecting main memory [8, 9], cache [8, 9, 23], datapath [4, 12, 13, 24], or multicore [25]. Currently, parity and ECC are the most widely-used mechanisms to protect the storage units, but come at the cost of area, energy and design time. Particularly, if the ECC computation is on the critical path, it may affect performance. Rajaram et al. analyzed the soft error rate for a variety of flip-flops with regard to register reliability against transient errors [7].
Memik et al. [14] proposed a scheme to replicate register values in the physical registers to increase register file reliability.While this approach can utilize the available physical registers to enhance reliability, it can only be used for superscalar processors, where additional physical registers are employed to support dynamic register renaming. In contrast, VLIW processors rely on compilers to manage the registers. They typically do not have additional physical registers. Therefore, the approach proposed in [14] cannot be applied to VLIW-like processors that do not have physical registers. This paper, in comparison, proposes two compiler-guided techniques to improve the register file immunity to soft errors that can be widely applied to a variety of processors. Lee and Shrivastava [26] proposed a compiler-microarchitecture hybrid approach to enhance the energy efficiency of soft error protection for register files. In contrast, this paper focuses on improving the reliability of register files.
As technology scales, lower supply voltage, higher density and higher frequency will make microprocessors more vulnerable to transient errors. The first step is to understand the vulnerability of different hardware components to soft errors accurately to enhance processor reliability cost-effectively. This is of particular importance for embedded systems with cost constraints. While existing work mainly focuses on examining the impact of soft errors on main memory [8, 9], cache [10, 11] or datapath [4, 12,13], this paper explores the register file reliability against soft errors, since registers are susceptible to transient errors and are accessed very frequently. In this paper, we propose the concept of RVF to characterize the probability that register soft errors can be propagated to other system components and thus affect the final output. We also propose an approach to compute RVF based on register access patterns. Therefore, the reliability of register files can be estimated using both the RVF and raw soft error rate of latches.
We develop two compiler-guided techniques built upon the concept of RVF to improve register file reliability by decreasing the RVF value, because a smaller RVF value indicates that the register file is less susceptible to soft errors and thus is more reliable. The first technique is a pure software-based approach that exploits the scheduling slack to move the register write operations, as late as possible, and the register read operations, as early as possible, without increasing critical path latency. Our experiments demonstrate that the instruction re-scheduling based on hyperblocks [16] can reduce the RVF to 9.1% on average.
The second technique targets register files that are partially protected by ECC. The proposed reliability-oriented register assignment improves register file immunity to soft errors by protecting the most susceptible intervals, based on the profiling information. We also propose a hybrid scheme built upon both these techniques to further reduce RVF. Experiments show the hybrid scheme can reduce average RVF to 6.1% with only four out of 64 registers covered by ECC. Thus, register file reliability is improved substantially without significantly influencing cost. Moreover, all the techniques proposed in this paper can enhance register file immunity to soft errors without compromising performance.