Exploiting Standard Deviation of CPI to Evaluate Architectural TimePredictability
 DOI : 10.5626/JCSE.2014.8.1.34
 Author: Zhang Wei, Ding Yiqiang
 Organization: Zhang Wei; Ding Yiqiang
 Publish: Journal of Computing Science and Engineering Volume 8, Issue1, p34~42, 30 March 2014

ABSTRACT
Timepredictability of computing is critical for hard realtime and safetycritical systems. However, currently there is no metric available to quantitatively evaluate timepredictability, a feature crucial to the design of timepredictable processors. This paper first proposes the concept of architectural timepredictability, which separates the time variation due to hardware architectural/microarchitectural design from that due to software. We then propose the standard deviation of clock cycles per instruction (CPI), a new metric, to measure architectural timepredictability. Our experiments confirm that the standard deviation of CPI is an effective metric to evaluate and compare architectural timepredictability for different processors.

KEYWORD
Performance , Reliability , Hard realtime systems , WCET analysis , Timepredictability

I. INTRODUCTION
Processor architectural design has traditionally focused on improving averagecase performance or energy efficiency. However, for realtime systems, timepredictability is the most important design consideration. Unfortunately, many performanceenhancing architectural features of today’s microprocessors, such as caches and branch predictions, are detrimental to timepredictability [1, 2]. As a result, an estimation of worstcase execution time (WCET) on modern microprocessors, which is crucial for hard realtime and safetycritical systems, is very difficult, if not impossible, to make [1, 2]. On the other hand, a timepredictable processor with low performance may be ineffective or even useless. To achieve both timepredictability and performance, researchers have proposed a timepredictable design for microprocessors [2] with the goal to achieve timepredictability (or WCET analyzeability) while minimizing the impact on averagecase performance.
In the last two decades, researchers have proposed several designs of timepredictable processors for realtime systems. Colnaric and Halang [3] proposed a simple asymmetrical multiprocessor architecture without dynamic architectural features, such as pipelines and caches, for hard realtime applications. Delvai et al. [4] designed a scalable processor for embedded realtime applications, which employed a simple 3stage pipeline without cache memory, and Edwards and Lee [5] proposed a precision timed (PRET) machine. Yamasaki et al. [6] studied instructions per cycle (IPC) control and prioritization of multithreaded processors. Paolieri et al. [7] examined timepredictable multicore architectures to support WCET analyzability. Finally, Schoeberl [8] proposed timepredictable Java processor.
However, to the best of our knowledge, none of the prior work has quantitatively evaluated timepredictability. While (averagecase) performance can be easily assessed, most prior work on timepredictable design either simply reported the worstcase performance through measurement or analysis or just qualitatively explained that the design was timepredictable by removing undesirable architectural features. This is because, to the best of our knowledge, there is no welldefined metric to evaluate timepredictability of microprocessors, and this has fundamentally limited the advance of timepredictable architectural design. The lack of a metric does not only prevent designers from performing a quantitative comparison of different timepredictable designs in order to select the better one, but also makes it impossible to quantitatively analyze the tradeoff between timepredictability and performance since these two goals often conflict with each other.
To address this problem, this paper proposes to use the standard deviation of cycles per instruction (CPI) to measure the timepredictability of different architectures. Such a metric can quantitatively indicate the variation in execution time of different architectural/microarchitectural designs, and thus can provide useful insights for engineers to make better design tradeoffs between performance and timepredictability.
II. ARCHITECTURAL TIMEPREDICTABILITY
The execution time of a realtime task (or generally a program) can be calculated by using Eq. (1), where
T is the execution time,N is the number of instructions,CPI represents the clock cycles per instruction, andτ stands for the clock cycle time. The computer architecture community has used this equation to evaluate the averagecase performance of microprocessors for decades, and we can also use it explain the variation of the execution time. In this equation,τ is fixed for each instruction, which is a predictable quantity after the processor is designed and implemented. In contrast, bothN andCPI can vary significantly, thus making it hard to achieve timepredictability. Specifically, the number of instructions (N ) can be affected by both software and instruction set architecture (ISA). Given an ISA, the number of instructions is mainly determined by the software and its inputs, i.e., which paths are executed at runtime. The CPI is affected by the architectural/microarchitectural design. Usually in a pipelined processor, different types of instructions take various clock cycles, and even the same type of instructions may take different clock cycles. For example, the latencies of load instructions depend on whether they hit in the cache or not.Since our goal is to design timepredictable processors (to support running realtime software), we should separate the time variation caused by software and hardware. For example, the inputs that are given, or the program paths that are executed, should not be a concern for hardware design. Prior work [2] seems to consider timepredictability of both the software and hardware as a single problem, making it overly complicated, if not impossible, to define a useful metric to specifically guide hardware design for timepredictability. In this paper, we propose a concept called architectural timepredictability (ATP) using the definition below.
Definition 1 (Architectural timepredictability).Given a number of instructions, architectural timepredictability indicates the degree that the architecture under study can provide predictable execution time. In the above definition, the number of instructions of the ISA is assumed to be known, which we believe is a reasonable assumption to separate the timepredictability of software and hardware. Once the ISA is defined, the number of instructions executed is mostly affected by software characteristics, features, and processes, including the algorithm design, inputs, and compilation. Moreover, the number of instructions can be accurately obtained. For hard realtime systems, the worstcase number of instructions executed is needed to compute the WCET. The worstcase number of instructions can be calculated by stateoftheart timing analysis techniques, such as the implicit path enumeration technique (IPET) [9], although the worstcase inputs may not be known without exhaustive measurements, which are prohibitive.
III. STANDARD DEVIATION OF CPI
We propose to use the standard deviation of CPI to quantitatively evaluate architectural timepredictability. The averaged CPI has been used by the computer architecture community as a key measure of microarchitecture effectiveness. However, as an aggregate metric, it cannot indicate the variation of clock cycles for different instructions. As mentioned above, different instructions usually vary in the amount of clock cycles they take on modern processors, which is the fundamental reason why it is so difficult to predict microarchitectural timing during WCET analysis [1]. Therefore, we use the standard deviation of CPI, which can clearly indicate the amount of variation of microarchitectural timing, to measure architectural timepredictability. In the ideal scenarion, there is 100% timepredictability of the architecture (i.e., the standard deviation of CPI is 0), and it becomes trivial to compute the WCET. When we apply Eq. (1) to this ideal case, both
CPI andτ are known, and the worstcase number of instructions is independent of hardware design and can be calculated by using today’s timing analysis techniques, such as IPET [9].In modern processors, different types of instructions usually have different latencies. For example, an integer addition instruction is usually faster than a load instruction. Thus the standard deviation of CPI needs to be classified into the standard deviation of CPI for each type of instruction, rather than the standard deviation of CPI for all instructions. For instance, in a timepredictable processor, it is desirable for all the loads to take the same number of clock cycles. However, it would be unwise or unnecessary to require all loads to take the same amount of clock cycles as additions and subtractions. Thus in our evaluation, we classify the standard deviation for three types of instructions, including loads, stores, and other regular instructions.
IV. EVALUATION METHODOLOGY
We simulate a superscalar processor with different architectural features to evaluate architectural timepredictability quantitatively based on the standard deviation of CPI, by using SimpleScalar, a cycleaccurate simulator [10]. We select six realtime benchmarks (Table 1) from the Malardalen WCET benchmark suite [11] for our experiments. Two of them are singlepathed, the other four have multiple paths, and each of them is executed with various inputs. We select three representative inputs to represent the best case, the normal case, and the worst case behavior observed (though it may not be the theoretical best/worst case).
The basic configuration of the simulated processors is listed in Table 2, including the parameters of the pipelines, the caches, and the memory.
> A. Architectures Evaluated
Caches and pipelines are two important architectural features that can boost performance and are widely employed in modern processors. Due to space limitations, our experiments in this paper focus on evaluating architectural timepredictability of caches and pipelines. More specifically, we quantitatively compare the architectural timepredictability of four different architectures with features as those given in Table 2.
With caches, with pipelines: a processor with L1 instruction cache, L1 data cache, L2 cache, and pipelines. With caches, without pipelines: a processor with L1 instruction cache, L1 data cache, and L2 cache, but no pipeline. Without caches, with pipelines: a processor with no cache and with pipelines. Without caches, without pipelines: a processor with no cache and no pipeline.
V. EXPERIMENTAL RESULTS
> A. Variation of CPI
Our first experiment quantifies the variation of CPI for different benchmarks on a regular processor with caches and pipelines. Fig. 1 shows the variation of CPI of all the instructions for the benchmarks
insertsort andqsortexam , both with the worstcase inputs. In the rest of the paper, all the experimental results are based on the worstcase inputs, whose timing behaviors are particularly important for hard realtime systems. The black line in each figure shows the mean CPI of all instructions of the benchmark. As we can see for both benchmarks, the CPIs vary significantly, indicating bad timepredictability of the default architecture (i.e., with caches and pipelines).> B. TimePredictability Results
Our second experiment uses the standard deviation of CPI to quantitatively study the effects of caches on the architectural timepredictability of the processors with pipelines, and the results are given in Table 3. As mentioned above, we classify the CPI for different types of instructions, including loads, stores, and other regular instructions. “All” in Table 3 basically represents all the instructions, regardless of their types. As we can see in Table 3, the averaged standard deviation of CPI for load instructions, store instructions, and other regular instructions are 41.16, 16.58, and 29.79, respectively, with caches, while it decreases to 0, 0.02, and 0.4, respectively without caches. This quantitatively demonstrates the time unpredictability caused by cache memories due to the latencies of memory accesses depending on whether or not there is a hit in the caches and which cache is hit. By disabling all caches, the load instructions become perfectly predictable (i.e., the standard deviation of CPI is 0), as there is neither a cache hit nor miss. The store instructions and other regular instructions also become much more timepredictable without caches. However, their standard deviation of CPI is not exactly 0. This is due to the time variation caused by the pipelines, including delays due to control and data hazards as well as various queuing delays. For example, an instruction may need to stay in the instruction window for longer time because it has to wait for the preceding instruction to commit.
It should be noted that the standard deviation of CPI of all instructions does not reveal architectural timepredictability accurately as mentioned earlier. As can be seen in Table 3, the averaged standard deviation of CPI of all instructions with caches is actually smaller than that without caches. This is because the majority of memory accesses with caches result in cache hits, making the latencies of most loads/stores close to those of other regular instructions. This leads to lower standard deviations when compared to that without caches, in which case the loads/stores are guaranteed to miss and thus have longer latencies than other regular instructions, resulting in a higher standard deviation.
Table 4 compares the mean and the standard deviation of CPI for nonpipelined processors with and without caches. A similar trend can be observed where disabling caches leads to better timepredictability. Moreover, we observe that the standard deviations of CPI of both stores and other regular instructions become 0 without pipelines, indicating that disabling pipelines can further improve timepredictability.
Although both caches and pipelines are detrimental to timepredictability, due to the quantitative nature of our evaluation, we observe that caches can compromise the architectural timepredictability much more than pipelines. This is due to a higher variation of the standard deviation of CPI for each type of instructions between the architectures with caches and without caches when compared to that between the architectures with pipelines and without pipelines (Tables 3 and 4). This may also quantitatively explain why the timing analysis of caches is generally more complex and harder to perform than that of pipelines [1].
> C. Performance Results
Although architectural timepredictability is degraded by using caches and pipelines, the averagecase performance improves with the use of both. As shown in Table 5, the performance of all the benchmarks is best on the architecture with both caches and pipelines and is worst on the architecture without both caches and pipelines. The former is highperformance but not timepredictable, whereas the latter is timepredictable but not highperformance. Among these four architectures, the one with pipelines but without caches seems to provide high timepredictability with adequate performance, better than that without both caches and pipelines. By comparison, the architecture with caches but without pipelines can generally provide good performance, but timepredictability is inadequate. Moreover, architectural innovations are needed to keep highperformance features that do not affect timepredictability and to design new features that are timepredictable and can also boost performance.
VI. RELATED WORK
Recently, the realtime and embedded computing community has shown interest in defining a metric of timepredictability due to the importance of studying timepredictability quantitatively. Thiele and Wilhelm [2] defined timepredictability as a pessimistic WCET analysis and bestcase execution time (BCET) analysis, and Grund [12] defined timepredictability as the relation between BCET and WCET and argued that it should be an inherent system property. Grund et al. [13] then proposed a template for predictability definitions and refined the definition into stateinduced timepredictability (SIP) and inputinduced timepredictability. Kirner and Puschner [14] formalized a universal definition of timepredictability by combining WCET analyzeability and stability of the system. However, in all the above work, except for that of Grund [12] and Grund et al. [13], the calculation for timepredictability is still dependent on the computation of WCET. Since WCET estimation is usually pessimistic and there is no standard way to compute it (though different methods to derive WCET, such as abstract interpretation and static cache simulation do exist), any timepredictability metric relying on WCET analysis is likely to be inaccurate and would be difficult to be standardized in practice.
Moreover, in all the above work, except again for that of Grund [12] and Grund et al. [13], the definition for timepredictability does not separate the time variation caused by software and hardware, making it overly complicated to derive a timepredictable metric that can effectively guide the architectural design. While Grund [12] and Grund et al. [13] proposed SIP to separate the timing uncertainty between hardware and software, the metric they proposed to evaluate SIP needs to exhaustively find the maximum and minimum execution time of all different states, and this may not be computationally feasible. Also, no quantitative results are given in their study. In contrast, this paper proposes a metric to efficiently assess architectural timepredictability with a quantitative evaluation of architectural timepredictability on staticallyscheduled processors with different architectural features.
Recently, Ding and Zhang [15] proposed to use a quantitative architectural timepredictability factor (ATF). Compared to ATF, which can only quantitatively show the ATP of the whole processor, the standard deviation of CPI can provide the ATP for various microarchitectural components, such as caches and pipelines. Thus, we can quantitatively understand the impact of different microarchitectural components on ATP by using the standard deviation of CPI for different types of instructions, a feature which is not available in ATF. Therefore, the standard deviation of CPI can provide useful insights for engineers to design and implement new architectures with better timepredictability.
VII. CONCLUSION
In this paper, we present the concept of architectural timepredictability to separate that of hardware design from that of software design. We then propose a new metric to evaluate the architectural timepredictability that uses the standard deviation of the clock CPI. Our preliminary evaluation demonstrates that the proposed metric can quantitatively measure timepredictability of processors with different architectural features, such as caches and pipelines.
The standard deviation of CPI can be used to quantitatively guide the design of timepredictable processors. In our future work, we plan to use the standard deviation of CPI to evaluate timepredictability of different architectures, for example branch prediction, multicore designs, etc. Moreover, we intend to use this new metric to quantitatively trade off timepredictability with performance, energy, and cost for realtime systems.

[Table 1.] Realtime benchmarks used in our experiments

[Table 2.] Basic configuration of the simulated processor

[Fig. 1.] The variation of cycles per instruction (CPI) of all the instructions for the benchmarks (a) insertsort and (b) qsortexam both with the worstcase input.

[Table 3.] The mean and standard deviation of cycles per instruction for all benchmarks on architectures with the pipelines, with and without caches

[Table 4.] The mean and standard deviation of cycles per instruction for all benchmarks on architectures without the pipelines, with and without caches

[Table 5.] The total execution cycles of all the benchmarks on all the four architectures studied