With Moore's law supplying billions of transistors on-chip, embedded systems are undergoing a paradigm shift from single core to multicore to exploit this high transistor density for high performance. This paradigm shift has led to the emergence of diverse multicore embedded systems in a plethora of application domains (e.g., high-performance computing, dependable computing, mobile computing). Many modern embedded systems integrate multiple cores (whether homogeneous or heterogeneous) on-chip to satisfy computing demand while maintaining design constraints (e.g., energy, power, performance). For example, a 3G mobile handset's signal processing requires 35–40 giga operations per second (GOPS). Considering the limited energy of a mobile handset battery, these performance levels must be met with a power dissipation budget of approximately 1 W, which translates to a performance efficiency of 25 mW/GOPS or 25 pJ/operation for the 3G receiver [143]. These demanding and competing power-performance requirements make modern embedded system design challenging.
Increasing customer expectations/demands for embedded system functionality has led to an exponential increase in design complexity. While industry focuses on increasing the number of on-chip processor cores to meet customer performance demands, embedded system designers face the new challenge of optimal layout of these processor cores along with the memory subsystem (caches and main memory (MM)) to satisfy power, area, and stringent real-time constraints. The short time-to-market (time from product conception to market release) of embedded systems further exacerbates design challenges. Architectural modeling of embedded systems helps in reducing the time-to-market by enabling fast application-to-device mapping since identifying an appropriate architecture for a set of target applications significantly reduces the design time of an embedded system. To ensure timely completion of an embedded system's design with sufficient confidence in theproduct's market release, design engineers must make tradeoffs between the abstraction level of the system's architecture model and the attainable accuracy.
Modern multicore embedded systems allow processor cores to share hardware structures, such as last-level caches (LLCs) (e.g., level two (L2) or level three (L3) cache), memory controllers, and interconnection networks [144]. Since the LLC's configuration (e.g., size, line size, associativity) and the layout of the processor cores (on-chip location) have a significant impact on a multicore embedded system's performance and energy, our work focuses on the performance and energy characterization of embedded architectures based on different LLC configurations and layout of the processor cores. Although there is a general consensus on using private level one (L1) instruction (L1-I) and data (L1-D) caches in embedded systems, there has been no dominant architectural paradigm for private or shared LLCs. Since many embedded systems contain an L2 cache as the LLC, we focus on the L2 cache; however, our study can easily be extended for L3 caches and beyond as LLCs.
Since multicore benchmark simulation requires significant simulation time and resources, a lightweight modeling technique for multicore architecture evaluation is crucial [145]. Furthermore, simulation-driven architectural evaluation is based on specific benchmarks and consequently only provides performance information for programs similar to the benchmarks. A well-devised modeling technique can model diverse workloads, thus enabling performance evaluation for workloads with any computing requirements. Previous work presents various multicore system models; however, these models become increasingly complex with varying degrees of cache sharing [146]. Many of the previous models assumed that sharing among processor cores occurred at either the main memory level or the processor cores, all shared the same cache hierarchy; however, multicore embedded systems can have an L2 cache shared by a subset of cores (e.g., Intel's six-core Dunnington processor has L2 caches shared by two processor cores). We leverage for the first time, to the best of our knowledge, the queueing network theory as an alternative approach for modeling multicore embedded systems for performance analysis (though queueing network models have been studied in the context of traditional computer systems [135]). Our queueing network model approach allows modeling the layout of processor cores (processor cores can be either homogeneous or heterogeneous) with caches of different capacities and configurations at different cache levels. Our modeling technique only requires a high-level workload characterization of an application (i.e., whether the application is processor-bound (requiring high processing resources), memory-bound (requiring a large number of memory accesses), or mixed).
Our main contributions in this chapter are as follows:
We point out that although queueing theory has been used in the literature for performance analysis of multidisk and pipelined systems [135, 147, 148], we for the first time, to the best of our knowledge, apply queueing theory-based modeling and performance analysis techniques to multicore embedded systems. Furthermore, we for the first time develop a methodology to synthesize workloads/benchmarks on our queueing theoretic multicore models based on probabilities that are assigned according to workload characteristics (e.g., processor-bound, memory-bound, or mixed) and cache miss rates. We verify our queueing theoretic modeling approach by running SPLASH-2 multithreaded benchmarks on the SuperESCalar (SESC) simulator. Results reveal that our queueing theoretic model qualitatively evaluates multicore architectures accurately with an average difference of 5.6% as compared to the architectures' evaluations from the SESC simulator. The SESC simulation results validate our queueing theoretic modeling approach as a quick and inexpensive architectural evaluation method.
Our queueing theoretic approach can be leveraged for early design space pruning by eliminating infeasible architectures in very early design stages, which reduces the number of lengthy architectural evaluations when running targeted benchmarks in later design stages. Specifically, our approach focuses on the qualitativecomparison of architectures in the early design stage and not the quantitative comparison of architectures for different benchmarks. Our model is designed to operate using synthetic workloads that a designer can categorize for an expected behavior, such as process- or memory-bound workloads, along with an estimate of the expected cache miss rates. The synthetic workloads preclude the need to obtain benchmark-specific statistics from an architecture-level simulator. Furthermore, the cache miss rates are estimates, and thus are not required to be the exact miss rates for any specific benchmark. Our discussion in Section 5.3.2 regarding statistics obtained from an architecture-level simulator only explains how a real workload can be represented with our queueing theoretic model and is not required for synthetic workloads.
Our investigation of performance and energy for different cache miss rates and workloads is significant because cache miss rates and workloads can significantly impact the performance and energy of an embedded architecture. Furthermore, cache miss rates also give an indication of the degree of cache contention between different threads' working sets. Our performance, power, and performance per watt results indicate that multicore embedded architectures that leverage shared LLCs are scalable and provide the best LLC performance per watt. However, shared LLC architectures may introduce main memory response time and throughput bottlenecks for high cache miss rates. The architectures that leverage a hybrid of private and shared LLCs are scalable and alleviate main memory bottlenecks at the expense of reduced performance per watt. The architectures with private LLCs exhibit less scalability but do not introduce main memory bottlenecks at the expense of reduced performance per watt.
The remainder of this chapter is organized as follows. Section 5.1 gives a review of related work. Section 5.2 illustrates queueing network modeling of multicore embedded architecture. Validation of queueing network models is presented in Section 5.3. Section 5.4 provides insights obtained from our queueing theoretic models regarding performance, performance per watt, and performance per unit area for different multicore embedded architectures. Finally, Section 5.5 concludes the chapter.
Several previous works investigated analytical modeling techniques for performance evaluation. Sorin et al. [149] developed an analytical model for evaluating shared memory systems with processors that aggressively exploited instruction-level parallelism (ILP). The application's ILP and interaction with the memory subsystem were characterized by the model's input parameters. Ïpek et al. [150] used predictive modeling based onartificial neural networks to efficiently explore architectural design spaces. Simulation of sampled points in the design space helped the model to generate functions that described the relationship between design parameters, which were then used to estimate performance for other points in the design space. Chandra et al. [151] investigated performance models that predicted the impact of cache sharing on coscheduled threads for chip multiprocessors (CMPs). The authors observed that cache contention could significantly increase a thread's cache miss rate depending on the coscheduled threads' memory requirements. Chen and Aamodt [152] proposed an analytical model for predicting the impact of cache contention on cache miss rates for multiprogrammed workloads running on multithreaded manycore architectures.
Queueing theory has been used in the literature for the performance analysis of computer networks and other computer systems. Samari and Schneider [153] used queueing theory for the analysis of distributed computer networks. The authors proposed a correction factor in the analytical model and compared these results with the analytical model without this correction factor [154] and with simulation results to verify the correctness of the proposed model. Mainkar and Trivedi [155] used queueing theory-based models for performance evaluation of computer systems with a central processing unit (CPU) and disk drives. Willick and Eager [156] used queueing theory to model packet-switched multistage interconnection networks. The analytical model permitted general interconnection network topologies (including arbitrary switch sizes) and arbitrary memory reference patterns. Our work differs from the previous work on queueing theory-based models for computer systems in that our work applies queueing theory for performance evaluation of multicore embedded systems with different cache subsystems, which have not been investigated using queueing theory. Furthermore, our work introduces a novel way of representing workloads with different computing requirements probabilistically in a queueing theory-based model.
Characterization of workloads for analytical models has been explored in the literature. Nussbaum and Smith [157] developed an analytical performance model for superscalar processors based on statistical simulation. The model depended on detailed simulation to gather statistical information that was used to generate a synthetic instruction trace that was then fed as input to the model along with cache miss rate and branch prediction statistics. Karkhanis and Smith [158] developed an analytical performance model for superscalar processors. The model used trace-derived data dependence information, data and instruction cache miss rates, and branch miss-prediction rates as input. Wunderlich et al. [159] proposed a framework for statistical sampling that enabled sampling of a minimal subset of a benchmark's instruction execution stream to estimate the performance of the complete benchmark. Our work differs from previous work on workload characterization for analytical models in that we characterize workloads probabilistically, which provides an alternative to statistical sampling.
Previous work presents evaluation and modeling techniques for multicore embedded architectures for different applications with varying workload characteristics. Savage and Zubair [146] proposed a unified memory hierarchy model for multicore architectures that captured varying degrees of cache sharing at different cache levels. However, the model was only applicable to straight-line computations that could be represented by directed acyclic graphs (DAGs) (e.g., matrix multiplication, fast Fourier transform (FFT)). Our queueing theoretic models are virtually applicable to any type of workload with any computing requirements. Fedorova et al. [144] studied contention-aware task scheduling for multicore architectures with shared resources (caches, memory controllers, and interconnection networks). The authors modeled the contention-aware task scheduler and investigated the scheduler's impact on the application performance for multicore architectures. In contrast to these prior works, our queueing theoretic models permit a wide range of scheduling disciplines based on workload requirements (e.g., first-come-first-served (FCFS), priority, round robin (RR)).
Some previous works investigated performance and energy aspects for multicore systems. Kumar et al. [160] studied power, throughput, and response time metrics for heterogeneous CMPs. The authors observed that heterogeneous CMPs could improve energy per instruction by 4–6 and throughput by 63% over an equivalent-area homogeneous CMP because of closer adaptation to the resource requirements of different application phases. The authors used a multicore simulator for performance analysis; however, our queueing theoretic models can be used as a quick and inexpensive alternative for investigating performance aspects of heterogeneous CMPs. Sabry et al. [161] investigated performance, energy, and area tradeoffs for private and shared L2 caches for multicore embedded systems. The authors proposed a SystemC-based platform that could model private, shared, and hybrid L2 cache architectures. A hybrid L2 cache architecture contains several private L2 caches, each containing only private data, and a unified shared L2 cache that stores only shared data. However, the SystemC-based model required integration with other simulators, such as MPARM, to obtain performance results. Our queueing theoretic models do not require integration with other multicore simulators to obtain performance results and serve as an independent comparative performance analysis approach for multicore architecture evaluation.
Benítez et al. [162] proposed an adaptive L2 cache architecture that adapted to fit the code and data during runtime using partial cache array shutdown. The adaptive cache could be configured in four modes that prioritized instructions per cycle (IPC), processor power dissipation, processor energy consumption, or processor product. Experiments revealed that CMPs with 2 MB of private adaptive L2 cache provided 14.2%, 44.3%, 18.1%, and 29.4% improvements in IPC, power dissipation, energy consumption, and , respectively, over a 4 MB shared L2 cache. Our work does not consider adaptive private caches but compares private, shared, and hybrid caches on an equal area basis to provide a fair comparison between different LLCs.
There exists work in the literature related to memory subsystem latency and throughput analysis. Ruggiero [163] investigated cache latency (at all cache levels), memory latency, cache bandwidth/throughput (at all cache levels), and memory bandwidth for multicore embedded systems using LMBench, an open-source benchmark suite, on Intel processors. Our models enable measurement of cache latency and throughput of modeled architectures in an early design phase when fabricated architectures are not available.
Although there exist previous works on performance evaluation, our work is novel because we for the first time develop queueing network models for various multicore embedded architectures. Some previous works present benchmark-driven evaluation for specific embedded architectures; however, multicore embedded architecture evaluation considering different workload characteristics, cache configurations (private, shared, or hybrid), and miss rates with comparative analysis has not been addressed.
Although queueing networks are widely used in computer networks, the interpretation of queueing network terminology for multicore embedded architectures is a prerequisite for queueing theory-based modeling. Furthermore, because of the diversity of queueing network models, the specific approach taken to model multicore embedded architectures is the most critical aspect of queueing theory-based modeling. Finally, the queueing theory-based model relies on some assumptions under which the model is a valid representation of the multicore embedded architectures. This section discusses the queueing network terminology in the context of multicore embedded architectures, our modeling approach, and the underlying simplifying assumptions.
A queueing network consists of service centers (e.g., processor core, L1-I cache, L1-D cache, L2 cache, and main memory (MM)) and customers (e.g., jobs/tasks). A service center consists of one or more queues to hold jobs waiting for service. We use the term jobs instead of tasks (decomposed workload resulting from parallelizing a job) to be consistent with general queueing network terminology. Our modeling approach is broadly applicable to multiprogrammed workloads where multiple jobs run on the multicore embedded architecture as well as for parallelized applications/jobs that run different tasks on the multicore architectures. Arriving jobs enter the service center's queue, and a scheduling/queueingdiscipline (e.g., FCFS, priority, round-robin (RR), processor sharing (PS)) selects the next job to be served when a service center becomes idle. The queueing discipline is preemptive if an arriving higher priority job can suspend the service/execution of a lower priority job, otherwise the queueing discipline is nonpreemptive. FCFS is a nonpreemptive queueing discipline that serves the waiting jobs in the order in which the jobs enter the queue. Priority-based queueing disciplines can be preemptive or nonpreemptive and serve the jobs based on an assigned job priority. In the RR queueing discipline, a job receives a service time quantum (slot). If the job does not complete during the service time quantum, the job is placed at the end of the queue to resume during a subsequent service time quantum. In the PS queueing discipline, all jobs at a service center are serviced simultaneously (and hence there is no queue) with the service center's speed equally divided across all of the jobs. After being serviced, a job either moves to another service center or leaves the network.
A queueing network is open if jobs arrive from an external source, spend time in the network, and then depart. A queueing network is closed if there is no external source and no departures (i.e., a fixed number of jobs circulate indefinitely among the service centers). A queueing network is a single-chain queueing network if all jobs possess the same characteristics (e.g., arrival rates, required service rates, and routing probabilities for various service centers) and are serviced by the same service centers in the same order. If different jobs can belong to different chains, the network is a multichain queueing network. An important class of queueing networks is product-form where the joint probability of the queue sizes in the network is a product of the probabilities for the individual service center's queue sizes.
The queueing network performance metrics include response time, throughput, and utilization. The response time is the amount of time a job spends at the service center including the queueing delay (the amount of time a job waits in the queue) and the service time. The service time of a job depends on the amount of work (e.g., number of instructions) needed by that job. The throughput is defined as the number of jobs served per unit of time. In our multicore embedded architecture context, throughput measures the number of instructions/data (bits) processed by the architectural element (processor, cache, MM) per second. Utilization measures the fraction of time that a service center (processor, cache, MM) is busy. Little's law governs the relationship between the number of jobs in the queueing network and response time (i.e., where denotes the average arrival rate of jobs admitted to the queueing network [164]).
We consider the closed product-form queueing network for modeling multicore embedded architectures because the closed product-form queueing network enables unequivocal modeling of workloads. A typical embedded system executes a fixed number of jobs (e.g., a mobile phone has only a few applications to run, such as instant messaging, audio coding/decoding, calculator, and graphics interface). We point out that additional applications can be added/updated in an embedded system (e.g., a smartphone) over time; however, these additional applications can be represented as synthetic workloads in our queueing theoretic model. Furthermore, closed product-form queueing networks assume that a job leaving the network is replaced instantaneously by a statistically identical new job [135]. Table 5.1 describes the multicore embedded architectures that we evaluate in this chapter. We focus on embedded architectures ranging from 2 (2P) to 4 (4P) processor cores to reflect current architectures [165]; however, our model is applicable to any number of cores. Our modeled embedded architectures contain processor cores, L1-I and L1-D private caches, L2 caches (private or shared), and MM (embedded systems are typically equipped with DRAM/NAND/NOR Flash memory [166, 167]).
Table 5.1 Multicore embedded architectures with varying processor cores and cache configurations
Architecture | Description |
2P-2L1ID-2L2-1M | Multicore embedded architecture with two processor cores, |
private L1-I/D caches, private L2 caches, and a shared M | |
2P-2L1ID-1L2-1M | Multicore embedded architecture with two processor cores, |
private L1-I/D caches, a shared L2 cache, and a shared M | |
4P-4L1ID-4L2-1M | Multicore embedded architecture with four processor cores, |
private L1-I/D caches, private L2 caches, and a shared M | |
4P-4L1ID-1L2-1M | Multicore embedded architecture with four processor cores, |
private L1-I/D caches, a shared L2 cache, and a shared M | |
4P-4L1ID-2L2-1M | Multicore embedded architecture with four processor cores, |
private L1-I/D caches, 2 shared L2 caches, and a shared M |
(P denotes the processor core, M the main memory, and the integer constants in front of P, L1ID (L1 instruction and data cache), L2, and M denotes the number of these architectural components in the embedded architecture)
We consider a closed product-form queueing network with service centers where each service center has a service rate . Let be the probability of a job leaving service center and entering another service center . The relative visit count to service center is
The performance metrics (e.g., throughput, response time) for a closed product-form queueing network can be calculated using a mean value analysis (MVA) iterative algorithm [168]. The basis of MVA is a theorem stating that when a job arrives at a service center in a closed network with jobs, the distribution of the number of jobs already queued is the same as the steady-state distribution of jobs in the queue [169]. Solving Eq. (5.1) using MVA recursively gives the following performance metric values: the mean response time at service center ( denotes the number of jobs in the network), the mean network response time , the mean queueing network throughput , the mean throughput of jobs at service center , and the mean queue length at service center when there are jobs in the network. The initial recursive conditions are such that . The values for these performance metrics can be calculated for jobs based on the computed values for jobs as [135]:
To explain our modeling approach for multicore embedded architectures, we describe a sample queueing model for the 2P-2L1ID-2L2-1M architecture in detail (other architecture models follow a similar explanation). Figure 5.1 depicts the queueing network model for 2P-2L1ID-2L2-1M. The task scheduler schedules the tasks/jobs on the two processor cores and . We assume that the task scheduler is contention-aware and schedules tasks with minimal or no contention on cores sharing LLCs [144]. The queueing network consists of two chains: chain 1 corresponds to processor core and chain 2 corresponds to processor core . The jobs serviced by either reenter with probability ( in the subscript denotes chain 1) or enter the L1-I cache with probability or the L1-D cache with probability . The job arrival probabilities into the service centers (processor core, L1-I, L1-D, L2, or MM) depend on the workload characteristics (i.e., processor-bound, memory-bound, or mixed). The data from the L1-I cache and the L1-D cache returns to with probabilities and , respectively, after L1-I and L1-D cache hits. The requests from the L1-I cache and the L1-D cache are directed to the L2 cache with probabilities and , respectively, after L1-I and L1-D cache misses. The probability of requests entering or the L2 cache from the L1-I and L1-D caches depends on the miss rates of the L1-I and L1-D caches. After an L2 cache hit, the requested data is transferred to with probability or enters MM with probability after an L2 cache miss. The requests from MM always return to with probability . The queueing network chain and the path for chain 2 corresponding to follow the same pattern as chain 1 corresponding to . For example, requests from the L2 cache in chain 2 either return to with probability after an L2 cache hit or enter MM with probability after an L2 cache miss ( in the subscript denotes chain 2).
To further elaborate on our modeling approach, Fig. 5.2 depicts the queueing model for 2P-2L1ID-1L2-1M, which is similar to the model for 2P-2L1ID-2L2-1M (Fig. 5.1), except that this queueing model contains a shared L2 cache (L2s denotes the shared L2 cache in Fig. 5.2) for the two processor cores and instead of private L2 caches. The queueing network consists of two chains: chain 1 corresponds to processor core and chain 2 corresponds to processor core . The requests from the L1-I cache and the L1-D cache for chain 1 go to the shared L2 cache (L2s) with probability and , respectively, on L1-I and L1-D cache misses. The requested data is transferred from the L2 cache to with probability on an L2 cache hit whereas the data request goes to MM with probability on an L2 cache miss. The requests from the L1-I cache and the L1-D cache for chain 2 go to the shared L2 cache (L2s) with probability and , respectively, on L1-I and L1-D cache misses. The requested data from the L2 cache is transferred to with probability on an L2 cache hit, whereas the data request goes to MM with probability on an L2 cache miss.
The assignment of the probabilities in our queueing network models to represent a synthetic workload (or to emulate a real workload) on a multicore architecture is critical to our modeling approach and the fidelity of our evaluations. Our models can be leveraged for studying workloads based on an overall workload behavior, where processor-to-processor probability and processor-to-memory probability remain uniform throughout the workload, or for a more detailed study of workloads with different phases, where a different and would be assigned for each phase. These probabilities can be determined based on processor and memory statistics for a given workload using one of the two methods: actual statistics for real workloads can be gathered using a functional simulator if these statistics are not available from prior research; or synthetic workloads can be leveraged wherein the designer would assign these statistics based on the expected workload behavior. For our models, can be estimated as
where and denote the number of processor operations and total operations (processor and memory) in a benchmark, respectively. Processor operations refer to arithmetic and logic unit (ALU) micro-operations, such as add and subtract, and memory operations refer to micro-operations involving memory accesses, such as load and store. For floating point benchmarks, can be assumed to be equal to the number of floating point operations. can be estimated as
where denotes the number of memory operations, which is the sum of the total read and total write operations. can be estimated as
The probability of requests going from the processor core in chain to the L1-I and L1-D caches can be estimated from the L1-I and L1-D cache access ratios (defined later), which requires the number of L1-I and L1-D accesses. The number of L1-I accesses can be calculated as (assuming there is no self-modifying code):
The number of L1-D accesses can be calculated as
Total number of L1 cache accesses can be calculated as
The L1-I access ratio can be calculated as
The L1-D access ratio can be calculated as
The probability of requests going from the processor core in chain to the L1-I cache can be calculated as
Similarly, the probability of requests going from the processor core in chain to the L1-D cache can be calculated as
The probabilities of requests going from the L1-I and L1-D caches to the L2 cache and from the L2 cache to MM can be calculated directly from the cache miss rate statistics and are omitted for brevity.
We exemplify the assignment of probabilities for our queueing network models using the 2P-2L1ID-2L2-1M multicore architecture for memory-bound workloads given the following statistics: ; assuming that the L1-I, L1-D, and L2 cache miss rates are 25%, 50%, and 30%, respectively; and L1-I and L1-D access ratios are 0.8 and 0.2, respectively. The probabilities are set as: , , , , , , , , , (different probabilities can be assigned for processor-bound or mixed workloads).
Our queueing theoretic models determine performance metrics of component-level architectural elements (e.g., processor cores, L1-I, L1-D); however, embedded system designers are often also interested in system-wide performance metrics. For example, system-wide response time of an architecture is an important metric for real-time embedded applications. Our queueing theoretic models enable calculations of system-wide performance metrics. Based on our queueing theoretic models, we can calculate the system-wide response time of a multicore embedded architecture as
where denotes the total number of processor cores in the multicore embedded architecture. , , , , and denote the response times for processor core , L1-I, L1-D, and L2 corresponding to chain and MM, respectively. denotes the probability of requests looping back from processor core to processor core in the queueing network chain (the total number of chains in the queueing network is equal to ). and denote the probability of requests going from processor core in chain to the L1-I cache and the L1-D cache, respectively. and denote the probability of requests going from the L1-I cache and the L1-D cache in chain to the L2 cache, respectively. denotes the probability of requests going from the L2 cache in chain to MM. We point out that Eq. (17) is an extension of Eq. (5.3) for a multichain queueing network and also simplifies the calculation by using probabilities between architectural elements directly instead of relative visit counts as in Eq. (5.3). Since processor cores in a multicore embedded architecture operate in parallel, the effective response time of a multicore architecture is the maximum response time out of all of the chains in the queueing network operating in parallel, as given in Eq. (17). In the context of queueing networks for parallel computing, the overall response time is determined by the slowest chain since the other chains must wait idle for the slowest chain to complete (load balancing deals with making the processor cores' response times close to each other to minimize the idle waiting). The response time of a single chain in the queueing network is the sum of the processing times of the architectural elements in a chain (e.g., processor core, L1-I, L1-D, L2, and MM) multiplied by the associated service probabilities of these architectural elements. System-wide throughput can be given similar to Eq. (17).
Our queueing network modeling provides a faster alternative for performance evaluation of multicore architectures as compared to running complete benchmarks on multicore simulators (and/or trace simulators) though at the expense of accuracy. Our queueing network models require simulating only a subset of the benchmark's instructions (specified implicitly by the service rates of the architectural components, such as processor cores and caches) that are necessary to reach a steady state/equilibrium in the queueing network with the workload behavioral characteristics captured by the processor-to-processor and processor-to-memoryprobabilities (as shown in Fig. 5.1). Since the service centers (processors, caches, memory) have been modeled explicitly in our queueing theoretic models, the effect of these service centers on the overall system performance is captured by our model [156].
Our queueing theoretic models make some simplifying assumptions, which do not affect the general applicability of our approach. Our queueing network models assume cycle-level assignments of tokens (service time slices) for a given workload/job such that in each cycle, the tokens receive service from a particular service center with a given probability. For example, a job leaving the processor core either returns to the processor core's queue to wait for another time slice or goes to either the L1-I or L1-D cache for an instruction or data fetch, respectively [135]. Completed jobs are replaced immediately by a statistically identical job, an assumption for closed product-form queueing networks, which holds true for embedded systems [135]. Our queueing network modeling approach can be extended to instruction-level tokens if desired (although not required) for a given workload where an instruction can go to multiple service centers simultaneously (e.g., an instruction going to the L1-I and L1-D caches simultaneously to obtain the instruction opcodes and data, respectively). For these cases, a compound service center can be added to our queueing network model (e.g., L1-I + L1-D) that represents multiple service centers. The probability of a job going to a compound service center would be the sum of the probabilities of the individual service centers represented by that compound service center.
Our models are well suited for multiprogrammed workloads that are common in embedded systems; however, our approach is also applicable to multithreaded workloads with some simplifying assumptions. We assume that sizes of the LLCs are large enough to hold the working sets of the executing threads, which alleviate performance problems, such as suboptimal throughput arising due to cache thrashing and thread starvation [151]. We assume that appropriate inter-thread cache partitioning schemes alleviate performance issues due to cache sharing and is not the focus of our model. Furthermore, since cache contention impacts cache miss rates, our model allows specification of appropriate cache miss rates for investigating the impact of workloads with cache contention. For example, workloads that are likely to cause cache contention can be assigned higher cache miss rates in our models to incorporate the cache contention effects. Additionally, shared LLCs and MM in our queueing network model account for the contention of resources between processor cores and/or executing threads [135].
Although our current models do not explicitly model critical sections and coherence in multithreaded programs as stated in our modelingassumptions, our models can capture critical sections in a workload. We point out that a critical section is a piece of code that accesses a shared resource (e.g., a data structure) that must not be concurrently accessed by more than one thread of execution. Since critical sections are effectively serialized, the response time of the workload containing critical sections will increase depending on the number of critical sections and the number of instructions in each critical section. Hence, additional time for executing critical sections can be calculated by the number of critical sections and the number of instructions in each critical section and then added to the response time of the workload. The coherence is implicitly modeled in our queueing theory models since coherence issues will cause coherence misses and these misses can be incorporated into our cache miss rate specifications for a workload as our models enable specification of any cache miss rates for our synthesized workloads.
We note that even though some of these assumptions may violate practical scenarios, such violations would not significantly impact the insights obtained from our queueing theoretic models because our models measure performance trends and focus on the relative performance of architectures for different benchmarks rather than the absolute performance.
In this section, we validate our queueing network models to assure that our models' results conform with expected queueing theoretic results. We further validate our queueing network models with a multicore simulator running multithreaded benchmarks. We also calculate the speedups attained by our queueing network models as compared to a multicore simulator for architectural performance evaluation.
We analyzed our queueing network models for different cache miss rates and workloads and find that the model's simulation results conform with expected queueing theoretical results. For example, Fig. 5.3 depicts the response time for mixed workloads (, ) for 2P-2L1ID-1L2-1M as the number of jobs/tasks varies. The figure shows that as increases, the response time for the processor core, L1-I, L1-D, L2, and MM increases for all of the cache miss rates. We point out that cache miss rates could increase as increases due to inter-task address conflicts and increasing cache pressure (increased number of working sets in the cache), but we assume that the cache sizes are sufficiently large enough so that capacity misses remain the same for the considered number of jobs. We present the average response time individually for the processor cores and for the L1-I, L1-D, andL2 caches. For smaller L1-I, L1-D, and L2 cache miss rates, the response time of the processor core increases drastically as increases because most of the time jobs are serviced by the processor core, whereas for larger L1-I, L1-D, and L2 cache miss rates, the response time of the MM increases drastically because of a large number of MM accesses. These results along with our other observed results conform with the expected queueing theoretical results and validate our queueing network models for multicore architectures.
We further validate our queueing theoretic approach for modeling multicore architectures using multithreaded benchmarks executing on a multicore simulator. We choose kernels/applications from the SPLASH-2 benchmark suite, which represent a range of computations in the scientific, engineering, and graphics domains. Our selected kernels/applications from the SPLASH-2 benchmark suite include FFT, LU decomposition, Radix, Raytrace, and Water-spatial [170]. We briefly describe our selected kernels/applications from the SPLASH-2 benchmark suite:
Fast Fourier transform: The FFT kernel is a complex 1D algorithm for FFT calculation, which is optimized to minimize interprocess communication. The kernel's time and memory requirement growth rates are and , respectively.
LU decomposition: The LU kernel factors a dense matrix into the product of a lower triangular and an upper triangular matrix using a blocking algorithm that exploits temporal locality on submatrix elements. We obtain results using the noncontiguous version of LU in the SPLASH-2 suite as the noncontiguous version exhibits better performance on CMPs [171], which is the focus of our study. The kernel's time and memory requirement growth rates are and , respectively.
Radix: The integer Radix sort kernel implements an iterative noncomparison-based sorting algorithm for integers. The kernel's time and memory requirement growth rates are both .
Raytrace: The Raytrace application renders a 3D scene using ray tracing. The kernel's time and memory requirement growth rates for this application is unpredictable [145].
Water-spatial: This application evaluates forces that occur over time in a system of water molecules. The application's time and memory requirement growth rates are both .
Tables 5.2 and 5.3 depict the queueing network model probabilities for SPLASH-2 benchmarks for 2P-2L1ID-2L2-1M and 2P-2L1ID-1L2-1M, respectively (Section 5.2.2). These probabilities are obtained using Eqs. (5.7)–(16) where the statistics required by these equations are acquired from an architecture-level simulator (SESC in this case) for an accurate representation of the benchmarks in our queueing network model. From our probabilistic characterization of the workloads, FFT, LU, Radix, and Water-spatial can be classified as processor-bound workloads and Raytrace can be classified as a mixed workload. We reemphasize that this statistics gathering is only required for accurate representation of real benchmarks and is not required for synthetic workloads.
Table 5.2 Queueing network model probabilities for SPLASH-2 benchmarks for 2P-2L1ID-2L2-1M
Benchmark | Chain | |||||||||
FFT | 0.68 | 0.26 | 0.06 | 0.99 | 0.947 | 0.00962 | 0.053 | 0.97 | 0.0258 | |
0.7 | 0.249 | 0.051 | 0.9996 | 0.837 | 0.000414 | 0.163 | 0.956 | 0.044 | ||
LU | 0.68 | 0.2598 | 0.0602 | 0.99979 | 0.852 | 0.00021 | 0.148 | 0.9858 | 0.0142 | |
0.76 | 0.2016 | 0.0384 | 0.9999 | 0.87 | 0.0000314 | 0.13 | 0.988 | 0.012 | ||
Radix | 0.781 | 0.184 | 0.035 | 0.9999 | 0.932 | 0.000051 | 0.068 | 0.894 | 0.106 | |
0.78 | 0.1848 | 0.0352 | 0.99998 | 0.932 | 0.0000151 | 0.068 | 0.894 | 0.106 | ||
Raytrace | 0.58 | 0.3158 | 0.1042 | 0.9788 | 0.9382 | 0.0212 | 0.0618 | 0.961 | 0.039 | |
0.584 | 0.312 | 0.104 | 0.9862 | 0.9146 | 0.0138 | 0.0854 | 0.9463 | 0.0537 | ||
Water-spatial | 0.68 | 0.2544 | 0.0656 | 0.99717 | 0.982 | 0.00283 | 0.018 | 0.9956 | 0.00442 | |
0.68 | 0.2547 | 0.0653 | 0.9991 | 0.9814 | 0.000882 | 0.0186 | 0.9969 | 0.00307 |
Table 5.3 Queueing network model probabilities for SPLASH-2 benchmarks for 2P-2L1ID-1L2-1M
Benchmark | Chain | |||||||||
FFT | 0.68 | 0.26 | 0.06 | 0.9905 | 0.947 | 0.0095 | 0.053 | 0.982 | 0.018 | |
0.7 | 0.249 | 0.051 | 0.9996 | 0.84 | 0.0004 | 0.16 | 0.982 | 0.018 | ||
LU | 0.68 | 0.2624 | 0.0576 | 0.9998 | 0.9655 | 0.000166 | 0.0345 | 0.9987 | 0.0013 | |
0.76 | 0.202 | 0.036 | 0.9999 | 0.86 | 0.000011 | 0.14 | 0.9987 | 0.0013 | ||
Radix | 0.781 | 0.184 | 0.035 | 0.99996 | 0.9331 | 0.000036 | 0.0669 | 0.888 | 0.112 | |
0.78 | 0.1848 | 0.0352 | 0.999998 | 0.9335 | 0.0000022 | 0.0665 | 0.888 | 0.112 | ||
Raytrace | 0.582 | 0.314 | 0.1037 | 0.9791 | 0.752 | 0.0209 | 0.248 | 0.9881 | 0.0119 | |
0.584 | 0.312 | 0.104 | 0.9867 | 0.9285 | 0.0133 | 0.0715 | 0.9881 | 0.0119 | ||
Water-spatial | 0.679 | 0.256 | 0.0655 | 0.9972 | 0.9821 | 0.00283 | 0.0179 | 0.99819 | 0.00181 | |
0.679 | 0.2558 | 0.0652 | 0.99915 | 0.9814 | 0.000854 | 0.0186 | 0.99819 | 0.00181 |
We simulate the architectures in Table 5.1 using SESC [172]. To accurately capture our modeled architectures with SESC, our queueing theoretic models use the same processor and cache parameters (e.g., processor operating frequency, cache sizes, and associativity) for the architectures as specified in the SESC configuration files. We consider single-issue processors with five pipeline stages and a 45 nm process technology. The execution times for the benchmarks on SESC are calculated from the number of cycles required to execute those benchmarks (i.e., execution time = (number of cycles) (cycle time)). For example, FFT requires 964,057 cycles to execute on 4P-4L1ID-4L2-1M at 16.8 MHz (59.524 ns), which gives an execution time of 57.38 ms.
To verify that the insights obtained from our queueing theoretic models regarding architectural evaluation are the same as those obtained from a multicore simulator, we compare the execution time results for the FFT, LU (noncontiguous), Raytrace, Radix (an example for memory-bound workloads), and Water-spatial (an example for mixed workloads) benchmarks on SESC and our queueing theoretic models (the benchmarks are represented probabilistically for our queueing theoretic models (as shown in Tables 5.2 and 5.3 for the dual-core architectures). System-wide response time is obtained from our queueing theoretic models for these benchmarks using Eq. (17). We compare the execution time trends for the FFT, LU, Raytrace, Water-spatial, and Radix benchmarks on SESC and the modeled benchmarks for our queueing theoretic models using the SHARPE modeling tool/simulator [135].
Table 5.4 summarizes the performance (execution time) results on SESC for the FFT, LU (noncontiguous), Radix, Raytrace, and Water-spatial benchmarks for both the two-core and four-core processor architectures. The results from SESC and our queueing theoretic models provide similar insights and show that the multicore architectures with shared LLCs provide better performance than the architectures with private and hybrid LLCs for these benchmarks. Furthermore, architectures with hybrid LLCs exhibit superior performance than the architectures with private LLCs. Since our queueing theoretic models provide relative performance measures for different architectures and benchmarks by simulating a minimum number of the benchmarks' representative instructions, the results show a difference in the absolute execution times obtained from SESC and our queueing theoretic models. Furthermore, execution time of the benchmarks on SESC depends on the input sizes for the benchmarks and varies for different input sizes but normally retains similar trends across different architectures. Our queuing theoretic models capture the performance trends, which are important for the relative comparison of different architectures, and these trends match the performance trends obtained from SESC.
Table 5.4 Execution time comparison of the SPLASH-2 benchmarks on SESC for multicore architectures
Architecture | FFT (ms) | LU (ms) | Radix (s) | Raytrace (s) | Water-spatial (s) |
2P-2L1ID-2L2-1M | 65.06 | 518.52 | 4.24 | 26.32 | 24.47 |
2P-2L1ID-1L2-1M | 60.23 | 480.85 | 4.16 | 23.5 | 24.22 |
4P-4L1ID-4L2-1M | 57.38 | 362.13 | 2.24 | 17.84 | 13.06 |
4P-4L1ID-1L2-1M | 52.77 | 336.54 | 2.19 | 15.91 | 14.08 |
4P-4L1ID-2L2-1M | 52.81 | 337.3 | 2.38 | 16.52 | 14.16 |
In a high-level qualitative comparison of architectures, designers are typically interested in relative performance measures for various benchmarks on the evaluated architectures. To verify that our queueing network models can capture the differences in execution times for various benchmarks, Table 5.5 summarizes the evaluation results comparing different dual-core architectures (2P-2L1ID-2L2-1M and 2P-2L1ID-1L2-1M) on SESC and our queueing theoretic model. Results indicate that our queueing theoretic model qualitatively evaluates the two architectures accurately with an average difference of 5.6% as compared to the SESC simulator evaluation of the architectures. We observed similar architectural evaluation trends for four-core architectures and omit these details for brevity.
Table 5.5 Dual-core architecture evaluation () on SESC and (our queueing theoretic model) based on the SPLASH-2 benchmarks
Evaluation | FFT | LU | Radix | Raytrace | Water-spatial |
SESC | 1.08 | 1.08 | 0.935 | 1.166 | 1.01 |
QT | 1.16 | 1.13 | 0.99 | 1.07 | 1.02 |
% Difference | 7.4 | 4.63 | 5.88 | 8.97 | 0.99 |
and denote the time to execute a benchmark on architecture = 2P-2L1ID-2L2-1M and = 2P-2L1ID-1L2-1M, respectively. For each benchmark, architectural evaluation results are reported as (): time ratio of executing the benchmark on the two architectures ( and )
The SESC simulation results on multithreaded benchmarks verify the results and insights obtained from our queueing theoretic models for mixed and processor-bound workloads (Section 5.4.3). From the probabilistic characterization of the SPLASH-2 benchmarks, we ascertain that it is difficult to find benchmarks in a benchmark suite that cover the entire range of processor-to-processor and processor-to-memory probabilities (e.g., ranging from 0.05 for some benchmarks to 0.95 for others). The lack of computationally diverse benchmarks in terms of processor and memory requirements in a benchmark suite makes our queueing theoretic modeling approach an attractive solution for rigorous architectural evaluation because our modeling approach enables architectural evaluation via synthetic benchmarks with virtually any computing requirements characterized probabilistically.
To verify that our queueing theoretic modeling approach provides a quick architectural evaluation as compared to executing benchmarks on a multicore simulator, we compare the execution time required to evaluate multicore embedded architectures on SESC to our queueing theoretic model's execution time. The execution times for the SPLASH-2 benchmarks on SESC were measured using the Linux time
command on an Intel Xeon E5430 processor running at 2.66 GHz. The execution times for our queueing theoretic models were measured using the Linux time
command on an AMD Opteron 246 processor running at 2 GHz, and these results were scaled to 2.66 GHz to provide a fair comparison. The SHARPE execution time was measured on the AMD Opteron processor due to site-specific server installations [173]. The results reveal that our queueing theoretic models require 1.92 and 8.33 ms on average for multicore embedded architectures with two and four processor cores, respectively. We note that the execution times of our queueing theoretic models do not include the time taken to obtain processor, memory, and cache statistics (either from any prior work in the literature or running benchmarks on a multicore simulator or a functional simulator if these statistics are not available in any prior work) as the time to gather these statistics can vary for different benchmarks and depend on the existing work in the literature and available functional simulators for different benchmarks. Table 5.6 depicts the average execution time for the SPLASH-2 benchmarks on SESC for multicore embedded architectures with two and four processor cores and the ratio of the SESC execution times compared to our queueing theoretic models' execution times. Results reveal that our queuing theoretic models can provide architectural evaluation results 482,995 faster as compared to executing benchmarks on SESC. Therefore, our queueing theoretic modeling approach can be used for quick architectural evaluation for multicore embedded systems for virtually any set of workloads.
Table 5.6 Execution time and speedup comparison of our queueing theoretic models versus SESC. denotes the execution time required for simulating an -core architecture using where ( denotes our queueing theoretic model)
Benchmark | (s) | {(s)} | ||
FFT | 1.7 | 885 | 1.4 | 168 |
LU | 21 | 10,938 | 26.1 | 3,133 |
Radix | 72.1 | 37,552 | 77.3 | 9,280 |
Raytrace | 772.7 | 402,448 | 780.3 | 93,673 |
Water-spatial | 927.35 | 482,995 | 998.4 | 119,856 |
In this section, we present insights obtained from our queueing theoretic models regarding performance, performance per watt, and performance per unit area for the five different multicore embedded architectures depicted in Table 5.1. Furthermore, we verify the insights/trends for various workloads with different computational requirements that cannot be captured by a subset of benchmarks in a benchmark's suite and corroborate these results with our presented trends (for brevity, we present a subset of the results; however, our analysis and derived conclusions are based on our complete set of experimental results).
We consider the ARM7TDMI processor core, which is a 32-bit low-power processor with 32-bit instruction and data bus widths [174, 175]. We consider the following cache parameters [176]: cache sizes of 8, 8, and 64 KB for the L1-I, L1-D, and L2 caches, respectively; associativities of direct-mapped, two-way, and two-way for the L1-I, L1-D, and L2 caches, respectively; and block (line) sizes of 64, 16, and 64 B for the L1-I, L1-D, and L2 caches, respectively. We assume a 32 MB MM for all architectures, which is typical for mobile embedded systems (e.g., Sharp Zaurus SL-5600 personal digital assistant (PDA)) [177]. To provide a fair comparison between architectures, we ensure that the total L2 cache size for the shared L2 cache architectures and the private L2 cache architectures is equal. Furthermore, the shared bus bandwidth for the shared LLC (L2 in our experiments) architectures is times the bandwidth of the private LLC architectures, where is the number of cores sharing an LLC cache.
We implement our queueing network models of the multicore embedded architectures using the SHARPE modeling tool/simulator [135]. Figure 5.4 depicts the flow chart for our queueing network model setup in SHARPE. To set up our queueing network model simulation in SHARPE, we first specify the probabilities for a synthesized workload for a given multicore architecture (Section 5.2). For example, for 2P-2L1ID-2L2-1M for memory-bound workloads (processor-to-processor probability , processor-to-memory probability ) assuming that the L1-I, L1-D, and L2 cache miss rates are 25%, 50%, and 30%, respectively, and L1-I and L1-D access ratios are 0.8 and 0.2, respectively, the probabilities are set as: , , , , , , , , , (different probabilities can be assigned for processor-bound or mixed workloads). The code snippet for assigning these probabilities in SHARPE for the two chains is as follows:
bind
pr1P1P1 0.1
pr1P1L1I 0.72
pr1L1IP1 0.75
pr1L1IL2 0.25
pr1P1L1D 0.18
pr1L1DP1 0.5
pr1L1DL2 0.5
pr1L2P1 0.7
pr1L2M 0.3
pr2P2P2 0.1
pr2P2L1I 0.72
pr2L1IP2 0.75
pr2L1IL2 0.25
pr2P2L1D 0.18
pr2L1DP2 0.5
pr2L1DL2 0.5
pr2L2P2 0.7
pr2L2M 0.7
We then calculate the service rates for the service centers used in our multicore queueing models. We assume that the processor core delivers 15 MIPS at 16.8 MHz [174] (cycle time = 1/() = 59.524 ns), which for 32-bit instructions corresponds to a service rate of 480 Mbps. We assume L1-I, L1-D, and L2 caches and MM access latencies of 2, 2, 10, and 100 cycles, respectively [174, 178]. With an L1-I cache line size of 64 B, an access latency of 2 cycles, and a 32-bit (4 B) bus, transferring 64 B requires 64/4 = 16 cycles, which results in a total L1-I time (cycles) = access time + transfer time = 2 + 16 = 18 cycles, with a corresponding L1-I service rate = (64 8)/() = 477.86 Mbps. With an L1-D cache line size of 16 B, the transfer time = 16/4 = 4 cycles, and the total L1-D time = 2 + 4 = 6 cycles, with a corresponding L1-D service rate = (16 8)/() = 358.4 Mbps. With an L2 cache line size of 64 B, the transfer time = 64/4 = 16 cycles, which gives the total L2 time = 10 + 16 = 26 cycles, with a corresponding L2 service rate = (64 8)/() = 330.83 Mbps. With an MM line size of 64 B, the transfer time = 64/4 = 16 cycles, which gives a total MM time = 100 + 16 = 116 cycles, with a corresponding service rate = (64 8)/() = 74.15 Mbps. We assume that each individual job/task requires processing 1 Mb of instruction and data, which is implicit in our queueing models via service rate specifications (ensures steady-state/equilibrium behavior of the queueing network for our simulated workloads). The code snippet for assigning the service rates for the architectural elements (processor cores, caches, and MM) in Mbps in SHARPE is as follows:
P1rate 480
P2rate 480
L1Irate 477.86
L1Drate 358.4
L2rate 330.83
// Main memory service rate for chain 1
sMrate1 74.15
// Main memory service rate for chain 2
sMrate2 74.15
After service rate assignments for the architectural elements, we create a multichain product-form queueing network that outlines the architectural elements in each chain along with the associated transition probabilities between the architectural elements. We specify the architectural elements in each chain of the multichain product-form queueing network with associated transition probabilities between the architectural elements. The queueing network performs calculations for a given number of jobs NumJobs
. The code snippet for creating a multichain product-form queueing network in SHARPE is as follows:
mpfqn embarch1(NumJobs)
chain 1
P1 1L1I pr1P1L1I
1L1I 1L2 pr1L1IL2
1L1I P1 pr1L1IP1
P1 1L1D pr1P1L1D
1L1D 1L2 pr1L1DL2
1L1D P1 pr1L1DP1
1L2 P1 pr1L2P1
1L2 M pr1L2M
M P1 1
end
chain 2
P2 2L1I pr2P2L1I
2L1I 2L2 pr2L1IL2
2L1I P2 pr2L1IP2
P2 2L1D pr2P2L1D
2L1D 2L2 pr2L1DL2
2L1D P2 pr2L1DP2
2L2 P2 pr2L2P2
2L2 M pr2L2M
M P2 1
end
end
Next, we specify appropriate scheduling disciplines for the architectural elements (Section 5.2), such as FCFS scheduling for the processor core, L1-I, L1-D, and L2, and PS for MM. The scheduling discipline is specified for the architectural elements along with the associated service rates for these elements. The code snippet for specifying the scheduling disciplines for the architectural elements in SHARPE is as follows:
P1 fcfs P1rate
end
P2 fcfs P2rate
end
1L1I fcfs L1Irate
end
1L1D fcfs L1Drate
end
2L1I fcfs L1Irate
end
2L1D fcfs L1Drate
end
1L2 fcfs L2rate
end
2L2 fcfs L2rate
end
M ps
1 sMrate1
2 sMrate2
end
We specify the number of jobs for each chain in the queueing network depending on the workloads (parallelized tasks in a workload or multiprogrammed workloads). To simulate a particular (single) benchmark in SHARPE, we set the number of jobs equal to 1 in our queueing network models. For a balanced workload, the number of jobs should be divided equally between the two chains. The code snippet for specifying the number of jobs for each chain in SHARPE is as follows:
1 NumJobs/2
2 NumJobs/2
end
Finally, we compute statistics (e.g., queue length, throughput, utilization, response time) from the queueing network models for a given number of jobs for the architectural elements. The SHARPE code snippet below calculates these statistics for the processor core in chain 1. The code loops through the number of jobs varying from 5 to 20, incrementing by five jobs in each iteration.
loop NumJobs,5,20,5
expr mqlength(embarch1,P1;NumJobs)
expr mtput(embarch1,P1;NumJobs)
expr mutil(embarch1,P1;NumJobs)
expr mrtime(embarch1,P1;NumJobs)
end
In this section, we present results describing the effects of different L1-I, L1-D, and L2 cache miss rates on the architecture response time and throughput performance metrics for mixed, processor-bound, and memory-bound workloads. Considering the effects of different cache miss rates is an important aspect of performance evaluation of multicore embedded architectures with shared resources because cache miss rates give an indication whether the threads (corresponding to tasks) are likely to experience cache contention. Threads with higher LLC miss rates are more likely to have large working sets since each miss results in the allocation of a new cache line. These working sets may suffer from contention because threads may repeatedly evict the other threads' data (i.e., cache thrashing) [144]. We obtained results for cache miss rates of 0.0001, 0.05, and 0.2 up to 0.5, 0.7, and 0.7 for the L1-I, L1-D, and L2 caches, respectively. These cache miss rate ranges represent typical multicore embedded systems for a wide diversity of workloads [179–181].
Figure 5.5 depicts the effects of cache miss rate on the response time for mixed workloads (, ) for 2P-2L1ID-2L2-1M as the number of jobs varies. We observe that MM response time increases by 314% as miss rates for the L1-I, L1-D, and L2 caches increase from 0.0001, 0.05, and 0.2, to 0.5, 0.7, and 0.7, respectively, when . This response time increase is explained by the fact that as the cache miss rate increases, the number of accesses to MM increases, which increases the queue length and MM utilization, which causes an increase in the MM response time. The L1-I and L1-D response times decrease by 14% and 18%, respectively, as the miss rates for the L1-I and L1-D caches increase from 0.0001 and 0.05 to 0.5 and 0.7, respectively, when . This decrease in response time occurs because increased miss rates decrease the L1-I and L1-D queue lengths and utilizations. The L2 cache response time increases by 12% as the miss rates for the L1-I and L1-D caches increase from 0.0001 and 0.05 to 0.5 and 0.7, respectively, when (even though the L2 cache miss rate also increases from 0.2 to 0.7, but increased L1-I and L1-D miss rates effectively increase the number of L2 cache references, which increases the L2 cache queue length and utilization and thus L2 cache response time).
We observed that for mixed workloads (, ), the response times for the processor core, L1-I, L1-D, and MM for 2P-2L1ID-1L2-1M are very close to the response times for 2P-2L1ID-2L2-1M; however, the L2 response time presents interesting differences. The L2 response time for 2P-2L1ID-1L2-1M is 22.3% less than the L2 response time for 2P-2L1ID-2L2-1M when the L1-I, L1-D, and L2 cache miss rates are 0.0001, 0.05, and 0.2, respectively, and (similar percentage differences were observed for other values of ), whereas the L2 response time for 2P-2L1ID-1L2-1M is only 6.5% less than the L2 response time when the L1-I, L1-D, and L2 cache miss rates are 0.5, 0.7, and 0.7, respectively. This result shows that the shared L2 cache (of comparable area as the sum of the private L2 caches) performs better than the private L2 caches in terms of response time for small cache miss rates; however, the performance improvement decreases as the cache miss rate increases. Similar trends were observed for processor-bound (, ) and memory-bound workloads (, ).
For mixed workloads, the response times for the processor core, L1-I, L1-D, and MM for 4P-4L1ID-1L2-1M are 1.2, 1, 1.1, and 2.4 greater than the corresponding architectural elements for 4P-4L1ID-4L2-1M, whereas the L2 response time for 4P-4L1ID-1L2-1M is 1.1 less than the L2 response time for 4P-4L1ID-4L2-1M when the L1-I, L1-D, and L2 cache miss rates are 0.5, 0.7, and 0.7, respectively, and . This observation, in conjunction with our other experiments' results, reveals that the architectures with private LLCs provide improved response time for processor cores and L1 caches as compared to the architectures with shared LLCs; however, the response time of the LLC alone can be slightly better for architectures with shared LLCs because of the larger effective size for each core. The results also indicate that the MM response time could become a bottleneck for architectures with shared LLCs, especially when the cache miss rates become high. Another interesting observation is that shared LLCs could lead to increased response time for processor cores as compared to the private LLCs because of stalling or idle waiting of processor cores for bottlenecks caused by MM. Similar trends were observed for processor- and memory-bound workloads.
For mixed workloads, the L2 response time for 4P-4L1ID-2L2-1M is 1.2 less than 4P-4L1ID-4L2-1M and 1.1 greater than 4P-4L1ID-1L2-1M when the L1-I, L1-D, and L2 cache miss rates are 0.0001, 0.05, and 0.2, respectively, and . MM response time for 4P-4L1ID-2L2-1M is 2.3 less than 4P-4L1ID-1L2-1M, whereas MM response time for 4P-4L1ID-2L2-1M and 4P-4L1ID-4L2-1M is the same when the L1-I, L1-D, and L2 cache miss rates are 0.5, 0.7, and 0.7, respectively, and . The response times for the processor core and L1-I/D are comparable for the three architectures (4P-4L1ID-4L2-1M, 4P-4L1ID-2L2-1M, and 4P-4L1ID-1L2-1M). These results, in conjunction with our other experiments' results, show that having LLCs shared by fewer cores (e.g., the L2 cache shared by two cores in our considered architecture) do not introduce MM as a response time bottleneck, whereas the MMbecomes the bottleneck as more cores share the LLCs, especially for large cache miss rates. Similar trends were observed for processor- and memory-bound workloads.
We observe the effects of cache miss rates on throughput for various multicore embedded architectures. For mixed workloads, the throughput for the processor core, L1-I, L1-D, and MM for 2P-2L1ID-1L2-1M is very close to the throughput for 2P-2L1ID-2L2-1M; however, L2 throughput for 2P-2L1ID-1L2-1M is 100% greater on average than the L2 throughput for 2P-2L1ID-2L2-1M for different miss rates for the L1-I, L1-D, and L2 and . However, the combined throughput of the two private L2 caches in 2P-2L1ID-2L2-1M is comparable to the L2 throughput for 2P-2L1ID-1L2-1M. This shows that the shared and private L2 caches provide comparable net throughputs for the two architectures. The throughput for the processor core, L1-I, L1-D, and L2 for 4P-4L1ID-4L2-1M is 2.1 less on average than the corresponding 2P-2L1ID-2L2-1M architectural elements, whereas the throughput for the processor core, L1-I, L1-D, and L2 for the two architectures is the same when the miss rates for the L1-I, L1-D, and L2 caches are 0.5, 0.7, and 0.7, respectively, and . This indicates that the throughput for the individual architectural elements (except MM since it is shared for both the architectures) decreases for the architecture with more cores since the workload remains the same. The throughput for the processor core, L1-I, L1-D, L2, and MM for 4P-4L1ID-2L2-1M is 1.5, 1.5, 1.5, 2.5, and 1.3 less than the throughput for the corresponding 4P-4L1ID-1L2-1M architectural elements when the miss rates for the L1-I, L1-D, and L2 caches are 0.0001, 0.05, and 0.2, respectively, and . These observations reveal that changing the L2 cache from private to shared can also impact the throughput for other architectural elements because of the interactions between these elements.
We evaluate the effects of cache miss rates on throughput for processor-bound workloads (, ) for 2P-2L1ID-2L2-1M as varies. Results reveal that there is no appreciable increase in processor core throughput as increases from 5 to 20 because the processors continue to operate at utilization close to 1 when the L1-I, L1-D, and L2 cache miss rates are 0.3, 0.3, and 0.3, respectively (similar trends were observed for other cache miss rates). The MM throughput increases by 4.67% ( greater than the mixed workloads) as increases from 5 to 20 when L1-I, L1-D, and L2 cache miss rates are 0.5, 0.7, and 0.7, respectively. In this case, the MM percentage throughput increase is greater for processor-bound workloads as compared to mixed workloads because the MM is underutilized for processor-bound workloads (e.g., utilization of 0.519 for processor-bound workloads as compared to the utilization of 0.985 for mixed workloads when ). However, the MM absolute throughput for processor-bound workloads is less than the mixed workloads (e.g., an MM throughput of 38.5 Mbps for processor-bound workloads as compared to an MM throughput of 73 Mbps for mixed workloads when ). For processor-bound workloads, the throughput for the processor core, L1-I, L1-D, and MM for 2P-2L1ID-1L2-1M is similar to the throughput for2P-2L1ID-2L2-1M; however, the L2 throughput for 2P-2L1ID-1L2-1M is 100% greater than the L2 throughput for 2P-2L1ID-2L2-1M for all cache miss rates on average and . Similar trends were observed for memory-bound and mixed workloads for architectures with two or four cores with private and shared LLCs (these throughput trends would continue as the number of cores increases).
In this section, we present results describing the effects of different workloads on the response time and throughput performance metrics when the L1-I, L1-D, and L2 cache miss rates are held constant. We discuss the effects of varying the computing requirements of these workloads. The computing requirement of a workload signifies the workload's demand for processor resources, which depends on the percentage of arithmetic, logic, and control instructions in the workload relative to the load and store instructions. The computing requirements of the workloads are captured by and in our models.
The results show that the response time for the processor core, L1-I, and L1-D for 2P-2L1ID-1L2-1M is very close (within 7%) to 2P-2L1ID-2L2-1M as the computing requirements of the processor-bound workload vary. However, 2P-2L1ID-1L2-1M provides a 21.5% improvement in L2 response time and a 12.3% improvement in MM response time as compared to 2P-2L1ID-2L2-1M when and a 23.6% improvement in L2 response time and a 1.4% improvement in MM response time when and . 4P-4L1ID-2L2-1M provides a 22.3% improvement in L2 response time and a 13% improvement in MM response time over 4P-4L1ID-4L2-1M when and . 4P-4L1ID-2L2-1M provides a 22.3% improvement in L2 response time and a 3% improvement in MM response time as compared to 4P-4L1ID-4L2-1M when because a larger results in fewer MM references. 4P-4L1ID-1L2-1M provides a 7.4% improvement in L2 response time with a 5.2% degradation in MM response time as compared to 4P-4L1ID-2L2-1M when and . 4P-4L1ID-1L2-1M provides a 12.4% improvement in L2 response time with no degradation in MM response time over 4P-4L1ID-2L2-1M when and . These results indicate that shared LLCs provide more improvement in L2 response time as compared to hybrid and private LLCs for more compute-intensive processor-bound workloads. However, the hybrid LLCs provide better MM response time than shared LLCs for more compute-intensive processor-bound workloads. These results suggest that hybrid LLCs may be more suitable than shared LLCs in terms of scalability and overall response time for comparatively less compute-intensive processor-bound workloads.
Response time and throughput results reveal that memory subsystems for an architecture with private, shared, or hybrid LLCs have a profound impact on the response time of the architecture and throughputs for the L2 and MM with relatively little impact on throughput for the processor cores, L1-I, andL1-D.
Figure 5.6 depicts the effects of varying computing requirements for processor-bound workloads on response time for 2P-2L1ID-2L2-1M as varies where the L1-I, L1-D, and L2 cache miss rates are 0.01, 0.13, and 0.3, respectively. The figure depicts that as increases, the response time for the processor core, L1-I, L1-D, L2, and MM increases for all values of and . The figure shows that as increases, the response time of the processor increases, whereas the response times of L1-I, L1-D, L2, and MM show negligible effects due to the processor-bound nature of the workloads. For example, the processor response time increases by 19.8% as increases from 0.7 to 0.95 when . The response times of L1-I, L1-D, L2, and MM decrease by 10.8%, 14.2%, 2.2%, and 15.2%, respectively, as increases from 0.7 to 0.95 when because an increase in results in a decrease in memory requests, which decreases the response time for the caches and MM.
For memory-bound workloads, 2P-2L1ID-1L2-1M provides a 16.7% improvement in L2 response time and a 31.5% improvement in MM response time as compared to 2P-2L1ID-2L2-1M when and . 2P-2L1ID-1L2-1M provides an 18.2% improvement in L2 response time and a 25.8% improvement in MM response time over 2P-2L1ID-2L2-1M when and . 4P-4L1ID-2L2-1M provides a 19.8% improvement in L2 response time and a 20.2% improvement in MM response time on average over 4P-4L1ID-4L2-1M for both and and . 4P-4L1ID-1L2-1M provides a 2.4% improvement in L2 response time with a 15% degradation in MM response time as compared to 4P-4L1ID-2L2-1M when and . 4P-4L1ID-1L2-1M provides no improvement in L2 response time, with an 11.5% degradation in MM response time as compared to 4P-4L1ID-2L2-1M when and . These results indicate that shared LLCs provide a larger improvement in L2 and MM response time as compared to private LLCs for memory-bound workloads. Furthermore, hybrid LLCs are more amenable in terms of response time as compared to shared and private LLCs for memory-bound workloads. Similar trends were observed for mixed workloads for architectures with two or four cores containing private, shared, or hybrid LLCs.
We observe the effects of varying computing requirements for processor-bound workloads on throughput for 2P-2L1ID-2L2-1M as varies. As increases, the throughputs for the processor core, L1-I, L1-D, L2, and MM increase for all values of and . Furthermore, as increases, the throughput of the processor core increases, whereas the throughputs of L1-I, L1-D, L2, and MM decrease because of relatively fewer memory requests. For memory-bound workloads, L1-I and L1-D throughputs for 2P-2L1ID-2L2-1M and 2P-2L1ID-1L2-1M are comparable; however, 2P-2L1ID-1L2-1M improves the L2 throughput by 106.5% and 111% (due to larger combined L2 cache), whereas the MM throughput decreases by 126% and 121.2% when is 0.7 and 0.95, respectively. For memory-bound workloads, 2P-2L1ID-1L2-1M provides a 5.3% and 3.4% improvement in processor core throughput over 2P-2L1ID-2L2-1M when and , respectively, and . For processor-bound workloads, the processor core throughputs for 2P-2L1ID-2L2-1M and 2P-2L1ID-1L2-1M are comparable. Similar trends were observed for the architectures with four cores containing private, shared, or hybrid LLCs since the processor cores operate close to saturation (at high utilization) for processor-bound workloads, and memory stalls due to memory subsystem response time have a negligible effect on the processor core performance as memory accesses are completely overlapped with computation.
In this section, we compute performance per watt and performance per unit area for the multicore embedded architectures using our queueing theoretic models. The performance per unitarea is an important metric for embedded systems where the entire system is constrained to a limited space; however, performance per unit area is less important for desktop and supercomputing. Our performance per watt and performance per unit area computations assist in relative comparisons between different multicore embedded architectures. For these computations, we first need to calculate area and worst-case (peak) power consumption for different multicore embedded architectures, which we obtain using CACTI 6.5 [182], International Technology Roadmap for Semiconductors (ITRS) specifications [183], and datasheets for multicore embedded architectures.
Tables 5.7 and 5.8 show the area and peak power consumption for the processor cores, L1-I, L1-D, L2, and MM for two-core and four-core embedded architectures, respectively, assuming a 45 nm process. The core areas are calculated using Moore's law and the ITRS specifications [183] (i.e., the chip area required for the same number of transistors reduces by approximately 1/2 every technology node (process) generation). For example, ARM7TDMI core area is 0.26 mm at 130 nm process [184], the core area at 45 nm process (after three technology node generations, i.e., 130, 90, 65, and 45 nm) is approximately (1/2) mm.
Table 5.7 Area and power consumption of architectural elements for two-core embedded architectures
Element | 2P-2L1ID-2L2-1M | 2P-2L1ID-1L2-1M | ||
Area | Power | Area | Power | |
Core | 0.065 | 2.016 | 0.065 | 2.016 |
L1-I | 0.11 | 135.44 | 0.11 | 135.44 |
L1-D | 0.0998 | 79.76 | 0.0998 | 79.76 |
L2 | 0.578 | 307.68 | 0.5075 | 253.283 |
MM | 34.22 | 3174.12 | 34.22 | 3174.12 |
Table 5.8 Area and power consumption of architectural elements for four-core embedded architectures
Element | 4P-4L1ID-4L2-1M | 4P-4L1ID-1L2-1M | 4P-4L1ID-2L2-1M | |||
Area | Power | Area | Power | Area | Power | |
Core | 0.13 | 4.032 | 0.13 | 4.032 | 0.13 | 4.032 |
L1-I | 0.2212 | 270.88 | 0.2212 | 270.88 | 0.2212 | 270.88 |
L1-D | 0.1996 | 159.52 | 0.1996 | 159.52 | 0.1996 | 159.52 |
L2 | 1.1556 | 615.36 | 0.9366 | 354.04 | 1.015 | 506.8 |
MM | 34.22 | 3174.12 | 34.22 | 3174.12 | 34.22 | 3174.12 |
To illustrate our area and power calculation procedure that can be combined with the results obtained from our queuing theoretic models to obtain performance per unit area and performance per watt, we provide example area and power calculations for 2P-2L1ID-2L2-1M. The area calculations for 2P-2L1ID-2L2-1M are as follows: total processor core area = 2 0.0325 = 0.065 mm; total L1-I cache area = 2 (0.281878 0.19619) = 2 0.0553 = 0.11 mm (CACTI provides cache height width for individual caches (e.g., 0.281878 0.19619 for the L1-I cache)); total L1-D cache area = 2 (0.209723 0.23785) = 2 0.0499 = 0.0998 mm; total L2 cache area = 2 (0.45166 0.639594) = 2 0.2889 = 0.578 mm; and total MM area = 8.38777 4.08034 = 34.22 mm. The power consumption of the caches and MM is the sum of the dynamic power and the leakage power. Since CACTI gives dynamic energy and access time, dynamic power is calculated as the ratio of the dynamic energy to the access time. For example, for the L1-I cache, the dynamic power = 0.020362 (nJ)/0.358448 (ns) = 56.8 mW and the leakage power = 10.9229 mW, which gives the L1-I power consumption = 56.8 + 10.9229 = 67.72 mW. For the MM dynamic power calculation, we calculate the average dynamic energy per read and write access = (1.27955 + 1.26155)/2 = 1.27055 mJ, which gives the dynamic power = 1.27055 (nJ)/5.45309 (ns) = 233 mW. The total power consumption for the MM = 233 + 2941.12 = 3174.12 mW (2941.12 mW is the leakage power for the MM). The power consumption calculations for 2P-2L1ID-2L2-1M are as follows: total processor core power consumption = 2 1.008 = 2.016 mW; total L1-I cache power consumption = 2 67.72 = 135.44 mW; total L1-D cache power consumption = 2 39.88 = 79.76 mW; and total L2 cache power consumption = 2 153.84 =307.68 mW.
The area and power results for multicore embedded architectures show that the MM consumes the most area and power consumption followed by L2, L1-I, L1-D, and the processor core. We observe that the shared L2 caches for 2P-2L1ID-1L2-1M and 4P-4L1ID-1L2-1M require 14% and 24% less area and consume 21.5% and 74% less power as compared to the private L2 caches for 2P-2L1ID-2L2-1M and 4P-4L1ID-4L2-1M, respectively. The hybrid L2 caches for 4P-4L1ID-2L2-1M require 14% less area and consume 21.4% less power as compared to the private L2 caches for 4P-4L1ID-4L2-1M, whereas the shared L2 cache for 4P-4L1ID-1L2-1M requires 8.7% less area and consumes 43% less power as compared to the hybrid L2 caches for 4P-4L1ID-2L2-1M. These results indicate that the power efficiency of shared LLCs improves as the number of coresincreases.
Table 5.9 shows the area and peak power consumption for different multicore embedded architectures. Table 5.9 does not include the MM area and power consumption, which allows the results to isolate the area and peak power consumption of the processor cores and caches. This MM isolation from Table 5.9 enables deeper insights and a fair comparison for the embedded architectures since we assume an off-chip MM that has the same size and characteristics for all evaluated architectures. To illustrate the area and power calculations for multicore embedded architectures, we provide area and power consumption calculations for 2P-2L1ID-2L2-1M as an example. We point out that these area and power consumption calculations use constituent area and power consumption calculations for the architectural elements in a multicore embedded architecture. For 2P-2L1ID-2L2-1M, the total cache area = 0.11 + 0.0998 + 0.578 = 0.7878 mm, which gives an overall area (excluding the MM) = 0.065 + 0.7878 = 0.8528 mm (0.065 mm is the area for the processor cores as calculated earlier). For 2P-2L1ID-2L2-1M, the total cache power consumption = 135.44 + 79.76 + 307.68 = 522.88 mW, which gives an overall power consumption (excluding the MM) = 2.016 + 522.88 = 524.896 mW (2.016 mW is the power consumption for the processor cores).
Table 5.9 Area and power consumption for multicore architectures
Architecture | Area | Power |
2P-2L1ID-2L2-1M | 0.8528 | 524.896 |
2P-2L1ID-1L2-1M | 0.7823 | 470.5 |
4P-4L1ID-4L2-1M | 1.7064 | 1049.79 |
4P-4L1ID-1L2-1M | 1.4874 | 788.472 |
4P-4L1ID-2L2-1M | 1.5658 | 941.232 |
The overall area and power consumption results for different multicore embedded architectures (Table 5.9) show that 2P-2L1ID-2L2-1M requires 8.3% more on-chip area and consumes 10.4% more power as compared to 2P-2L1ID-1L2-1M. 4P-4L1ID-4L2-1M requires 8.2% and 12.8% more on-chip area and consumes 10.3% and 24.9% more power as compared to 4P-4L1ID-2L2-1M and 4P-4L1ID-1L2-1M, respectively. These results reveal that the architectures with shared LLCs become more area and power efficient as compared to the architectures with private or hybrid LLCs as the number of cores in the architecture increases.
We discuss performance per watt and performance per unit area results for multicore embedded architectures assuming 64-bit floating point operations. We observe that the performance per watt and performance per unit area delivered by the processor cores and the L1-I and L1-D caches for these architectures are very close (within 7%); however, the L2 cache presents interesting results. Although the MM performance per watt for these architectures also differs, this difference does not provide meaningful insights for the following two reasons: (1) the MM is typically off-chip and the performance per watt is more critical for on-chip architectural components than for the off-chip components; and (2) if more requests are satisfied by the LLC, then fewer requests are deferred to the MM, which decreases the MM throughput and hence the performance per watt. Therefore, we mainly focus on the performance per watt and performance per unit area calculations for the LLCs for our studied architectures.
We calculate performance per watt results for memory-bound workloads when the L1-I, L1-D, and L2 cache miss rates are 0.01, 0.13, and 0.3, respectively. The performance per watt values for the L2 caches are 2.42 and 3.1 MFLOPS/W, and the performance per watt for the MM is 0.164 and 0.074 MFLOPS/W for 2P-2L1ID-2L2-1M and 2P-2L1ID-1L2-1M, respectively, when and . Our performance per watt calculations for 2P-2L1ID-2L2-1M incorporate the aggregate throughput for the L2 cache, which is the sum of the throughputs for the two private L2 caches in 2P-2L1ID-2L2-1M. The performance per watt for the L2 caches drops to 2.02 and 2.53 MFLOPS/W, whereas the performance per watt for the MM drops to 0.137 and 0.06 MFLOPS/W for 2P-2L1ID-2L2-1M and 2P-2L1ID-1L2-1M, respectively, when and . The performance per watt values for the L2 caches are 2.21, 4.77, and 3.08 MFLOPS/W for 4P-4L1ID-4L2-1M, 4P-4L1ID-1L2-1M, and 4P-4L1ID-2L2-1M, respectively, when and . We observe similar trends for mixed workloads and processor-bound workloads but with comparatively lower performance per watt for the LLC caches because these workloads have comparatively lower as compared to memory-bound workloads. The performance per watt for caches drops as decreases because fewer requests are directed to the LLC caches for a low , which decreases the throughput and hence the performance per watt. These results indicate that architectures with shared LLCs provide the highest LLC performance per watt followed by architectures with hybrid LLCs and private LLCs. The difference in performance per watt forthese multicore architectures is mainly due to the difference in the LLC power consumption as there is a relatively small difference in the throughput delivered by these architectures for the same workloads with identical cache miss rates.
Based on our experimental results for different cache miss rates and workloads, we determine the peak performance per watt for the LLCs for our studied multicore embedded architectures. The peak performance per watt values for the L2 caches are 11.8 and 14.3 MFLOPS/W for 2P-2L1ID-2L2-1M and 2P-2L1ID-1L2-1M, respectively, when the L1-I, L1-D, and L2 cache miss rates are all equal to 0.3, , and . The peak performance per watt values for the L2 caches are 7.6, 9.2, and 13.8 MFLOPS/W for 4P-4L1ID-4L2-1M, 4P-4L1ID-2L2-1M, and 4P-4L1ID-1L2-1M, respectively, when the L1-I, L1-D, and L2 cache miss rates are all equal to 0.2, , and . Results reveal that these architectures deliver peak LLC performance per watt for workloads with mid-range cache miss rates (e.g., miss rates of 0.2 or 0.3) because at higher cache miss rates, a larger number of requests are directed toward the LLCs, which causes the LLCs' utilization to be close to one, which results in an increased response time and decreased throughput.
To investigate the effects of cache miss rates on the performance per watt of the LLCs, we calculate the performance per watt for memory-bound workloads () at high cache miss rates: L1-I = 0.5, L1-D = 0.7, and L2 = 0.7, and when . The performance per watt values for the L2 caches are 5.4 and 6.55 MFLOPS/W for 2P-2L1ID-2L2-1M and 2P-2L1ID-1L2-1M, respectively, whereas the performance per watt values are 2.7, 3.28, and 4.69 MFLOPS/W for 4P-4L1ID-4L2-1M, 4P-4L1ID-2L2-1M, and 4P-4L1ID-2L2-1M, respectively. Results reveal that at high cache miss rates, the performance per watt of the LLCs increases because relatively more requests are directed to the LLCs at higher cache miss rates than lower cache miss rates, which increases the throughput, and hence the performance per watt.
We calculate performance per unit area results for memory-bound workloads when the L1-I, L1-D, and L2 cache miss rates are 0.01, 0.13, and 0.3, respectively. The performance per unit area values for the L2 caches are 1.29 and 1.54 MFLOPS/mm, and the performance per unit area values for the MM are 15.25 and 6.9 KFLOPS/mm for 2P-2L1ID-2L2-1M and 2P-2L1ID-1L2-1M, respectively, when and . The performance per unit area values for the L2 caches drop to 1.08 and 1.26 MFLOPS/mm, whereas the performance per unit area values for the MM drop to 12.68 and 5.6 KFLOPS/mm for 2P-2L1ID-2L2-1M and 2P-2L1ID-1L2-1M, respectively, when and . The performance per unit area values for the L2 caches are 1.18, 1.8, and 1.54 MFLOPS/mm for 4P-4L1ID-4L2-1M, 4P-4L1ID-1L2-1M, and 4P-4L1ID-2L2-1M, respectively, when and . We observe similar trends for performance per unit area for mixed and processor-bound workloads as for the performance per watt trends explained earlier. These results indicate that architectures with shared LLCs provide the highest LLC performance per unit area followed by architectures with hybrid LLCs and private LLCs. The difference in performance per unit area for these multicore architectures is mainly dueto the difference in the LLC throughput as we ensure that the total LLC area occupied by a multicore embedded architecture with a given number of cores remains close enough (a minor difference in the occupied area of the LLCs occurs for different multicore embedded architectures due to practical implementation and fabrication constraints as determined by CACTI) to provide a fair comparison.
Based on our queuing theoretic models' results and area calculations, we determine the peak performance per unit area for the LLCs for our studied multicore embedded architectures. The peak performance per unit area values for the L2 caches are 6.27 and 7.14 MFLOPS/mm for 2P-2L1ID-2L2-1M and 2P-2L1ID-1L2-1M, respectively, when the L1-I, L1-D, and L2 cache miss rates are all equal to 0.3, , and . The peak performance per unit area values for the L2 caches are 4.04, 4.6, and 5.22 MFLOPS/mm for 4P-4L1ID-4L2-1M, 4P-4L1ID-2L2-1M, and 4P-4L1ID-1L2-1M, respectively, when the L1-I, L1-D, and L2 cache miss rates are all equal to 0.2, , and . Other results for peak performance per unit area reveal similar trends as for the peak performance per watt trends and therefore omit these discussions for brevity.
In this chapter, we developed closed product-form queueing network models for performance evaluation of multicore embedded architectures for different workload characteristics. The simulation results for the SPLASH-2 benchmarks executing on the SESC simulator (an architecture-level cycle-accurate simulator) verified the architectural evaluation insights obtained from our queueing theoretic models. Results revealed that our queueing theoretic model qualitatively evaluated multicore architectures accurately with an average difference of 5.6% as compared to the architectures' evaluations from the SESC simulator. The performance evaluation results indicated that the architectures with shared LLCs provided better cache response time and MFLOPS/W than the private LLCs for all cache miss rates especially as the number of cores increases. The results also revealed the disadvantage of shared LLCs indicating that the shared LLCs are more likely to cause a main memory response time bottleneck for larger cache miss rates as compared to the private LLCs. The memory bottleneck caused by shared LLCs may lead to increased response time for processor cores because of stalling or idle waiting. However, results indicated that the main memory bottleneck created by shared LLCs can be mitigated by using a hybrid of private and shared LLCs (i.e., sharing LLCs by a fewer number of cores) though hybrid LLCs consume more power than the shared LLCs and deliver comparatively less MFLOPS/W. The performance per watt and performance per unit area results for the multicore embedded architectures revealed that the multicore architectures with shared LLCs become more area and power efficient as compared to the architectures with private LLCs as the number of processor cores in the architectureincreases.