Criticality-Driven Cache Optimizations

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

40 5. RICHER CONSIDERATIONS

Search Tree

Trace

•

Figure 5.1: CSOPT uses a search tree; based on [Jeong and Dubois, 2006].

higher cost, or it might displace several low-cost blocks that together are more costly than the

single high-cost block. us, the best replacement decision depends on the behavior of multiple

competing lines in the future, resulting in a large search space of solutions.

Unfortunately, CSOPT’s branch-and-bound approach is exponential in the number of

cache accesses, whereas MIN is linear in the number of cache accesses. Jeong and Dubois [2006]

propose heuristics to prune the search tree and to reduce the search space. For a few scientiﬁc

workloads, these heuristics are shown to make the computation of CSOPT tractable—in the

sense that an oﬄine analysis can complete—but in general, they do not reduce the worst-case

complexity.

Figure 5.2 shows the relative cost savings of CSOPT over MIN (y-axis) for the Barnes Hut

benchmark from the SPLASH benchmark suite [Jeong and Dubois, 2006]. Here cache misses

are assumed to have just two possible costs, either high cost or low cost. e x-axis represents

diﬀerent proportions of high-cost accesses, and the lines represent diﬀerent cost ratios between

high-cost and low-cost accesses. We see that the cost savings of CSOPT over MIN increases

with higher cost ratios, and they are most pronounced when the high-cost accesses comprise

20–30% of the total number of accesses. is trend makes sense because the beneﬁt of CSOPT

over MIN grows when the diﬀerence is cost is high and when the percentage of low-cost misses

is high, because in these cases, MIN is more likely to make the wrong choice.

Of course, like MIN, CSOPT is impractical since it requires knowledge of future accesses.

erefore, practical solutions for cost-aware cache replacement rely on intuitive heuristics to

(1) identify high-cost cache accesses and (2) preferentially cache high-cost loads over low-cost

loads.

5.1.1 MEMORY-LEVEL PARALLELISM (MLP)

Qureshi et al. [2006] were the ﬁrst to propose an MLP-aware cache replacement policy. eir

key insight is that isolated misses (low MLP) are more costly for performance than parallel

misses (high MLP) because an isolated miss impedes all dependent instructions behind it and

5.1. COST-AWARE CACHE REPLACEMENT 41

Barnes

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

r = Inf

r = 32

r = 16

r = 8

r = 4

r = 2

High-cost Access Fraction

Relative Cost Savings (%)

Figure 5.2: Cost savings over MIN with random cost assignments [Jeong and Dubois, 2006].

leads to a long-latency processor stall. By contrast, parallel misses are not as expensive because

the processor can hide their latency by issuing multiple loads in parallel. us, an MLP-aware

cache replacement policy can improve performance by reducing the number of performance-

critical isolated misses.

Qureshi et al. illustrate this phenomenon with an example (see Figure 5.3), where P1,

P 2, P 3, and P 4 can be serviced in parallel, and where S1, S2, and S3 are isolated misses. e

top part of Figure 5.3 shows that Belady’s solution results in only 4 misses, but it also results

in 4 stalls because the S1, S2, and S 3 all miss and cause the processor to stall. By contrast,

the MLP-aware policy in the bottom part of the ﬁgure never evicts S1, S2, and S3, so even

though this policy results in six misses, it stalls the processor only twice, resulting in overall

better performance.

MLP-Aware Linear ReplacementPolicy ere are two components to Qureshi et al. [2006]’s

MLP-aware Linear (LIN) policy. e ﬁrst is an algorithm that predicts the MLP-cost of a future

miss. e second is a replacement policy that takes into account both recency and MLP-based

cost.

e MLP-based cost of a cache miss is deﬁned to be the number of cycles that the miss

spends waiting to be serviced, which can be easily tracked using Miss Status Handling Registers.

For parallel misses, the cycle count is divided equally among all concurrent misses. us, a higher

MLP-based cost implies that the line is likely to result in an isolated miss, so it is more desirable

to keep it in the cache. Furthermore, Qureshi et al. note that the MLP-based cost repeats for

consecutive misses, so the last MLP-based cost of a miss is a good indicator of its MLP-based

cost for future misses.

42 5. RICHER CONSIDERATIONS

A B C D E A’

A B C D E A’ Cycles Saved

A’

B C D E

(a) An access pattern. A miss to an S block (S1,S2,S3) results in an Isolated miss. A miss

to a P block (P1,P2,P3, P4) can be serviced in parallel with misses to other P blocks.

(b) Execution timeline for one iteration with Belady’s OPT replacement.

Hit/Miss

Hit: P1,P2,P3

Miss:P4

Hit: P1

Miss: P2,P3,P4

Hit: P4

Miss: P3,P2,P1

Hit: S1 Hit: S2 Hit: S3

Hit: P4,P3,P2,P1 Miss: S1 Miss: S2 Miss: S3

Stall

Stall Stall

Stall Stall Stall

Cache

State

Hit/Miss

time

Cache

State

Misses/Iteration: 4

Stalls/Iteration: 4

Misses/Iteration: 6

Stalls/Iteration: 2

P1 P2 P3 S3 P1

P1, P2, P3, P4 P4, P3, P2, P1

P2 P3 P4 P1 P2 P3 S1 P1 P2 P3 S2 P1 P2 P3 S3

S1 S2 S3 P1 S1 S2 S3 P4 S1

S1 S2 S3

S2 S3 P1

Figure 5.3: MLP-aware caching can outperform OPT; based on [Qureshi et al., 2006].

e LIN replacement policy linearly combines recency and MLP-based cost and evicts

the line with the lowest aggregate cost. In particular, if R.i/ is the recency value of a block i

(the highest value denotes the MRU position and lowest value denotes the LRU position), and

cost

.i/ is its quantized MLP-based cost, then the LIN policy will evict the line with the lowest

value of R.i/ C   cost

.i/, where  is the importance of the cost

in choosing the victim. A

high value of  indicates that LIN is likely to retain recent blocks with a high MLP-based cost,

and a low value of  indicates that LIN is likely to put more emphasis on recency. Qureshi et al.

set  to 4.

Finally, because LIN does not always outperform LRU, Qureshi et al. dynamically choose

between LRU and LIN by using Set Dueling (see Section 3.3.2) to periodically choose between

the LIN and LRU policies. Arunkumar and Wu [2014] provide an alternate solution by incor-

porating post-eviction reuse information into the cost-metric itself. In particular, their ReMAP

policy deﬁnes the overall cost of a line as a linear combination of recency, DRAM latency, and

post-eviction reuse distance, where the post-eviction reuse distance is computed using a bloom

ﬁlter for evicted lines.

Locality-Aware Cost-Sensitive Cache Replacement Policy Recent work builds on Qureshi

et al.’s work by (1) deﬁning new cost metrics, and (2) deﬁning new replacement strategies. For ex-

ample, the Locality-Aware Cost-Sensitive Cache Replacement Algorithm (LACS) [Kharbutli

and Sheikh, 2014] estimates the cost of a cache block by counting the number of instructions is-

sued during a block’s LLC miss. Intuitively, this deﬁnition of cost reﬂects the processor’s ability

to hide miss penalty, which is similar to MLP. Cache blocks are classiﬁed as either low-cost or

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Criticality-Driven Cache Optimizations

Create new playlist

Sign In

Sign Up

Table of Contents for
Criticality-Driven Cache Optimizations