40 5. RICHER CONSIDERATIONS
Search Tree
Trace
x
1
x
2
s
Figure 5.1: CSOPT uses a search tree; based on [Jeong and Dubois, 2006].
higher cost, or it might displace several low-cost blocks that together are more costly than the
single high-cost block. us, the best replacement decision depends on the behavior of multiple
competing lines in the future, resulting in a large search space of solutions.
Unfortunately, CSOPTs branch-and-bound approach is exponential in the number of
cache accesses, whereas MIN is linear in the number of cache accesses. Jeong and Dubois [2006]
propose heuristics to prune the search tree and to reduce the search space. For a few scientific
workloads, these heuristics are shown to make the computation of CSOPT tractable—in the
sense that an offline analysis can complete—but in general, they do not reduce the worst-case
complexity.
Figure 5.2 shows the relative cost savings of CSOPT over MIN (y-axis) for the Barnes Hut
benchmark from the SPLASH benchmark suite [Jeong and Dubois, 2006]. Here cache misses
are assumed to have just two possible costs, either high cost or low cost. e x-axis represents
different proportions of high-cost accesses, and the lines represent different cost ratios between
high-cost and low-cost accesses. We see that the cost savings of CSOPT over MIN increases
with higher cost ratios, and they are most pronounced when the high-cost accesses comprise
20–30% of the total number of accesses. is trend makes sense because the benefit of CSOPT
over MIN grows when the difference is cost is high and when the percentage of low-cost misses
is high, because in these cases, MIN is more likely to make the wrong choice.
Of course, like MIN, CSOPT is impractical since it requires knowledge of future accesses.
erefore, practical solutions for cost-aware cache replacement rely on intuitive heuristics to
(1) identify high-cost cache accesses and (2) preferentially cache high-cost loads over low-cost
loads.
5.1.1 MEMORY-LEVEL PARALLELISM (MLP)
Qureshi et al. [2006] were the first to propose an MLP-aware cache replacement policy. eir
key insight is that isolated misses (low MLP) are more costly for performance than parallel
misses (high MLP) because an isolated miss impedes all dependent instructions behind it and
5.1. COST-AWARE CACHE REPLACEMENT 41
Barnes
100
90
80
70
60
50
40
30
20
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
r = Inf
r = 32
r = 16
r = 8
r = 4
r = 2
High-cost Access Fraction
Relative Cost Savings (%)
Figure 5.2: Cost savings over MIN with random cost assignments [Jeong and Dubois, 2006].
leads to a long-latency processor stall. By contrast, parallel misses are not as expensive because
the processor can hide their latency by issuing multiple loads in parallel. us, an MLP-aware
cache replacement policy can improve performance by reducing the number of performance-
critical isolated misses.
Qureshi et al. illustrate this phenomenon with an example (see Figure 5.3), where P1,
P 2, P 3, and P 4 can be serviced in parallel, and where S1, S2, and S3 are isolated misses. e
top part of Figure 5.3 shows that Beladys solution results in only 4 misses, but it also results
in 4 stalls because the S1, S2, and S 3 all miss and cause the processor to stall. By contrast,
the MLP-aware policy in the bottom part of the figure never evicts S1, S2, and S3, so even
though this policy results in six misses, it stalls the processor only twice, resulting in overall
better performance.
MLP-Aware Linear ReplacementPolicy ere are two components to Qureshi et al. [2006]’s
MLP-aware Linear (LIN) policy. e first is an algorithm that predicts the MLP-cost of a future
miss. e second is a replacement policy that takes into account both recency and MLP-based
cost.
e MLP-based cost of a cache miss is defined to be the number of cycles that the miss
spends waiting to be serviced, which can be easily tracked using Miss Status Handling Registers.
For parallel misses, the cycle count is divided equally among all concurrent misses. us, a higher
MLP-based cost implies that the line is likely to result in an isolated miss, so it is more desirable
to keep it in the cache. Furthermore, Qureshi et al. note that the MLP-based cost repeats for
consecutive misses, so the last MLP-based cost of a miss is a good indicator of its MLP-based
cost for future misses.
42 5. RICHER CONSIDERATIONS
A
A B C D E A
A B C D E A Cycles Saved
A
B C D E
(a) An access pattern. A miss to an S block (S1,S2,S3) results in an Isolated miss. A miss
to a P block (P1,P2,P3, P4) can be serviced in parallel with misses to other P blocks.
(b) Execution timeline for one iteration with Belady’s OPT replacement.
(c) Execution timeline for one iteration with MLP-aware replacement.
Hit/Miss
Hit: P1,P2,P3
Miss:P4
Hit: P1
Miss: P2,P3,P4
Hit: P4
Miss: P3,P2,P1
Hit: S1 Hit: S2 Hit: S3
Hit: P4,P3,P2,P1 Miss: S1 Miss: S2 Miss: S3
Stall
Stall Stall
Stall Stall Stall
Cache
State
Hit/Miss
time
Cache
State
Misses/Iteration: 4
Stalls/Iteration: 4
Misses/Iteration: 6
Stalls/Iteration: 2
P1 P2 P3 S3 P1
P1, P2, P3, P4 P4, P3, P2, P1
P2 P3 P4 P1 P2 P3 S1 P1 P2 P3 S2 P1 P2 P3 S3
S1 S2 S3 P1 S1 S2 S3 P4 S1
S1 S2 S3
S2 S3 P1
Figure 5.3: MLP-aware caching can outperform OPT; based on [Qureshi et al., 2006].
e LIN replacement policy linearly combines recency and MLP-based cost and evicts
the line with the lowest aggregate cost. In particular, if R.i/ is the recency value of a block i
(the highest value denotes the MRU position and lowest value denotes the LRU position), and
cost
q
.i/ is its quantized MLP-based cost, then the LIN policy will evict the line with the lowest
value of R.i/ C cost
q
.i/, where is the importance of the cost
q
in choosing the victim. A
high value of indicates that LIN is likely to retain recent blocks with a high MLP-based cost,
and a low value of indicates that LIN is likely to put more emphasis on recency. Qureshi et al.
set to 4.
Finally, because LIN does not always outperform LRU, Qureshi et al. dynamically choose
between LRU and LIN by using Set Dueling (see Section 3.3.2) to periodically choose between
the LIN and LRU policies. Arunkumar and Wu [2014] provide an alternate solution by incor-
porating post-eviction reuse information into the cost-metric itself. In particular, their ReMAP
policy defines the overall cost of a line as a linear combination of recency, DRAM latency, and
post-eviction reuse distance, where the post-eviction reuse distance is computed using a bloom
filter for evicted lines.
Locality-Aware Cost-Sensitive Cache Replacement Policy Recent work builds on Qureshi
et al.’s work by (1) defining new cost metrics, and (2) defining new replacement strategies. For ex-
ample, the Locality-Aware Cost-Sensitive Cache Replacement Algorithm (LACS) [Kharbutli
and Sheikh, 2014] estimates the cost of a cache block by counting the number of instructions is-
sued during a block’s LLC miss. Intuitively, this definition of cost reflects the processor’s ability
to hide miss penalty, which is similar to MLP. Cache blocks are classified as either low-cost or
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset