Download notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ENGS 116 Lecture 12
1
Caches
Vincent H. Berk
Wednesday October 29th, 2008
Reading for Friday: Sections C.1 – C.3
Article for Friday: Jouppi
Reading for Monday: Section C.4 – C.7
ENGS 116 Lecture 12
2
Who Cares about the Memory Hierarchy?
•
So far, have discussed only processor
– CPU Cost/Performance, ISA, Pipelined Execution, ILP
1000
CPU
CPU-DRAM Gap
100
10
DRAM
•
•
•
1980: no cache in microprocessors
1995: 2-level cache, 60% transistors on Alpha 21164
2002: IBM experimenting with Main Memory on die.
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1
ENGS 116 Lecture 12
3
The Motivation for Caches
Memory System
Processor
Cache
Main
Memory
• Motivation:
– Large memories (DRAM) are slow
– Small memories (SRAM) are fast
• Make the average access time small by servicing most accesses
from a small, fast memory
• Reduce the bandwidth required of the large memory
ENGS 116 Lecture 12
Principle of Locality of Reference
• Programs do not access their data or code all at once or with equal
probability
– Rule of thumb: Program spends 90% of its execution time in
only 10% of the code
• Programs access a small portion of the address space at any one time
• Programs tend to reuse data and instructions that they have recently
used
• Implication of locality: Can predict with reasonable accuracy what
instructions and data a program will use in the near future based on
its accesses in the recent past
4
ENGS 116 Lecture 12
5
Memory System
Illusion
Reality
Processor
Processor
Memory
Memory
Memory
Memory
ENGS 116 Lecture 12
6
General Principles
• Locality
– Temporal Locality: referenced again soon
– Spatial Locality: nearby items referenced soon
• Locality + smaller HW is faster memory hierarchy
– Levels: each smaller, faster, more expensive/byte than level
below
– Inclusive: data found in top also found in lower levels
• Definitions
– Upper is closer to processor
– Block: minimum, address aligned unit that fits in cache
– Address = Block frame address + block offset address
– Hit time: time to access upper level, including hit determination
ENGS 116 Lecture 12
7
Cache Measures
• Hit rate: fraction of accesses found in that level
– So high that we usually talk about the miss rate
– Miss rate fallacy: miss rate induces miss penalty,
determines average memory performance
• Average memory-access time (AMAT) =
Hit time + Miss rate x Miss penalty (ns or clocks)
• Miss penalty: time to replace a block from lower level,
including time to copy to and restart CPU
–
–
access time: time to lower level = ƒ(lower level latency)
transfer time: time to transfer block = ƒ(BW upper &
lower, block size)
ENGS 116 Lecture 12
8
Block Size vs. Cache Measures
• Increasing block size generally increases the miss penalty
Miss
Penalty
Miss
penalty
Transfer
time

Miss
Rate
Miss
rate
Avg.
Memory
Access
Time
=>
Average
access
time
Access
time
Block Size
Block Size
Block Size
ENGS 116 Lecture 12
Key Points of Memory Hierarchy
• Need methods to give illusion of large, fast memory
• Programs exhibit both temporal locality and spatial locality
– Keep more recently accessed data closer to the processor
– Keep multiple contiguous words together in memory blocks
• Use smaller, faster memory close to processor – hits are processed
quickly; misses require access to larger, slower memory
• If hit rate is high, memory hierarchy has access time close to that
of highest (fastest) level and size equal to that of lowest (largest)
level
9
ENGS 116 Lecture 12
10
Implications for CPU
• Fast hit check since every memory access needs this check
– Hit is the common case
• Unpredictable memory access time
– 10s of clock cycles: wait
– 1000s of clock cycles: (Operating System)
» Interrupt & switch & do something else
» Lightweight: multithreaded execution
ENGS 116 Lecture 12
Four Memory Hierarchy Questions
• Q1: Where can a block be placed in the upper level?
(Block placement)
• Q2: How is a block found if it is in the upper level?
(Block identification)
• Q3: Which block should be replaced on a miss?
(Block replacement)
• Q4: What happens on a write?
(Write strategy)
11
ENGS 116 Lecture 12
12
Q1: Where can a block be placed in the cache?
• Block 12 placed in 8 block
cache:
– Fully associative, direct
mapped, 2-way set
associative
– S.A. Mapping =
Block number modulo
number sets
Fully associative:
block 12 can go
anywhere
Block
no.
0 1 2 3 4 5 6 7
Direct mapped:
block 12 can go
only into block 4
(12 mod 8)
Block
no.
0 1 2 3 4 5 6 7
Set associative:
block 12 can go
anywhere in set 0
(12 mod 4)
Block
no.
0 1 2 3 4 5 6 7
Cache
Set Set Set Set
0 1 2 3
Block frame address
Block
no.
Memory
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
ENGS 116 Lecture 12
13
Direct Mapped Cache
• Each memory location is mapped to exactly one location in the cache
• Cache location assigned based on address of word in memory
• Mapping: (address of block) mod (# of blocks in cache)
000
00000
00100
01000
001
010
01100
011
100
101
10000
110
111
10100
11000
11100
ENGS 116 Lecture 12
14
Associative Caches
• Fully Associative: block can go anywhere in the cache
• N-way Set Associative: block can go in one of N locations in the set
ENGS 116 Lecture 12
15
Q2: How is a block found if it is in the cache?
• Tag on each block
– No need to check index or block offset
• Increasing associativity shrinks index, expands tag
Block Address
Tag
Index
Fully Associative: No index
Direct Mapped: Large index
Block
Offset
ENGS 116 Lecture 12
16
Examples
• 512-byte cache, 4-way set associative, 16-byte blocks, byte
addressable
• 8-KB cache, 2-way set associative, 32-byte blocks, byte addressable
ENGS 116 Lecture 12
17
Q3: Which block should be replaced on a miss?
• Easy for direct mapped
• Set associative or fully associative:
– Random (large associativities)
– LRU (smaller associativities)
– FIFO (large associativities)
Associativity:
2-way
Size
LRU Random
16 KB
114.1 117.3
64 KB
103.4 104.3
256 KB
92.2
92.1
FIFO
115.5
103.9
92.5
4-way
LRU Random
111.7 115.1
102.4 102.3
92.1
92.1
FIFO
113.3
103.1
92.5
ENGS 116 Lecture 12
18
Q4: What Happens on a Write?
• Write through: The information is written to both the block in the
cache and to the block in the lower-level memory.
• Write back: The information is written only to the block in the cache.
The modified cache block is written to main memory only when it is
replaced.
– Is block clean or dirty?
• Pros and Cons of each:
– WT: read misses cannot result in writes (because of
replacements)
– WB: no writes of repeated writes
• WT always combined with write buffers so that we don’t wait for
lower level memory
• WB write buffer, giving a read-miss precedence
ENGS 116 Lecture 12
19
Example: 21064 Data Cache
• Index = 8 bits: 256 blocks = 8192/(32 x 1)
Block
offset
<5>
Block address
<21>
<8>
Tag
Direct
Mapped
1
CPU
address
Data Data
in
out
Index
4
Valid
<1>
(256
Blocks)
Tag
<21>
Data
<256>
2
•••
=?
•••
3
4:1 MUX
Write
buffer
Lower Level Memory
ENGS 116 Lecture 12
20
2-way Set Associative,
Address to Select Word
Block address
<22>
<7>
Tag
Two sets of
address tags
and data RAM
Block
offset
<5>
CPU
address
Data Data
in
out
Data
<64>
Index
Valid
<1>
Tag
<21>
•••
•••
2:1 mux
selects data
=?
2:1
MU
X
=?
Use address
bits to select
correct RAM
Write
buffer
•••
•••
Lower Level Memory
ENGS 116 Lecture 12
21
Structural Hazard:
Instruction and Data?
Size
Instruction Cache
8 KB
8.16
16 KB
3.82
32 KB
1.36
64 KB
0.61
128 KB
0.30
256 KB
0.02
Misses per 1000 instructions
Mix: instructions 74%, data 26%
Data Cache
44.0
40.9
38.4
36.9
35.3
32.6
Unified Cache
63.0
51.0
43.3
39.4
36.2
32.9
ENGS 116 Lecture 12
22
Cache Performance
includes hit time
CPU time = (CPU execution clock cycles + Memory-stall clock cycles)
 Clock cycle time
Memory-stall clock cycles
=
=
= Read-stall cycles + Write-stall cycles
Memory access
 Miss rate  Miss penalty
Program
Instructio ns
Misses

 Miss penalty
Program
Instructio n
ENGS 116 Lecture 12
23
Cache Performance
CPU time = IC  (CPIexecution + Mem accesses per instruction  Miss
rate  Miss penalty)  Clock cycle time
Misses per instruction = Memory accesses per instruction  Miss rate
CPU time = IC  (CPIexecution + Misses per instruction  Miss penalty)
 Clock cycle time
These formulas are conceptual only. Modern day out-of-order
processors hide much latency though parallelism.
ENGS 116 Lecture 12
24
Summary of Cache Basics
• Associativity
• Block size (cache line size)
• Write Back/Write Through, write buffers, dirty bits
• AMAT as a basic performance measure
• Larger block size decreases miss rate but can increase miss penalty
• Can increase bandwidth of main memory to transfer cache blocks
more efficiently
• Memory system can have significant impact on program execution
time, memory stalls can be over 100 cycles
• Faster processors => memory stalls more costly