Download High Throughput On-Chip Interconnection Networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Design and Analysis of a Robust
Pipelined Memory System
Hao Wang†, Haiquan (Chuck) Zhao*,
Bill Lin†, and Jun (Jim) Xu*
†University
of California, San Diego
*Georgia Institute of Technology
Infocom 2010, San Diego
Memory Wall
• Modern Internet routers need to manage large
amounts of packet- and flow-level data at line rates
• e.g., need to maintain per-flow records during a
monitoring period, but
– Core routers have millions of flows, translating to
100’s of megabytes of storage
– 40 Gb/s OC-768 link, new packet can arrive every 8 ns
2
Memory Wall
• SRAM/DRAM dilemma
• SRAM: access latency typically between 5-15 ns
(fast enough for 8 ns line rate)
• But the capacity of SRAMs is substantially
inadequate in many cases: 4 MB largest typically
(much less than 100’s of MBs needed)
3
Memory Wall
• DRAM provides inexpensive bulk storage
• But random access latency typically 50- 100 ns
(much slower than 8 ns needed for 40 Gb/s line rate)
• Conventional wisdom is that DRAMs
are not fast enough to keep up with
ever-increasing line rates
4
Memory Design Wish List
• Line rate memory bandwidth (like SRAM)
• Inexpensive bulk storage (like DRAM)
• Predictable performance
• Robustness to adversarial access patterns
5
Main Observation
• Modern DRAMs can be fast and
cheap!
– Graphics, video games, and HDTV
– At commodity pricing, just
$0.01/MB currently, $20 for 2GB!
6
Example: Rambus XDR Memory
• 16 internal banks
7
Memory Interleaving
• Performance achieved through memory interleaving
– e.g. suppose we have B = 6 DRAM banks and access
pattern is sequential
1
7
13
2
8
14
3
9
15
4
10
16
5
11
17
6
12
18
:
:
:
:
:
:
:
:
:
:
:
:
– Effective memory bandwidth B times faster
8
Memory Interleaving
• But, suppose access pattern is as follows:
1
7
13
19
:
25
:
2
8
14
3
9
15
4
10
16
5
11
17
6
12
18
:
:
:
:
:
:
:
:
:
:
• Memory bandwidth degrades to worst-case DRAM
latency
9
Memory Interleaving
• One solution is to apply pseudo-randomization of
memory locations
:
:
:
:
:
:
:
:
:
:
:
:
10
Adversarial Access Patterns
• However, memory bandwidth can still degrade to
worst-case DRAM latency even with randomization:
1. Lookups to same global variable will trigger accesses to
same memory bank
2. Attacker can flood packets with same TCP/IP header,
triggering updates to the same memory location and
memory bank, regardless of the randomization function.
11
Outline
• Problem and Background
→Proposed Design
• Theoretical Analysis
• Evaluation
12
Pipelined Memory Abstraction
Emulates SRAM with Fixed Delay
op
addr
data
R
R
c
b
W
W
data out
SRAM
R
a
c
c
a
W
R
R
R
5
4
time
time
5
4
3
2
1
0
op
addr
data
R
R
c
b
W
W
SRAM
Emulation
R
3
2
1
data out
a
c
c
a
W
R
R
R
time
0
time
5
4
3
2
1
0
D+5 D+4 D+3 D+2 D+1
D
13
Implications of Emulation
• Fixed pipeline delay: If a read operation is issued at
time t to an emulated SRAM, the data is available
from the memory controller at exactly t + D (instead
of same cycle).
• Coherency: The read operations output the same
results as an ideal SRAM system.
14
Proposed Solution: Basic Idea
• Keep SRAM reservation table of memory operations
and data that occurred in last C cycles
• Avoid introducing new DRAM operation for memory
references to same location within C cycles
15
Details of Memory Architecture
request buffers
B
random
address
permutation
…
data
data
data
op
addr
data
R-link
R-link
R-link
p
p
p
R-link
p
…
addr
addr
addr
data
out
C
…
op
op
op
MRI table
(CAM)
MRW table
(CAM)
C
reservation table
C
input
operations
DRAM banks
16
Merging of Operations
• Requests arrive from right to left.
1. …
2. …
+ WRITE  WRITE
read copies data from
write
WRITE + WRITE  WRITE
2nd write overwrites 1st
write
READ
2nd read copies data
from 1st read
3. …
READ
+
READ

4. …
WRITE +
READ
 WRITE +
READ
READ
17
Proposed Solution
• Rigorously prove that with merging, worst-case delay
for memory operation is bounded by some fixed D
w.h.p.
• Provide pipelined memory abstraction in which
operations issued at time t are completed at exactly
t + D cycles later (instead of same cycle).
• Reservation table with C > D also used to implement
the pipeline delay, as well as serving as a “cache”.
18
Outline
• Problem and Background
• Proposed Design
→Theoretical Analysis
• Evaluation
19
Robustness
• At most one write operation in a request buffer every
C cycles to a particular memory address.
• At most one read operation in a request buffer every
C cycles to a particular memory address.
• At most one read operation followed by one write
operation in a request buffer every C cycles to a
particular address.
20
Theoretical Analysis
• Worst case analysis
• Convex ordering
• Large deviation theory
• Prove: with a cache of size C, the best an attacker can do
is to send repetitive requests every C+1 cycles.
21
Bound on Overflow Probability
• Want to bound the probability that a request buffer
overflows in n cycles
Pr[overflow] 

0 s t  n
Pr[ Ds ,t ]
Ds ,t     : X s ,t    K 
•
is the number of updates to a bank during cycles
[s, t],  t  s , K is the length of a request queue.
X s ,t
• For total overflow probability bound multiply by B.
22
Chernoff Inequality
Pr[ Ds ,t ]  Pr[ X  K +  ]
 Pr[e X  e( K +  ) ]
E[e X  ]
 ( K +  )
e
• Since this is true for all θ>0,
E[e X  ]
Pr[ Ds ,t ]  min ( K +  )
 0 e
• We want to find the update sequence that maximizes
E[e X  ]
23
Worst Case Request Patterns
q1    T 1 C
q2    2Tq1   2T  1 
1
 
T  
C 
•
•
•
•

r    2Tq1   2T 1 q2
C
q1+q2 +1 requests for distinct counters a1 , aq + q + r ,
q1 requests repeat 2T times each
q2 requests repeat 2T-1 times each
1 request repeat r times each
1
2
24
Outline
• Problem and Background
• Proposed Design
• Theoretical Analysis
→Evaluation
25
Evaluation
• Overflow probability for 16 million addresses,
µ=1/10, and B=32.
0
10
-2
Overflow Probability Bound
10
-4
10
-6
10
-8
10
C=6000
-10
C=7000
10
C=8000
SRAM 156 KB,
CAM 24 KB
-12
10
C=9000
-14
10
80
90
100
110
120
130
140
Queue Length K
150
160
170
180
26
Evaluation
• Overflow probability for 16 million addresses,
µ=1/10, and C=8000.
0
10
-5
Overflow Probability Bound
10
-10
10
-15
10
B=32
-20
10
B=34
B=36
-25
10
B=38
-30
10
80
90
100
110
120
130
140
Request Buffer Size K
150
160
170
180
27
Conclusion
• Proposed a robust memory architecture that provides
throughput of SRAM with density of DRAM.
• Unlike conventional caching that have unpredictable
hit/miss performance, our design guarantees w.h.p. a
pipelined memory architecture abstraction that can
support new memory operation every cycle with fixed
pipeline delay.
• Convex ordering and large deviation theory to rigorously
prove robustness under adversarial accesses.
28
Thank You