Download High Throughput On-Chip Interconnection Networks

Design and Analysis of a Robust Pipelined Memory System Hao Wang†, Haiquan (Chuck) Zhao*, Bill Lin†, and Jun (Jim) Xu* †University of California, San Diego *Georgia Institute of Technology Infocom 2010, San Diego Memory Wall • Modern Internet routers need to manage large amounts of packet- and flow-level data at line rates • e.g., need to maintain per-flow records during a monitoring period, but – Core routers have millions of flows, translating to 100’s of megabytes of storage – 40 Gb/s OC-768 link, new packet can arrive every 8 ns 2 Memory Wall • SRAM/DRAM dilemma • SRAM: access latency typically between 5-15 ns (fast enough for 8 ns line rate) • But the capacity of SRAMs is substantially inadequate in many cases: 4 MB largest typically (much less than 100’s of MBs needed) 3 Memory Wall • DRAM provides inexpensive bulk storage • But random access latency typically 50- 100 ns (much slower than 8 ns needed for 40 Gb/s line rate) • Conventional wisdom is that DRAMs are not fast enough to keep up with ever-increasing line rates 4 Memory Design Wish List • Line rate memory bandwidth (like SRAM) • Inexpensive bulk storage (like DRAM) • Predictable performance • Robustness to adversarial access patterns 5 Main Observation • Modern DRAMs can be fast and cheap! – Graphics, video games, and HDTV – At commodity pricing, just $0.01/MB currently, $20 for 2GB! 6 Example: Rambus XDR Memory • 16 internal banks 7 Memory Interleaving • Performance achieved through memory interleaving – e.g. suppose we have B = 6 DRAM banks and access pattern is sequential 1 7 13 2 8 14 3 9 15 4 10 16 5 11 17 6 12 18 : : : : : : : : : : : : – Effective memory bandwidth B times faster 8 Memory Interleaving • But, suppose access pattern is as follows: 1 7 13 19 : 25 : 2 8 14 3 9 15 4 10 16 5 11 17 6 12 18 : : : : : : : : : : • Memory bandwidth degrades to worst-case DRAM latency 9 Memory Interleaving • One solution is to apply pseudo-randomization of memory locations : : : : : : : : : : : : 10 Adversarial Access Patterns • However, memory bandwidth can still degrade to worst-case DRAM latency even with randomization: 1. Lookups to same global variable will trigger accesses to same memory bank 2. Attacker can flood packets with same TCP/IP header, triggering updates to the same memory location and memory bank, regardless of the randomization function. 11 Outline • Problem and Background →Proposed Design • Theoretical Analysis • Evaluation 12 Pipelined Memory Abstraction Emulates SRAM with Fixed Delay op addr data R R c b W W data out SRAM R a c c a W R R R 5 4 time time 5 4 3 2 1 0 op addr data R R c b W W SRAM Emulation R 3 2 1 data out a c c a W R R R time 0 time 5 4 3 2 1 0 D+5 D+4 D+3 D+2 D+1 D 13 Implications of Emulation • Fixed pipeline delay: If a read operation is issued at time t to an emulated SRAM, the data is available from the memory controller at exactly t + D (instead of same cycle). • Coherency: The read operations output the same results as an ideal SRAM system. 14 Proposed Solution: Basic Idea • Keep SRAM reservation table of memory operations and data that occurred in last C cycles • Avoid introducing new DRAM operation for memory references to same location within C cycles 15 Details of Memory Architecture request buffers B random address permutation … data data data op addr data R-link R-link R-link p p p R-link p … addr addr addr data out C … op op op MRI table (CAM) MRW table (CAM) C reservation table C input operations DRAM banks 16 Merging of Operations • Requests arrive from right to left. 1. … 2. … + WRITE  WRITE read copies data from write WRITE + WRITE  WRITE 2nd write overwrites 1st write READ 2nd read copies data from 1st read 3. … READ + READ  4. … WRITE + READ  WRITE + READ READ 17 Proposed Solution • Rigorously prove that with merging, worst-case delay for memory operation is bounded by some fixed D w.h.p. • Provide pipelined memory abstraction in which operations issued at time t are completed at exactly t + D cycles later (instead of same cycle). • Reservation table with C > D also used to implement the pipeline delay, as well as serving as a “cache”. 18 Outline • Problem and Background • Proposed Design →Theoretical Analysis • Evaluation 19 Robustness • At most one write operation in a request buffer every C cycles to a particular memory address. • At most one read operation in a request buffer every C cycles to a particular memory address. • At most one read operation followed by one write operation in a request buffer every C cycles to a particular address. 20 Theoretical Analysis • Worst case analysis • Convex ordering • Large deviation theory • Prove: with a cache of size C, the best an attacker can do is to send repetitive requests every C+1 cycles. 21 Bound on Overflow Probability • Want to bound the probability that a request buffer overflows in n cycles Pr[overflow]   0 s t  n Pr[ Ds ,t ] Ds ,t     : X s ,t    K  • is the number of updates to a bank during cycles [s, t],  t  s , K is the length of a request queue. X s ,t • For total overflow probability bound multiply by B. 22 Chernoff Inequality Pr[ Ds ,t ]  Pr[ X  K +  ]  Pr[e X  e( K +  ) ] E[e X  ]  ( K +  ) e • Since this is true for all θ>0, E[e X  ] Pr[ Ds ,t ]  min ( K +  )  0 e • We want to find the update sequence that maximizes E[e X  ] 23 Worst Case Request Patterns q1    T 1 C q2    2Tq1   2T  1  1   T   C  • • • •  r    2Tq1   2T 1 q2 C q1+q2 +1 requests for distinct counters a1 , aq + q + r , q1 requests repeat 2T times each q2 requests repeat 2T-1 times each 1 request repeat r times each 1 2 24 Outline • Problem and Background • Proposed Design • Theoretical Analysis →Evaluation 25 Evaluation • Overflow probability for 16 million addresses, µ=1/10, and B=32. 0 10 -2 Overflow Probability Bound 10 -4 10 -6 10 -8 10 C=6000 -10 C=7000 10 C=8000 SRAM 156 KB, CAM 24 KB -12 10 C=9000 -14 10 80 90 100 110 120 130 140 Queue Length K 150 160 170 180 26 Evaluation • Overflow probability for 16 million addresses, µ=1/10, and C=8000. 0 10 -5 Overflow Probability Bound 10 -10 10 -15 10 B=32 -20 10 B=34 B=36 -25 10 B=38 -30 10 80 90 100 110 120 130 140 Request Buffer Size K 150 160 170 180 27 Conclusion • Proposed a robust memory architecture that provides throughput of SRAM with density of DRAM. • Unlike conventional caching that have unpredictable hit/miss performance, our design guarantees w.h.p. a pipelined memory architecture abstraction that can support new memory operation every cycle with fixed pipeline delay. • Convex ordering and large deviation theory to rigorously prove robustness under adversarial accesses. 28 Thank You

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download High Throughput On-Chip Interconnection Networks