Download On Power and Multi-Processors

© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch On Power and Multi-Processors Finishing up power issues and how those issues have led us to multi-core processors. Introduce multi-processor systems. © Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch Stuff upcoming soon • 3/24: HW4 due • 3/26: MS2 due • 3/28: MS2 meetings (sign up posted on 3/24) • 4/2: Quiz © Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch Power from last time • Power is perhaps the performance limiter – Can’t remove enough heat to keep performance increasing • Even for things with plugs, energy is an issue – $$$$ for energy, $$$$ to remove heat. • For things without plugs, energy is huge – Cost of batteries – Time between charging © Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch What we did last time (1/5) • Power is important – It has become a (perhaps the) limiting factor in performance © Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch When last we met (2/5) • Power is important to computer architects. – It limits performance. • With cooling techniques we can increase our power budget (and thus our performance). – But these techniques get very expensive very very quickly. – Both to build and operate active cooling devices. Costs power (and thus $) to cool Active cooling system costs $ to build © Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch When we last met (3/5) • Energy is important to computer architects – Energy is what we really pay for (10¢ per kWh) – Energy is what limits batteries (usually listed as mAh) • AA batteries tend to have 1000-2000 mAh or (assuming 1.5V) 1.53.0Wh* • iPad battery is rated at about 25Wh. – Some devices limited by energy they can scavenge. http://iprojectideas.blogspot.com/2010/06/human-power-harvesting.html * Watch those units. Assuming it takes 5Wh of energy to charge a 3Wh battery, how much does it cost for the energy to charge that battery? What about an iPad? © Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch What uses power in a chip? When we last met (4/5) Power vs. Energy Dynamic (Capacitive) Power Dissipation What uses power in a chip? I VIN VOUT CL • Data dependent – a function of switching activity 8 What uses power in a chip? Capacitive Power dissipation Capacitance: Function of wire length, transistor size Supply Voltage: Has been dropping with successive fab generations Power ~ ½ CV2Af Activity factor: How often, on average, do wires switch? Clock frequency: Increasing… 9 Static Power: Leakage Currents What uses power in a chip? VIN VOUT ISub IDSub  k  e  qVT akaT CL Igate • Subthreshold currents grow exponentially with increases in temperature, decreases in threshold voltage • • • But threshold voltage scaling is key to circuit performance! Gate leakage primarily dependent on gate oxide thickness, biases Both type of leakage heavily dependent on stacking and input pattern 10 © Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch When last we met (5/5) • How do performance and power relate?* – Power approximately proportional to V2f. – Performance is approximately proportional to f – The required voltage is approximately proportional to the desired frequency.** • If you accept all of these assumptions, we get that an X increase in performance would require an X3 increase in power. – This is a really important fact *These are all pretty rough rules of thumb. Consider the second one and discuss its shortcomings. **This one in particular tends to hold only over fairly small (10-20%?) changes in V. © Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch What we didn’t quite do last time • For things without plugs, energy is huge – Cost of batteries – Time between charging • (somewhat extreme) Example: – iPad has ~42Wh (11,560mAh) which is huge – Still run down a battery in hours – Costs $99 for a new one (2 years of use or so) © Brehob 2014 -- Portions © Brooks, Wenisch E vs. EDP vs. ED2P What uses power in a chip? • Currently have a processor design: • 80W, 1 BIPS, 1.5V, 1GHz • Want to reduce power, willing to lose some performance • Cache Optimization: –IPC decreases by 10%, reduces power by 20% => Final Processor: 900 MIPS, 64W –Relative E = MIPS/W (higher is better) = 14/12.5 = 1.125x • Energy is better, but is this a “better” processor? 13 © Brehob 2014 -- Portions © Brooks, Wenisch Not necessarily • 80W, 1 BIPS, 1.5V, 1GHz What uses power in a chip? • • Cache Optimization: – IPC decreases by 10%, reduces power by 20% => Final Processor: 900 MIPS, 64W – Relative E = MIPS/W (higher is better) = 14/12.5 = 1.125x – Relative EDP = MIPS2/W = 1.01x – Relative ED2P = MIPS3/W = .911x What if we just adjust frequency/voltage on processor? • • • How to reduce power by 20%? P = CV2F = CV3 => Drop voltage by 7% (and also Freq) => .93*.93*.93 = .8x So for equal power (64W) – Cache Optimization = 900MIPS – Simple Voltage/Frequency Scaling = 930MIPS 14 © Brehob 2014 -- Portions © Brooks, Wenisch So? Performance & Power • What we’ve concluded is that if we want to increase performance by a factor of X, we might be looking at a factor of X3 power! • But if you are paying attention, that’s just for voltage scaling! • What about other techniques? 15 © Brehob 2014 -- Portions © Brooks, Wenisch Other techniques? • Well, we could try to improve the amount of ILP we take advantage of. Performance & Power • • • • That probably involves making a “wider” processor (more superscalar) What are the costs associated with doing that? – How much bigger do things get? What do we expect the performance gains to be? How about circuit techniques? • • Historically the “threshold voltage” has dropped as circuits get smaller. – So power drops. This has (mostly) stopped being true. – And it’s actually what got us in trouble to begin with! 16 © Brehob 2014 -- Portions © Brooks, Wenisch So we are hosed… Performance & Power • I mean if voltage scaling doesn’t work, circuit shrinking doesn’t help (much), and ILP techniques don’t clearly work… • What’s left? • How about we drop performance to 80% of what it was and have 2 processors? • • • • How much power does that take? How much performance could we get? Pros/Cons? What if I wanted 8 processors? –How much performance drop needed per processor? 17 © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Multiprocessors Keeping it all working together EECS 470 © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Why multi-processors? • Why multi-processors?  Multi-processors have been around for a long time.   Originally used to get best performance for certain highly-parallel tasks. But as noted, we now use them to get solid performance per unit of energy. • So that’s it?  Not so much.   We need to make it possible/reasonable/easy to use. Nothing comes for free. If we take a task and break it up so it runs on a number of processors, there is going to be a price.  EECS 470 © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Thread-Level Parallelism struct acct_t { int bal; }; shared struct acct_t accts[MAX_ACCT]; int id,amt; if (accts[id].bal >= amt) { accts[id].bal -= amt; spew_cash(); } 0: 1: 2: 3: 4: 5: 6: addi r1,accts,r3 ld 0(r3),r4 blt r4,r2,6 sub r4,r2,r4 st r4,0(r3) call spew_cash ... ... ... • Thread-level parallelism (TLP)   Collection of asynchronous tasks: not started and stopped together Data shared loosely, dynamically • Example: database/web server (each query is a thread)   EECS 470 accts is shared, can’t register allocate even if it were scalar id and amt are private variables, register allocated to r1, r2 © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Shared-Memory Multiprocessors • Shared memory  Multiple execution contexts sharing a single address space    Multiple programs (MIMD) Or more frequently: multiple copies of one program (SPMD) Implicit (automatic) communication via loads and stores P1 P2 P3 Memory System EECS 470 P4 © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch What’s the other option? • Basically the only other option is “message passing”   We communicate via explicit messages. So instead of just changing a variable, we’d need to call a function to pass a specific message. • Message passing systems are easy to build and pretty  effeicent. But harder to code. • Shared memory programming is basically the same as multi- threaded programming on one processors  EECS 470 And (many) programmers already know how to do that. © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch So Why Shared Memory? Pluses     For applications looks like multitasking uniprocessor For OS only evolutionary extensions required Easy to do communication without OS being involved Software can worry about correctness first then performance Minuses    Proper synchronization is complex Communication is implicit so harder to optimize Hardware designers must implement Result   EECS 470 Traditionally bus-based Symmetric Multiprocessors (SMPs), and now the CMPs are the most success parallel machines ever And the first with multi-billion-dollar markets © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Shared-Memory Multiprocessors • There are lots of ways to connect processors together P1 Cache P2 M1 Interface Cache P3 M2 Interface Cache P4 M3 Interface Interconnection Network EECS 470 Cache M4 Interface © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Paired vs. Separate Processor/Memory? • Separate processor/memory  + –  Uniform memory access (UMA): equal latency to all memory Simple software, doesn’t matter where you put data Lower peak performance Bus-based UMAs common: symmetric multi-processors (SMP) • Paired processor/memory  – + Non-uniform memory access (NUMA): faster to local memory More complex software: where you put data matters Higher peak performance: assuming proper data placement CPU($) CPU($) CPU($) CPU($) CPU($) Mem R Mem EECS 470 Mem Mem Mem CPU($) Mem R CPU($) Mem R CPU($) Mem R © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Shared vs. Point-to-Point Networks • Shared network: e.g., bus (left) + – + Low latency Low bandwidth: doesn’t scale beyond ~16 processors Shared property simplifies cache coherence protocols (later) • Point-to-point network: e.g., mesh or ring (right) – + – Longer latency: may need multiple “hops” to communicate Higher bandwidth: scales to 1000s of processors Cache coherence protocols are complex CPU($) Mem R EECS 470 CPU($) Mem R CPU($) Mem R CPU($) Mem R CPU($) Mem R CPU($) R Mem Mem R CPU($) R Mem CPU($) © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Organizing Point-To-Point Networks • Network topology: organization of network  Tradeoff performance (connectivity, latency, bandwidth)  cost • Router chips   Networks that require separate router chips are indirect Networks that use processor/memory/router packages are direct + Fewer components, “Glueless MP” • Point-to-point network examples   Indirect tree (left) Direct mesh or ring (right) R R Mem R CPU($) EECS 470 Mem R CPU($) CPU($) Mem R CPU($) R Mem Mem R CPU($) R Mem CPU($) R Mem R CPU($) Mem R CPU($) © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Implementation #1: Snooping Bus MP CPU($) CPU($) CPU($) Mem Mem CPU($) • Two basic implementations • Bus-based systems   Typically small: 2–8 (maybe 16) processors Typically processors split from memories (UMA)    EECS 470 Sometimes multiple processors on single chip (CMP) Symmetric multiprocessors (SMPs) Common, I use one everyday © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Implementation #2: Scalable MP CPU($) Mem R CPU($) R Mem Mem R CPU($) R Mem CPU($) • General point-to-point network-based systems  Typically processor/memory/router blocks (NUMA)   Can be arbitrarily large: 1000’s of processors   Massively parallel processors (MPPs) In reality only government (DoD) has MPPs…   EECS 470 Glueless MP: no need for additional “glue” chips Companies have much smaller systems: 32–64 processors Scalable multi-processors © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Issues for Shared Memory Systems • Two in particular   Cache coherence Memory consistency model • Closely related to each other EECS 470 © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch An Example Execution Processor 0 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash Processor 1 CPU0 0: 1: 2: 3: 4: 5: addi r1,accts,r3 ld 0(r3),r4 blt r4,r2,6 sub r4,r2,r4 st r4,0(r3) call spew_cash • Two $100 withdrawals from account #241 at two ATMs   EECS 470 Each transaction maps to thread on different processor Track accts[241].bal (address is in r3) CPU1 Mem © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch No-Cache, No-Problem Processor 0 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash Processor 1 500 500 400 0: 1: 2: 3: 4: 5: addi r1,accts,r3 ld 0(r3),r4 blt r4,r2,6 sub r4,r2,r4 st r4,0(r3) call spew_cash • Scenario I: processors have no caches  EECS 470 No problem 400 300 © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Cache Incoherence Processor 0 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash Processor 1 0: 1: 2: 3: 4: 5: addi r1,accts,r3 ld 0(r3),r4 blt r4,r2,6 sub r4,r2,r4 st r4,0(r3) call spew_cash V:500 500 500 D:400 500 D:400 V:500 500 D:400 D:400 500 • Scenario II: processors have write-back caches   EECS 470 Potentially 3 copies of accts[241].bal: memory, p0$, p1$ Can get incoherent (inconsistent) © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Hardware Cache Coherence CPU • Coherence controller: CC bus EECS 470 D$ data D$ tags   Examines bus traffic (addresses and data) Executes coherence protocol  What to do with local copy when you see different things happening on bus © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Snooping Cache-Coherence Protocols Bus provides serialization point Each cache controller “snoops” all bus transactions take action to ensure coherence      EECS 470 invalidate update supply value depends on state of the block and the protocol © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Snooping Design Choices Controller updates state of blocks in response to processor and snoop events and generates bus xactions Processor ld/st Cache State Tag Often have duplicate cache tags Snoopy protocol    set of states state-transition diagram actions Basic Choices   EECS 470 write-through vs. write-back invalidate vs. update Data ... Snoop (observed bus transaction) © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch The Simple Invalidate Snooping Protocol PrRd / -- PrWr / BusWr Actions: PrRd, PrWr, BusRd, BusWr Valid BusWr PrRd / BusRd Invalid PrWr / BusWr EECS 470 Write-through, nowrite-allocate cache Example time Processor 1 Processor Transaction Processor 2 Cache State Processor Transaction Read A Read A Read A Write A Read A Write A Write A Actions: • PrRd, PrWr, • BusRd, BusWr EECS 470 Bus Cache State © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch More Generally: MOESI [Sweazey & Smith ISCA86] M - Modified (dirty) O - Owned (dirty but shared) WHY? E - Exclusive (clean unshared) only copy, not dirty S - Shared I - Invalid S Variants     EECS 470 MSI MESI MOSI MOESI O ownership M validity E exclusiveness I MESI example Actions: • PrRd, PrWr, • BRL – Bus Read Line (BusRd) • BWL – Bus Write Line (BusWr) • BRIL – Bus Read and Invalidate • BIL – Bus Invalidate Line Processor 1 Processor Transaction © Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Processor 2 Cache State Processor Transaction Bus Cache State Read A Read A Read A Write A Read A Write A Write A • • • • EECS 470 M - Modified (dirty) E - Exclusive (clean unshared) only copy, not dirty S - Shared I - Invalid

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download On Power and Multi-Processors