* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download On Power and Multi-Processors
Survey
Document related concepts
Transcript
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
On Power and Multi-Processors
Finishing up power issues and how those
issues have led us to multi-core processors.
Introduce multi-processor systems.
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
Stuff upcoming soon
• 3/24: HW4 due
• 3/26: MS2 due
• 3/28: MS2 meetings (sign up posted on 3/24)
• 4/2: Quiz
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
Power from last time
• Power is perhaps the performance limiter
– Can’t remove enough heat to keep performance
increasing
• Even for things with plugs, energy is an issue
– $$$$ for energy, $$$$ to remove heat.
• For things without plugs, energy is huge
– Cost of batteries
– Time between charging
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
What we did last time (1/5)
• Power is important
– It has become a (perhaps the) limiting factor in
performance
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
When last we met (2/5)
• Power is important to computer architects.
– It limits performance.
• With cooling techniques we can increase our power
budget (and thus our performance).
– But these techniques get very expensive very very quickly.
– Both to build and operate active cooling devices.
Costs power (and thus $) to cool
Active cooling system costs $ to build
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
When we last met (3/5)
• Energy is important to computer architects
– Energy is what we really pay for (10¢ per kWh)
– Energy is what limits batteries (usually listed as mAh)
• AA batteries tend to have 1000-2000 mAh or (assuming 1.5V) 1.53.0Wh*
• iPad battery is rated at about 25Wh.
– Some devices limited by energy
they can scavenge.
http://iprojectideas.blogspot.com/2010/06/human-power-harvesting.html
* Watch those units. Assuming it takes 5Wh of energy to charge a 3Wh battery,
how much does it cost for the energy to charge that battery? What about an iPad?
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
What uses power in a chip?
When we last met (4/5)
Power vs. Energy
Dynamic (Capacitive) Power Dissipation
What uses power in a chip?
I
VIN
VOUT
CL
•
Data dependent – a function of switching
activity
8
What uses power in a chip?
Capacitive Power dissipation
Capacitance:
Function of wire
length, transistor size
Supply Voltage:
Has been dropping
with successive fab
generations
Power ~ ½ CV2Af
Activity factor:
How often, on average,
do wires switch?
Clock frequency:
Increasing…
9
Static Power: Leakage Currents
What uses power in a chip?
VIN
VOUT
ISub
IDSub k e
qVT
akaT
CL
Igate
•
Subthreshold currents grow exponentially with increases in
temperature, decreases in threshold voltage
•
•
•
But threshold voltage scaling is key to circuit performance!
Gate leakage primarily dependent on gate oxide thickness,
biases
Both type of leakage heavily dependent on stacking and input
pattern
10
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
When last we met (5/5)
• How do performance and power relate?*
– Power approximately proportional to V2f.
– Performance is approximately proportional to f
– The required voltage is approximately
proportional to the desired frequency.**
• If you accept all of these assumptions, we get
that an X increase in performance would
require an X3 increase in power.
– This is a really important fact
*These are all pretty rough rules of thumb. Consider the second one and discuss its shortcomings.
**This one in particular tends to hold only over fairly small (10-20%?) changes in V.
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
What we didn’t quite do last time
• For things without plugs, energy is huge
– Cost of batteries
– Time between charging
• (somewhat extreme) Example:
– iPad has ~42Wh (11,560mAh) which is huge
– Still run down a battery in hours
– Costs $99 for a new one (2 years of use or so)
© Brehob 2014 -- Portions © Brooks, Wenisch
E vs. EDP vs. ED2P
What uses power in a chip?
•
Currently have a processor design:
• 80W, 1 BIPS, 1.5V, 1GHz
• Want to reduce power, willing to lose some
performance
• Cache Optimization:
–IPC decreases by 10%, reduces power by
20% =>
Final Processor: 900 MIPS, 64W
–Relative E = MIPS/W (higher is better) =
14/12.5 = 1.125x
• Energy is better, but is this a “better”
processor?
13
© Brehob 2014 -- Portions © Brooks, Wenisch
Not necessarily
•
80W, 1 BIPS, 1.5V, 1GHz
What uses power in a chip?
•
•
Cache Optimization:
– IPC decreases by 10%, reduces power by 20% =>
Final Processor: 900 MIPS, 64W
– Relative E = MIPS/W (higher is better) = 14/12.5 =
1.125x
– Relative EDP = MIPS2/W = 1.01x
– Relative ED2P = MIPS3/W = .911x
What if we just adjust frequency/voltage on
processor?
•
•
•
How to reduce power by 20%?
P = CV2F = CV3 => Drop voltage by 7% (and also Freq) =>
.93*.93*.93 = .8x
So for equal power (64W)
– Cache Optimization = 900MIPS
– Simple Voltage/Frequency Scaling = 930MIPS
14
© Brehob 2014 -- Portions © Brooks, Wenisch
So?
Performance & Power
•
What we’ve concluded is that if we want to
increase performance by a factor of X, we
might be looking at a factor of X3 power!
• But if you are paying attention, that’s just for
voltage scaling!
• What about other techniques?
15
© Brehob 2014 -- Portions © Brooks, Wenisch
Other techniques?
•
Well, we could try to improve the amount of ILP we
take advantage of.
Performance & Power
•
•
•
•
That probably involves making a “wider” processor
(more superscalar)
What are the costs associated with doing that?
– How much bigger do things get?
What do we expect the performance gains to be?
How about circuit techniques?
•
•
Historically the “threshold voltage” has dropped as
circuits get smaller.
– So power drops.
This has (mostly) stopped being true.
– And it’s actually what got us in trouble to begin with!
16
© Brehob 2014 -- Portions © Brooks, Wenisch
So we are hosed…
Performance & Power
•
I mean if voltage scaling doesn’t work,
circuit shrinking doesn’t help (much), and
ILP techniques don’t clearly work…
• What’s left?
•
How about we drop performance to 80% of
what it was and have 2 processors?
•
•
•
•
How much power does that take?
How much performance could we get?
Pros/Cons?
What if I wanted 8 processors?
–How much performance drop needed per
processor?
17
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Multiprocessors
Keeping it all working together
EECS 470
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Why multi-processors?
• Why multi-processors?
Multi-processors have been around for a long time.
Originally used to get best performance for certain highly-parallel tasks.
But as noted, we now use them to get solid performance per unit of
energy.
• So that’s it?
Not so much.
We need to make it possible/reasonable/easy to use.
Nothing comes for free.
If we take a task and break it up so it runs on a number of
processors, there is going to be a price.
EECS 470
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Thread-Level Parallelism
struct acct_t { int bal; };
shared struct acct_t accts[MAX_ACCT];
int id,amt;
if (accts[id].bal >= amt)
{
accts[id].bal -= amt;
spew_cash();
}
0:
1:
2:
3:
4:
5:
6:
addi r1,accts,r3
ld 0(r3),r4
blt r4,r2,6
sub r4,r2,r4
st r4,0(r3)
call spew_cash
... ... ...
• Thread-level parallelism (TLP)
Collection of asynchronous tasks: not started and stopped
together
Data shared loosely, dynamically
• Example: database/web server (each query is a thread)
EECS 470
accts is shared, can’t register allocate even if it were scalar
id and amt are private variables, register allocated to r1, r2
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Shared-Memory Multiprocessors
• Shared memory
Multiple execution contexts sharing a single address space
Multiple programs (MIMD)
Or more frequently: multiple copies of one program (SPMD)
Implicit (automatic) communication via loads and stores
P1
P2
P3
Memory System
EECS 470
P4
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
What’s the other option?
• Basically the only other option is “message passing”
We communicate via explicit messages.
So instead of just changing a variable, we’d need to call a function to
pass a specific message.
• Message passing systems are easy to build and pretty
effeicent.
But harder to code.
• Shared memory programming is basically the same as multi-
threaded programming on one processors
EECS 470
And (many) programmers already know how to do that.
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
So Why Shared Memory?
Pluses
For applications looks like multitasking uniprocessor
For OS only evolutionary extensions required
Easy to do communication without OS being involved
Software can worry about correctness first then performance
Minuses
Proper synchronization is complex
Communication is implicit so harder to optimize
Hardware designers must implement
Result
EECS 470
Traditionally bus-based Symmetric Multiprocessors (SMPs), and now
the CMPs are the most success parallel machines ever
And the first with multi-billion-dollar markets
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Shared-Memory Multiprocessors
• There are lots of ways to connect processors together
P1
Cache
P2
M1
Interface
Cache
P3
M2
Interface
Cache
P4
M3
Interface
Interconnection Network
EECS 470
Cache
M4
Interface
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Paired vs. Separate Processor/Memory?
• Separate processor/memory
+
–
Uniform memory access (UMA): equal latency to all memory
Simple software, doesn’t matter where you put data
Lower peak performance
Bus-based UMAs common: symmetric multi-processors (SMP)
• Paired processor/memory
–
+
Non-uniform memory access (NUMA): faster to local memory
More complex software: where you put data matters
Higher peak performance: assuming proper data placement
CPU($)
CPU($)
CPU($)
CPU($)
CPU($)
Mem R
Mem
EECS 470
Mem
Mem
Mem
CPU($)
Mem R
CPU($)
Mem R
CPU($)
Mem R
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Shared vs. Point-to-Point Networks
• Shared network: e.g., bus (left)
+
–
+
Low latency
Low bandwidth: doesn’t scale beyond ~16 processors
Shared property simplifies cache coherence protocols (later)
• Point-to-point network: e.g., mesh or ring (right)
–
+
–
Longer latency: may need multiple “hops” to communicate
Higher bandwidth: scales to 1000s of processors
Cache coherence protocols are complex
CPU($)
Mem R
EECS 470
CPU($)
Mem R
CPU($)
Mem R
CPU($)
Mem R
CPU($)
Mem R
CPU($)
R Mem
Mem R
CPU($)
R Mem
CPU($)
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Organizing Point-To-Point Networks
• Network topology: organization of network
Tradeoff performance (connectivity, latency, bandwidth) cost
• Router chips
Networks that require separate router chips are indirect
Networks that use processor/memory/router packages are direct
+
Fewer components, “Glueless MP”
• Point-to-point network examples
Indirect tree (left)
Direct mesh or ring (right)
R
R
Mem R
CPU($)
EECS 470
Mem R
CPU($)
CPU($)
Mem R
CPU($)
R Mem
Mem R
CPU($)
R Mem
CPU($)
R
Mem R
CPU($)
Mem R
CPU($)
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Implementation #1: Snooping Bus MP
CPU($)
CPU($)
CPU($)
Mem
Mem
CPU($)
• Two basic implementations
• Bus-based systems
Typically small: 2–8 (maybe 16) processors
Typically processors split from memories (UMA)
EECS 470
Sometimes multiple processors on single chip (CMP)
Symmetric multiprocessors (SMPs)
Common, I use one everyday
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Implementation #2: Scalable MP
CPU($)
Mem R
CPU($)
R Mem
Mem R
CPU($)
R Mem
CPU($)
• General point-to-point network-based systems
Typically processor/memory/router blocks (NUMA)
Can be arbitrarily large: 1000’s of processors
Massively parallel processors (MPPs)
In reality only government (DoD) has MPPs…
EECS 470
Glueless MP: no need for additional “glue” chips
Companies have much smaller systems: 32–64 processors
Scalable multi-processors
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Issues for Shared Memory Systems
• Two in particular
Cache coherence
Memory consistency model
• Closely related to each other
EECS 470
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
An Example Execution
Processor 0
0: addi r1,accts,r3
1: ld 0(r3),r4
2: blt r4,r2,6
3: sub r4,r2,r4
4: st r4,0(r3)
5: call spew_cash
Processor 1
CPU0
0:
1:
2:
3:
4:
5:
addi r1,accts,r3
ld 0(r3),r4
blt r4,r2,6
sub r4,r2,r4
st r4,0(r3)
call spew_cash
• Two $100 withdrawals from account #241 at two ATMs
EECS 470
Each transaction maps to thread on different processor
Track accts[241].bal (address is in r3)
CPU1
Mem
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
No-Cache, No-Problem
Processor 0
0: addi r1,accts,r3
1: ld 0(r3),r4
2: blt r4,r2,6
3: sub r4,r2,r4
4: st r4,0(r3)
5: call spew_cash
Processor 1
500
500
400
0:
1:
2:
3:
4:
5:
addi r1,accts,r3
ld 0(r3),r4
blt r4,r2,6
sub r4,r2,r4
st r4,0(r3)
call spew_cash
• Scenario I: processors have no caches
EECS 470
No problem
400
300
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Cache Incoherence
Processor 0
0: addi r1,accts,r3
1: ld 0(r3),r4
2: blt r4,r2,6
3: sub r4,r2,r4
4: st r4,0(r3)
5: call spew_cash
Processor 1
0:
1:
2:
3:
4:
5:
addi r1,accts,r3
ld 0(r3),r4
blt r4,r2,6
sub r4,r2,r4
st r4,0(r3)
call spew_cash
V:500
500
500
D:400
500
D:400
V:500
500
D:400
D:400
500
• Scenario II: processors have write-back caches
EECS 470
Potentially 3 copies of accts[241].bal: memory, p0$, p1$
Can get incoherent (inconsistent)
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Hardware Cache Coherence
CPU
• Coherence controller:
CC
bus
EECS 470
D$ data
D$ tags
Examines bus traffic (addresses and data)
Executes coherence protocol
What to do with local copy when you see different
things happening on bus
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Snooping Cache-Coherence Protocols
Bus provides serialization point
Each cache controller “snoops” all bus transactions
take action to ensure coherence
EECS 470
invalidate
update
supply value
depends on state of the block and the protocol
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Snooping Design Choices
Controller updates state of blocks
in response to processor and
snoop events and generates bus
xactions
Processor
ld/st
Cache
State Tag
Often have duplicate cache tags
Snoopy protocol
set of states
state-transition diagram
actions
Basic Choices
EECS 470
write-through vs. write-back
invalidate vs. update
Data
...
Snoop (observed bus transaction)
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
The Simple Invalidate Snooping Protocol
PrRd / --
PrWr / BusWr
Actions: PrRd, PrWr,
BusRd, BusWr
Valid
BusWr
PrRd / BusRd
Invalid
PrWr / BusWr
EECS 470
Write-through, nowrite-allocate cache
Example time
Processor 1
Processor
Transaction
Processor 2
Cache
State
Processor
Transaction
Read A
Read A
Read A
Write A
Read A
Write A
Write A
Actions:
• PrRd, PrWr,
• BusRd, BusWr
EECS 470
Bus
Cache
State
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
More Generally: MOESI
[Sweazey & Smith ISCA86]
M - Modified (dirty)
O - Owned (dirty but shared)
WHY?
E - Exclusive (clean unshared) only copy, not dirty
S - Shared
I - Invalid
S
Variants
EECS 470
MSI
MESI
MOSI
MOESI
O
ownership
M
validity
E
exclusiveness
I
MESI example
Actions:
• PrRd, PrWr,
• BRL – Bus Read Line (BusRd)
• BWL – Bus Write Line (BusWr)
• BRIL – Bus Read and Invalidate
• BIL – Bus Invalidate Line
Processor 1
Processor
Transaction
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Processor 2
Cache
State
Processor
Transaction
Bus
Cache
State
Read A
Read A
Read A
Write A
Read A
Write A
Write A
•
•
•
•
EECS 470
M - Modified (dirty)
E - Exclusive (clean unshared) only copy, not dirty
S - Shared
I - Invalid