* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download L12-13_advance_2014
Pulse-width modulation wikipedia , lookup
Mains electricity wikipedia , lookup
Electrification wikipedia , lookup
Power over Ethernet wikipedia , lookup
Switched-mode power supply wikipedia , lookup
Utility frequency wikipedia , lookup
Alternating current wikipedia , lookup
Power engineering wikipedia , lookup
Microprocessor wikipedia , lookup
Magnetic core wikipedia , lookup
Immunity-aware programming wikipedia , lookup
Computer program wikipedia , lookup
Computer Structure Advanced Topics Lihu Rappoport and Adi Yoaz 1 Computer Structure 2014 – Advanced Topics Intel® CoreTM μArch 2 Computer Structure 2014 – Advanced Topics Tick/Tock Development Model 45nm Process Technology 32nm Process Technology 22nm Process Technology Nehalem Westmere Sandy Bridge Ivy Bridge Haswell 1st Generation Intel® Core™ New Intel® μarch (Nehalem) 1st Generation Intel® Core™ Intel® μarch (Nehalem) 2nd Generation Intel® Core™ New Intel® μarch (Sandy Bridge) 3rd Generation Intel® Core™ Intel® μarch (Sandy Bridge) 4th Generation Intel® Core™ New Intel® μarch (Haswell) TICK TOCK TICK TOCK TOCK Haswell CPU 22nm Process Technology 2nd 3 Haswell builds upon innovations in the and 3rd Generation Intel® Core™ i3/i5/i7 Processors (Sandy Bridge and Ivy Bridge) Foil taken from IDF 2012 Computer Structure 2014 – Advanced Topics 5th Generation Intel CoreTM Processor  14nm Process Technology  Broadwell micro-architecture 4 Computer Structure 2014 – Advanced Topics Haswell Core at a Glance Next generation branch prediction Branch Prediction I-cache Tag ITLB • Improves performance and saves wasted work µop-Cache Tag Improved front-end • Initiate TLB and cache misses speculatively • Handle cache misses in parallel to hide latency • Leverages improved branch prediction I-cache Data Deeper buffers • Extract more instruction parallelism µop-Cache Data Decode Decode • More resources when running a single thread More execution units, shorter latencies • Power down when not in use More load/store bandwidth • Better prefetching, better cache line split latency & throughput, double L2 bandwidth μop Queue µop Allocation • New modes save power without losing performance Out-of-Order Execution No pipeline growth 0 1 5 2 3 4 5 6 7 • Same branch misprediction latency • Same L1/L2 cache latency Foil taken from IDF 2012 Computer Structure 2014 – Advanced Topics Branch Prediction Unit  Predict branch targets Branch Prediction I-cache Tag µop-Cache Tag ITLB I-cache Data µop-Cache Data Decode Decode – Direct Calls and Jumps – target provided by a Target Array – Indirect Calls and Jumps – predicted either as having a fixed target or as having targets that vary based on execution path – Returns – predicted by a 16 entry Return Stack Buffer (RSB)  For conditional branches – predict if taken or not μop Queue µop Allocation  The BPU makes predictions for 32 bytes at a time – Twice the width of the fetch engine Out-of-Order Execution 0 1 6 2 3 4 5 6 7 – Enables taken branches to be predicted with no penalty From the Optimization Manual / IDF 2012 Computer Structure 2014 – Advanced Topics Instruction TLB Branch Prediction I-cache Tag µop-Cache Tag ITLB  Initiate TLB and cache misses speculatively – Handle cache misses in parallel to hide latency – Leverages improved branch prediction I-cache Data µop-Cache Data Decode Decode μop Queue µop Allocation Out-of-Order Execution 0 1 7 2 3 4 5 6 7 From the Optimization Manual / IDF 2012 Computer Structure 2014 – Advanced Topics Instruction Fetch and Pre-decode 32KB L1 I-Cache Pre decode Instruction Queue Decoders Decoders Decoders Decoders μop Queue MSROM • 32KB 8-way I-Cache – Fetches aligned 16 bytes per cycle  Typical programs average ~4 bytes per instruction – A misaligned target into the line or a taken branch out of the line reduces the effective number of instruction bytes fetched  In typical integer code: a taken branches every ~10 instructions, which translates into a partial fetch every 3~4 cycles • The Pre-Decode Unit – Determines the length of each instruction – Decodes all prefixes associated with instructions – Mark various properties of instructions for the decoders  8 for example, “is branch” From the Optimization Manual Computer Structure 2014 – Advanced Topics Instruction Pre-decode and IQ 32KB L1 I-Cache Pre decode Instruction Queue Decoders Decoders Decoders Decoders μop Queue MSROM  The pre-decode unit writes 6 instructions/cycle into the IQ – If a fetch line contains >6 instructions  Continues to pre-decode 6 instructions/cycle, until all instructions in the fetch line are written into IQ  The subsequent fetch line starts pre-decode only after the current fetch completes – A fetch of line of 7 instructions takes 2 cycles to pre-decode  Average of 3.5 inst/cycle (still higher than IPC of most apps)  Length changing prefixes (LCP) change the inst. length – Each LCP within the fetch line takes 3 cycle to pre-decode  The Instruction Queue is 18 instructions deep 9 From the Optimization Manual Computer Structure 2014 – Advanced Topics Instruction Decode and μop Queue 32KB L1 I-Cache Pre decode Instruction Queue Decoders Decoders Decoders Decoders μop Queue MSROM • 4 Decoders, which decode instructions into μops – 1st decoder can decode all instructions of up to 4 μops – The other 3 decoders handle common single μop instructions • Instructions with >4 μops generate μops from the MSROM • Up to 4 μops/cycle are delivered into the μop Queue – Buffers 56 μops (28 μops / thread when running 2 threads) – Decouples the front end and the out-of order engine – Helps hiding bubbles introduced between the various sources of μops in the front end 10 From the Optimization Manual Computer Structure 2014 – Advanced Topics Macro-Fusion L1 I-Cache Pre decode Instruction Queue Decoders Decoders Decoders Decoders μop Queue MSROM  The IQ sends up to 5 inst. / cycle to the decoders  Merge two adjacent instructions into a single μop – A macro-fused instruction executes with a single dispatch  Reduces latency and frees execution resources – Increased decode, rename and retire bandwidth – Power savings from representing more work in fewer bits  The 1st instruction modifies flags – CMP, TEST, ADD, SUB, AND, INC, DEC  The 2nd instruction pair is a conditional branch  These pairs are common in many types of applications 11 From the Optimization Manual Computer Structure 2014 – Advanced Topics Stack Pointer Tracker  PUSH, POP, CALL, RET implicitly update ESP – Add or sub an offset, which would require a dedicated μop – The Stack Pointer Tracker performs implicit ESP updates Δ=0 PUSH EAX PUSH EBX INC ESP ESP = ESP - 4 Δ=Δ-4 STORE [ESP], EAX STORE [ESP-4], EAX ESP = ESP - 4 Δ=Δ-4 STORE [ESP], EBX STORE [ESP-8], EBX ESP = ADD ESP, 1 ESPNeed = SUB ESP, ESP 8 to sync ! Δ=-4 Δ=-8 Δ=0 ESP = ADD ESP, 1  Provides the following benefits – – – – 12 Improves decode BW: PUSH, POP and RET become single μop instructions Conserves rename, execution and retire resources Remove dependencies on ESP – can execute stack operations in parallel Saves power From the Optimization Manual and ITJ Computer Structure 2014 – Advanced Topics Micro-Fusion  Fuse multiple μops from same instruction into a single μop – Instruction which decode into a single micro-fused μop can be handled by all decoders – Improves instruction bandwidth delivered from decode to retirement and saves power – A micro-fused μop is dispatched multiple times in the OOO  As it would if it were not micro-fused  Micro-fused instructions – Stores are comprised of 2 μop: store-address and store-data  Fused into a single μop – Load + op instruction  e.g., FADD DOUBLE PTR [RDI+RSI*8] – Load + jump  13 e.g., JMP [RDI+200] From the Optimization Manual Computer Structure 2014 – Advanced Topics Decoded μop-Cache 32KB L1 I-Cache Pre decode Instruction Queue Decoders Decoders Decoders Decoders Branch Prediction Unit μop Queue μop-Cache • Caches the μops coming out of the decoders – Up to 1.5K μops (32 sets × 8 ways × 6 μops/way) – Next time μops are taken from the μop Cache – ~80% hit rate for most applications – Included in the IC and iTLB, flushed on a context switch • Higher Bandwidth and Lower Latency – More cycles sustaining 4 instruction/cycle 14  In each cycle provide μops for instructions mapped to 32 bytes  Able to ‘stitch’ across taken branches in the control flow From the Optimization Manual Computer Structure 2014 – Advanced Topics Decoded μop-Cache 32KB L1 I-Cache Pre decode Instruction Queue Branch Prediction Unit Decoders Decoders Decoders Decoders Zzzz μop Queue μop-Cache • Decoded μop Cache lets the normal front end sleep – Decode one time instead of many times • Branch misprediction penalty reduced – The correct path is also the most efficient path Save Power while Increasing Performance 15 Foil taken from IDF 2011 Computer Structure 2014 – Advanced Topics Loop Stream Detector (LSD) 32KB L1 I-Cache Pre decode Instruction Queue Decoders Decoders Decoders Decoders Branch Prediction Unit Zzzz LSD μop Queue μop-Cache  LSD detects small loops that fit in the μop queue – The μop queue streams the loop, allowing the front-end to sleep – Until a branch miss-prediction inevitably ends it  Loops qualify for LSD replay if all following conditions are met – Up to 28 μops, with 8 taken branches,  8 32-byte chunks – All μops are also resident in the μop-Cache – No CALL or RET – No mismatched stack operations (e.g., more PUSH than POP) 16 From the Optimization Manual Computer Structure 2014 – Advanced Topics The Renamer Load Buffers Store Reorder Buffers Buffers μop Queue Allocate/Rename/Retire In order OOO Scheduler  Moves ≤4μops/cycle from the μop-queue to the OOO – Renames architectural sources and destinations of the μops to micro-architectural sources and destinations – Allocates resources to the μops, e.g., load or store buffers – Binds the μop to an appropriate dispatch port – Up to 2 branches each cycle  Some μops are executed to completion during rename, effectively costing no execution bandwidth – A subset of register-to-register MOV and FXCHG – Zero-Idioms – NOP 17 From the Optimization Manual Computer Structure 2014 – Advanced Topics Dependency Breaking Idioms  Zero-Idiom – an instruction that zeroes a register – Regardless of the input data, the output data is always 0 – E.g.: XOR REG, REG and SUB REG, REG – No μop dependency on its sources  Zero-Idioms are detected and removed by the Renamer – Do not consume execution resource, have zero exe latency  Zero-Idioms remove partial register dependencies – Improve instruction parallelism  Ones-Idiom – an instruction that sets a register to “all 1s” – Regardless of the input data the output is always "all 1s" – E.g., CMPEQ XMM1, XMM1;  No μop dependency on its sources, as with the zero idiom  Can execute as soon as it finds a free execution port – As opposed to Zero-Idiom, the Ones-Idiom μop must execute 18 From the Optimization Manual Computer Structure 2014 – Advanced Topics Out-of-Order Cluster Load Buffers Store Reorder Buffers Buffers Allocate/Rename/Retire In order Out-oforder Scheduler FP/INT Vector PRF Int PRF • The Scheduler queues μops until all source operands are ready – Schedules and dispatches ready μops to the available execution units in as close to a first in first out (FIFO) order as possible • Physical Register File (PRF) – Instead of centralized Retirement Register File  Single copy of every data with no movement after calculation – Allows significant increase in buffer sizes  Dataflow window ~33% larger • Retirement – Retires μops in order and handles faults and exceptions 19 Foil taken from IDF 2011 Computer Structure 2014 – Advanced Topics Intel® Advanced Vector Extensions  Vectors are a natural data-type for many apps – Extend SSE FP instruction set to 256 bits operand size – Extend all 16 XMM registers to 256bits XMM0 – 128 bits YMM0 – 256 bits (AVX) • New, non-destructive source syntax – VADDPS ymm1, ymm2, ymm3 • New Operations to enhance vectorization – Broadcasts, masked load & store Wide vectors+ non-destructive source: more work with fewer instructions Extending the existing state is area and power efficient 20 Foil taken from IDF 2011 Computer Structure 2014 – Advanced Topics Execution Cluster Scheduler sees matrix: • 3 “ports” to 3 “stacks” of execution units ALU Port 0 – General Purpose Integer – SIMD (Vector) Integer – SIMD Floating Point • Challenge: double the output of one of these stacks in a manner that is invisible to the others 21 FP MUL VI Shuffle Blend DIV GPR SIMD INT ALU VI ADD Port 1 Port 5 VI MUL SIMD FP FP ADD VI Shuffle ALU FP Shuf JMP FP Bool Blend Foil taken from IDF 2011 Computer Structure 2014 – Advanced Topics Execution Cluster – 256 bit Vectors Solution: • Repurpose existing data paths to dual-use • SIMD integer and legacy SIMD FP use legacy stack style • Intel® AVX utilizes both 128-bit execution stacks ALU FP MUL VI MUL FP Multiply VI Shuffle FP BlendBlend Port 0 DIV GPR ALU Port 1 SIMD INT SIMD FP VI ADD FP ADDFP ADD VI Shuffle • Double FLOPs – 256-bit Multiply + 256-bit ADD + 256-bit Load per clock Port 5 ALU FP Shuf FP Shuffle JMP FP Bool FP Boolean FP BlendBlend “Cool” Implementation of Intel AVX 256-bit Multiply + 256-bit ADD + 256-bit Load per clock… Double your FLOPs with great energy efficiency 22 Foil taken from IDF 2011 Computer Structure 2014 – Advanced Topics  2 new FMA units provide 2× peak FLOPs/cycle – Fused Multiply-Add Instruction Set SP/ cyc DP /cyc Nehalem SSE (128-bits) 8 4 SandyBridge AVX (256-bits) 16 8 Haswell AVX2 & FMA 32 16 Peak Cache bytes/clock Haswell AVX2: FMA & Peak FLOPS 70 Haswell 60 50 40 Sandy Bridge 30 20 10 0 0 Merom Banias 5 10 – 2× L2 bandwidth  5-cycle FMA latency same as an FP multiply 23 20 25 Peak FLOPS/clock  2× cache bandwidth to feed wide vector units – 32-byte load/store for L1 15 Latency (clks) Prior Gen MulPS, PD 5 5 AddPS, PD 3 3 Mul+Add /FMA 8 5 Foil taken from IDF 2012 30 35 Haswell Ratio 1.6 Computer Structure 2014 – Advanced Topics Haswell Execution Units Unified Reservation Station Vector Logicals Vector Logicals Branch Divide Vector Shifts Port 7 Vector Int ALU 2×FMA • Doubles peak FLOPs • Two FP multiplies benefits legacy Port 6 Vector Int Multiply Store Data Port 5 FMA FP Mult FP Add Load & Store Address Port 4 FMA FP Multiply Port 3 Integer ALU & LEA Port 2 Port 1 Port 0 Integer ALU & Shift Integer ALU & LEA Integer ALU & Shift Store Address Vector Shuffle Vector Int ALU Vector Logicals 4th ALU • Great for integer workloads • Frees Port0 & 1 for vector New Branch Unit • Reduces Port0 Conflicts • 2nd EU for high branch code Branch New AGU for Stores • Leaves Port 2 & 3 open for Loads Intel® Microarchitecture (Haswell) 24 Foil taken from IDF 2012 Computer Structure 2014 – Advanced Topics Memory Cluster Store Data Load/ Load/ Store Store Store Address Address Address Memory Control 32×3 bytes/cycle Fill Buffers 256KB ML$ 32KB L1 D$  The L1 D$ handles 2× 256-bit loads + 1× 256-bit store per cycle  The ML$ can service one cache line (64 bytes) each cycle  72 load buffers keep load μops from allocation till retirement – Re-dispatch blocked loads  42 store buffers keep store μops from allocation till the store value is written to L1-D$ – or written to the line fill buffers – for non-temporal stores 25 From the Optimization Manual Computer Structure 2014 – Advanced Topics L1 Data Cache  Handles 2× 256-bit loads + 1× 256-bit store per cycle  Maintains requests which cannot be completed – Cache misses – Unaligned access that splits across cache lines – Data not ready to be forwarded from a preceding store – Load block due to cache line replacement  Handles up to 10 outstanding cache misses in the LFBs – Continues to service incoming stores and loads  The L1 D$ is a write-back write-allocate cache – Stores that hit in the L1-D$ do not update the L2/LLC/Mem – Stores that miss the L1-D$ allocate a cache line 26 From the Optimization Manual Computer Structure 2014 – Advanced Topics Stores  Stores to memory are executed in two phases  At EXE – Fill store buffers with linear + physical address and with data – Once store address and data are known  Forward Store data to load operations that need it  After the store retires – completion phase – First, the line must be in L1 D$, in E or M MESI state  Otherwise, fetch it using a Read for Ownership request from L1 D$  L2$  LLC  L2$ and L1 D$ in other cores  Memory – Read the data from the store buffers and write it to L1 D$ in M state  Done at retirement to preserve the order of memory writes – Release the store buffer entry taken by the store 27  Affects performance only if store buffer becomes full (allocation stalls)  Loads needing the store data get when store is executed From the Optimization Manual Computer Structure 2014 – Advanced Topics Store Forwarding  Forward data directly from a store to a load that needs it – Instead of having the store write the data to the DCU and then the load read the data from the DCU  Store to load forwarding conditions – The store is the last store to that address, prior to the load  Requires addresses of all stores older than the load to be known – The store contains all data being loaded – The load is from a write-back memory type – Neither the load nor the store are non-temporal accesses – The load is not a 4 an 8 byte load that crosses 8 byte boundary, relative to the preceding 16- or 32-byte store – The load does not cross a 16-byte boundary of a 32-byte store 28 From the Optimization Manual Computer Structure 2014 – Advanced Topics Memory Disambiguation  A load may depend on a preceding store – A load is blocked until all preceding store addresses are known  Predict which loads do not depend on any previous stores – Forward data from a store or L1 D$ even when not all previous store addresses are known – The prediction is verified, and if a conflict is detected  The load and all succeeding instructions are re-fetched – Always assumes dependency between loads and earlier stores that have the same address bits 0:11  The following loads are not disambiguated – Loads that cross the 16-byte boundary – 32-byte Intel AVX loads that are not 32-byte aligned 29 From the Optimization Manual Computer Structure 2014 – Advanced Topics Data Prefetching  Two hardware prefetchers load data to the L1 D$  DCU prefetcher (the streaming prefetcher) – Triggered by an ascending access to very recently loaded data – Fetches the next line, assuming a streaming load  Instruction pointer (IP)-based stride prefetcher – Tracks individual load instructions, detecting a regular stride 30  Prefetch address = current address + stride  Detects strides of up to 2K bytes, both forward and backward From the Optimization Manual Computer Structure 2014 – Advanced Topics Core Cache Size/Latency/Bandwidth Metric Nehalem Sandy Bridge Haswell L1 Instruction Cache 32K, 4-way 32K, 8-way 32K, 8-way L1 Data Cache 32K, 8-way 32K, 8-way 32K, 8-way 4 cycles 4 cycles 4 cycles Load bandwidth 16 Bytes/cycle 32 Bytes/cycle (banked) 64 Bytes/cycle Store bandwidth 16 Bytes/cycle 16 Bytes/cycle 32 Bytes/cycle 256K, 8-way 256K, 8-way 256K, 8-way 10 cycles 11 cycles 11 cycles 32 Bytes/cycle 32 Bytes/cycle 64 Bytes/cycle 4K: 128, 4-way 2M/4M: 7/thread 4K: 128, 4-way 2M/4M: 8/thread 4K: 128, 4-way 2M/4M: 8/thread 4K: 64, 4-way 2M/4M: 32, 4-way 1G: fractured 4K: 64, 4-way 2M/4M: 32, 4-way 1G: 4, 4-way 4K: 64, 4-way 2M/4M: 32, 4-way 1G: 4, 4-way 4K: 512, 4-way 4K: 512, 4-way 4K+2M shared: 1024, 8-way Fastest Load-to-use L2 Unified Cache Fastest load-to-use Bandwidth to L1 L1 Instruction TLB L1 Data TLB L2 Unified TLB All caches use 64-byte lines 31 Foil taken from IDF 2012 Computer Structure 2014 – Advanced Topics Buffer Sizes Extract more parallelism in every generation Nehalem Sandy Bridge Haswell 128 168 192 Scheduling Window 36 54 In-flight Loads 48 64 72 In-flight Stores 32 36 42 Scheduler Entries 36 54 60 Integer Register File N/A 160 168 FP Register File N/A 144 168 28/thread 28/thread 56 Out-of-order Window Allocation Queue 32 Foil taken from IDF 2012 Computer Structure 2014 – Advanced Topics Haswell Core μArch 32KB L1 I-Cache Instruction Queue Pre decode Branch Prediction Unit Load Buffers LSD Decoders Decoders Decoders Decoders μop Queue μop Cache Store Reorder Buffers Buffers Allocate/Rename/Retire In order OOO Scheduler Port 1 Port 5 Port 6 ALU ALU ALU ALU Shift JMP LEA MUL LEA Shift JMP FMA FMA ALU ALU Shift VEC INT Port 0 Shift DIV ALU Shuffle 256KB ML$ 33 Port 4 Store Data Load Data 2 Load Data 3 Port 2 Port 3 Load Load Store Address Store Address Port 7 Store Address Memory Control Fill Buffers From the Optimization Manual 32KB L1 D$ Computer Structure 2014 – Advanced Topics Hyper Threading Technology 34 Computer Structure 2014 – Advanced Topics Thread-Level Parallelism  Multiprocessor systems have been used for many years – There are known techniques to exploit multiprocessors  Software trends – Applications consist of multiple threads or processes that can be executed in parallel on multiple processors  Thread-level parallelism (TLP) – threads can be from – the same application – different applications running simultaneously – operating system services  Increasing single thread performance becomes harder – and is less and less power efficient  Chip Multi-Processing (CMP) – Two (or more) processors are put on a single die 35 From the Optimization Manual Computer Structure 2014 – Advanced Topics Multi-Threading Schemes  Multi-threading: a single processor executes multiple threads  Time-slice multithreading – The processor switches between software threads after a fixed period – Can effectively minimize the effects of long latencies to memory  Switch-on-event multithreading – Switch threads on long latency events such as cache misses – Works well for server applications that have many cache misses  A deficiency of both time-slice MT and switch-on-event MT – They do not cover for branch mis-predictions and long dependencies  Simultaneous multi-threading (SMT) – Multiple threads execute on a single processor simultaneously w/o switching – Makes the most effective use of processor resources  36 Maximizes performance vs. transistor count and power From the Optimization Manual Computer Structure 2014 – Advanced Topics Hyper-Threading Technology  Two logical processors in each physical processor – Sharing most execution/cache resources using SMT – Look like two processors to SW (OS and apps)  Each logical processor executes a software thread – Threads execute simultaneously on one physical processor  Each LP has maintains its own architectural state – Complete set of architectural registers  general-purpose registers, control registers, machine state registers, debug registers – Instruction pointers  Each LP has its own interrupt controller – Interrupts sent to a specific LP are handled only by it 37 From the Optimization Manual Computer Structure 2014 – Advanced Topics Two Important Goals  When one thread is stalled the other thread can continue to make progress – Independent progress ensured by either   Partitioning buffering queues and limiting the number of entries each thread can use Duplicating buffering queues  A single active thread running on a processor with HT runs at the same speed as without HT – Partitioned resources are recombined when only one thread is active 38 From the Optimization Manual Computer Structure 2014 – Advanced Topics Single-task And Multi-task Modes  MT-mode (Multi-task mode) – Two active threads, with some resources partitioned as described earlier  ST-mode (Single-task mode) – ST0 / ST1 – only thread 0 / 1 is active – Resources that are partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources  Moving the processor from between modes Thread 0 executes HALT Interrupt ST0 Thread 1 executes HALT 39 Low Power Thread 1 executes HALT ST1 MT From the Optimization Manual Thread 0 executes HALT Computer Structure 2014 – Advanced Topics Thread Optimization The OS should implement two optimizations:  Use HALT if only one logical processor is active – Allows the processor to transition to either the ST0 or ST1 mode – Otherwise the OS would execute on the idle logical processor a sequence of instructions that repeatedly checks for work to do – This so-called “idle loop” can consume significant execution resources that could otherwise be used by the other active logical processor  On a multi-processor system – OS views logical processors similar to physical processors  But can still differentiate and prefer to schedule a thread on a new physical processor rather than on a 2nd logical processors in the same phy processor – Schedule threads to logical processors on different physical processors before scheduling multiple threads to the same physical processor – Allows SW threads to use different physical resources when possible 40 From the Optimization Manual Computer Structure 2014 – Advanced Topics Physical Resource Sharing Schemes  Replicated Resources – Each thread has its own resource  Partitioned Resources – Resource is partitioned in MT mode, and combined in ST mode  Competitively-shared resource – Both threads compete for the entire resource  HT unaware resources – Both threads use the resource  Alternating Resources – Alternate between active threads – if one thread idle  41 the active thread uses the resource continuously From the Optimization Manual Computer Structure 2014 – Advanced Topics Front End Resource Sharing  Instruction Pointer is replicated  Branch prediction resources are either duplicated or shared – E.g., the return stack buffer is duplicated  iTLB – The large page iTLB is replicated – The small page iTLB is partitioned  The I$ is Competitively-shared  The decoder logic is shared and its use is alternated – If only one thread needs the decode logic, it gets the full decode bandwidth  The μop-Cache is partitioned  The μop Queue is partitioned  56 μops in ST mode, 28 μops/thread in MT mode 42 From the Optimization Manual Computer Structure 2014 – Advanced Topics Back End Resource Sharing  Out-Of-Order Execution – Register renaming tables are replicated – The reservation stations are competitively-shared – Execution units are HT unaware – The re-order buffers are partitioned – Retirement logic alternates between threads  Memory – The Load buffers and store buffers are partitioned – The DTLB and STLB are competitively-shared – The cache hierarchy and fill buffers are competitively-shared 43 From the Optimization Manual Computer Structure 2014 – Advanced Topics Power Management 44 Computer Structure 2014 – Advanced Topics Processor Power Components  The power consumed by a processor consists of – Dynamic power: power for toggling transistors and lines from 01 or 10  αCV2f : α – activity, C – capacitance, V – voltage, f – frequency – Leakage power: leakage of transistors under voltage  function of: Z – total size of all transistors, V – voltage, t – temperature  Peak power must not exceed the thermal constrains – Power generates heat  Heat must be dissipated to keep transistors within allowed temperature – Peak power determines peak frequency (and thus peak performance) – Also affects form factor, cooling solution cost, and acoustic noise  Average power – Determines battery life (for mobile devices), electricity bill, air-condition bill – Average power = Total Energy / Total time  Including low-activity and idle-time (~90% idle time for client) 45 Computer Structure 2014 – Advanced Topics Performance per Watt  In small form-factor devices thermal budget limits performance – Old target: get max performance – New target: get max performance at a given power envelope  Performance per Watt  Increasing f also requires increasing V (~linearly) – Dynamic Power = αCV2f = Kf3  X% performance costs ~3X% power – A power efficient feature – better than 1:3 performance : power  Otherwise it is better to just increase frequency (and voltage)  Vmin is the minimal operation voltage – Once at Vmin, reducing frequency no longer reduces voltage – At this point a feature is power efficient only if it is 1:1 performance : power  Active energy efficiency tradeoff – Energyactive = Poweractive × Timeactive  Poweractive / Perfactive – Energy efficient feature: 1:1 performance : power 46 Computer Structure 2014 – Advanced Topics Platform Power  Processor average power is <10% of the platform CLK 5% LAN Fan DVD 2% 2% ICH 2% 3% Display (panel + inverter) 33% HDD 8% GFX 8% Misc. 8% CPU 10% MCH 9% 47 Power Supply 10% Computer Structure 2014 – Advanced Topics Managing Power  Typical CPU usage varies over time – Bursts of high utilization & long idle periods (~90% of time in client)  Optimize power and energy consumption – High power when high performance is needed – Low power at low activity or idle  Enhanced Intel SpeedStep® Technology – Multi voltage/frequency operating points – OS changes frequency to meet performance needs and minimize power – Referred to as processor Performance states = P-States  OS notifies CPU when no tasks are ready for execution – CPU enters sleep state, called C-state – Using MWAIT instruction, with C-state level as an argument – Tradeoff between power and latency  Deeper sleep  more power savings  longer to wake 48 Computer Structure 2014 – Advanced Topics P-states  Operation frequencies are called P-states = Performance states – P0 is the highest frequency – P1,2,3… are lower frequencies – Pn is the min Vcc point = Energy efficient point  DVFS = Dynamic Voltage and Frequency Scaling – Power = CV2f ; f = KV  Power ~ f3 – Program execution time ~ 1/f – E = P×t  E ~ f2  Pn is the most energy efficient point Power P0 P1 – Going up/down the cubic curve of power  High cost to achieve frequency  large power savings for some small frequency reduction P2 Pn Freq 49 Computer Structure 2014 – Advanced Topics C-States: C0  C0: CPU active state Active Core Power Local Clocks and Logic Clock Distribution Leakage 50 Computer Structure 2014 – Advanced Topics C-States: C1  C0: CPU active state  C1: Halt state: • • • • Active Core Power Stop core pipeline Stop most core clocks No instructions are executed Caches respond to external snoops Clock Distribution Leakage 51 Computer Structure 2014 – Advanced Topics C-States: C3  C0: CPU active state  C1: Halt state: • • • • Active Core Power Stop core pipeline Stop most core clocks No instructions are executed Caches respond to external snoops  C3 state: • Stop remaining core clocks • Flush internal core caches Leakage 52 Computer Structure 2014 – Advanced Topics C-States: C6  C0: CPU active state  C1: Halt state: • • • • Active Core Power Stop core pipeline Stop most core clocks No instructions are executed Caches respond to external snoops  C3 state: • Stop remaining core clocks • Flush internal core caches  C6 state: • Processor saves architectural state • Turn off power gate, eliminating leakage Leakage Core power goes to ~0 53 Computer Structure 2014 – Advanced Topics Putting it all together  CPU running at max power and frequency  Periodically enters C1 C0 P0 20 18 16 Power [W] 14 12 10 8 6 4 C1 2 0 Time 54 Computer Structure 2014 – Advanced Topics Putting it all together  Going into idle period – Gradually enters deeper C states – Controlled by OS C0 P0 20 18 16 Power [W] 14 12 10 8 C2 6 C3 4 C1 2 C4 0 Time 55 Computer Structure 2014 – Advanced Topics Putting it all together  Tracking CPU utilization history – OS identifies low activity – Switches CPU to lower P state C0 P0 20 18 16 C0 P1 Power [W] 14 12 10 8 C2 6 C3 4 C1 2 C4 0 Time 56 Computer Structure 2014 – Advanced Topics Putting it all together  CPU enters Idle state again C0 P0 20 18 16 C0 P1 Power [W] 14 12 10 8 C2 6 C3 4 C1 2 C2 C4 C3 C4 0 Time 57 Computer Structure 2014 – Advanced Topics Putting it all together  Further lowering the P state  DVD play runs at lowest P state C0 P0 20 18 16 C0 P1 Power [W] 14 12 10 8 6 C2 C3 4 C1 2 C2 C4 C0 P2 C3 C4 0 Time 58 Computer Structure 2014 – Advanced Topics Voltage and Frequency Domains  Two Independent Variable Power Planes  Shared frequency for all IA32 cores and ring  Independent frequency for PG  Fixed Programmable power plane for System Agent VCC SA Embedded power gates – CPU cores, ring and LLC  Embedded power gates – each core can be turned off individually  Cache power gating – turn off portions or all cache at deeper sleep states – Graphics processor  Can be varied or turned off when not active VCC Periphery – Optimize SA power consumption – System On Chip functionality and PCU logic – Periphery: DDR, PCIe, Display VCC Core (Gated) VCC Core (Gated) VCC Core (ungated) VCC Core (Gated) VCC Core (Gated) VCC Graphics VCC Periphery 59 Computer Structure 2014 – Advanced Topics Turbo Mode  P1 is guaranteed frequency – CPU and GFX simultaneous heavy load at worst case conditions – Actual power has high dynamic range – OS treats P0 as any other P-state  Requesting is when it needs more performance – P1 to P0 range is fully H/W controlled  Frequency transitions handled completely in HW  PCU keeps silicon within existing operating limits – Systems designed to same specs, with or without Turbo Mode  Pn is the energy efficient state – Lower than Pn is controlled by Thermal-State frequency  P0 is max possible frequency – the Turbo frequency – P1-P0 has significant frequency range (GHz)  Single thread or lightly loaded applications  GFX <>CPU balancing P0 1C “Turbo” H/W Control P1 OS Visible States OS Control T-state & Throttle Pn LFM 60 Computer Structure 2014 – Advanced Topics Turbo Mode Power Gating Zero power for inactive cores 61 Core 3 Core 2 Core 1 Core 0 Workload Lightly Threaded Frequency (F) Core 3 Core 2 Core 1 Core 0 Frequency (F) No Turbo Computer Structure 2014 – Advanced Topics Turbo Mode Power Gating Zero power for inactive cores Turbo Mode Use thermal budget of inactive core to increase frequency of active cores 62 Core 1 Core 0 Workload Lightly Threaded Frequency (F) Core 3 Core 2 Core 1 Core 0 Frequency (F) No Turbo Computer Structure 2014 – Advanced Topics Turbo Mode Power Gating Zero power for inactive cores Turbo Mode Use thermal budget of inactive core to increase frequency of active cores 63 Core 1 Core 0 Workload Lightly Threaded Frequency (F) Core 3 Core 2 Core 1 Core 0 Frequency (F) No Turbo Computer Structure 2014 – Advanced Topics Turbo Mode Turbo Mode Increase frequency within thermal headroom 64 Core Core3 3 Core Core2 2 Core 11 Core Core Core0 0 Active cores running workloads < TDP Frequency (F) Core 3 Core 2 Core 1 Core 0 Frequency (F) No Turbo Computer Structure 2014 – Advanced Topics Turbo Mode Power Gating Zero power for inactive cores Turbo Mode Increase frequency within thermal headroom 65 Core 2 Core 3 Core 1 Core 0 Workload Lightly Threaded And active cores < TDP Frequency (F) Core 3 Core 2 Core 1 Core 0 Frequency (F) No Turbo Computer Structure 2014 – Advanced Topics Thermal Capacitance Classic Model Steady-State Thermal Resistance Steady-State Thermal Resistance AND Dynamic Thermal Capacitance Temperature Temperature Design guide for steady state New Model Classic model response Time More realistic response to power changes Time Temperature rises as energy is delivered to thermal solution Thermal solution response is calculated at real-time 66 Foil taken from IDF 2011 Computer Structure 2014 – Advanced Topics Intel® Turbo Boost Technology 2.0 Power After idle periods, the system accumulates “energy budget” and can accommodate high power/performance for a few seconds C0/P0 (Turbo) Turbo Boost 2.0 In Steady State conditions the power stabilizes on TDP Use accumulated energy budget to enhance user experience “TDP” Sleep or Low power Time Buildup thermal budget during idle periods 67 Foil taken from IDF 2011 Computer Structure 2014 – Advanced Topics Core and Graphic Power Budgeting • Cores and Graphics integrated on the same die with separate voltage/frequency controls; tight HW control • Full package power specifications available for sharing • Power budget can shift between Cores and Graphics Core Power [W] Heavy CPU workload Total package power Sandy Bridge Next Gen Turbo Sum of max power for short periods Realistic concurrent max power Specification Core Power Applications Heavy Graphics workload Specification Graphics Power 68 Foil taken from IDF 2011 Graphics Power [W] Computer Structure 2014 – Advanced Topics
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            