Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computer Architecture (“MAMAS”, 234267) Spring 2015 Lecturer: Yoav Etsion Reception: Mon 15:00, Fishbach 306-8 TAs: Tomer Gurevich, Franck Sala, Andrey Zhitnikov Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, Adi Yoaz and Dan Tsafrir 1 Computer Architecture 2015 – Introduction General Info  Grade    25% Exercise (mandatory) תקף 75% Final exam Textbook  “Computer Architecture: A Quantitative Approach” (4th Edition) by: Patterson & Hennessy  Other course information   2 Course web site: http://webcourse.cs.technion.ac.il/234267/Spring2013 Lectures will be upload to the web a day before the class Computer Architecture 2015 – Introduction Assignments  Five mandatory assignments during the semester   NO CHEATING!     Copying any part of the assignments from another student or using a reference from a previous year Letting someone else copy from you Posting code/solutions to a shared forum  We can track the code back to the posters!  3 Suspected parties will be sent to a Technion trial Typical outcome: course disqualified, which means your graduation will be postponed by at least one semester Possible examples of cheating (a few of many):   Only programming assignments Be honest and work by yourselves! Computer Architecture 2015 – Introduction The Computer System 4 Computer Architecture 2015 – Introduction Classical System Diagram Cache More to the “north” = closer to the CPU = faster CPU CPU BUS North Bridge External Graphics Card DDR2 or DDR3 Channel 1 PCI express 2.0 IOMMU On-board Memory Graphics controller Serial Port Parallel Port IO Controller 5 DDR2 or DDR3 Channel 2 PCI express ×1 South Bridge Floppy Drive Mem BUS keybrd USB controller mouse SATA controller DVD Drive Hard Disk PCI Sound Card speakers Lan Adap LAN Computer Architecture 2015 – Introduction Intel Nehalem Core i3 i5 i7 For high-end i-Series chips, Northbridge functionality moved onto processor (=> made faster) 45 to 32 nm 6 Computer Architecture 2015 – Introduction Intel Sandy Bridge Core i3 i5 i7 32 to 22 nm 7 The trend continues Computer Architecture 2015 – Introduction Computer and System Architecture   The physical design of the computer system is commonly known as Computer Architecture The definition of computer architecture: Prof. Christos Kozyrakis, Stanford Computer theorists invent algorithms that solve important problems and analyze their asymptotic behavior (e.g. O(NlogN) or O(N3)). Computer architects set the constant factors for these algorithms… 8 Computer Architecture 2015 – Introduction Computer and System Architecture   Axiom: The ultimate goal is performance (ops/sec) Constraints:    Practical goal: Highest performance in light of constraints Today, the most common goal is Performance/Watt  9 Power, Number of transistors, Bandwidth, … e.g. ops/Joule Computer Architecture 2015 – Introduction Course Focus  Start from CPU (=processor)      Move on to Memory Hierarchy    Caching and Main memory Virtual Memory Finish with latency tolerance and parallelism  10 Instruction set, performance Pipeline, hazards Branch prediction Out-of-order execution Hardware multithreading Computer Architecture 2015 – Introduction The Processor 11 Computer Architecture 2015 – Introduction Architecture vs. Microarchitecture  Architecture: = The processor features as seen by its user = Interface   Microarchitecture: = Manner by which the processor is implemented = Implementation details   Caches size and structure, number of execution units, … Note: different processors with different u-archs can support the same arch   12 Instruction set, number of registers, addressing modes,… Intel Pentium-IV vs. Intel Core2 Duo ARMv9 implemented by Qualcomm, TI, Samsung Computer Architecture 2015 – Introduction Why Should We Care?  Abstractions enhance productivity, so:    Same goes for arch   Just details for a programmer of a high-level language Abstractions only work so long as what’s below works  13 If we know the arch (=interface), Why should we care about the u-arch (=internals)? The taxi story: http://vimeo.com/11478146 (4:50-6:00) Computer Architecture 2015 – Introduction Recent Processor Trends Source: http://www.scidacreview.org/0904/html/multicore.html 14 Computer Architecture 2015 – Introduction Well-Known Moore’s Law Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm 15 Computer Architecture 2015 – Introduction 16 Computer Architecture 2015 – Introduction The Story in a Nutshell Transistors (1000s) clock speed (MHz) power (W) Instructions/cycle (ILP) 17 Computer Architecture 2015 – Introduction Took the Industry by Surprise 18 Computer Architecture 2015 – Introduction Dire Implications: Performance 19 Computer Architecture 2015 – Introduction Dire Implications: Programmers 22 Computer Architecture 2015 – Introduction Supercomputing: “Top 500 list” 23 Computer Architecture 2015 – Introduction Dire Implications: Supercomputing 24 Computer Architecture 2015 – Introduction Processor Performance 25 Computer Architecture 2015 – Introduction Metrics: IC, CPI, IPC  CPUs work according to a clock signal    Instruction Count (IC)   Clock cycle: measured in nanoseconds (10-9 of a second) Clock frequency = 1/|clock cycle|: in GHz (109 cycles/sec) Total number of instructions executed in the program Cycles Per Instruction (CPI)  Average #cycles per Instruction (in a given program) CPI =  26 #cycles required to execute the program IC IPC (= 1/CPI) : Instructions per cycles. Can be > 1; see the “story in a nutshell slide” Computer Architecture 2015 – Introduction Minimizing Execution Time  CPU Time - time required to execute a program CPU Time = IC  CPI  clock cycle  Our goal: minimize CPU Time (any of above components)  Minimize clock cycle: increase GHz (processor design)  Minimize CPI: u-arch (e.g.: more execution units)  Minimize IC: arch + u-arch (e.g.: SSETM) SSE = streaming SIMD extension (Intel) 27 Computer Architecture 2015 – Introduction Alternative Way to Calculate CPI   ICi = #times instruction of type-i is executed in program n IC = #instruction executed in program = IC   ICi i 1   Fi = relative frequency of type-i instruction = ICi/IC CPIi = #cycles to execute type-i instruction   e.g.: CPIadd = 1, CPImul = 3 n #cycles required to execute the program: # cyc   CPI i  ICi i 1  CPI: n # cyc CPI   IC 28  CPI  IC i 1 i IC i n ICi n   CPIi    CPIi  Fi IC i 1 i 1 Computer Architecture 2015 – Introduction Performance Evaluation: How?  No simple answer  Performance depends on    Mathematical analysis    Typically impossible Systems is too complex to model accurately So how do we evaluate systems?  29 Application Input Empirical analysis Computer Architecture 2015 – Introduction Benchmarks  Use benchmarks & measure how long it takes   Use real applications (=> no absolute answers) Preferably standardized benchmarks (+input), e.g.,  SPEC INT: integer apps • Compression, C complier, Perl, text-processing, …     Sometimes you see FLOPS (“peak” or “sustained”)  30 SPEC FP: floating point apps (mostly scientific) TPC benchmarks: measure transaction throughput (DB) SPEC JBB: models wholesale company (Java server, DB) Supercomputers (top500 list), against LINPACK Computer Architecture 2015 – Introduction Evaluating Performance  Use a performance simulator to evaluate the performance of a new feature / algorithm    Models the uarch to a great detail Run 100’s of representative applications Produce the performance s-curve   Sort the applications according to the IPC increase Baseline (0%) is the processor without the new feature 3% Bad S-curve 2% 6% Positive outliers Good S-curve Positive outliers 4% 1% 0% 2% -1% -2% Negative outliers -3% 0% Small negative outliers -2% -4% 31 Computer Architecture 2015 – Introduction Amdahl’s Law   Suppose we accelerate the computation such that  P = portion of computation we make faster  S = speedup experienced by the portion we improved For example  If an improvement can speedup 40% of the computation => P = 0.4   32 If the improvement makes the portion run twice as fast => S = 2 Then overall speedup = 1 (1  P)  P S Computer Architecture 2015 – Introduction Amdahl’s Law - Example  FP operations improved to run 2x faster  S = 2, but…  P = only affects 10% of the program  Speedup: 1 1    1.053 0.1 0.95 (1  P)  P (1  0.1)  S 2  Conclusion  33 1 Better to make common case fast… Computer Architecture 2015 – Introduction Amdahl’s Law – Parallelism  When parallelizing a program  P = proportion of program that can be made parallel  1 - P = inherently serial   N = number of processing elements (say, cores) Speedup: 1 (1  P)  P  N Serial component imposes a hard limit   1 1   lim  N   (1  P) (1  P)  P  N  34 Computer Architecture 2015 – Introduction Instruction Set Design software The ISA is what the user & compiler see instruction set hardware 35 The HW implements the ISA Computer Architecture 2015 – Introduction Considerations in ISA Design  Instruction size  Long instructions take more time to fetch from memory  Longer instructions require a larger memory • Important for small (embedded) devices  Number of instructions (IC)   36 Reduce IC => reduce runtime (at a given CPI & frequency) Virtues of instructions simplicity  Simpler HW allows for: higher frequency & lower power  Optimization can be applied better to simpler code  Cheaper HW Computer Architecture 2015 – Introduction Basing Design Decisions on Workload Immediate argument’s size in bits (histogram) 30% Int. Avg. FP Avg. 20% 10% 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0% Immediate data bits   37 1% of data values > 16-bits Having 16 bits is likely good enough Computer Architecture 2015 – Introduction CISC Processors  CISC - Complex Instruction Set Computer   Example: x86 The idea: a high level machine language • When people programmed in assembly, CISC supposedly easier  Characteristic   Many instruction types, with a many addressing modes Some of the instructions are complex • Execute complex tasks • Require many cycles  ALU operations directly on memory (e.g., arr[j] = arr[i]+n) • Registers not used (and, accordingly, only a few registers exist)  Variable length instructions • common instructions get short codes  save code length 38 Computer Architecture 2015 – Introduction But it Turns Out… Rank instruction % of total executed 1 load 22% 2 conditional branch 20% 3 compare 16% 4 store 12% 5 add 8% 6 and 6% 7 sub 5% 8 move register-register 4% 9 call 1% 10 return 1% Total 96% Simple instructions dominate instruction frequency 39 Computer Architecture 2015 – Introduction CISC Drawbacks  Complex instructions and complex addressing modes  complicates the processor  slows down the simple, common instructions  contradicts Make The Common Case Fast  Compilers don’t use complex instructions / indexing methods  Variable length instructions are real pain in the neck   Difficult to decode few instructions in parallel • As long as instruction is not decoded, its length is unknown  It is unknown where the instruction ends  It is unknown where the next instruction starts An instruction may be longer than a cache line • Or even longer longer than a page (in theory) 40 Computer Architecture 2015 – Introduction RISC Processors  RISC - Reduced Instruction Set Computer   The idea: simple instructions enable fast hardware Characteristic    A small instruction set, with only a few instructions formats A few indexing methods Load/Store machine: operate only on registers • Memory is accessed using Load and Store instructions only • Many orthogonal registers • Three address machine: Add dst, src1, src2   41 Fixed length instructions Examples: ARMTM MIPSTM, SparcTM, AlphaTM, PowerTM Computer Architecture 2015 – Introduction RISC Processors (Cont.)  Simple arch => simple u-arch      Compiler can be smarter    Better pipeline usage Better register allocation Existing RISC processor are not “pure” RISC  42 Smaller => faster Easier to design & validate (=> cheaper to manufacture) Shorten time-to-market More general-purpose registers (=> less memory refs) Various complex operations added along the way Computer Architecture 2015 – Introduction Compilers and ISA  Ease of compilation  Orthogonality: • no special registers • few special cases • all operand modes available with any data type or instruction type  Regularity: • no overloading for the meanings of instruction fields  streamlined • resource needs easily determined  Register assignment is critical too  43 Easier if lots of registers Computer Architecture 2015 – Introduction CISC or RISC?  ARM (RISC) dominates the personal market   x86 (CISC) dominates the PC and server market    Programs are more compact (fewer instructions) Dynamic optimizations Pro RISC:  44 CISC instructions are broken down to RISC micro-ops Pro CISC:   Not necessarily because it is CISC… Intel processors have a RISC arch   Not necessarily because it is RISC… Simpler and faster uarch Computer Architecture 2015 – Introduction ISA Extensions  Extend arch to accelerate exec of specific apps  Example: SSETM – Streaming SIMD Extensions     128-bit packed (vector) / scalar single precision FP (4×32) Introduced on Pentium® III on ’99 8 new 128 bit registers (XMM0 – XMM7) Accelerates graphics, video, scientific calculations, … 128-bits x3 x2 x1 x0 y0 + 45 y3 y2 y1 x3+y3 x2+y2 x1+y1 x0+y0 Computer Architecture 2015 – Introduction BACKUP 46 Computer Architecture 2015 – Introduction Compatibility  Backward compatibility (HW responsibility)  When buying new hardware, it can run existing software: • i5 can run SW written for Core2 Duo, Pentium4, PentiumM, Pentium III, Pentium II, Pentium, 486, 386, 268 BTW:  Forward compatibility (SW responsibility)    Architecture-independent SW   47 For example: MS Word 2003 can open MS Word 2010 doc Commonly supports one or two generations behind Run SW on top of VM that does JIT (just in time compiler): JVM for Java and CLR for .NET Interpreted languages: Perl, Python Computer Architecture 2015 – Introduction