Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Foundation to Computer Architecture CHAPTER 3: A TOP-LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION Computer Components: Top Level View Instruction Cycle • Two steps: – Fetch – Execute Fetch Cycle • Program Counter (PC) holds address of next instruction to fetch • Processor fetches instruction from memory location pointed by PC • Increment PC – Unless told otherwise • Instruction loaded into Instruction Register (IR) • Processor interprets (DECODE) instruction and performs required actions Execute Cycle • Processor-memory – data transfer between CPU and main memory • Processor I/O – Data transfer between CPU and I/O module • Data processing – Some arithmetic or logical operation on data • Control – Alteration of sequence of operations – e.g. jump • Combination of above Characteristics of a Hypothetical Machine 6 0001 : Load AC from memory Example of Program Execution 0101 : Add to AC from memory 0010 : Store AC to memory Instruction Cycle State Diagram Interrupts • Mechanism by which other modules (e.g. I/O) may interrupt normal sequence of processing • Program – e.g. overflow, division by zero • Timer – Generated by internal processor timer – Used in pre-emptive multi-tasking • I/O – from I/O controller • Hardware failure – e.g. memory parity error Classes of Interrupts Program Flow Control Interrupt Cycle • Added to instruction cycle • Processor checks for interrupt – Indicated by an interrupt signal • If no interrupt, fetch next instruction • If interrupt pending: – – – – – Suspend execution of current program Save context Set PC to start address of interrupt handler routine Process interrupt Restore context and continue interrupted program Transfer of Control via Interrupts Instruction Cycle with Interrupts Program Timing Short I/O Wait Typical Case Program Timing Long I/O Wait Instruction Cycle (with Interrupts) State Diagram Multiple Interrupts • Disable interrupts – Processor will ignore further interrupts whilst processing one interrupt – Interrupts remain pending and are checked after first interrupt has been processed – Interrupts handled in sequence as they occur • Define priorities – Low priority interrupts can be interrupted by higher priority interrupts – When higher priority interrupt has been processed, processor returns to previous interrupt Multiple Interrupts Sequential Multiple Interrupts – Nested Time Sequence of Multiple Interrupts Computer Memory System CHAPTER 4: CACHE MEMORY 21 Characteristics • • • • • • • • Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation Location • CPU • Internal • External 23 Capacity • Word size – The natural unit of organisation • Number of words – or Bytes 24 Unit of Transfer • Internal – Usually governed by data bus width • External – Usually a block which is much larger than a word • Addressable unit – Smallest location which can be uniquely addressed – Word internally – Cluster on M$ disks 25 Access Methods (1) • Sequential – Start at the beginning and read through in order – Access time depends on location of data and previous location – e.g. tape • Direct – Individual blocks have unique address – Access is by jumping to vicinity plus sequential search – Access time depends on location and previous location – e.g. disk 26 Access Methods (2) • Random – Individual addresses identify locations exactly – Access time is independent of location or previous access – e.g. RAM • Associative – Data is located by a comparison with contents of a portion of the store – Access time is independent of location or previous access – e.g. cache 27 Memory Hierarchy • Registers – In CPU • Internal or Main memory – May include one or more levels of cache – “RAM” • External memory – Backing store 28 Memory Hierarchy - Diagram 29 Performance • Access time – Time between presenting the address and getting the valid data • Memory Cycle time – Time may be required for the memory to “recover” before next access – Cycle time is access + recovery • Transfer Rate – Rate at which data can be moved 30 1. Decreasing cost per bit 2. Increasing capacity 3. Increasing access time 4. Decreasing frequency of access of the memory by the processor 31 Physical Types • Semiconductor – RAM • Magnetic – Disk & Tape • Optical – CD & DVD • Others – Bubble – Hologram 32 Physical Characteristics • • • • Decay Volatility Erasable Power consumption 33 Organisation • Physical arrangement of bits into words • Not always obvious • e.g. interleaved 34 The Bottom Line • How much? – Capacity • How fast? – Time is money • How expensive? 35 Mapping Function • Cache of 64kByte • Cache block of 4 bytes – i.e. cache is 16k (214) lines of 4 bytes • 16MBytes main memory • 24 bit address – (224=16M) 36 Direct Mapping • Each block of main memory maps to only one cache line – i.e. if a block is in cache, it must be in one specific place • • • • Address is in two parts Least Significant w bits identify unique word Most Significant s bits specify one memory block The MSBs are split into a cache line field r and a tag of s-r (most significant) 37 Direct Mapping Address Structure Tag s-r Line or Slot r 8 14 • 24 bit address • 2 bit word identifier (4 byte block) • 22 bit block identifier – 8 bit tag (=22-14) – 14 bit slot or line • No two blocks in the same line have the same Tag field • Check contents of cache by finding line and checking Tag Word w 2 Direct Mapping from Cache to Main Memory 39 Direct Mapping Cache Line Table Cache line Main Memory blocks held 0 0, m, 2m, 3m…2s-m 1 1,m+1, 2m+1…2s-m+1 … m-1 m-1, 2m-1,3m-1…2s-1 Direct Mapping Cache Organization 41 Hit and Miss 42 Direct Mapping Example 43 44 Direct Mapping Example 45 Direct Mapping Summary • Address length = (s + w) bits • Number of addressable units = 2s+w words or bytes • Block size = line size = 2w words or bytes • Number of blocks in main memory = 2s+ w/2w = 2s • Number of lines in cache = m = 2r • Size of tag = (s – r) bits 46 Direct Mapping pros & cons • Simple • Inexpensive • Fixed location for given block – If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high 47 Victim Cache • Lower miss penalty • Remember what was discarded – Already fetched – Use again with little penalty • Fully associative • 4 to 16 cache lines • Between direct mapped L1 cache and next memory level 48 Associative Mapping • A main memory block can load into any line of cache • Memory address is interpreted as tag and word • Tag uniquely identifies block of memory • Every line’s tag is examined for a match • Cache searching gets expensive 49 Associative Mapping from Cache to Main Memory 50 Fully Associative Cache Organization 51 Associative Mapping Example 52 Associative Mapping Address Structure Word 2 bit Tag 22 bit • 22 bit tag stored with each 32 bit block of data • Compare tag field with tag entry in cache to check for hit • Least significant 2 bits of address identify which 16 bit word is required from 32 bit data block • e.g. – Address – FFFFFC Tag 3FFFFF Data 24682468 Cache line 3FFF 53 Associative Mapping Summary • Address length = (s + w) bits • Number of addressable units = 2s+w words or bytes • Block size = line size = 2w words or bytes • Number of blocks in main memory = 2s+w/2w = 2s • Number of lines in cache = undetermined • Size of tag = s bits 54 Set Associative Mapping • Cache is divided into a number of sets • Each set contains a number of lines • A given block maps to any line in a given set – e.g. Block B can be in any line of set i • e.g. 2 lines per set – 2 way associative mapping – A given block can be in one of 2 lines in only one set 55 Set Associative Mapping Example • 13 bit set number • Block number in main memory is modulo 213 • 000000, 00A000, 00B000, 00C000 … map to same set 56 Mapping From Main Memory to Cache: v Associative 57 Mapping From Main Memory to Cache: k-way Associative 58 K-Way Set Associative Cache Organization 59 Set Associative Mapping Address Structure Tag 9 bit Set 13 bit Word 2 bit • Use set field to determine cache set to look in • Compare tag field to see if we have a hit • e.g – Address – 1FF 7FFC – 001 7FFC Tag 1FF 001 Data 12345678 11223344 Set number 1FFF 1FFF 60 Two Way Set Associative Mapping Example 61 Set Associative Mapping Summary • Address length = (s + w) bits • Number of addressable units = 2s+w words or bytes • Block size = line size = 2w words or bytes • Number of blocks in main memory = 2d • Number of lines in set = k • Number of sets = v = 2d • Number of lines in cache = kv = k * 2d • Size of tag = (s – d) bits 62 Central Processing Unit (CPU) Basics CHAPTER 14: PROCESSOR STRUCTURE AND FUNCTION 63 Pipelining • • • • • • Fetch instruction Decode instruction Calculate operands (i.e. EAs) Fetch operands Execute instructions Write result • Overlap these operations 64 Two Stage Instruction Pipeline 65 Timing Diagram for Instruction Pipeline Operation 66 The Effect of a Conditional Branch on Instruction Pipeline Operation 67 Six Stage Instruction Pipeline 68 Alternative Pipeline Depiction 69 Pipeline Hazards • Pipeline, or some portion of pipeline, must stall. • Also called pipeline bubble. 70 Types of hazards Data hazards Structural hazards (Resource) Control hazards (branching hazards) DATA HAZARDS read after write (RAW), a true dependency write after read (WAR), an anti-dependency write after write (WAW), an output dependency Consider two instructions i1 and i2, with i1 occurring before i2 in program order. Read After Write (RAW) i2 tries to read a source before i1 writes to it. A read after write (RAW) data hazard refers to a situation where an instruction refers to a result that has not yet been calculated or retrieved i1. i2. R2 <= R1 + R3 R4 <= R2 + R3 Write After Read (WAR) i2 tries to write a destination before it is read by i1. A write after read (WAR) data hazard represents a problem with concurrent execution. i1. i2. R4 <= R1 + R5 R5 <= R1 + R2 Write After Write (WAW) i2 tries to write an operand before it is written by i1. A write after write (WAW) data hazard may occur in a concurrent execution environment. i1. i2. R2 <- R4 + R7 R2 <- R1 + R3 Data Hazards • • • • • Conflict in access of an operand location Two instructions to be executed in sequence Both access a particular memory or register operand If in strict sequence, no problem occurs If in a pipeline, operand value could be updated so as to produce different result from strict sequential execution • E.g. x86 machine instruction sequence: • ADD EAX, EBX /* EAX = EAX + EBX • SUB ECX, EAX /* ECX = ECX – EAX • • • • ADD instruction does not update EAX until end of stage 5, at clock cycle 5 SUB instruction needs value at beginning of its stage 2, at clock cycle 4 Pipeline must stall for two clocks cycles Without special hardware and specific avoidance algorithms, results in inefficient pipeline usage 76 Data Hazard Diagram 77 Resource Hazards • • • • Two (or more) instructions in pipeline need same resource Executed in serial rather than parallel for part of pipeline Also called structural hazard E.g. Assume simplified five-stage pipeline • • • • • • Ideal case is new instruction enters pipeline each clock cycle Assume main memory has single port Assume instruction fetches and data reads and writes performed one at a time Ignore the cache Operand read or write cannot be performed in parallel with instruction fetch Fetch instruction stage must idle for one cycle fetching I3 • • E.g. multiple instructions ready to enter execute instruction phase Single ALU • One solution: increase available resources – Each stage takes one clock cycle – Multiple main memory ports – Multiple ALUs 78 Resource Hazard Diagram 79 Control Hazard • Also known as branch hazard • Pipeline makes wrong decision on branch prediction • Brings instructions into pipeline that must subsequently be discarded • Dealing with Branches – – – – – Multiple Streams Pre-fetch Branch Target Loop buffer Branch prediction Delayed branching • Normally the fetch stage. 80 Multiple Streams • Have two pipelines • Pre-fetch each branch into a separate pipeline • Use appropriate pipeline • Leads to bus & register contention • Multiple branches lead to further pipelines being needed 81 Prefetch Branch Target • Target of branch is pre-fetched in addition to instructions following branch • Keep target until branch is executed • Used by IBM 360/91 82 Loop Buffer • • • • • • Very fast memory Maintained by fetch stage of pipeline Check buffer before fetching from memory Very good for small loops or jumps c.f. cache Used by CRAY-1 83 Loop Buffer Diagram 84 Branch Prediction (1) • Predict never taken – Assume that jump will not happen – Always fetch next instruction – 68020 & VAX 11/780 – VAX will not pre-fetch after branch if a page fault would result (O/S v CPU design) • Predict always taken – Assume that jump will happen – Always fetch target instruction 85 Branch Prediction (2) • Predict by Opcode – Some instructions are more likely to result in a jump than Others – Can get up to 75% success • Taken/Not taken switch – Based on previous history – Good for loops – Refined by two-level or correlation-based branch history • Correlation-based – In loop-closing branches, history is good predictor – In more complex structures, branch direction correlates with that of related branches • Use recent branch history as well 86 Branch Prediction (3) • Delayed Branch – Do not take jump until you have to – Rearrange instructions 87 Branch Prediction Flowchart 88 Branch Prediction State Diagram 89 Processor Internals CHAPTER 16: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 90 What is Superscalar? • Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently • Equally applicable to RISC & CISC • In practice usually RISC 91 Why Superscalar? • Most operations are on scalar quantities (see RISC notes) • Improve these operations to get an overall improvement 92 General Superscalar Organization 93 Superpipelined • Many pipeline stages need less than half a clock cycle • Double internal clock speed gets two tasks per external clock cycle • Superscalar allows parallel fetch execute 94 Superscalar v Superpipeline 95 Limitations • • • • Instruction level parallelism Compiler based optimisation Hardware techniques Limited by – – – – – True data dependency Procedural dependency Resource conflicts Output dependency Anti-dependency 96 True Data Dependency • ADD r1, r2 (r1 := r1+r2;) • MOVE r3,r1 (r3 := r1;) • Can fetch and decode second instruction in parallel with first • Can NOT execute second instruction until first is finished 97 Procedural Dependency • Can not execute instructions after a branch in parallel with instructions before a branch • Also, if instruction length is not fixed, instructions have to be decoded to find out how many fetches are needed • This prevents simultaneous fetches 98 Resource Conflict • Two or more instructions requiring access to the same resource at the same time – e.g. two arithmetic instructions • Can duplicate resources – e.g. have two arithmetic units 99 Effect of Dependencies 100 Processor Internals CHAPTER 20: CONTROL UNIT OPERATION 101 Hardwired and Micro-programmed Control Unit Implementation 102 Hardwired Implementation (1) • Control unit inputs • Flags and control bus – Each bit means something • Instruction register – Op-code causes different control signals for each different instruction – Unique logic for each op-code – Decoder takes encoded input and produces single output – n binary inputs and 2n outputs 103 Hardwired Implementation (2) • Clock – Repetitive sequence of pulses – Useful for measuring duration of micro-ops – Must be long enough to allow signal propagation – Different control signals at different times within instruction cycle – Need a counter with different control signals for t1, t2 etc. 104 Control Unit with Decoded Inputs 105 PART 6: (1/2) Enhancing CPU Performance CHAPTER 21: MICROPROGRAMMED CONTROL 106 Control Unit Organization 107 Micro-programmed Control • Use sequences of instructions (see earlier notes) to control complex operations • Called micro-programming or firmware 108 Implementation (1) • All the control unit does is generate a set of control signals • Each control signal is on or off • Represent each control signal by a bit • Have a control word for each micro-operation • Have a sequence of control words for each machine code instruction • Add an address to specify the next microinstruction, depending on conditions 109 Implementation (2) • Today’s large microprocessor – Many instructions and associated register-level hardware – Many control points to be manipulated • This results in control memory that – Contains a large number of words • co-responding to the number of instructions to be executed – Has a wide word width • Due to the large number of control points to be manipulated 110 Micro-program Word Length • Based on 3 factors – Maximum number of simultaneous microoperations supported – The way control information is represented or encoded – The way in which the next micro-instruction address is specified 111 Micro-instruction Types • Each micro-instruction specifies single (or few) micro-operations to be performed – (vertical micro-programming) • Each micro-instruction specifies many different micro-operations to be performed in parallel – (horizontal micro-programming) 112 Enhancing CPU Performance CHAPTER 17: PARALLEL PROCESSING 113 Multiple Processor Organization • Single instruction, single data stream – SISD • Single instruction, multiple data stream – SIMD • Multiple instruction, single data stream – MISD • Multiple instruction, multiple data stream- MIMD 114 Single Instruction, Single Data Stream SISD • • • • Single processor Single instruction stream Data stored in single memory Uni-processor 115 Single Instruction, Multiple Data Stream - SIMD • • • • • • Single machine instruction Controls simultaneous execution Number of processing elements Lockstep basis Each processing element has associated data memory Each instruction executed on different set of data by different processors • Vector and array processors 116 Multiple Instruction, Single Data Stream - MISD • Sequence of data • Transmitted to set of processors • Each processor executes different instruction sequence • Never been implemented 117 Multiple Instruction, Multiple Data Stream- MIMD • Set of processors • Simultaneously execute different instruction sequences • Different sets of data • SMPs, clusters and NUMA systems 118 Taxonomy of Parallel Processor Architectures 119 MIMD - Overview • General purpose processors • Each can process all instructions necessary • Further classified by method of processor communication 120 Tightly Coupled - SMP • Processors share memory • Communicate via that shared memory • Symmetric Multiprocessor (SMP) – Share single memory or pool – Shared bus to access memory – Memory access time to given area of memory is approximately the same for each processor 121 Tightly Coupled - NUMA • Nonuniform memory access • Access times to different regions of memory may differ 122 Loosely Coupled - Clusters • Collection of independent uniprocessors or SMPs • Interconnected to form a cluster • Communication via fixed path or network connections 123 Parallel Organizations - SISD 124 Parallel Organizations - SIMD 125 Parallel Organizations - MISD 126 Parallel Organizations - MIMD Shared Memory 127 Parallel Organizations - MIMD Distributed Memory 128 CPU Externals CHAPTER 7: INPUT/OUTPUT 129 Input/Output Problems • Wide variety of peripherals – Delivering different amounts of data – At different speeds – In different formats • All slower than CPU and RAM • Need I/O modules 130 Generic Model of I/O Module 131 External Devices • Human readable – Screen, printer, keyboard • Machine readable – Monitoring and control • Communication – Modem – Network Interface Card (NIC) 132 External Device Block Diagram 133 I/O Module Function • • • • • Control & Timing CPU Communication Device Communication Data Buffering Error Detection 134 I/O Steps • • • • • • CPU checks I/O module device status I/O module returns status If ready, CPU requests data transfer I/O module gets data from device I/O module transfers data to CPU Variations for output, DMA, etc. 135 I/O Module Diagram 136 Input Output Techniques • Programmed • Interrupt driven • Direct Memory Access (DMA) 137 Three Techniques for Input of a Block of Data 138 Programmed I/O • CPU has direct control over I/O – Sensing status – Read/write commands – Transferring data • CPU waits for I/O module to complete operation • Wastes CPU time 139 I/O Commands • CPU issues address – Identifies module (& device if >1 per module) • CPU issues command – Control - telling module what to do • e.g. spin up disk – Test - check status • e.g. power? Error? – Read/Write • Module transfers data via buffer from/to device 140 I/O Mapping • Memory mapped I/O – Devices and memory share an address space – I/O looks just like memory read/write – No special commands for I/O • Large selection of memory access commands available • Isolated I/O – Separate address spaces – Need I/O or memory select lines – Special commands for I/O • Limited set 141 Memory Mapped and Isolated I/O 142 Interrupt Driven I/O • Overcomes CPU waiting • No repeated CPU checking of device • I/O module interrupts when ready 143 Simple Interrupt Processing 144 Changes in Memory and Registers for an Interrupt 145 Multiple Interrupts • Each interrupt line has a priority • Higher priority lines can interrupt lower priority lines • If bus mastering only current master can interrupt 146 Example - PC Bus • 80x86 has one interrupt line • 8086 based systems use one 8259A interrupt controller • 8259A has 8 interrupt lines 147 82C59A Interrupt Controller 148 Direct Memory Access • Interrupt driven and programmed I/O require active CPU intervention – Transfer rate is limited – CPU is tied up • DMA is the answer 149 DMA Function • Additional Module (hardware) on bus • DMA controller takes over from CPU for I/O 150 Typical DMA Module Diagram 151 DMA Operation • CPU tells DMA controller:– – – – Read/Write Device address Starting address of memory block for data Amount of data to be transferred • CPU carries on with other work • DMA controller deals with transfer • DMA controller sends interrupt when finished 152 DMA and Interrupt Breakpoints During an Instruction Cycle 153 DMA Configurations (1) • Single Bus, Detached DMA controller • Each transfer uses bus twice – I/O to DMA then DMA to memory • CPU is suspended twice 154 DMA Configurations (2) • Single Bus, Integrated DMA controller • Controller may support >1 device • Each transfer uses bus once – DMA to memory • CPU is suspended once 155 DMA Configurations (3) • Separate I/O Bus • Bus supports all DMA enabled devices • Each transfer uses bus once – DMA to memory • CPU is suspended once 156 157