Download Revision - Portal UniMAP

Document related concepts
no text concepts found
Transcript
Foundation to Computer
Architecture
CHAPTER 3:
A TOP-LEVEL VIEW OF COMPUTER
FUNCTION AND INTERCONNECTION
Computer
Components:
Top Level
View
Instruction Cycle
• Two steps:
– Fetch
– Execute
Fetch Cycle
• Program Counter (PC) holds address of next
instruction to fetch
• Processor fetches instruction from memory
location pointed by PC
• Increment PC
– Unless told otherwise
• Instruction loaded into Instruction Register (IR)
• Processor interprets (DECODE) instruction and
performs required actions
Execute Cycle
• Processor-memory
– data transfer between CPU and main memory
• Processor I/O
– Data transfer between CPU and I/O module
• Data processing
– Some arithmetic or logical operation on data
• Control
– Alteration of sequence of operations
– e.g. jump
• Combination of above
Characteristics of a Hypothetical
Machine
6
0001 : Load AC from memory
Example of
Program
Execution
0101 : Add to AC from memory
0010 : Store AC to memory
Instruction Cycle State Diagram
Interrupts
• Mechanism by which other modules (e.g. I/O) may
interrupt normal sequence of processing
• Program
– e.g. overflow, division by zero
• Timer
– Generated by internal processor timer
– Used in pre-emptive multi-tasking
• I/O
– from I/O controller
• Hardware failure
– e.g. memory parity error
Classes of
Interrupts
Program Flow Control
Interrupt Cycle
• Added to instruction cycle
• Processor checks for interrupt
– Indicated by an interrupt signal
• If no interrupt, fetch next instruction
• If interrupt pending:
–
–
–
–
–
Suspend execution of current program
Save context
Set PC to start address of interrupt handler routine
Process interrupt
Restore context and continue interrupted program
Transfer of Control via Interrupts
Instruction Cycle with Interrupts
Program Timing
Short I/O Wait
Typical Case
Program
Timing
Long I/O Wait
Instruction Cycle (with Interrupts) State Diagram
Multiple Interrupts
• Disable interrupts
– Processor will ignore further interrupts whilst
processing one interrupt
– Interrupts remain pending and are checked after first
interrupt has been processed
– Interrupts handled in sequence as they occur
• Define priorities
– Low priority interrupts can be interrupted by higher
priority interrupts
– When higher priority interrupt has been processed,
processor returns to previous interrupt
Multiple
Interrupts Sequential
Multiple Interrupts – Nested
Time Sequence of Multiple Interrupts
Computer Memory System
CHAPTER 4:
CACHE MEMORY
21
Characteristics
•
•
•
•
•
•
•
•
Location
Capacity
Unit of transfer
Access method
Performance
Physical type
Physical characteristics
Organisation
Location
• CPU
• Internal
• External
23
Capacity
• Word size
– The natural unit of organisation
• Number of words
– or Bytes
24
Unit of Transfer
• Internal
– Usually governed by data bus width
• External
– Usually a block which is much larger than a word
• Addressable unit
– Smallest location which can be uniquely addressed
– Word internally
– Cluster on M$ disks
25
Access Methods (1)
• Sequential
– Start at the beginning and read through in order
– Access time depends on location of data and previous
location
– e.g. tape
• Direct
– Individual blocks have unique address
– Access is by jumping to vicinity plus sequential search
– Access time depends on location and previous
location
– e.g. disk
26
Access Methods (2)
• Random
– Individual addresses identify locations exactly
– Access time is independent of location or previous
access
– e.g. RAM
• Associative
– Data is located by a comparison with contents of a
portion of the store
– Access time is independent of location or previous
access
– e.g. cache
27
Memory Hierarchy
• Registers
– In CPU
• Internal or Main memory
– May include one or more levels of cache
– “RAM”
• External memory
– Backing store
28
Memory Hierarchy - Diagram
29
Performance
• Access time
– Time between presenting the address and getting
the valid data
• Memory Cycle time
– Time may be required for the memory to
“recover” before next access
– Cycle time is access + recovery
• Transfer Rate
– Rate at which data can be moved
30
1. Decreasing cost per bit
2. Increasing capacity
3. Increasing access time
4. Decreasing frequency of access of the memory by the
processor
31
Physical Types
• Semiconductor
– RAM
• Magnetic
– Disk & Tape
• Optical
– CD & DVD
• Others
– Bubble
– Hologram
32
Physical Characteristics
•
•
•
•
Decay
Volatility
Erasable
Power consumption
33
Organisation
• Physical arrangement of bits into words
• Not always obvious
• e.g. interleaved
34
The Bottom Line
• How much?
– Capacity
• How fast?
– Time is money
• How expensive?
35
Mapping Function
• Cache of 64kByte
• Cache block of 4 bytes
– i.e. cache is 16k (214) lines of 4 bytes
• 16MBytes main memory
• 24 bit address
– (224=16M)
36
Direct Mapping
• Each block of main memory maps to only one
cache line
– i.e. if a block is in cache, it must be in one specific
place
•
•
•
•
Address is in two parts
Least Significant w bits identify unique word
Most Significant s bits specify one memory block
The MSBs are split into a cache line field r and a
tag of s-r (most significant)
37
Direct Mapping
Address Structure
Tag s-r
Line or Slot r
8
14
• 24 bit address
• 2 bit word identifier (4 byte block)
• 22 bit block identifier
– 8 bit tag (=22-14)
– 14 bit slot or line
• No two blocks in the same line have the same Tag field
• Check contents of cache by finding line and checking Tag
Word w
2
Direct Mapping from Cache to Main Memory
39
Direct Mapping
Cache Line Table
Cache line
Main Memory blocks held
0
0, m, 2m, 3m…2s-m
1
1,m+1, 2m+1…2s-m+1
…
m-1
m-1, 2m-1,3m-1…2s-1
Direct Mapping Cache Organization
41
Hit and Miss
42
Direct Mapping Example
43
44
Direct
Mapping
Example
45
Direct Mapping Summary
• Address length = (s + w) bits
• Number of addressable units = 2s+w words or
bytes
• Block size = line size = 2w words or bytes
• Number of blocks in main memory = 2s+ w/2w = 2s
• Number of lines in cache = m = 2r
• Size of tag = (s – r) bits
46
Direct Mapping pros & cons
• Simple
• Inexpensive
• Fixed location for given block
– If a program accesses 2 blocks that map to the
same line repeatedly, cache misses are very high
47
Victim Cache
• Lower miss penalty
• Remember what was discarded
– Already fetched
– Use again with little penalty
• Fully associative
• 4 to 16 cache lines
• Between direct mapped L1 cache and next
memory level
48
Associative Mapping
• A main memory block can load into any line of
cache
• Memory address is interpreted as tag and
word
• Tag uniquely identifies block of memory
• Every line’s tag is examined for a match
• Cache searching gets expensive
49
Associative Mapping from
Cache to Main Memory
50
Fully Associative Cache Organization
51
Associative
Mapping
Example
52
Associative Mapping
Address Structure
Word
2 bit
Tag 22 bit
• 22 bit tag stored with each 32 bit block of data
• Compare tag field with tag entry in cache to check for hit
• Least significant 2 bits of address identify which 16 bit word is
required from 32 bit data block
• e.g.
– Address
– FFFFFC
Tag
3FFFFF
Data
24682468
Cache line
3FFF
53
Associative Mapping Summary
• Address length = (s + w) bits
• Number of addressable units = 2s+w words or
bytes
• Block size = line size = 2w words or bytes
• Number of blocks in main memory = 2s+w/2w = 2s
• Number of lines in cache = undetermined
• Size of tag = s bits
54
Set Associative Mapping
• Cache is divided into a number of sets
• Each set contains a number of lines
• A given block maps to any line in a given set
– e.g. Block B can be in any line of set i
• e.g. 2 lines per set
– 2 way associative mapping
– A given block can be in one of 2 lines in only one
set
55
Set Associative Mapping
Example
• 13 bit set number
• Block number in main memory is modulo 213
• 000000, 00A000, 00B000, 00C000 … map to
same set
56
Mapping From Main Memory to Cache:
v Associative
57
Mapping From Main Memory to Cache:
k-way Associative
58
K-Way Set Associative Cache
Organization
59
Set Associative Mapping
Address Structure
Tag 9 bit
Set 13 bit
Word
2 bit
• Use set field to determine cache set to look in
• Compare tag field to see if we have a hit
• e.g
– Address
– 1FF 7FFC
– 001 7FFC
Tag
1FF
001
Data
12345678
11223344
Set number
1FFF
1FFF
60
Two Way Set Associative Mapping
Example
61
Set Associative Mapping Summary
• Address length = (s + w) bits
• Number of addressable units = 2s+w words or
bytes
• Block size = line size = 2w words or bytes
• Number of blocks in main memory = 2d
• Number of lines in set = k
• Number of sets = v = 2d
• Number of lines in cache = kv = k * 2d
• Size of tag = (s – d) bits
62
Central Processing Unit
(CPU) Basics
CHAPTER 14:
PROCESSOR STRUCTURE AND FUNCTION
63
Pipelining
•
•
•
•
•
•
Fetch instruction
Decode instruction
Calculate operands (i.e. EAs)
Fetch operands
Execute instructions
Write result
• Overlap these operations
64
Two Stage Instruction Pipeline
65
Timing Diagram for
Instruction Pipeline Operation
66
The Effect of a Conditional Branch on
Instruction Pipeline Operation
67
Six Stage
Instruction
Pipeline
68
Alternative
Pipeline
Depiction
69
Pipeline Hazards
• Pipeline, or some portion of pipeline, must
stall.
• Also called pipeline bubble.
70
Types of hazards
 Data hazards
 Structural hazards (Resource)
 Control hazards (branching hazards)
DATA HAZARDS
read after write (RAW), a true dependency
write after read (WAR), an anti-dependency
write after write (WAW), an output dependency
Consider two instructions i1 and i2, with i1 occurring
before i2 in program order.
Read After Write (RAW)
i2 tries to read a source before i1 writes to it.
A read after write (RAW) data hazard refers to a
situation where an instruction refers to a result that
has not yet been calculated or retrieved
i1.
i2.
R2 <= R1 + R3
R4 <= R2 + R3
Write After Read (WAR)
i2 tries to write a destination before it is read by i1.
 A write after read (WAR) data hazard represents a
problem with concurrent execution.
i1.
i2.
R4 <= R1 + R5
R5 <= R1 + R2
Write After Write (WAW)
i2 tries to write an operand before it is written by i1.
A write after write (WAW) data hazard may occur in a
concurrent execution environment.
i1.
i2.
R2 <- R4 + R7
R2 <- R1 + R3
Data Hazards
•
•
•
•
•
Conflict in access of an operand location
Two instructions to be executed in sequence
Both access a particular memory or register operand
If in strict sequence, no problem occurs
If in a pipeline, operand value could be updated so as to produce different
result from strict sequential execution
• E.g. x86 machine instruction sequence:
• ADD EAX, EBX /* EAX = EAX + EBX
• SUB ECX, EAX /* ECX = ECX – EAX
•
•
•
•
ADD instruction does not update EAX until end of stage 5, at clock cycle 5
SUB instruction needs value at beginning of its stage 2, at clock cycle 4
Pipeline must stall for two clocks cycles
Without special hardware and specific avoidance algorithms, results in
inefficient pipeline usage
76
Data Hazard Diagram
77
Resource Hazards
•
•
•
•
Two (or more) instructions in pipeline need same resource
Executed in serial rather than parallel for part of pipeline
Also called structural hazard
E.g. Assume simplified five-stage pipeline
•
•
•
•
•
•
Ideal case is new instruction enters pipeline each clock cycle
Assume main memory has single port
Assume instruction fetches and data reads and writes performed one at a time
Ignore the cache
Operand read or write cannot be performed in parallel with instruction fetch
Fetch instruction stage must idle for one cycle fetching I3
•
•
E.g. multiple instructions ready to enter execute instruction phase
Single ALU
•
One solution: increase available resources
– Each stage takes one clock cycle
– Multiple main memory ports
– Multiple ALUs
78
Resource
Hazard
Diagram
79
Control Hazard
• Also known as branch hazard
• Pipeline makes wrong decision on branch prediction
• Brings instructions into pipeline that must
subsequently be discarded
• Dealing with Branches
–
–
–
–
–
Multiple Streams
Pre-fetch Branch Target
Loop buffer
Branch prediction
Delayed branching
• Normally the fetch stage.
80
Multiple Streams
• Have two pipelines
• Pre-fetch each branch into a separate pipeline
• Use appropriate pipeline
• Leads to bus & register contention
• Multiple branches lead to further pipelines
being needed
81
Prefetch Branch Target
• Target of branch is pre-fetched in addition to
instructions following branch
• Keep target until branch is executed
• Used by IBM 360/91
82
Loop Buffer
•
•
•
•
•
•
Very fast memory
Maintained by fetch stage of pipeline
Check buffer before fetching from memory
Very good for small loops or jumps
c.f. cache
Used by CRAY-1
83
Loop Buffer Diagram
84
Branch Prediction (1)
• Predict never taken
– Assume that jump will not happen
– Always fetch next instruction
– 68020 & VAX 11/780
– VAX will not pre-fetch after branch if a page fault
would result (O/S v CPU design)
• Predict always taken
– Assume that jump will happen
– Always fetch target instruction
85
Branch Prediction (2)
• Predict by Opcode
– Some instructions are more likely to result in a jump than
Others
– Can get up to 75% success
• Taken/Not taken switch
– Based on previous history
– Good for loops
– Refined by two-level or correlation-based branch history
• Correlation-based
– In loop-closing branches, history is good predictor
– In more complex structures, branch direction correlates
with that of related branches
• Use recent branch history as well
86
Branch Prediction (3)
• Delayed Branch
– Do not take jump until you have to
– Rearrange instructions
87
Branch
Prediction
Flowchart
88
Branch Prediction State Diagram
89
Processor Internals
CHAPTER 16:
INSTRUCTION-LEVEL PARALLELISM
AND SUPERSCALAR PROCESSORS
90
What is Superscalar?
• Common instructions (arithmetic, load/store,
conditional branch) can be initiated and
executed independently
• Equally applicable to RISC & CISC
• In practice usually RISC
91
Why Superscalar?
• Most operations are on scalar quantities (see
RISC notes)
• Improve these operations to get an overall
improvement
92
General Superscalar Organization
93
Superpipelined
• Many pipeline stages need less than half a
clock cycle
• Double internal clock speed gets two tasks per
external clock cycle
• Superscalar allows parallel fetch execute
94
Superscalar v
Superpipeline
95
Limitations
•
•
•
•
Instruction level parallelism
Compiler based optimisation
Hardware techniques
Limited by
–
–
–
–
–
True data dependency
Procedural dependency
Resource conflicts
Output dependency
Anti-dependency
96
True Data Dependency
• ADD r1, r2 (r1 := r1+r2;)
• MOVE r3,r1 (r3 := r1;)
• Can fetch and decode second instruction in
parallel with first
• Can NOT execute second instruction until first
is finished
97
Procedural Dependency
• Can not execute instructions after a branch in
parallel with instructions before a branch
• Also, if instruction length is not fixed,
instructions have to be decoded to find out
how many fetches are needed
• This prevents simultaneous fetches
98
Resource Conflict
• Two or more instructions requiring access to
the same resource at the same time
– e.g. two arithmetic instructions
• Can duplicate resources
– e.g. have two arithmetic units
99
Effect of
Dependencies
100
Processor Internals
CHAPTER 20:
CONTROL UNIT OPERATION
101
Hardwired and Micro-programmed Control
Unit Implementation
102
Hardwired Implementation (1)
• Control unit inputs
• Flags and control bus
– Each bit means something
• Instruction register
– Op-code causes different control signals for each
different instruction
– Unique logic for each op-code
– Decoder takes encoded input and produces single
output
– n binary inputs and 2n outputs
103
Hardwired Implementation (2)
• Clock
– Repetitive sequence of pulses
– Useful for measuring duration of micro-ops
– Must be long enough to allow signal propagation
– Different control signals at different times within
instruction cycle
– Need a counter with different control signals for
t1, t2 etc.
104
Control Unit with Decoded Inputs
105
PART 6: (1/2)
Enhancing CPU Performance
CHAPTER 21:
MICROPROGRAMMED CONTROL
106
Control Unit
Organization
107
Micro-programmed Control
• Use sequences of instructions (see earlier
notes) to control complex operations
• Called micro-programming or firmware
108
Implementation (1)
• All the control unit does is generate a set of
control signals
• Each control signal is on or off
• Represent each control signal by a bit
• Have a control word for each micro-operation
• Have a sequence of control words for each
machine code instruction
• Add an address to specify the next microinstruction, depending on conditions
109
Implementation (2)
• Today’s large microprocessor
– Many instructions and associated register-level
hardware
– Many control points to be manipulated
• This results in control memory that
– Contains a large number of words
• co-responding to the number of instructions to be executed
– Has a wide word width
• Due to the large number of control points to be manipulated
110
Micro-program Word Length
• Based on 3 factors
– Maximum number of simultaneous microoperations supported
– The way control information is represented or
encoded
– The way in which the next micro-instruction
address is specified
111
Micro-instruction Types
• Each micro-instruction specifies single (or few)
micro-operations to be performed
– (vertical micro-programming)
• Each micro-instruction specifies many
different micro-operations to be performed in
parallel
– (horizontal micro-programming)
112
Enhancing CPU Performance
CHAPTER 17:
PARALLEL PROCESSING
113
Multiple Processor Organization
• Single instruction, single data stream – SISD
• Single instruction, multiple data stream – SIMD
• Multiple instruction, single data stream – MISD
• Multiple instruction, multiple data stream- MIMD
114
Single Instruction, Single Data Stream SISD
•
•
•
•
Single processor
Single instruction stream
Data stored in single memory
Uni-processor
115
Single Instruction, Multiple Data
Stream - SIMD
•
•
•
•
•
•
Single machine instruction
Controls simultaneous execution
Number of processing elements
Lockstep basis
Each processing element has associated data memory
Each instruction executed on different set of data by
different processors
• Vector and array processors
116
Multiple Instruction, Single Data
Stream - MISD
• Sequence of data
• Transmitted to set of processors
• Each processor executes different instruction
sequence
• Never been implemented
117
Multiple Instruction, Multiple Data
Stream- MIMD
• Set of processors
• Simultaneously execute different instruction
sequences
• Different sets of data
• SMPs, clusters and NUMA systems
118
Taxonomy of Parallel Processor Architectures
119
MIMD - Overview
• General purpose processors
• Each can process all instructions necessary
• Further classified by method of processor
communication
120
Tightly Coupled - SMP
• Processors share memory
• Communicate via that shared memory
• Symmetric Multiprocessor (SMP)
– Share single memory or pool
– Shared bus to access memory
– Memory access time to given area of memory is
approximately the same for each processor
121
Tightly Coupled - NUMA
• Nonuniform memory access
• Access times to different regions of memory may
differ
122
Loosely Coupled - Clusters
• Collection of independent uniprocessors or SMPs
• Interconnected to form a cluster
• Communication via fixed path or network connections
123
Parallel Organizations - SISD
124
Parallel Organizations - SIMD
125
Parallel Organizations - MISD
126
Parallel Organizations - MIMD Shared
Memory
127
Parallel Organizations - MIMD
Distributed Memory
128
CPU Externals
CHAPTER 7:
INPUT/OUTPUT
129
Input/Output Problems
• Wide variety of peripherals
– Delivering different amounts of data
– At different speeds
– In different formats
• All slower than CPU and RAM
• Need I/O modules
130
Generic Model of I/O Module
131
External Devices
• Human readable
– Screen, printer, keyboard
• Machine readable
– Monitoring and control
• Communication
– Modem
– Network Interface Card (NIC)
132
External Device Block Diagram
133
I/O Module Function
•
•
•
•
•
Control & Timing
CPU Communication
Device Communication
Data Buffering
Error Detection
134
I/O Steps
•
•
•
•
•
•
CPU checks I/O module device status
I/O module returns status
If ready, CPU requests data transfer
I/O module gets data from device
I/O module transfers data to CPU
Variations for output, DMA, etc.
135
I/O Module Diagram
136
Input Output Techniques
• Programmed
• Interrupt driven
• Direct Memory Access (DMA)
137
Three Techniques for Input of a Block of Data
138
Programmed I/O
• CPU has direct control over I/O
– Sensing status
– Read/write commands
– Transferring data
• CPU waits for I/O module to complete
operation
• Wastes CPU time
139
I/O Commands
• CPU issues address
– Identifies module (& device if >1 per module)
• CPU issues command
– Control - telling module what to do
• e.g. spin up disk
– Test - check status
• e.g. power? Error?
– Read/Write
• Module transfers data via buffer from/to device
140
I/O Mapping
• Memory mapped I/O
– Devices and memory share an address space
– I/O looks just like memory read/write
– No special commands for I/O
• Large selection of memory access commands available
• Isolated I/O
– Separate address spaces
– Need I/O or memory select lines
– Special commands for I/O
• Limited set
141
Memory Mapped and Isolated I/O
142
Interrupt Driven I/O
• Overcomes CPU waiting
• No repeated CPU checking of device
• I/O module interrupts when ready
143
Simple Interrupt
Processing
144
Changes in Memory
and Registers
for an Interrupt
145
Multiple Interrupts
• Each interrupt line has a priority
• Higher priority lines can interrupt lower
priority lines
• If bus mastering only current master can
interrupt
146
Example - PC Bus
• 80x86 has one interrupt line
• 8086 based systems use one 8259A interrupt
controller
• 8259A has 8 interrupt lines
147
82C59A
Interrupt
Controller
148
Direct Memory Access
• Interrupt driven and programmed I/O require
active CPU intervention
– Transfer rate is limited
– CPU is tied up
• DMA is the answer
149
DMA Function
• Additional Module (hardware) on bus
• DMA controller takes over from CPU for I/O
150
Typical DMA
Module
Diagram
151
DMA Operation
• CPU tells DMA controller:–
–
–
–
Read/Write
Device address
Starting address of memory block for data
Amount of data to be transferred
• CPU carries on with other work
• DMA controller deals with transfer
• DMA controller sends interrupt when finished
152
DMA and Interrupt Breakpoints During
an Instruction Cycle
153
DMA Configurations (1)
• Single Bus, Detached DMA controller
• Each transfer uses bus twice
– I/O to DMA then DMA to memory
• CPU is suspended twice
154
DMA Configurations (2)
• Single Bus, Integrated DMA controller
• Controller may support >1 device
• Each transfer uses bus once
– DMA to memory
• CPU is suspended once
155
DMA Configurations (3)
• Separate I/O Bus
• Bus supports all DMA enabled devices
• Each transfer uses bus once
– DMA to memory
• CPU is suspended once
156
157