Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Embedded Processor Architecture
5kk73
RISC
Instruction Set
Implementation Alternatives
== using MIPS as example ==
TU/e 5kk73
Henk Corporaal
2013
Topics
MIPS ISA: Instruction Set Architecture
MIPS single cycle implementation
MIPS multi-cycle implementation
MIPS pipelined implementation
Pipeline hazards
Recap of RISC principles
Other architectures
Based on the book: ch2-4 (4th ed)
Many slides; I'll go quick and
skip some
H.Corporaal EmbProcArch 5kk73
2
Main Types of Instructions
Arithmetic
Memory access instructions
Integer
Floating Point
Load & Store
Control flow
Jump
Conditional Branch
Call & Return
H.Corporaal EmbProcArch 5kk73
3
MIPS arithmetic
Most instructions have 3 operands
Operand order is fixed (destination first)
Example:
C code:
A = B + C
MIPS code: add $s0, $s1, $s2
($s0, $s1 and $s2 are associated with variables by
compiler)
H.Corporaal EmbProcArch 5kk73
4
MIPS arithmetic
C code:
A = B + C + D;
E = F - A;
MIPS code: add $t0, $s1, $s2
add $s0, $t0, $s3
sub $s4, $s5, $s0
Operands must be registers, only 32 registers provided
Design Principle: smaller is faster.
Why?
H.Corporaal EmbProcArch 5kk73
5
Registers vs. Memory
Arithmetic instruction operands must be registers,
— only 32 registers provided
Compiler associates variables with registers
What about programs with lots of variables ?
CPU
Memory
register file
IO
H.Corporaal EmbProcArch 5kk73
6
Register allocation
Compiler tries to keep as many variables in registers as
possible
Some variables can not be allocated
large arrays (too few registers)
aliased variables (variables accessible through pointers in C)
dynamic allocated variables
 heap
 stack
Compiler may run out of registers => spilling
H.Corporaal EmbProcArch 5kk73
7
Memory Organization
Viewed as a large, single-dimension array, with an
address
A memory address is an index into the array
"Byte addressing" means that successive addresses are
one byte apart
0
8 bits of data
1
8 bits of data
2
8 bits of data
3
8 bits of data
4
8 bits of data
5
8 bits of data
6
8 bits of data
...
H.Corporaal EmbProcArch 5kk73
8
Memory Organization
Bytes are nice, but most data items use larger "words"
For MIPS, a word is 32 bits or 4 bytes.
0
32 bits of data
4
32 bits of data
8
32 bits of data
...
12
32 bits of data
Registers hold 32 bits of data
232 bytes with byte addresses from 0 to 232-1
230 words with byte addresses 0, 4, 8, ... 232-4
H.Corporaal EmbProcArch 5kk73
9
Memory layout: Alignment
31
address
0
4
23
15
7
0
this word is aligned; the others are not!
8
12
16
20
24
Words are aligned
What are the least 2 significant bits of a word
address?
H.Corporaal EmbProcArch 5kk73
10
Memory hierarchy
H.Corporaal EmbProcArch 5kk73
11
High end Intel i7 (Ivy Bridge) die
L1+L2 L1+L2 L1+L2 L1+L2
•1.4 billion transistors
•160 mm2
•3.5 GHz
•775kk73
W / 22 nm
H.Corporaal EmbProcArch
•64KB L1 per core
•256 KB L2 per core
•8MB L3 shared
12
Instructions: load and store
Example:
C code:
A[8] = h + A[8];
MIPS code: lw $t0, 32($s3)
add $t0, $s2, $t0
sw $t0, 32($s3)
Store word operation has no destination (reg) operand
Remember arithmetic operands are registers, not
memory!
H.Corporaal EmbProcArch 5kk73
13
Let's translate some C-code
Can we figure out the code?
swap(int v[], int k);
{ int temp;
temp = v[k]
v[k] = v[k+1];
v[k+1] = temp;
}
swap:
muli
add
lw
lw
sw
sw
jr
$2 ,
$2 ,
$15,
$16,
$16,
$15,
$31
$5, 4
$4, $2
0($2)
4($2)
0($2)
4($2)
Explanation:
index k : $5
base address of v: $4
address of v[k] is $4 + 4.$5
H.Corporaal EmbProcArch 5kk73
14
Machine Language
Instructions, like registers and words of data, are also 32 bits long
Example: add $t0, $s1, $s2
Registers have numbers: $t0=9, $s1=17, $s2=18
Instruction Format:
op
000000
6 bits
rs
rt
10001
10010
5 bits
5 bits
rd
01001
5 bits
shamt
00000
5 bits
funct
100000
6 bits
Can you guess what the field names stand for?
H.Corporaal EmbProcArch 5kk73
15
Machine Language
Consider the load-word and store-word instructions,
Introduce a new type of instruction format
What would the regularity principle have us do?
New principle: Good design demands a compromise
I-type for data transfer instructions
other format was R-type for register
Example: lw $t0, 32($s2)
H.Corporaal EmbProcArch 5kk73
35
18
9
op
rs
rt
32
16 bit number
16
Stored Program Concept
memory
OS
Program 1
CPU
code
global data
stack
heap
unused
Program 2
unused
H.Corporaal EmbProcArch 5kk73
17
Control flow
Decision making instructions
alter the control flow,
i.e., change the "next" instruction to be executed
MIPS conditional branch instructions:
bne $t0, $t1, Label
beq $t0, $t1, Label
Example:
if (i==j) h = i + j;
bne $s0, $s1, Label
add $s3, $s0, $s1
Label:
....
H.Corporaal EmbProcArch 5kk73
18
Control flow
MIPS unconditional branch instructions:
j label
Example:
if (i!=j)
h=i+j;
else
h=i-j;
beq $s4, $s5, Lab1
add $s3, $s4, $s5
j Lab2
Lab1:sub $s3, $s4, $s5
Lab2:...
Can you build a simple for loop?
H.Corporaal EmbProcArch 5kk73
19
So far:
Instruction
Meaning
add $s1,$s2,$s3
sub $s1,$s2,$s3
lw $s1,100($s2)
sw $s1,100($s2)
bne $s4,$s5,L
beq $s4,$s5,L
j Label
$s1 = $s2 + $s3
$s1 = $s2 – $s3
$s1 = Memory[$s2+100]
Memory[$s2+100] = $s1
Next instr. is at Label if $s4 ° $s5
Next instr. is at Label if $s4 = $s5
Next instr. is at Label
Formats:
R
op
rs
rt
rd
I
op
rs
rt
16 bit address
J
op
H.Corporaal EmbProcArch 5kk73
shamt
funct
26 bit address
20
Control Flow
We have: beq, bne, what about Branch-if-less-than?
New instruction:
meaning:
if
slt $t0, $s1, $s2
$s1 < $s2 then
$t0 = 1
else
$t0 = 0
Can use this instruction to build "blt $s1, $s2, Label"
— can now build general control structures
Note that the assembler needs a register to do this,
— use conventions for registers
H.Corporaal EmbProcArch 5kk73
21
MIPS compiler/assembler Conventions
Name Register number
Usage
$zero
0
the constant value 0
$v0-$v1
2-3
values for results and expression evaluation
$a0-$a3
4-7
arguments
$t0-$t7
8-15
temporaries
$s0-$s7
16-23
saved (by callee)
$t8-$t9
24-25
more temporaries
$gp
28
global pointer
$sp
29
stack pointer
$fp
30
frame pointer
$ra
31
return address
H.Corporaal EmbProcArch 5kk73
22
Constants
Small constants are used quite frequently (50% of operands)
e.g., A = A + 5;
B = B + 1;
C = C - 18;
Solutions? Why not?
put 'typical constants' in memory and load them
create hard-wired registers (like $zero) for constants like 0, 1, 2, …
or …….
MIPS Instructions:
addi
slti
andi
ori
H.Corporaal EmbProcArch 5kk73
$29,
$8,
$29,
$29,
$29,
$18,
$29,
$29,
4
10
6
4
3
23
How about larger constants?
We'd like to be able to load a 32 bit constant into a register
Must use two instructions; new "load upper immediate"
instruction
lui $t0, 1010101010101010 filled with zeros
1010101010101010
0000000000000000
Then must get the lower order bits right, i.e.,
ori $t0, $t0, 1010101010101010
ori
1010101010101010
0000000000000000
0000000000000000
1010101010101010
1010101010101010
1010101010101010
H.Corporaal EmbProcArch 5kk73
24
Assembly Language vs. Machine Language
Assembly provides convenient symbolic representation
much easier than writing down numbers
e.g., destination first
Machine language is the underlying reality
e.g., destination is no longer first
Assembly can provide 'pseudoinstructions'
e.g., “move $t0, $t1” exists only in Assembly
would be implemented using “add $t0,$t1,$zero”
Another pseudo instr: blt $t1, $t2, label
When considering performance you should count real instructions
H.Corporaal EmbProcArch 5kk73
25
Addresses in Branches and Jumps
Instructions:
bne $t4,$t5,Label
beq $t4,$t5,Label
j Label
Next instruction is at Label if $t4  $t5
Next instruction is at Label if $t4 = $t5
Next instruction is at Label
Formats:
I
op
J
op
rs
rt
16 bit address
26 bit address
Addresses are not 32 bits
— How do we handle this with load and store instructions?
H.Corporaal EmbProcArch 5kk73
26
What's the next address?
Instructions:
Next instruction is at Label if $t4  $t5
Next instruction is at Label if $t4 = $t5
bne $t4,$t5,Label
beq $t4,$t5,Label
Formats:
I
op
rs
rt
16 bit address
Could specify a register (like lw and sw) and add it to address
use Instruction Address Register (PC = program counter)
most branches are local (principle of locality)
Jump instructions just use high order bits of PC
address boundaries of 256 MB
H.Corporaal EmbProcArch 5kk73
27
To summarize:
add
MIPS assembly language
Example
Meaning
add $s1, $s2, $s3
$s1 = $s2 + $s3
Three operands; data in registers
subtract
sub $s1, $s2, $s3
$s1 = $s2 - $s3
Three operands; data in registers
addi $s1, $s2, 100
lw $s1, 100($s2)
sw $s1, 100($s2)
lb $s1, 100($s2)
sb $s1, 100($s2)
lui $s1, 100
$s1 = $s2 + 100
Used to add constants
$s1 = Memory[$s2 + 100]Word from memory to register
Memory[$s2 + 100] = $s1 Word from register to memory
$s1 = Memory[$s2 + 100]Byte from memory to register
Memory[$s2 + 100] = $s1 Byte from register to memory
Loads constant in upper 16 bits
$s1 = 100 * 2 16
beq
$s1, $s2, 25
if ($s1 == $s2) go to
PC + 4 + 100
Equal test; PC-relative branch
branch on not equal bne
$s1, $s2, 25
if ($s1 != $s2) go to
PC + 4 + 100
Not equal test; PC-relative
$s1, $s2, $s3
if ($s2 < $s3) $s1 = 1;
else $s1 = 0
Compare less than; for beq, bne
Category
Arithmetic
Instruction
add immediate
load w ord
store w ord
Data transfer load byte
store byte
load upper
immediate
branch on equal
Conditional
branch
Unconditional jump
set on less than
slt
set less than
immediate
slti
jump
jump register
jump and link
j
jr
jal
H.Corporaal EmbProcArch 5kk73
$s1, $s2, 100 if ($s2 < 100) $s1 = 1;
Comments
Compare less than constant
else $s1 = 0
2500
$ra
2500
go to 10000
Jump to target address
go to $ra
For sw itch, procedure return
$ra = PC + 4; go to 10000 For procedure call
28
MIPS (3+2) addressing modes overview
1. Immediate addressing
op
rs
rt
Immediate
2. Register addressing
op
rs
rt
rd
...
funct
Registers
Register
3. Base addressing
op
rs
rt
Memory
Address
+
Register
Byte
Halfword
Word
4. PC-relative addressing
op
rs
rt
Memory
Address
PC
+
Word
5. Pseudodirect addressing
op
Address
PC
H.Corporaal EmbProcArch 5kk73
Memory
Word
29
MIPS Datapath
Building a datapath
A single cycle processor datapath
all instruction actions in one (long) cycle
A multi-cycle processor datapath
support a subset of the MIPS-I instruction-set
each instructions takes multiple (shorter) cycles
For details see book (ch 5, 3rd ed. Or
ch 4 in 4th ed.):
H.Corporaal EmbProcArch 5kk73
30
Datapath and Control
FSM
or
Microprogramming
Registers &
Memories
Multiplexors
Buses
ALUs
Control
H.Corporaal EmbProcArch 5kk73
Datapath
31
The Processor: Datapath & Control
Simplified MIPS implementation to contain only:
lw, sw
add, sub, and, or, slt
beq, j
Generic Implementation:
memory-reference instructions:
arithmetic-logical instructions:
control flow instructions:
use the program counter (PC) to supply instruction address
get the instruction from memory
read registers
use the instruction to decide exactly what to do
All instructions use the ALU after reading the registers
Why?
memory-reference?
 arithmetic?
 control flow?
H.Corporaal EmbProcArch 5kk73
32
More Implementation Details
Abstract / Simplified View:
Data
PC
Address
Instruction
memory
Instruction
Register #
Registers
Register #
ALU
Address
Data
memory
Register #
Data
Two types of functional units:
elements that operate on data values (combinational)
elements that contain state (sequential)
H.Corporaal EmbProcArch 5kk73
33
State Elements
Unclocked vs. Clocked
Clocks used in synchronous logic
when should an element that contains state be updated?
falling edge
cycle time
rising edge
H.Corporaal EmbProcArch 5kk73
34
An unclocked state element
The set-reset (SR) latch
output depends on present inputs and also on past inputs
R
Q
Q
S
Truth table:
H.Corporaal EmbProcArch 5kk73
R
0
0
1
1
S
0
1
0
1
Q
Q
1
0
?
state change
35
Latches and Flip-flops
Output is equal to the stored value inside the element
(don't need to ask for permission to look at the value)
Change of state (value) is based on the clock
Latches: whenever the inputs change, and the clock is asserted
Flip-flop: state changes only on a clock edge
(edge-triggered methodology)
A clocking methodology defines when signals can be read and written
— wouldn't want to read a signal at the same time it was being written
H.Corporaal EmbProcArch 5kk73
36
D-latch (level-sensitive)
Two inputs:
the data value to be stored (D)
the clock signal (C) indicating when to read & store data (D)
Two outputs:
the value of the internal state (Q) and it's complement
C
D
Q
C
_
Q
D
H.Corporaal EmbProcArch 5kk73
Q
37
D flip-flop (edge-triggered)
Output changes only on the clock edge
D
D
C
D
latch
Q
D
Q
D
latch _
C
Q
Q
_
Q
C
D
C
Q
H.Corporaal EmbProcArch 5kk73
38
Our Implementation
An edge triggered methodology
Typical execution:
read contents of some state elements,
send values through some combinational logic,
write results to one or more state elements
State
element
1
Combinational logic
State
element
2
Clockcycle
H.Corporaal EmbProcArch 5kk73
39
Register File
3-ported: one write, two read ports
Read reg. #1
Read
data 1
Read reg.#2
Read
data 2
Write reg.#
Write
data
Write
H.Corporaal EmbProcArch 5kk73
40
Register file: read ports
• Register file built using D flip-flops
Read register
number 1
Register 0
Register 1
M
Register n – 1
u
x
Read data 1
Register n
Read register
number 2
M
u
Read data 2
x
Implementation of the read ports
H.Corporaal EmbProcArch 5kk73
41
Register file: write port
Note: we still use the real clock to determine when to
write
W r ite
0
1
R e g is te r n um b e r
n -to -1
C
R e g iste r 0
D
C
d e co d e r
n – 1
R e g iste r 1
D
n
C
R e g is te r n – 1
D
C
R e g iste r n
R e gister d ata
H.Corporaal EmbProcArch 5kk73
D
42
Building the Datapath
Use multiplexors to stitch them together
PCSrc
M
u
x
Add
Add ALU
result
4
Shift
left 2
Registers
PC
Read
address
Instruction
Instruction
memory
Read
register 1
Read
Read
data 1
register 2
Write
register
Write
data
RegWrite
16
H.Corporaal EmbProcArch 5kk73
ALUSrc
Read
data 2
Sign
extend
M
u
x
3 ALU operation
Zero
ALU ALU
result
MemWrite
MemtoReg
Address
Read
data
Data
Write memory
data
M
u
x
32
MemRead
43
Our Simple Control Structure
All of the logic is combinational
We wait for everything to settle down, and the right thing
to be done
ALU might not produce “right answer” right away
we use write signals along with clock to determine when to write
Cycle time determined by length of the longest path
S tate
elem ent
1
Com binational logic
State
elem ent
2
Clock cycle
We are ignoring some details like setup and hold times !
H.Corporaal EmbProcArch 5kk73
44
Control
Selecting the operations to perform (ALU, read/write, etc.)
Controlling the flow of data (multiplexor inputs)
Information comes from the 32 bits of the instruction
Example:
add $8, $17, $18
000000
op
Instruction Format:
10001
rs
10010
rt
01000
rd
00000
shamt
100000
funct
ALU's operation based on instruction type and function code
H.Corporaal EmbProcArch 5kk73
45
Control: 2 level implementation
31
6
Control 2
26
instruction register
Opcode
bit
2
ALUop
00: lw, sw
01: beq
10: add, sub, and, or, slt
Funct.
Control 1
5
6
3
ALUcontrol 000: and
001: or
010: add
110: sub
111: set on less than
ALU
0
H.Corporaal EmbProcArch 5kk73
46
Datapath with Control
0
M
u
x
Add ALU
result
Add
4
Instruction[31–26]
PC
Read
register 1
Instruction[20–16]
Instruction
[31–0]
Instruction
memory
Instruction[15–11]
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
Control ALUOp
MemWrite
ALUSrc
RegWrite
Instruction[25–21]
Read
address
0
M
u
x
1
1
Read
data1
Read
register 2
Registers Read
Write
data2
register
0
M
u
x
1
Write
data
Zero
ALU ALU
result
Address
Write
data
Instruction[15–0]
16
Sign
extend
Read
data
Data
memory
1
M
u
x
0
32
ALU
control
Instruction[5–0]
H.Corporaal EmbProcArch 5kk73
47
ALU Control1
What should the ALU do with this instruction
example: lw $1, 100($2)
35
2
1
op
rs
rt
16 bit offset
ALU control input
000
001
010
110
111
100
AND
OR
add
subtract
set-on-less-than
Why is the code for subtract 110 and not 011?
H.Corporaal EmbProcArch 5kk73
48
ALU Control1
Must describe hardware to compute 3-bit ALU control input
given instruction type
00 = lw, sw
01 = beq,
10 = arithmetic
function code for arithmetic
ALU Operation class,
computed from instruction type
Describe it using a truth table (can turn into gates):
inputs
ALUOp
ALUOp1 ALUOp0
0
0
X
1
1
X
1
X
1
X
1
X
1
X
H.Corporaal EmbProcArch 5kk73
F5
X
X
X
X
X
X
X
Funct field
F4 F3 F2 F1
X X X X
X X X X
X 0 0 0
X 0 0 1
X 0 1 0
X 0 1 0
X 1 0 1
outputs
Operation
F0
X
X
0
0
0
1
0
010
110
010
110
000
001
111
49
ALU Control1
Simple combinational logic (truth tables)
ALUOp
ALU control block
ALUOp0
ALUOp1
F3
F2
F (5– 0)
Operation2
Operation1
Operation
F1
Operation0
F0
H.Corporaal EmbProcArch 5kk73
50
Deriving Control2 signals
Input
6-bits
9 control (output) signals
Memto- Reg Mem Mem
Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
R-format
1
0
0
1
0
0
0
1
0
lw
0
1
1
1
1
0
0
0
0
sw
X
1
X
0
0
1
0
0
0
beq
X
0
X
0
0
0
1
0
1
Determine these control signals directly from the opcodes:
R-format: 0
lw:
35
sw:
43
beq:
4
H.Corporaal EmbProcArch 5kk73
51
Control 2
Inputs
Op5
Op4
Op3
PLA example
implementation
Op2
Op1
Op0
Outputs
R-format
Iw
sw
beq
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOpO
H.Corporaal EmbProcArch 5kk73
52
Single Cycle Implementation
Calculate cycle time assuming negligible delays except:
memory (2ns), ALU and adders (2ns), register file access (1ns)
PCSrc
Add
ALU
Add result
4
RegWrite
Instruction [25– 21]
PC
Read
address
Instruction
[31– 0]
Instruction
memory
Instruction [20– 16]
1
M
u
Instruction [15– 11] x
0
RegDst
Instruction [15– 0]
Read
register 1
Read
register 2
Read
data 1
Read
Write
data 2
register
Write
Registers
data
16
Sign 32
extend
1
M
u
x
0
Shift
left 2
MemWrite
ALUSrc
1
M
u
x
0
Zero
ALU ALU
result
MemtoReg
Address
Write
data
ALU
control
Read
data
Data
memory
1
M
u
x
0
MemRead
Instruction [5– 0]
ALUOp
H.Corporaal EmbProcArch 5kk73
53
Single Cycle Implementation
Memory (2ns), ALU & adders (2ns), reg. file access (1ns)
Fixed length clock: longest instruction is the ‘lw’ which requires 8 ns
Variable clock length (not realistic, just as exercise):
R-instr:
Load:
Store:
Branch:
Jump:
6 ns
8 ns
7 ns
5 ns
2 ns
Average depends on instruction mix
H.Corporaal EmbProcArch 5kk73
54
Where we are headed
Single Cycle Problems:
what if we had a more complicated instruction like floating point?
wasteful of area: NO Sharing of Hardware resources
One Solution:
use a “smaller” cycle time
have different instructions take different numbers of cycles
a “multicycle” datapath:
Instruction
register
PC
Address
ALU
Registers
Memory
data
register
MDR
H.Corporaal EmbProcArch 5kk73
A
Register #
Instruction
Memory
or data
Data
Data
IR
ALUOut
Register #
B
Register #
55
Multicycle Approach
We will be reusing functional units
Add registers after every major functional unit
Our control signals will not be determined solely by
instruction
ALU used to compute address and to increment PC
Memory used for instruction and data
e.g., what should the ALU do for a “subtract” instruction?
We’ll use a finite state machine (FSM) or microcode for
control
H.Corporaal EmbProcArch 5kk73
56
Review: finite state machines
Finite state machines:
a set of states and
next state function (determined by current state and the input)
output function (determined by current state and possibly input)
Current state
Next-state
function
Next
state
Clock
Inputs
Output
function
Outputs
We’ll use a Moore machine (output based only on current state)
H.Corporaal EmbProcArch 5kk73
57
Multicycle Approach
Break up the instructions into steps, each step takes a
cycle
At the end of a cycle
balance the amount of work to be done
restrict each cycle to use only one major functional unit
store values for use in later cycles (easiest thing to do)
introduce additional “internal” registers
Notice: we distinguish
processor state: programmer visible registers
internal state: programmer invisible registers (like IR, MDR, A, B,
and ALUout)
H.Corporaal EmbProcArch 5kk73
58
Multicycle Approach
PC
0
M
u
x
1
Address
Memory
Instruction
[25–21]
Read
register 1
Instruction
[20–16]
Read
Read
data1
register 2
Registers
Write
Read
register data2
MemData
Write
data
Instruction
[15–0] Instruction
[15–11]
Instruction
register
Instruction
[15–0]
Memory
data
register
H.Corporaal EmbProcArch 5kk73
0
M
u
x
1
A
B
0
M
u
x
1
Sign
extend
32
Zero
ALU ALU
result
ALUOut
0
4
Write
data
16
0
M
u
x
1
1M
u
2x
3
Shift
left 2
59
Multicycle Approach
Note that previous picture does not include:
branch support
jump support
Control lines and logic
Tclock > max (ALU delay, Memory access, Regfile access)
See book for complete picture
H.Corporaal EmbProcArch 5kk73
60
Five Execution Steps
Instruction Fetch
Instruction Decode and Register Fetch
Execution, Memory Address Computation, or Branch
Completion
Memory Access or R-type instruction completion
Write-back step
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
H.Corporaal EmbProcArch 5kk73
61
Step 1: Instruction Fetch
Use PC to get instruction and put it in the Instruction
Register
Increment the PC by 4 and put the result back in the PC
Can be described succinctly using RTL "Register-Transfer
Language"
IR = Memory[PC];
PC = PC + 4;
Can we figure out the values of the control signals?
What is the advantage of updating the PC now?
H.Corporaal EmbProcArch 5kk73
62
Step 2: Instruction Decode and
Register Fetch
Read registers rs and rt in case we need them
Compute the branch address in case the instruction is a
branch
Previous two actions are done optimistically!!
RTL:
A = Reg[IR[25-21]];
B = Reg[IR[20-16]];
ALUOut = PC+(sign-extend(IR[15-0])<< 2);
We aren't setting any control lines based on the instruction
type
(we are busy "decoding" it in our control logic)
H.Corporaal EmbProcArch 5kk73
63
Step 3 (instruction dependent)
ALU is performing one of four functions, based on instruction type
Memory Reference:
ALUOut = A + sign-extend(IR[15-0]);
R-type:
ALUOut = A op B;
Branch:
if (A==B) PC = ALUOut;
Jump:
PC = PC[31-28] || (IR[25-0]<<2)
H.Corporaal EmbProcArch 5kk73
64
Step 4 (R-type or Memory-access)
Loads and stores access memory
MDR = Memory[ALUOut];
or
Memory[ALUOut] = B;
R-type instructions finish
Reg[IR[15-11]] = ALUOut;
The write actually takes place at the end of the cycle
on the edge
H.Corporaal EmbProcArch 5kk73
65
Write-back step
Memory read completion step
Reg[IR[20-16]]= MDR;
What about all the other instructions?
H.Corporaal EmbProcArch 5kk73
66
Summary execution steps
Steps taken to execute any instruction class
Step name
Instruction fetch
Action for R-type
instructions
Instruction
decode/register fetch
Action for memory-reference
Action for
instructions
branches
IR = Memory[PC]
PC = PC + 4
A = Reg [IR[25-21]]
B = Reg [IR[20-16]]
ALUOut = PC + (sign-extend (IR[15-0]) << 2)
Execution, address
computation, branch/
jump completion
ALUOut = A op B
ALUOut = A + sign-extend
(IR[15-0])
Memory access or R-type
completion
Reg [IR[15-11]] =
ALUOut
Load: MDR = Memory[ALUOut]
or
Store: Memory [ALUOut] = B
Memory read completion
H.Corporaal EmbProcArch 5kk73
if (A ==B) then
PC = ALUOut
Action for
jumps
PC = PC [31-28] II
(IR[25-0]<<2)
Load: Reg[IR[20-16]] = MDR
67
Simple Questions
How many cycles will it take to execute this code?
lw $t2, 0($t3)
lw $t3, 4($t3)
beq $t2, $t3, L1
add $t5, $t2, $t3
sw $t5, 8($t3)
L1: ...
#assume not taken
What is going on during the 8th cycle of execution?
In what cycle does the actual addition of $t2 and $t3 takes place?
H.Corporaal EmbProcArch 5kk73
68
Implementing the Control
Value of control signals is dependent upon:
Use the information we have accumulated to specify a finite
state machine (FSM)
what instruction is being executed
which step is being performed
specify the finite state machine graphically, or
use microprogramming
Implementation can be derived from specification
H.Corporaal EmbProcArch 5kk73
69
0
Start
How many
state bits will
we need?
Memory address
computation
6
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
Memory
access
5
MemRead
IorD = 1
Write-back step
4
RegDst = 0
RegWrite
MemtoReg = 1
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
Branch
completion
8
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
Memory
access
3
1
Execution
2
(Op = 'LW')
MemRead
ALUSrcA = 0
IorD = 0
IRWrite
ALUSrcB = 01
ALUOp = 00
PCWrite
PCSource = 00
MemWrite
IorD = 1
RegDst = 1
RegWrite
MemtoReg = 0
Jump
completion
9
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCWriteCond
PCSource = 01
R-type completion
7
(Op = 'J')
Graphical Specification
of FSM
Instruction decode/
register fetch
Instruction fetch
PCWrite
PCSource = 10
Finite State Machine for Control
PCWrite
PCWriteCond
IorD
MemRead
Implementation:
MemWrite
IRWrite
Control logic
MemtoReg
PCSource
ALUOp
Outputs
ALUSrcB
ALUSrcA
RegWrite
RegDst
NS3
NS2
NS1
NS0
Instruction register
opcode field
H.Corporaal EmbProcArch 5kk73
S0
S1
S2
S3
Op0
Op1
Op2
Op3
Op4
Op5
Inputs
State register
71
Op4
opcode
PLA
Implementation
Op5
(see book)
Op3
Op2
Op1
Op0
S3
current
state
S2
S1
S0
If I picked a
horizontal or
vertical line could
you explain it ?
What type of
FSM is used?
Mealy or Moore?
datapath control
PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
MemtoReg
PCSource1
PCSource0
ALUOp1
ALUOp0
ALUSrcB1
ALUSrcB0
ALUSrcA
RegWrite
RegDst
NS3
NS2
NS1
NS0
next
state
H.Corporaal EmbProcArch 5kk73
72
Pipelined implementation
Pipelining
Pipelined datapath
Pipelined control
Hazards:
Structural
Data
Control
Exceptions
Scheduling
For details see the book (chapter 6):
H.Corporaal EmbProcArch 5kk73
73
Pipelining
Improve performance by increasing instruction throughput
P rog ram
e x ec ution
T im e
o rd er
(in in structio ns )
lw $ 1, 10 0 ($0 )
2
Ins truction
R eg
fe tch
lw $ 2, 20 0 ($0 )
4
6
8
A LU
Data
access
10
12
14
16
18
R eg
Instruction
R eg
fe tch
8 ns
lw $ 3, 30 0 ($0 )
D ata
acc ess
A LU
R eg
Instruction
fe tch
8 ns
...
8 ns
P rog ram
e x ec utio n
Tim e
o rd er
(in in struc tio n s)
2
lw $1 , 1 00 ($ 0)
Instruction
fetch
lw $2 , 2 00 ($ 0)
2 ns
lw $3 , 3 00 ($ 0)
4
Reg
Instruction
fetch
2 ns
6
ALU
Reg
Instruction
fetch
2 ns
H.Corporaal EmbProcArch 5kk73
8
Da ta
access
ALU
Reg
2 ns
10
14
12
Reg
Data
access
Reg
ALU
D a ta
access
2 ns
2 ns
Reg
2 ns
74
Pipelining
Ideal speedup = number of stages
Do we achieve this?
H.Corporaal EmbProcArch 5kk73
75
Pipelining
What makes it easy
What makes it hard?
all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores
structural hazards: suppose we had only one memory
control hazards: need to worry about branch instructions
data hazards: an instruction depends on a previous instruction
We’ll build a simple pipeline and look at these issues
We’ll talk about modern processors and what really makes it
hard:
exception handling
trying to improve performance with out-of-order execution, etc.
H.Corporaal EmbProcArch 5kk73
76
Basic idea: start from single cycle impl.
What do we need to add to actually split the datapath into stages?
IF: Instruction fetch
ID: Instruction decode/
register file read
EX: Execute/
address calculation
MEM: Memory access
WB: Write back
0
M
u
x
1
Add
Add
Add result
4
Shift
left 2
PC
Read
register 1
Address
Instruction
Instruction
memory
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
H.Corporaal EmbProcArch 5kk73
Sign
extend
Read
data
1
M
u
x
0
32
77
Pipelined Datapath
Can you find a problem even if there are no dependencies?
What instructions can we execute to manifest the problem?
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
H.Corporaal EmbProcArch 5kk73
Sign
extend
Read
data
1
M
u
x
0
32
78
Corrected Datapath
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Add
result
PC
Address
Instruction
memory
I nstr ucti on
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
H.Corporaal EmbProcArch 5kk73
Sign
extend
Read
data
1
M
u
x
0
32
79
Graphically Representing Pipelines
Time (in clock cycles)
Program
execution
order
(in instructions)
CC 1
CC 2
CC 3
CC 4
CC 5
IM
Reg
ALU
DM
Reg
ALU
DM
lw $10, 20($1)
sub $11, $2, $3
IM
Reg
CC 6
Reg
Can help with answering questions like:
how many cycles does it take to execute this code?
what is the ALU doing during cycle 4?
use this representation to help understand datapaths
H.Corporaal EmbProcArch 5kk73
80
Pipeline Control
PCSrc
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
4
Add
result
Branch
Shift
left 2
PC
Address
Instruction
memory
Instruction
RegWrite
Read
register 1
MemWrite
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
ALUSrc
Zero
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Data
memory
Write
Read
data
1
M
u
x
0
data
Instruction
16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
0
M
u
x
1
ALU
control
MemRead
ALUOp
RegDst
H.Corporaal EmbProcArch 5kk73
81
Pipeline control
We have 5 stages. What needs to be controlled in each
stage?
Instruction Fetch and PC Increment
Instruction Decode / Register Fetch
Execution
Memory Stage
Write Back
How would control be handled in an automobile plant?
a fancy control center telling everyone what to do?
should we use a finite state machine?
H.Corporaal EmbProcArch 5kk73
82
Pipeline Control
Instruction
R-format
lw
sw
beq
Execution/Address
Calculation stage control
lines
Reg
ALU
ALU
ALU
Dst
Op1
Op0
Src
1
1
0
0
0
0
0
1
X
0
0
1
X
0
1
0
Memory access stage
control lines
Branc Mem
Mem
h
Read Write
0
0
0
0
1
0
0
0
1
1
0
0
Write-back
stage control
lines
Reg
Mem
write to Reg
1
0
1
1
0
X
0
X
(compare single cycle control!)
Pass control signals along
just like the data:
WB
Instruction
H.Corporaal EmbProcArch 5kk73
IF/ID
Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
83
Datapath with Control
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Instruction
memory
Branch
Shift
left 2
MemWrite
Address
Instruction
PC
Add
Add result
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
H.Corporaal EmbProcArch 5kk73
84
H.Corporaal EmbProcArch 5kk73
85
Hazards: problems due to pipelining
Hazard types:
Structural
Data
same resource is needed multiple times in the same cycle
data dependencies limit pipelining
Control
next executed instruction may not be the next specified
instruction
H.Corporaal EmbProcArch 5kk73
86
Structural hazards
Examples:
Two accesses to a single ported memory
Two operations need the same function unit
at the same time
Two operations need the same function unit
in successive cycles, but the unit is not pipelined
Solutions:
stalling
add more hardware
H.Corporaal EmbProcArch 5kk73
87
Structural hazards on MIPS
Q: Do we have structural hazards on our simple MIPS
pipeline?
time
IF
H.Corporaal EmbProcArch 5kk73
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
88
Data hazards
Data dependencies:
(read-after-write)
(write-after-write)
(write-after-read)
Hardware solution:
RaW
WaW
WaR
Forwarding / Bypassing
Detection logic
Stalling
Software solution: Scheduling
H.Corporaal EmbProcArch 5kk73
89
Data dependences
Three types: RaW, WaR and WaW
add r1, r2, 5
sub r4, r1, r3
; r1 := r2+5
; RaW of r1
add r1, r2, 5
sub r2, r4, 1
; WaR of r2
add r1, r2, 5
sub r1, r1, 1
; WaW of r1
st
ld
; M[r2+5] := r1
; RaW if 5+r2 = 0+r4
r1, 5(r2)
r5, 0(r4)
WaW and WaR do not occur in simple pipelines, but they limit
scheduling freedom!
Problems for your compiler and your Laptop processors!
 use register renaming to solve this!
H.Corporaal EmbProcArch 5kk73
90
RaW on MIPS pipeline
T im e (in clock cycle s)
V alue of
re giste r $ 2 :
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10
10
1 0 /– 2 0
– 20
– 20
– 20
– 20
IM
Reg
DM
Reg
P ro gra m
e xe cution
orde r
(in instru ctio ns)
sub $ 2 , $ 1 , $ 3
a nd $ 1 2 , $ 2 , $ 5
or $ 1 3 , $ 6 , $ 2
a dd $ 1 4 , $ 2 , $ 2
sw $ 1 5, 1 0 0 ($ 2 )
H.Corporaal EmbProcArch 5kk73
IM
DM
R eg
IM
DM
R eg
IM
R eg
DM
Reg
IM
R eg
R eg
R eg
DM
Reg
91
Forwarding
Use temporary results, don’t wait for them to be written
register file forwarding to handle read/write to same register
ALU forwarding
T ime (in clock cycles)
CC 1
V alue of re gister $ 2 : 10
V a lue of E X/M EM : X
V a lue of M EM /W B : X
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
– 20
X
1 0/– 20
X
– 20
– 20
X
X
– 20
X
X
– 20
X
X
– 20
X
X
DM
R eg
Program
e xe cution orde r
(in instructions)
sub $ 2, $1 , $ 3
What if this
$2 was $13?
a nd $ 12 , $ 2, $5
or $ 13 , $ 6, $ 2
a dd $ 14 , $ 2, $2
sw $ 15 , 1 00 ($2 )
H.Corporaal EmbProcArch 5kk73
IM
Reg
IM
R eg
IM
DM
R eg
IM
R eg
DM
DM
R eg
IM
Reg
R eg
Reg
DM
Reg
92
Forwarding hardware
ALU forwarding circuitry principle:
from register file
ALU
to register file
from register file
Note: there are two options
• buf - ALU – bypass – mux - buf
• buf - bypass – mux – ALU - buf
H.Corporaal EmbProcArch 5kk73
93
Forwarding
ID/EX
WB
Control
PC
Instruction
memory
Instruction
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
ForwardA
ALU
Data
memory
M
u
x
IF/ID.RegisterRs
Rs
IF/ID.RegisterRt
Rt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
ForwardB
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
H.Corporaal EmbProcArch 5kk73
M
u
x
MEM/WB.RegisterRd
94
Forwarding check
Check for matching register-ids:
For each source-id of operation in the EX-stage check if
there is a matching pending dest-id
Example:
if (EX/MEM.RegWrite) 
(EX/MEM.RegisterRd  0) 
(EX/MEM.RegisterRd = ID/EX.RegisterRs)
then ForwardA = 10
Q. How many comparators do we need?
H.Corporaal EmbProcArch 5kk73
95
Can't always forward
Load word can still cause a hazard:
an instruction tries to read register r following a load to the same r
Need a hazard detection unit to “stall” the load instruction
Time (in clock cycles)
Program
CC 1
execution
order
(in instructions)
lw $2, 20 ($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
H.Corporaal EmbProcArch 5kk73
IM
CC 2
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
96
Stalling
We can stall the pipeline by keeping an instruction in the same stage
Program
Time(inclockcycles)
execution
CC1
CC2
order
(ininstructions)
lw$2, 20($1)
and$4, $2, $5
or $8, $2, $6
IM
CC3
Reg
IM
CC4
CC5
DM
Reg
Reg
Reg
IM
IM
CC6
CC7
DM
Reg
Reg
DM
CC8
CC9
CC10
Reg
bubble
add$9, $4, $2
In
CC4 the ALU is not used,
slt $1, $6, $7
Reg, and IM are redone
H.Corporaal EmbProcArch 5kk73
IM
DM
Reg
IM
Reg
Reg
DM
Reg
97
Hazard Detection Unit
ID/EX.MemRead
Hazard
detection
unit
ID/EX
IF/IDWrite
WB
Control
0
M
u
x
PC
Instruction
memory
Instruction
PCWrite
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
M
u
x
IF/ID.RegisterRs
IF/ID.RegisterRt
H.Corporaal EmbProcArch 5kk73
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
ID/EX.RegisterRt
Rs
Rt
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
MEM/WB.RegisterRd
98
Software only solution?
Have compiler guarantee that no hazards occur
Example: where do we insert the “NOPs” ?
sub
and
or
add
sw
$2,
$12,
$13,
$14,
$13,
$1, $3
$2, $5
$6, $2
$2, $2
100($2)
sub
nop
nop
and
or
Problem: this really slows us down! add
nop
sw
H.Corporaal EmbProcArch 5kk73
$2,
$1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$13, 100($2)
99
Control hazards
Control operations may change the sequential flow of
instructions
branch
jump
call (jump and link)
return
(exception/interrupt and rti / return from interrupt)
H.Corporaal EmbProcArch 5kk73
100
Control hazard: Branch
Branch actions:
Compute new address
Determine condition
Perform the actual branch (if taken): PC := new address
H.Corporaal EmbProcArch 5kk73
101
Branch example
Progra m
Time (in clock cycle s)
execu tion
CC 1
CC 2
IM
Reg
CC 3
CC 4
CC 5
DM
R eg
CC 6
CC 7
CC 8
CC 9
order
(in instructions)
40 beq $1, $3, 7
44 an d $1 2, $2 , $ 5
48 or $13, $6 , $2
52 ad d $1 4, $2 , $ 2
72 lw $4 , 50($ 7)
H.Corporaal EmbProcArch 5kk73
IM
R eg
IM
DM
R eg
IM
R eg
DM
R eg
IM
R eg
DM
R eg
Reg
DM
R eg
102
Branching
Squash pipeline:
When we decide to branch, other instructions are in the
pipeline!
We are predicting “branch not taken”
need to add hardware for flushing instructions if we are wrong
H.Corporaal EmbProcArch 5kk73
103
Branch with predict not taken
Clock cycles
Branch L
IF
Predict
not taken
L:
H.Corporaal EmbProcArch 5kk73
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
104
Branch speedup
Earlier address computation
Earlier condition calculation
Put both in the ID pipeline stage
adder
comparator
Clock cycles
Branch L
Predict
not taken
L:
H.Corporaal EmbProcArch 5kk73
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
105
Improved branching / flushing IF/ID
IF.Flush
Hazard
detection
unit
ID/EX
M
u
x
WB
Control
0
M
u
x
IF/ID
4
M
WB
EX
M
MEM/WB
WB
Shift
left 2
Registers
PC
EX/MEM
=
M
u
x
Instruction
memory
ALU
M
u
x
Data
memory
M
u
x
Sign
extend
M
u
x
Forwarding
unit
H.Corporaal EmbProcArch 5kk73
106
Exception support
Types of exceptions:
Overflow
I/O device request
Operating system call
Undefined instruction
Hardware malfunction
Page fault
Precise exception:
finish previous instructions (which are still in the pipeline)
flush excepting and following instructions, redo them after
handling the exception(s)
H.Corporaal EmbProcArch 5kk73
107
Exceptions
Changes needed for handling overflow exception of an
operation in EX stage (see book for details) :
Extend PC input mux with extra entry with fixed address
Add EPC register recording the ID/EX stage PC
this is the address of the next instruction !
Cause register recording exception type
E.g., in case of overflow exception insert 3 bubbles;
flush the following stages:
IF/ID stage
ID/EX stage
EX/MEM stage
H.Corporaal EmbProcArch 5kk73
108
Scheduling, why?
Let’s look at the execution time:
Texecution = Ncycles x Tcycle
= Ninstructions x CPI x Tcycle
Scheduling may reduce Texecution
Reduce CPI (cycles per instruction)
 early scheduling of long latency operations
 avoid pipeline stalls due to structural, data and control hazards
allow Nissue > 1 and therefore CPI < 1
Reduce Ninstructions
 compact many operations into each instruction (VLIW)
H.Corporaal EmbProcArch 5kk73
109
Scheduling data hazards:
example 1
Try and avoid RaW stalls (in this case load interlocks)!
E.g., reorder these instructions:
lw
lw
sw
sw
$t0,
$t2,
$t2,
$t0,
H.Corporaal EmbProcArch 5kk73
0($t1)
4($t1)
0($t1)
4($t1)
lw
lw
sw
sw
$t0,
$t2,
$t0,
$t2,
0($t1)
4($t1)
4($t1)
0($t1)
110
Scheduling data hazards
example 2
Avoiding RaW stalls:
Reordering instructions for
following program
(by you or the compiler)
Unscheduled code:
Lw R1,b
Lw R2,c
Add R3,R1,R2
interlock
Sw a,R3
Lw R1,e
Lw R2,f
Sub R4,R1,R2
interlock
Sw d,R4
Code:
a = b + c
d = e - f
H.Corporaal EmbProcArch 5kk73
Scheduled code:
Lw R1,b
Lw R2,c
Lw R5,e
extra reg. needed!
Add R3,R1,R2
Lw R2,f
Sw a,R3
Sub R4,R5,R2
Sw d,R4
111
Scheduling control hazards
Texecution = Ninstructions x CPI x Tcycle
CPI = CPIideal + fbranch x Pbranch
Pbranch = Ndelayslots x miss_rate
Modern processors tend to have large branch penalty,
Pbranch, due to:
many pipeline stages
multi-issue
Note that penalties have larger effect when CPIideal is
low
H.Corporaal EmbProcArch 5kk73
112
Scheduling control hazards
What can we do about control hazards and CPI
penalty?
Keep penalty Pbranch low:
Early computation of new PC
Early determination of condition
Visible branch delay slots filled by compiler (MIPS)
Branch prediction
Reduce control dependencies (control height
reduction) [Schlansker and Kathail, Micro’95]
Remove branches: if-conversion
Conditional instructions: CMOVE, cond skip next
Guarding all instructions: TriMedia
H.Corporaal EmbProcArch 5kk73
113
Branch delay slot
Add a branch delay slot:
the next instruction after a branch is always executed
rely on compiler to “fill” the slot with something useful
Is this a good idea?
let's look how it works
H.Corporaal EmbProcArch 5kk73
114
Branch delay slot scheduling
Q. What to put in the delay slot?
op 1
beq r1,r2, L
.............
'fall-through'
op 2
.............
branch target
L: op 3
.............
H.Corporaal EmbProcArch 5kk73
115
Summary
Modern processors are (deeply) pipelined, to reduce
Tcycle and aim at CPI = 1
Hazards increase CPI
Several software and hardware measure to avoid or
reduce hazards are taken
Not discussed, but important developments:
Multi-issue further reduces CPI
Branch prediction to avoid high branch penalties
Dynamic scheduling
In all cases: a scheduling compiler needed
H.Corporaal EmbProcArch 5kk73
116
Recap of MIPS
RISC architecture
Register space
Addressing
Instruction format
Pipelining
H.Corporaal EmbProcArch 5kk73
117
Why RISC? Keep it simple
RISC characteristics:
Reduced number of instructions
Limited addressing modes
Large register set
uniform (no distinction between e.g. address and data registers)
Limited number of instruction sizes (preferably one)
load-store architecture
enables pipelining
know directly where the following instruction starts
Limited number of instruction formats
Memory alignment restrictions
......
Based on quantitative analysis
" the famous MIPS one percent rule": don't even think about it
when its not used more than one percent
H.Corporaal EmbProcArch 5kk73
118
Register space
Name
Register
Number
$zero
0
$at
1
$v0 - $v1
2-3
$a0 - $a3
4-7
$t0 - $t7
8-15
$s0 - $s7
16-23
$t8 - $t9
24-25
$gp
28
$sp
29
$fp
30
$ra
31
H.Corporaal EmbProcArch 5kk73
32 integer (and 32 floating point)
registers of 32-bit
Usage
Preserve
on call?
constant 0 (hardware)
n.a.
reserved for assembler
n.a.
returned values
no
arguments
yes
temporaries
no
saved values
yes
temporaries
no
global pointer
yes
stack pointer
yes
frame pointer
yes
return addr (hardware)
yes
119
Addressing
1. Immediate addressing
op
rs
rt
Immediate
2. Register addressing
op
rs
rt
rd
...
funct
Registers
Register
3. Base addressing
op
rs
rt
Memory
Address
+
Register
Byte
Halfword
Word
4. PC-relative addressing
op
rs
rt
Memory
Address
PC
+
Word
5. Pseudodirect addressing
op
Address
PC
H.Corporaal EmbProcArch 5kk73
Memory
Word
120
Instruction format
R
op
rs
rt
rd
I
op
rs
rt
16 bit address
J
op
Example instructions
Instruction
add $s1,$s2,$s3
addi $s2,$s3,4
lw $s1,100($s2)
bne $s4,$s5,L
j Label
H.Corporaal EmbProcArch 5kk73
shamt
funct
26 bit address
Meaning
$s1 = $s2 + $s3
$s2 = $s3 + 4
$s1 = Memory[$s2+100]
if $s4<>$s5 goto L
goto Label
121
Pipelining
All integer instructions fit into the following pipeline
time
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
H.Corporaal EmbProcArch 5kk73
WB
122
Other architecture styles
Accumulator architecture
Stack
three operands, all in registers
loads and stores are the only instructions accessing memory (i.e.
with a memory (indirect) addressing mode
Register-Memory
zero operand: all operands implicit (on TOS)
Register (load store)
one operand (in register or memory), accumulator almost always
implicitly used
two operands, one in memory
Memory-Memory
three operands, may be all in memory
(there are more varieties / combinations)
H.Corporaal EmbProcArch 5kk73
123
Accumulator architecture
Accumulator
latch
ALU
registers
address
Memory
latch
Example code: a = b+c;
load b;
add
c;
store a;
H.Corporaal EmbProcArch 5kk73
// accumulator is implicit operand
124
Stack architecture
latch
latch
top of
stack
ALU
latch
Example code: a = b+c;
push b;
push b
push c;
b
add;
stack:
pop a;
H.Corporaal EmbProcArch 5kk73
Memory
stack pt
push c
add
c
b
b+c
pop a
125
Other architecture styles
Let's look at the code for C = A + B
Stack
Architecture
Accumulator
Architecture
RegisterMemory
MemoryMemory
Register
(load-store)
Push A
Load A
Load r1,A
Add C,B,A
Load r1,A
Push B
Add
Add
Add
Store C
Pop
C
B
r1,B
Store C,r1
Load r2,B
Add
r3,r1,r2
Store C,r3
Q: What are the advantages / disadvantages of load-store (RISC) architecture?
H.Corporaal EmbProcArch 5kk73
126