Download High Performance Embedded Computing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lecture 6: Embedded
Processors
Embedded Computing Systems
Mikko Lipasti, adapted from M. Schulte
Based on slides and textbook from Wayne Wolf
© 2007 Elsevier
Topics




Embedded microprocessor market.
Categories of CPUs.
RISC, DSP, and Multimedia processors.
CPU mechanisms.
High Performance Embedded Computing
© 2007 Elsevier
Demand for Embedded Processors

Embedded processors
account for



Over 97% of total
processors sold
Over 60% of total sales from
processors
Sales expected to
increase by roughly 15%
each year
High Performance Embedded Computing
© 2007 Elsevier
Flynn’s taxonomy of processors






Single-instruction single-data (SISD)
Single-instruction multiple-data (SIMD)
Multiple-instruction multiple-data (MIMD)
Multiple-instruction single data (MISD)
What is an example of each?
Which would you expect to see in embedded
systems?
High Performance Embedded Computing
© 2007 Elsevier
Other axes of comparison






RISC vs. CISC---Instruction set style.
Instruction issue width.
Static vs. dynamic scheduling for multipleissue machines.
Scalar vs. vector processing.
Single-threaded vs. multithreading.
A single CPU can fit into multiple categories.
High Performance Embedded Computing
© 2007 Elsevier
Embedded vs. general-purpose processors

Embedded processors may be customized
for a category of applications.


Customization may be narrow or broad.
We may judge embedded processors using
different metrics:




Code size.
Energy efficiency.
Memory system performance.
Predictability.
High Performance Embedded Computing
© 2007 Elsevier
Embedded RISC processors


RISC processors often
have simple, highlypipelinable instructions
Pipelines of embedded
RISC processors have
grown over time:



ARM7 has 3-stage
pipeline.
ARM9 has 5-stage
pipeline
ARM11 has 8-stage
pipeline.
ARM11 pipeline [ARM05].
High Performance Embedded Computing
© 2007 Elsevier
RISC processor families

ARM:



MIPS:




ARM7 has in-order execution, and no memory management or
branch prediction;
ARM9 ARM11 has out of order execution, memory
management, and branch prediction,
MIPS32 4K has 5-stage pipeline;
4KE family has DSP extension;
4KS is designed for security.
PowerPC:


PowerPC 400 series includes several embedded processors;
Motorola and IBM offer superscalar versions of the PowerPC
High Performance Embedded Computing
© 2007 Elsevier
Embedded DSP Processors

Embedded DSP processors are optimized to perform
DSP algorithms; speech coding, filtering, convolution,
fast Fourier transforms, discrete cosine transforms
N
y k   bn x k  n
n0

DSP processors feature






Deterministic execution times
Fast multiply-accumulate instructions
Multiple data accesses per cycle
Specialized addressing modes
Efficient support for loops and interrupts
Efficient processing of “streaming” data
High Performance Embedded Computing
© 2007 Elsevier
Example: TI C55x/C54x DSPs






40-bit arithmetic (32-bit values + 8 guard bits).
Barrel shifter.
17 x 17 multiplier.
Two address generators.
Lots of special purpose registers and
addressing modes
Coprocessors for compute-intensive functions
including pixel interpolation, motion estimation,
and DCT/IDCT computations
High Performance Embedded Computing
© 2007 Elsevier
TI C55x microarchitecture
High Performance Embedded Computing
© 2007 Elsevier
Parallelism extraction

Static:





Use compiler to
analyze program.
Simpler CPU.
Can’t depend on data
values.
VLIW
Dynamic:




Use hardware to
identify opportunities.
More complex CPU.
Can make use of data
values.
Superscalar
High Performance Embedded Computing
© 2007 Elsevier
VLIW architectures

Each very long instruction word (VLIW) erforms multiple
operations in parallel
Branch Memory Memory Arithmetic Logic Vector



Needs a good compiler that understands the architecture
Allows deterministic execution times
Code growth can be reduced by allowing
 Operations within an instruction to be performed
sequentially
 A given field to specify different types of operations
Seq Branch/Mem Mem/Arith Arith/Logic Vector
High Performance Embedded Computing
© 2007 Elsevier
Simple VLIW architecture

Large register file feeds multiple function
units.
E box
Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP
Register file
ALU
ALU
Load/store Load/store FU
High Performance Embedded Computing
© 2007 Elsevier
Clustered VLIW architecture


Register file, function units divided into clusters.
What are advantages/disadvantages of having
clusters in VLIW architectures?
Cluster bus
Execution
Execution
Register file
Register file
High Performance Embedded Computing
© 2007 Elsevier
TI C62x/C67x DSPs



VLIW with up to 8 instructions/cycle.
32 32-bit registers.
Function units:



Two multipliers.
Six ALUs.
All instructions execute conditionally.
High Performance Embedded Computing
© 2007 Elsevier
TI C6x data operations




8/16/32-bit arithmetic.
40-bit operations.
Bit manipulation operations.
C67x processors add floating-point
arithmetic.
High Performance Embedded Computing
© 2007 Elsevier
C6x block diagram
Data RAM
512K bits
Program RAM/cache
512K bits
JTAG
bus
timers
Execute
DMA
Serial
Data path 1/
Reg file 1
Data path 2/
Reg file 2
PLL
High Performance Embedded Computing
© 2007 Elsevier
Texas Instruments C62x
High Performance[Texas
EmbeddedInstruments
Computing
N. Seshan, “High VelociTI processing
VLIW DSP architecture]”,
IEEE Signal Processing Magazine, ©v.2007
15,Elsevier
no. 2, pp. 86-101, 117, 1998.
Emerging DSP Architectures

Parallelism at multiple levels

Multiple processors


Multiple simultaneous tasks


Very Long Instruction Word (VLIW) architectures
Multiple operation per instruction


Multithreaded processors
Multiple instruction per cycle


System-on-a-chip designs
Single Instruction Multiple Data (SIMD) instructions
Architecture/compiler pairs improve performance
and help manage application complexity
High Performance Embedded Computing
© 2007 Elsevier
Superscalar processors

Instructions are dynamically scheduled.


Used to some extent in embedded
processors.



Dependencies are checked at run time in
hardware.
Embedded Pentium is two-issue in-order.
Some PowerPCs are superscalar
What advantages/disadvantages do VLIW
processors compared to superscalar?
High Performance Embedded Computing
© 2007 Elsevier
SIMD and subword parallelism

Many special-purpose SIMD machines


Subword parallelism is widely used for video.



All processors perform same operation on
different data
ALU is divided into subwords for independent
operations on small operands.
Vector processing is another form of SIMD
processing
Lots of times these terms are interchanged
High Performance Embedded Computing
© 2007 Elsevier
SIMD Instructions



Recent multimedia processors commonly support
Single Instruction Multiple data (SIMD) instructions
The same operation is performed on multiple data
operands using a single instruction
A3
A2
A1
A0
B3
B2
B1
B0
A3+B3
A2+B2
A1+B1
A0+B0
Exploits low precision and high data parallelism of
multimedia applications
High Performance Embedded Computing
© 2007 Elsevier
Operand characteristics in MediaBench
High Performance Embedded Computing
© 2007 Elsevier
Dynamic behavior of loops in MediaBench



The loops of media
applications in many
cases are not very
deep
Path ratio =
(instructions executed
per iteration) / (total
number of loop
instructions).
What does the path
ratio reveal?
High Performance Embedded Computing
© 2007 Elsevier
TriMedia TM-1 characteristics

Characteristics




Floating point support
Sub-word parallelism
support
VLIW
Additional custom
operations
High Performance Embedded Computing
© 2007 Elsevier
Trimedia TM-1
memory interface
video in
video out
audio in
audio out
I2C
serial
timers
VLD co-p
image co-p
VLIW CPU
PCI
High Performance Embedded Computing
© 2007 Elsevier
TM-1 VLIW CPU
register file
read/write crossbar
FU1
...
FU27
slot 1 slot 2 slot 3 slot 4 slot 5
High Performance Embedded Computing
© 2007 Elsevier
Multithreading


Low-level parallelism mechanism.
Interleaved multithreading (IMT) alternately
fetches instructions from separate threads.


Simultaneous multithreading (SMT) fetches
instructions from several threads on each
cycle.


Often used with VLIW and vector processors
Often used with superscalar processors
What advantages/disadvantages does IMT
have relative to SMT?
High Performance Embedded Computing
© 2007 Elsevier
Dynamic voltage scaling (DVS)



Power scales with V2
while performance
scales roughly as V.
Reduce operating
voltage, add parallel
operating units to make
up for lower clock
speed.
DVS doesn’t work well
in processors with highleakage power.
High Performance Embedded Computing
© 2007 Elsevier
Dynamic voltage and frequency scaling
(DVFS)


Scale both voltage and
clock frequency.
Can use control
algorithms to match
performance to
application, reduce
power.
High Performance Embedded Computing
© 2007 Elsevier
Razor architecture



Razor runs clock faster
than worst case allows
Used specialized latch
to detect errors.
Recovers only on
errors, gains averagecase performance.
High Performance Embedded Computing
© 2007 Elsevier