Download Powerpoint slides - Lecture 3 - User Web Areas at the University of

Development in hardware – Why?  Option: array of custom processing nodes     Step 1: analyze the application and extract the component tasks Step 2: design the custom processors Step 3: program the FPGA Step 4: assign the tasks to the processors and set up the connection network ← Multi-cellular organization ← ??? ← Growth (cellular division) Development in hardware – Why?  Step 2: as a function of the tasks, design one (or more) custom processors. × IN ×+ ÷≠ FFT + DCT OUT Cellular differentiation  Cells adapt their physical structure to fit the “application”  Can circuits/processors do the same?   Physically? No Logically? Yes, but…  Can they do it easily (dare we say, automatically)? Cellular differentiation  Needed: adaptable cellular architecture That is, a processor architecture that is       Customizable Compact Powerful Easy to design and modify Amenable to evolution and learning Possible solution: MOVE architectures The MOVE paradigm  One single instruction : move  Data displacements trigger operations  Architecture based around data ≠ operation centric  Regular structure : functional units + data network  Scalable and modular architecture Example: Sum of two values Conventional architecture: add R1, R2, R3; MOVE architecture: move O(Fxxx), I1(Fsum) move O(Fyyy), I2(Fsum) move O(Fsum), I(Fzzz) Cellular differentiation  Main features:  Conventional fetch/decode mechanism – compatible with bio-inspired mechanisms  No pipeline: computation carried out in specialized functional units (FU)  Communication carried out in specialized communication units (CU)  Only one instruction that MOVEs data to and from the CUs and FUs (dataflow architecture) Cellular differentiation  Main advantages:  Can be easily customized by introducing applicationspecific functional and communication units.  Perfectly fits the requirements of systolic arrays (arbitrarily complex communication patterns).  The introduction of custom components does not affect the assembler language, the code structure, the fetch and decode units, or the transport bus. Example – Automatic Synthesis  Phenotype Layer Application-specific (parallel) functions  Mapping Layer  Genotype Layer Developmental algorithm Genetic code Example – Automatic Synthesis  Phenotype Layer  Mapping Layer  Genotype Layer Totipotent Cell Example – Automatic Synthesis Programmable Logic Totipotent Cell Example – Automatic Synthesis Programmable Logic Cellular Array Implementation - The BioWall Development in hardware – Why?  Option: array of custom processing nodes     Step 1: analyze the application and extract the component tasks Step 2: design the custom processors Step 3: program the FPGA Step 4: assign the tasks to the processors and set up the connection network ← Multi-cellular organization ← ??? ← Cell specialization ← Growth (cellular division) Cell design and specialization  Phenotype Layer Application code (parallel) Within a MOVE framework, the specialization (differentiation) of a cell corresponds to the selection of the functional and communication units that can most efficiently implement the desired application. FU extraction  Extracting the optimal FUs from the code is a complex problem! FU extraction  How about having a quick peek at biology?  Idea: let us use evolution!!  In fact, this approach is much closer to biology than simply evolving code: in nature, the hardware (the cell) and the software (the genome) have evolved together! FU extraction  Idea: let us use evolution!! FU extraction  First step: profiling the code (standard compilation technique) FU extraction  Second step: transform into tree (standard compilation technique)  Third step: represent as 1-D genome  Fourth step: run the GA (with some fancy optimizations) Fitness evaluation s = size of the new processor t = execution time of the program on the new processor α = execution time of the program on a minimal processor β = hardware area to implement the minimal processor (which has, by definition, a fitness of 1) hwLimit = maximum hardware allowed to implement the new processor Note: • Relative fitness function • When out of allowed hardware range, logarithmic decrease • The hardware investment has to be small enough to be retained Determining hardware size  How can the size of the new FU estimated (the β parameter of the fitness) ?  The idea:  Determine the size of each basic building block (+, -AND, …)   What to do with assignments or loops ? Compute how many of them are used for a new FU  The characterization has to be done for every target platform. Determining hardware execution time  Use the same idea used for size :    Compute the time needed for each elementary function Take targeted clock period as a basis When time estimated > clock period, add 1 to the total time  small jumps in the fitness landscape Pattern-matching optimization  How to find reusable FUs ?   The GA behaves a bit like random mutations  difficult to find reusability this way Helps the GA a bit : search the whole tree each time a new HW block is defined to replace similar pieces of code Non-optimal block pruning  “Cleaning” phase made at each step  Removes HW blocks that are non- optimal from the fitness point-of-view  To see if a block is useful, compute the fitness with and without this block implemented in HW. If the software solution has a better fitness, the block is non-optimal and can be removed. FU extraction - Interface DOMAIN STANDARD FU extraction - Results  Example (functions from FACT factorization algorithm):   Hardware increase (estimated): 10% (fixed) Speedup (estimated): 2.27 (227%)  Other results:  All were obtained in a few seconds

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Powerpoint slides - Lecture 3 - User Web Areas at the University of