Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Assemblers, linkers, and loaders • You need to know – A little more about how an assembler works • Standard “two pass” assembler generating code • “Lexical Analysis” Assemblers, linkers, and loaders – “lexical analysis is the process of converting a sequence of characters into a sequence of tokens, i.e. meaningful character strings” » In an assembler, you are reading text and identifying instruction mnemonics, directives, octal numbers, … » Compilers also start with lexical analysis – but have a much richer vocabulary of “meaningful character strings” Mostly an introduction to how an assembler works – A few more details of linking and loading • Quite a lot already covered indirectly when considering .a archives and .so libraries in last section 1 2 Assembler Absolute assembler • That assembler added to the PDP‐11 simulator was an “absolute assembler” • You used a simple assembler in those PDP‐11 exercises Magic? ; Program to copy and determine length of string .origin 1000 001000 012701 start: mov #msga, r1 001002 001024 001004 012702 mov #msgb, r2 001006 001076 001010 005000 clr r0 001012 112122 l1: movb (r1)+, (r2)+ 001014 001402 beq done 001016 005200 inc r0 001020 000774 br l1 001022 000000 done: halt msga: .string "A string whose length is to be determined" 001024 020101 001026 072163 … 001074 000144 msgb: .string "Different string that should get overwritten" 001076 064504 001100 063146 … 001150 067145 001152 000000 .end start Source text of program – Which means that it works with explicit addresses for where code and variables go. – Working with explicit addresses makes things simpler – but is very limiting • It really is only appropriate when 1. 2. Coding for a simple machine with no operating system (like the PDP‐11 in the simulator) The entire program exists as a single source file – no use of libraries • Will start by looking at how an absolute assembler works – Subsequently look at a “relocatable assembler” that allows for use of libraries and OS facilities. “Assembled” code 3 What did the “absolute assembler” do? How is the assembler performing magic working? • Read a text file • It starts with a “symbol table” that has entries with details of instruction mnemonics – Text! • Things like – Comments – Assembler directives (e.g. the origin directive that set the start address for the subsequent group of instructions or data definitions) – Mnemonic instruction names – mov, beq, … – Combinations of register names, r1, and addressing mode tags – r1+, (r1), @(r1), … – Names invented for variables, label names on instructions • But as seen in the input they are all simply character sequences • Perform magic! • Output a representation of the bit patterns, representing instructions or initialized data variables, that are to be placed in specified memory locations 5 nabg 4 – mov • “2 address instruction” • Bit pattern 0001 ssssss dddddd – a 4‐bit op‐code, 6‐bits for addressing mode of source address, 6‐bits for addressing mode of destination • May use one or two following words in memory to hold address data – br • “instruction with branch offset bits” • Bit pattern 00000001 oooooooo – an 8‐bit op‐code, 8‐bits for an offset address • Will require a name of a label as next input element – inc • “1 address instruction” • Bit pattern 0000101001– an 10‐bit op‐code, 6‐bits for addressing mode of destination • Will require addressing data as next element 6 1 How is the assembler performing magic working? How is the assembler performing magic working? • It has a set of built in processing rules • It has a set of built in processing rules 1. Discard comments 2. Use origin directive to set a “location counter” whose value determines the address where the next instruction or data element is to go 3. Process other directives • E.g .string – Convert quoted string into a sequence of bytes (rembering to swap byte order!), and add 1 or 2 nul bytes; – Work out how many bytes used, and update the location counter 4. Recognize instructions and check addressing modes • • • • .blkw – Consume next element (should be an octal constant defining number of words to be left empty) – Update location counter E.g. clr r0 – – clr ‐ a one address instruction r0 – mode and register information; it’s register mode on register r0 This is a correctly formed instruction that with its data will occupy one‐word of memory, update location counter by 2 Could (but will not at this stage) generate the bit pattern – – – clr ‐ 10‐bit instruction format 0000101000 r0 – register mode 000, register r0 000 So instruction word would be 0000101000000000 7 Instructions referencing labels How is the assembler performing magic working? • Assembly code has labels on instructions and data elements • Instructions reference these labels • Bit pattern generated for the instruction will depend on address of referenced label • It has a set of built in processing rules 4. Recognize instructions and check addressing modes • E.g. beq done – – – 8 beq – an instruction that will need an 8‐bit offset ‐ the amount to branch Value of offset will be difference in location value of “done” and current location counter Ooops! » Don’t know where done is! » Haven’t met this symbol! – beq done – • need the offset, the number of bytes between current location counter and the still unknown address associated with label done – mov #msgb, r2 • #msgb – immediate addressing, next word following the mov instruction is supposed to contain the address associated with label msgb – but that is still unknown! 9 Assembler must use 2 passes 10 Assembler must use 2 passes • First pass • Second pass – Basically simple checking of input – Generate the bit patterns that represent the code and data • Are all instructions and addressing modes appropriate • Keep track of location counter – So have to carefully consider addressing modes of each instruction along with space allocation by directives like .string. – Recognize labels and add them to symbol table start: mov #msg1, r1 • Add entry: symbol type = “label”, name = “start”, value=1000 l1: movb (r1)+, (r2)+ • Add entry: symbol type = “label”, name = “l1”, value=1012 11 nabg 12 2 Line‐by‐line processing Magic: step‐by‐step • The example program • Of course, assembly language code is much simpler than code in a high level language • Each line of code can be processed in isolation – don’t have to maintain any context (apart from location counter) – (Think about a high level language with if … then … elsif … then … elsif … then … else … constructs that span multiple lines – need to remember that have to match up if and else etc) ; Program to copy and determine length of string .origin 1000 start: mov #msga, r1 mov #msgb, r2 clr r0 l1: movb (r1)+, (r2)+ beq done inc r0 br l1 done: halt msga: .string "A string whose length is to be determined" msgb: .string "Different string that should get overwritten" .end start 13 14 Magic: step‐by‐step : Pass 1, step 1 ; Program to copy and determine length of string start: mov #msga, r1 • First non‐white space character is “;” start: – this is a comment; – Discard entire line .origin 1000 • This is an “origin” assembler directive – Consume next input token which must be an octal number (not requiring a leading 0) – Set a variable “location” to this value, this “location” variable defines the address where the next instruction or data element will be placed Processing rules: condition (matched lexical element) and action Magic: step‐by‐step : Pass 1, step 2a • This is a label, it will get referenced elsewhere in program; it’s associated with current address location i.e. 1000 – Create an entry in symbol table • Symbol: Type=label, Name=start, Value=1000 15 16 Magic: step‐by‐step : Pass 1, step 3 Magic: step‐by‐step : Pass 1, step 2b start: mov #msga, r1 start: mov mov #msgb, r2 mov • mov is a symbol, • – Look it up in symbol table mov is a symbol, – Symbol: – – – – Type=instruction with 2 addresses, – Name=mov, – Value=0001 (4‐bit) – – Instruction symbol, that’s ok; check the two addresses start: mov #msga, Type=instruction with 2 addresses, Name=mov, Value=0001 (4‐bit) Instruction symbol, that’s ok; check the two addresses mov #msgb, • #msga, that would be “immediate addressing mode” – there will be an address placed in the next word, remember to update the location counter appropriately • r1, that would be “register addressing” • Location counter will now be 1004 17 nabg Look it up in symbol table • • Symbol: • #msgb, that would be “immediate addressing mode” – there will be an address placed in the next word, remember to update the location counter appropriately • r2, that would be “register addressing” • Location counter will now be 1010 18 3 Magic: step‐by‐step : Pass 1, step 4 clr r0 l1: movb (r1)+, (r2)+ l1: clr • clr is a symbol, • This is a label, it will get referenced elsewhere in program; it’s associated with current address location i.e. 1012 – Look it up in symbol table • Magic: step‐by‐step : Pass 1, step 5a Symbol: – Type=instruction with 1 address, – Name=clr, – Value=0000101000 (10‐bit) – Instruction symbol, that’s ok; check the one addresses clr r0 – Create an entry in symbol table • r0, that would be “register addressing” • Location counter will now be 1012 • Symbol: Type=label, Name=l1, Value=1012 19 20 Magic: step‐by‐step : Pass 1, step 6 Magic: step‐by‐step : Pass 1, step 5b l1: movb (r1)+, (r2)+ l1: movb beq done beq • movb is a symbol, • beq is a symbol, – Look it up in symbol table – Look it up in symbol table • • Symbol: Symbol: – Type=instruction with offset, – Name=beq, – Value=00000011(8‐bit) – Type=instruction with 2 addresses, – Name=movb, – Value=1001 (4‐bit) – Instruction symbol, that’s ok; check the one addresses – Instruction symbol, that’s ok; check the two addresses l1: movb (r1)+, (r2)+ beq done • done hopefully we will find this label before we have to generate code, but it looks ok for now • Location counter will now be 1016 • (r1)+, that would be “auto increment mode” • (r2)+ that would be “auto increment mode” • Location counter will now be 1014 21 Magic: step‐by‐step : Pass 1, step 7 22 Magic: step‐by‐step : Pass 1, step 8 inc r0 br l1 inc • br • inc is a symbol, – Look it up in symbol table • br is a symbol, – Look it up in symbol table • Symbol: – Type=instruction with 1 address, – Name=inc, – Value=0000101010 (10‐bit) – Type=instruction with offset, – Name=br, – Value=00000001(8‐bit) – Instruction symbol, that’s ok; check the one addresses – Instruction symbol, that’s ok; check the one addresses inc r0 br l1 • r0, that would be “register addressing” • Location counter will now be 1020 • l1 hopefully we will find this label before we have to generate code, but it looks ok for now (actually, we have already found it, but makes no difference) • Location counter will now be 1022 23 nabg Symbol: 24 4 Magic: step‐by‐step : Pass 1, step 9a Magic: step‐by‐step : Pass 1, step 9b done: halt done: done: halt done: halt • This is a label, it will get referenced elsewhere in program; it’s associated with current address location i.e. 1022 • halt is a symbol, – Look it up in symbol table • Symbol: – Type=instruction with 0 addresses, – Name=halt, – Value=0000000000000000 (16‐bit) – Instruction symbol, that’s ok; no addresses to check – Create an entry in symbol table • Location counter will now be 1024 • Symbol: Type=label, Name=done, Value=1022 25 26 Magic: step‐by‐step : Pass 1, step 10a Magic: step‐by‐step : Pass 1, step 10b msga: .string "A string whose length is to be determined" msga: msga: .string "A string whose length is to be determined" msga: .string • This is a label, it will get referenced elsewhere in program; it’s associated with current address location i.e. 1024 • String directive – Create an entry in symbol table – Require next element to be a quoted string – Determine length of string • If odd, add 1 to have null byte at end • If even add 2 (to have null word at end) – Update location counter, now 1076 • Symbol: Type=label, Name=msga, Value=1024 27 28 Magic: step‐by‐step : Pass 1, step 11a Magic: step‐by‐step : Pass 1, step 11b msgb: .string "Different string that should get overwritten" msgb: msgb: .string "Different string that should get overwritten" msgb: .string • This is a label, it will get referenced elsewhere in program; it’s associated with current address location i.e. 1076 • String directive – Create an entry in symbol table – Require next element to be a quoted string – Determine length of string • If odd, add 1 to have null byte at end • If even add 2 (to have null word at end) • Symbol: Type=label, Name=msgb, Value=1076 29 nabg 30 5 Magic: step‐by‐step : Pass 2, step 1 Magic: step‐by‐step : Pass 1, step 12 .end start ; Program to copy and determine length of string • End directive • 1st pass complete • First non‐white space character is “;” – Some limited syntax checking done, all those input statements looked ok – this is a comment; – Discard entire line .origin 1000 • This is an “origin” assembler directive – Consume next input token which must be an octal number (not requiring a leading 0) – Set a variable “location” to this value, this “location” variable defines the address where the next instruction or data element will be placed • (Would have stopped on an error such as an instruction requiring one address being given two address elements) – Labels all defined and added to symbol table. 31 32 Magic: step‐by‐step : Pass 1, step 2a Magic: step‐by‐step : Pass 2, step 2b start: mov #msga, r1 start: mov #msga, r1 start: mov start: • mov is a symbol, – • This is a label, skip over Look it up in symbol table • Symbol: – – – Type=instruction with 2 addresses, Name=mov, Value=0001 (4‐bit) – Start composing instruction word 0001ssssssdddddd start: mov #msga, • • • • • #msga, that would be “immediate addressing mode” – that was mode 2 register 7 so ssssss becomes 010111 r1, that would be “register addressing”, mode 0 register 1 so dddddd becomes 0000001 Complete instruction word is now 0001010111000001, output into location 1000; msga Look up value in symbol table for label named msga – it’s 1024; so output 1024 into location 1002 Location counter will now be 1004 33 34 Magic: step‐by‐step : Pass 2, step 6 Magic: step‐by‐step : Pass 2, steps 3, 4, 5 mov #msgb, r2 beq done beq • Similar to last line • clr r0 • Symbol: – Type=instruction with offset, – Name=beq, – Value=00000011(8‐bit) – 0000101000 dddddd • Work out address mode and register, mode 0, register 1 – Instruction symbol, that’s ok; check the one addresses – 0000101000 000001 beq done • Store instruction word in next address • done get value from symbol table, work out signed 8‐bit offset, combine with op‐code; store instruction in next address • Location counter will now be 1016 l1: movb (r1)+, (r2)+ Similar 35 nabg beq is a symbol, – Look it up in symbol table • Look up clr in symbol table – 36 6 Magic: step‐by‐step : Pass 2, step 7 Magic: step‐by‐step : Pass 2, step 8 br l1 inc r0 br inc • • inc is a symbol, • br is a symbol, – Look it up in symbol table – Look it up in symbol table • Symbol: Symbol: – Type=instruction with offset, – Name=br, – Value=00000001(8‐bit) – Type=instruction with 1 address, – Name=inc, – Value=0000101010 (10‐bit) – Instruction symbol, that’s ok; check the one addresses – Instruction symbol, that’s ok; check the one addresses br l1 inc r0 • l1 another of these offset instructions, now can get address of l1 from symbol table so can work out how many bytes to jump back, this is signed offset • Compose instruction word and store • Location counter will now be 1022 • r0, that would be “register addressing” • Compose instruction word and save in next address • Location counter will now be 1020 37 Magic: step‐by‐step : Pass 2, step 9 38 Magic: step‐by‐step : Pass 2, step 10 msga: .string "A string whose length is to be determined" msga: done: halt done: • Ignore the label • Process the string by copying bytes from string into successive memory locations, remembering to swap the byte order before storing them! Keep updating location counter as go along, and remember to zero out the nul byte (odd length strings) or nul word (even length strings) • Ignore the label done: halt • Look up 0‐address halt instruction in symbol table and output appropriate bits 39 Magic: step‐by‐step : Pass 2, step 11 40 Magic: step‐by‐step : Pass 2, step 12 msgb: .string "Different string that should get overwritten" .end start • Similar processing to last string • End directive • 2nd pass complete – Code all generated – Record start address. 41 nabg 42 7 Writing an assembler Pseudo‐code : 1 • Most of the code of an assembler program has to be crafted, – it is inevitably very specific to the machine and its instruction set and the assembler directives that provide features such as string encoding – PDP‐11 issues function assemble() { initializeSymbolTable(); // Loads predefined symbols try { pass1(); } catch (errormessage) { /* error report and terminate */ … } showSymbols(); // Print symbol table – useful reference for programmer debugging code try { pass2(); } catch (errormessage) {/* error report and terminate */ … } • Lots of different instruction formats and addressing modes – – – – – Branch instructions (8‐bit op‐code, 8‐bit offset) “sob” (looping) instruction – a unique format applying only to this instruction “register/address” instructions (e.g. MUL) Single address instructions Double address instructions } • Odd directives like that .string that had to re‐order bytes 43 44 Pseudo‐code : 2 Example input • Pass‐1, processing code line by line function pass1() { stop = false; phase = 1; initializeInput(); getNextLine(); while (!stop) { processElements1(); getNextLine(); } } loop: cmp r3,count ; while(r3<count) … • Has: – Label, “loop” – Instruction cmp • Need to check 2 address elements to determine whether following memory locations needed for address data – cmp r3,r4 » That would be a single 2‐byte instruction – cmp r3,count » It’s going to need 4‐bytes, 2 for the instruction and 2 for the address of “count” – cmp count,limit » That would require 6‐bytes! – Here it’s r3,count; so 4‐bytes – Comment 45 46 Pseudo‐code : 3 Pseudo‐code : 4 : handling a label function processElements1() { function p1_process_label(anobj) { var element = getNextElement(); badAddress(); // Trying to define a label – first check that have a current location defined! // Add the label to the symbol table var name = anobj.value; var ndx = findSymbol(name); // Better not exist already! Process the loop: label if (ndx >= 0) { throw "Redefining label " + name; } definition, putting an entry var obj = new Object(); into symbol table. obj.type = "label"; obj.name = name; obj.value = currentaddress; addSymbol(obj); // Now process rest of line ‐ // meaningful things are some directives (e.g. .string), instructions, or nothing // how about recursive call processElements1(); if (element === null) return; if (element.type === "comment") { return; } else if (element.type === "constant") { p1_process_constant(element); } else if (element.type === "directive") { p1_process_directive(element); } else if (element.type === "label") { p1_process_label(element); } else if (element.type === "number") { p1_process_comma_values(); } else if (element.type === "name") { p1_process_comma_values(); } else if (element.type === "instruction") { p1_process_instruction(element); } } } loop: cmp r3,count ; while(r3<count) … getNextElement() will return a “label” element consuming “loop:” from input 47 nabg Recursion: call to getNextElement() in processElements1() will find cmp – an element that is “instruction” type; resulting in call to { p1_process_instruction(element); 48 8 Pseudo‐code : 5a : handling an instruction Pseudo‐code : 5b if (addrtype === "special") { // deal with instructions like rti, rts, trap, … that have special formats and which // may require further analysis in pass 1 p1_handle_special(elem); } if (addrtype === "sob") { // here treat as a simple two byte instruction, no processing // in pass 2 will need to check register and offset return; } // Remaining instructions must have address data // Several possibilities like one‐address, register and address, two address // consume all input that could be part of address specification function p1_process_instruction(elem) { badAddress(); var name = elem.value; var ndx = findSymbol(name); var instruction = getSymbol(ndx); var addrtype = instruction.addrtype; // Instructions are two or more bytes ‐ it is the "more" that is hard work updateCurrentAddress(2); if ((addrtype === "0addr") || (addrtype === "offset")) { // Easy ‐ it was simply a two byte instruction, so have completed dealing with it // Not checking any offset address part in this pass. return; } loop: cmp r3,count ; while(r3<count) … … cmp will be recognized as a 2‐address instruction var addrpart = grabAddressPart(); if (addrpart === "") { throw "Missing address part of instruction on line " + linenum; } 49 loop: cmp r3,count ; while(r3<count) … The substring r3,count will be taken as the addrpart Pseudo‐code : 5c Pseudo‐code : 5d loop: cmp r3,count • Got to check those address parts in r3,count to determine whether any extra words needed for address data // And a safety check, input line should now be empty or a comment var elem2 = getNextElement(); if (!emptyOrComment(elem2)) { throw "Extraneous data after instruction on line " + linenum; } if (addrtype === "regaddr") p1_process_regaddr_instruction(addrpart); else if (addrtype === "addrreg") p1_process_addrreg_instruction(addrpart); else if (addrtype === "1addr") p1_process_1addr_instruction(addrpart); else function p1_process_2addr_instruction(addrpart) { // Should have two address parts, separated by comma var parts = addrpart.split(","); // Process each individually p1_process_1addr_instruction(parts[0]); p1_process_1addr_instruction(parts[1]); if (addrtype === "2addr") p1_process_2addr_instruction(addrpart); else { throw “Unrecognized address type " + addrtype; } } } loop: cmp r3,count ; while(r3<count) … The substring ; while(r3<count) … will be recognized as a comment 51 Pseudo‐code : 5d : Examining a single address part 52 Lots of special cases • Series of tests for all possibilities allowed by assembly language syntax, including: – Simple register, e.g. r0: no extra memory words – Register indirect, e.g. (r1): no extra memory words – Register indirect (!) @r1: no extra memory words – Auto‐increment (r1)+: no extra memory words – Immediate #msga: one extra memory word – Indexed data(r0): one extra memory word –… 53 nabg 50 • With all the different addressing modes to consider the code, for both Pass‐1 and Pass‐2, have to have lots of special cases to correctly determine locations and later generate code. • Custom coding unavoidable. 54 9 getNextElement • In contrast to rest of assembler with its highly customized code, the code for something like getNextElement() is relatively easy to standardize. • What it has to do: Lexical Analysis getNextElement() – Consume characters from input line, packing them into some other string variable – Recognize what type of element they are from the characters read Used when processing lines (in both Pass 1 and Pass 2) Returns “label” | “instruction” | “comment” | “directive” … • Possibly identified by 1st character • Possibly identified by last character • … 55 56 Recognising a comment Recognising a directive • • • • • Skip any leading white space characters • Find “;” – know that dealing with comment • Consume all characters until end‐of‐line Skip any leading white space characters Find “.” – know that possibly dealing with directive Consume all letters until first non‐letter Check that the substring built matches a name in a table of directives “a‐z” Whitespace “not \n” Whitespace Start Start Handling comment ; Return “ERROR” Any Handling possible directive . Not found Whitespace \n Check in list of directives Return comment Return “directive” Found 57 Recognizing a name as label‐declaration | instruction | label Recognizing an octal number Whitespace Error 0‐7 58 Whitespace Start 0‐7 any (building number) Start (building name) a‐z Whitespace Return label‐ declaration Listed as instruction Return “instruction” 59 Not present in table Whitespace : Return number nabg Error a‐z0‐9_ any Lookup in symbol table Listed as known label Return name 60 10 Finite state machines (almost) Finite state machine • Those element recognizers are variants on a “finite state machine” • Wikipedia says: – “A finite‐state machine (FSM) or finite‐state automaton, or simply a state machine, is a mathematical model of computation used to design both computer programs and sequential logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time; the state it is in at any given time is called the current state. It can change from one state to another when initiated by a triggering event or condition; this is called a transition. A particular FSM is defined by a list of its states, and the triggering condition for each transition.” – A computing device that can be in one of a number of states which makes transitions between states based on inputs – Initial state: • Input is space – stay in same state • Input is 0‐7 – transition to state where try to recognize an octal number • Input is . – transition to state where try to recognize a directive • Input is ; ‐ transition to state where try to recognize a comment • Input is a‐z – transition to state where try to recognize label‐ declaration, or instruction, or label name • Input is anything else – oops an error They aren’t true finite state machines as some of the states involve complex processing – such as checking whether a name is in the symbol table – to make decision on next state. A true finite‐state‐machine works entirely on inputs determining state changes. 61 62 Or, for the more mathematically inclined, … Creating the getNextElement() function • Typically, someone writing an assembler program would have sketched out the complete state chart and used this to guide coding Not \n Consuming comment ; Return comment \n 0‐7 Whitespace Building a number whitespace Return number 0‐7 other Start . a‐z 63 Creating the getNextElement() function • A high‐level pseudo‐code outline could then be composed (using conditionals and “gotos” that would easily convert to “cmp” and “br” or “jmp” instructions in the assembly language implementation) Start: ch = getNextCharacter(); if(isWhiteSpace(ch))goto Start; if(ch in 0-7) goto Number; if(ch is “;”) goto Comment; if(ch is “.”) goto Directive; if(ch in a-z) goto Name; goto Error; Comment: ch = getNextCharacter(); if(ch not “\n”) goto Comment; return Comment-symbol; Number: ch = getNextCharacter(); if(ch in 0-7) goto Number; if(isWhiteSpace(ch)) return Number-symbol; goto Error; 65 (Assembler programs were generally written in assembly language) nabg a‐z Return error other Possible directive Not found whitespace Label‐declaration | instuction | name Check in list of directives etc 64 Return directive Auto‐generated lexical scanner? • Code for such a “lexical scanner” is highly stereotyped. • It is relatively easy to write a program that takes as input a set of definitions of the expected input elements (octal‐ number, comment, directive, …) and generates the scanner code that recognizes elements. • Lexical scanner generators were one of the first automated programming systems created – – We briefly look at one (flex – the Linux version of lex from Unix (~1975)) in the next section on compiled languages 66 11 Lexical Scanner in an assembler • Because lexical scanners can have a nice mathematical basis – a finite state machine – they often get a disproportionate amount of attention in “Computer SCIENCE” presentations on assemblers • The real hard work is in the custom code that deals with bizarre instruction formats and instructions that take 2‐, 4‐, 6‐, 8‐ or more bytes Generated binary code 67 68 Assemblers and Loaders Assembler outputs generated binary • The assembler that has been added to Schmidt’s PDP‐11 simulator places the generated bit‐patterns directly into the integer array that the simulator uses to represent the memory of the computer. • In a real system, a binary output “file” (well, in the old days, a binary paper tape) would be generated containing address information and bit patterns that would later be used by the loader program that places the program in memory Assembly – pass 1 “Create Symbol Table” Assembler program Your assembly language program source tape Start Loop Lft 0200 0210 0216 Symbol table generated and saved in memory This area of memory typically used for data by current program Real loader program – can be overwritten Rim boot loader – hopefully never overwritten 69 Assemblers and Loaders Assemblers and Loaders Assembly – pass 2 “Generate code” Load your program 010000110101010 001101011101010 010101001010101 011101010101010 Assembler program Your assembly language program source tape (read for 2nd time) Start Loop Lft 0200 0210 0216 70 Computer Memory Binary tape with bit patterns for the instructions making up your program Binary tape with bit patterns for the instructions making up your program Bit patterns for your instructions placed in memory This area of memory typically used for data by current program Real loader program – can be overwritten Real loader program – can be overwritten Rim boot loader – hopefully never overwritten Rim boot loader – hopefully never overwritten 71 nabg 72 12 No fixed address • The .origin directive, as used in the code for our assembly language exercises, fixes the memory locations where the code and data elements will be located. • But on a computer with a real (multi‐tasking) operating system, the location for the program isn’t determined until execution requested. • So, the generated binary files cannot include actual addresses – and the bit patterns for some of the instructions must remain incomplete until load time. Relocatable code and linking with libraries 73 That example program again Loading the relocatable file • Suppose the start address wasn’t known, what could an assembler fill in? Absolute code (Note how branch instructions that use offsets won’t require any more work at load time – an advantage of ‘position independent code’) 74 012701 ?????? 012702 ?????? 050000 112122 Relocatable code 001402 005200 000774 000000 020101 072163 … msga=start+24; insert at start+2 msgb=start+76; insert at start+6 75 • If a file were created with that relocatable code represented in some manner, a “loader” could 012701 – Pick a start address where code to go • E.g. 2400 – Copy the binary data into that location and successive memory words – Process the relocation data: • msga must have value 2424 – Insert 2424 in 2402 • msgb must have value 2476 – Insert 2476 in 2406 – Code could then be executed Relocatable binary file Changes to the assembler • Files with generated “relocatable” code will have such a structure – Segment with the code – • 0‐valued bytes filling in places where addresses have to be filled in by the loader – Segment with relocation data – • ~names of labels (labels on instructions as in loops, branches, subroutine calls; labels (names) of static variables) • Their values as offsets from a start address that only the loader will know • A set of values identifying the places where the address must be written when known (usually, variables and subroutines are referenced at multiple places in the code – so it’s a set of places); again these locations would be defined as offsets from the start. 77 nabg 000000 → 0002424 012702 000000 → 0002476 050000 112122 Relocatable code 001402 005200 000774 000000 020101 072163 … msga=start+24; insert at start+2 msgb=start+76; insert at start+6 76 • Pass‐1 – It no longer has actual addresses to work with – Now, its “location” variable is not an address, it is a byte offset from the start of the code. • Pass‐2 – Label: • Don’t discard (as did previously); instead put an entry in the relocation data that is being generated separately from the code (and which will be appended to the binary code that is written to a file) – Record in relocation data: label‐name, value offset from start – Any address (address modes such as indexed – data(r1); immediate addressing #msg) • Don’t attempt to insert an address, leave a zero word • Add an entry in relocation data – Record in relocation data : label‐referenced, value offset from start – At end of pass‐2, append all relocation data to binary data as written to file 78 13 That example program again But what about using libraries Pass‐2 Output code to file • Building a relocatable code file (No .origin directive) Pass‐1 Construct symbol table working with offsets start @ offset 0 l1 @ offset 12 done @ offset 22 msga @ offset 24 msgb @ offset 76 (values in branch instructions can be determined from differences in offsets of current location and referenced location) Append necessary relocation data 012701 ?????? 012702 ?????? 050000 Relocatable code 112122 001402 005200 000774 000000 020101 072163 … msga=start+24; insert at start+2 msgb=start+76; 79 insert at start+6 • Assume that we have a slightly more sophisticated environment on our PDP‐11 (it’s got the OS that you are going to write as an Xmas holiday exercise) – There is a minimal OS kernel that provides ‘supervisor calls’ for things like reading or writing a byte – There are a bunch of standard functions written in assembler (which utilize the OS’s supervisor calls where necessary) • Readline – called with address of program buffer where input to be placed (use read‐svc) • Writeline – called with address of message to be printed (use write‐svc) • Itoa – called with integer value and address of buffer where ascii representation to go • Atoi – called with address of buffer with ascii string, returns integer • … 80 Your code using the libraries Simple way • The simplest way to use the library functions was to append the source of the libraries onto your program text! • The assembler would see both your code and the library functions, and calls to library functions would be handled just the same as calls to subroutines defined in your code. • Assembler could build either an absolute binary file (if permitted and origin directives set) or a relocatable file which would be sorted out by the loader • So, you are writing your program using the libraries – start: call writeline msg1 call readline buff1 call atoi buff1 … • How is the assembler to deal with those calls to the library functions? – All subroutines from library would appear at offsets from the start of program, loader could determine their addresses and fill these address into code as needed. • Obviously, not the most desirable approach. – (It was used in the 1950s) 81 82 But if the source code is not appended … Declaring “externals” • In pass‐1, the assembler wouldn’t be troubled by the fact that “readline” and “writeline” were not defined, it would easily determine that “call writeline” would occupy two words – one for the instruction, one for the address of the writeline subroutine – which was all it needed to do. • Pass 2 would be troubled! • Programmer had to identify the names of library functions so that assembler could process them correctly (rather than see them as errors) • Assembly program source code would start with a set of “extern” declarations – It looks up the name “writeline” and finds that this label has never been defined; probably a programmer’s error, terminate. extern extern extern extern readline writeline atoi itoa • Assembler would simply add these names to its symbol table as being labels for subroutines at unknown offsets 83 nabg 84 14 Resulting relocatable file – will need to link with library binaries The library code • Your code would be assembled to give a relocatable file as before • The relocatable files with the library code would have been created in much the same by assembling the library source. • There is one small refinement needed – – Extra entries in the relocation table • • • • – The library code provides those subroutines like readline, and atoi – msga – start+40, used at start+2, count – start+70, used at start+6, start+22, start+30 writeline – unknown, used at start+10 … • these names will appear as labels in the source of the library code • These names are referenced as externals in your code and need to be handled by the linker – Of course the library source code will have other labels – • You must then “link” this relocatable file with relocatable files that contain the library code in order to get a relocatable executable with all the code that the load can then place in memory. • labels on instructions referenced in branches or jumps, • entry points to private auxiliary subroutines, • names for static variables and constants belonging to the library – These will need to be in the relocation data for the library code – But we don’t want these (private implementation only) names getting confused with names of things in your program! • Everybody (including you and the author of the library code) uses loop:, l1:, l2:, done:, temp:, … 85 More elaborate relocation data 86 Using libraries : the “code” • Your code • The information stored in the “relocation data” section of a binary file generated for a library (and actually for your code as well) needs to distinguish which names are “local” to that file and which are “global” and are to be used when linking multiple files – Names that are “local” to files can appear in multiple files – the linker will not get them confused – Names that are “global” must be unique (so don’t use the name of a subroutine in the library for one of your subroutines) • Assembly language systems had various ways of allowing a programmer to identify which names were to be “global” – One approach – declare them as “entry” points extern writeline extern readline … start: call init … loop: call writeline msg1 … … init:mov count,r0 … … msg1: .string “…” count: 400 … • Library file entry writeline entry readline … writeline: mov (sp),r0 … loop: cmp (r0),endline … … … endline: .string “\n” … 87 Using libraries : the files with relocatable binary Yours: Library: Code 004767 000000 … … Relocation data Code 012600 … … Relocation data local1 (a.k.a. “count”) @offset 40 used @offset 10, @offset 16 local2 (a.k.a. “msga”) @offset 66 used @offset 2 writeline @unknown-extern used @offset 2 … Linker builds a “relocatable executable” • The linker will build a new file by appending the code sections, yours and that of each library you use, and constructs a new relocation table local1 (a.k.a. “loop”) @offset 160 used @offset 10, @offset 16 local2 (a.k.a. “endline”) @offset 364 used @offset 2 writeline entry@offset 0 not referenced in this code readline entry@offset 204 not referenced in this code – Still need a relocation table, because we still don’t know the “start” address where the program is going to be loaded – The linker works out new offsets relocation symbols for all appended files • Say your code and data take up 640 bytes • Library with writeline placed immediately after your code – so offset for writeline is now 640 rather than 0 … 89 nabg 88 90 15 Loader again Relocatable executable Code 004767 000000 … … 012600 …. Relocation data local1 (a.k.a. “count”) @offset 40 used @offset 10, @offset 16 local2 (a.k.a. “msga”) @offset 66 used @offset 2 writeline @offset 640 used @offset 2 Yours I/O library local1b (a.k.a. used @offset local2b(a.k.a. used @offset … “loop”) @offset 1020 650, @offset 656 “endline”) @offset 1224 642 • Loader can work as before with combined “relocatable executable” – Takes actual start address and uses relocation data entries to resolve final address – Insert these finalized addresses into actual code where needed Both are included in the one binary file. 91 92 Mixing code and data? • The approach described so far would have been used back in the 1970s • One consequence – – Code and data are intermixed One more refinement • Consider your assembly language code – – – – You had your code You might have some memory locations for integer constants You would have had fixed message strings You would have had memory locations for variables such as counts • The library would be similar – Code, then maybe space allocated for counts or for buffer arrays • When simply appended together to form a “relocatable executable” you end up with – Code, data, code, data, … 93 Bad idea! More elaborate relocatable files • Programmers have learnt by bitter experience that code and data should be kept separately – Then if you do overrun an array, you don’t store data over code – data that might later be interpreted as an instruction sequence! • As explained in the earlier section on operating systems, now expect to have different “segments” – “text” segment for the code – “static” segment for the kinds of data elements that we’ve been considering – (and, of course, stack and heap segments) 95 nabg 94 • Relocatable binary files for code and libraries and relocatable executable files now have more complex structures with separate sections for code and data • Assembler creates these when writing the file – Code – output to code section along with relocation table for “entries” and local labels for loops etc – Data – all the variables, strings etc kept aside and then written to a “data” segment in the relocatable file – followed by the relocation data for these elements • Linker combines files when creating relocatable executable – First combines all the code to get code section and its relocation data – Then combines all the data to get static data section and its relocation data 96 16 ELF ELF • ELF ‐ Executable and Linkable Format – Structure of relocatable code files and relocatable executables is pretty much standardised for Unix/Linux environments: Linux, Solaris, NetBSD, MINIX, ..., also Android and others – .o files from C/C++ compilations .a files with library code for static linking and linked executables all use “ELF” files • The structure of an ELF file is a more sophisticated variant of the structure just suggested – http://www.skyfree.org/linux/references/ELF_Format.pdf – http://wiki.osdev.org/ELF – https://refspecs.linuxbase.org/elf/gabi4+/ch4.intro.html – – “code” and relocation information – “static data” and associated relocation information 97 nabg • You may learn a little more about the structure and use of ELF files in CSCI212 • Lots of references on the web: (I’ve never needed that detailed a view) 98 17