CSE 378 Homework #2 Additional Information

Essential Information

The intent of this assignment is to help you understand the relationship between programs you might write in C (or similar languages) and the executable file, and more particularly the tool chain required to go from the C source to an executing program (which we call a process, by the way). There are two parts of this that are often confusing: the role of type declarations / type checking, and the roles of the linker and loader. As the assignment is about the second, I'll discuss that first, and then provide some optional reading about type declarations.

To understand what the purpose of the linker, it's important to think of a program as being composed of source in multiple files. The code in any particular file uses a set of symbols (e.g., function names). Those names can be either local (defined in that file) or external (defined in some other file). For instance, in this homework main() has a call to sub(), which is an external symbol. When compiling main.c, the compiler sees the call to sub(), but not sub() itself (because the C compiler looks at only one source file at a time, something a little different from what Java does). When it encounters the call to sub(), it knows it has to generate assembler code that will cause a branch to the machine instructions implementing sub(), but (a) it doesn't need to know what those instructions are (because all it has to do is generate a branch to whatever they are), and (b) it can't know where they will be loaded in memory during execution. Because of (a), it doesn't even bother to look for the definition of sub() - it doesn't need it. (It does need its type, though, which is why there are .h files.)

So, when compiling main(), the compiler generate an assembler instruction that includes the symbol sub, deferring the entire issue of resolving symbols (to their values - in this case, the address of the instructions implementing sub()) to later phases of the tool chain.

The assembler works in basically the same way as the compiler, looking at only a single source file at a time. When it sees the symbol sub, and can't come up with a value for it because it is an external symbol, it annotates main.o to indicate that the value of sub is needed by the jal instruction (indicating that instruction by its offset into the text segment of the .o file being produced). That is, the assembler further defers resolving sub to the linker.

As well as annotating the .o with the list of external symbol values that it needs, the assembler also annotates it with the values of symbol values that it defines. (Actually, these annotations are for symbols that have been declared to be global, meaning referencable from other files. Non-global symbols cannot be referenced from other files, and so the .o should not have annotations for them.) For instance, main.o would have an annotation giving the value of the symbol main, as an offset into the text segment of the .o. sub.o would have an annotation giving the value of the symbol sub, also as an offset into its text segment.

Unlike the two previous tools, the linker looks at all files at once: its input is a set of .o files (and perhaps some libraries, which you can think of as multiple .o's essentially zip'ed together into a single file). The linker collects the anotations indicating the definition of symbol values in all the .o's, and then runs through the annotations indicating which symbol values are needed by which instructions, filling in the ones it can. This process (of filling in fields of the partially encoded instructions emitted by the assembler) is called patching.

So, that's the job of the linker. And that's one of the two reasons why the full 32-bit encoding of an assembler instruction may not be knowable until some step that follows assembly. The other reason is related to what the loader does.

The main job of the loader is to find an unused chunk of main memory big enough to hold the executable, and then to read the .exe file into that memory. This means that the .exe may be loaded at different addresses each time it is run. Because some MIPS instructions have encodings that depend on absolute addresses (that is, exactly which memory locations are allocated to the execution of the program), some instruction patching may have to take place at this step as well: this is the earliest point at which absolute addresses are known.

So, the summary is that the assembler may not have enough information to fully encode some instructions at assembly time, either because the instruction involves an external symbol or because the instruction's 32-bit encoding requires knowledge of an absolute address. The linker can resolve external symbols, but can't determine absolute addresses. The loader can resolve absolute addresses.

(Note: Cebollita performs all these functions quite lazily (some might say sloppily), deferring full encoding of instructions until later than is absolutely necessary in some cases. For answering the assignment, you should assume that patching is something to be avoided, and that each tool in the chain fully encodes every instruction it is possible for it to fully encode, deferring only those for which it has no choice.)

Optional Information: Type Checking

Type checking is a concept implemented entirely by the compiler. It may not have leapt out at you, but there really aren't types in assembler. Even though
int i;
in a C program is compiled into something like
i: .word 0
in the assembler program, all the latter does is ask to reserve 32-bits of memory, and incidentally initalize it to 32 0-bits. You can't tell if those 32-bits are an integer or not - as you know, they're really just 32 bits, and each instruction that manipulates them independently implies what type it wants to consider them as.

The C program has types so that it can help the programmer, by looking for uses of variables that don't make sense, and so are likely to be bugs. For instance, given the declaration above, it's more likely that the statement

i = "this is a test";
is a mistake than what the programmer really meant. (It turns out that C is so weakly typed that that statement will compile, but the odds are it isn't immediately apparent to you what it means -- that is, it is most likely a mistake.)

Given that a C program successfully compiles, meaning that an assembler program for it has been generated by the compiler, there's really no reason to worry about type checking any longer. The assembler program is guaranteed to compute what the C program did (assuming the compiler doesn't have bugs), and since the C program made sense (passed type checking), so does the assembler program.

In any case, type checking is meaningful as an idea only in the language/compiler, and disappears once you reach assembler.