CSE 351, section 4: Procedure call

Michael Ratanapintha

Memory layout and usage of the stack

As Dr. Zahorjan just barely touched on yesterday, the memory address space of a single process consists of several regions. From lowest addresses to highest, these are:

the text segment (instructions), starting at address 0
constant data (literals)
statically allocated data (globals and static locals)
the heap (explicitly new'd objects) - starts just above statically allocated data, grows to higher addresses
the stack (local variables, activation records) - starts from the top of the address space (address 2^N-1 for word length N), grows to lower addresses

Here is a vaguely pictorial view of the memory layout:

0x0000_0000 |--------------|
            | Instructions |
     |      |--------------|
     |      |   Literals   |
     |      |--------------|
     |      |   Globals    |
     |      |--------------|
     |      |              |
     |      |    Heap      |
     |      |              |
     |      |--------------|
     |      |      |       |
     |      |      |       |
     |      |      |       |
     |      |      V       |
     |      |              |
     |      |              |
     |      |              |
     |      |      ^       |
     |      |      |       |
     |      |      |       |
     |      |      |       |
     |      |--------------|
     |      |              |
     |      |    Stack     |
     V      |              |
            |              |
0xFFFF_FFFF |--------------|

The activation record (AR)

Dr. Zahorjan said yesterday that the stack contains local variables for each function call. I asked what was significant about local variables. We established that every procedure call must have separate space in memory for each local variable, because

they cannot (easily) be referenced from source code outside the procedure's definition, and
because different calls to the same procedure do not share the same copy of each local variable, not even recursive calls.

I noted that this was fairly elementary stuff, what you learn in the first few weeks of CSE 142, but that you might not have thought of it explicitly before.

Building on this, we identified several data that must be kept for each call of a C function or Java method:

Local variables, as per above
Temporary values - values that are computed once and then thrown away, such as the values of expressions like x+1 , are naturally private to the procedure call that computes them.
Argument values - if a procedure modifies its arguments, the results are not reflected in the calling procedure's versions of the arguments. Hence the two versions must be separate copies. (An apparent exception: Java object variables are references to objects, not the actual object data, so calling a method or changing a field of the object does affect the caller's version of the object, which it shares with your procedure. However, if you assign to the reference variable, then that doesn't affect the caller. You should think of Java as passing object references by value - i.e. making copies of references which are passed to the called procedure.)
Return address - the called procedure, aka the callee, needs to determine the location in memory of the instruction to jump back to on function return.
Return value - this is like a local variable, but it is also copied back into a well-known, shared location so the caller can access it once the callee has returned.

All these data form the activation record, or AR, for the procedure call. Every active procedure call must have its own AR somewhere in memory.

Implementing ARs with stack frames

In C and Java, each program (actually each thread) has a chain of active procedure calls, only one of which is actually executing code at the moment - the others are suspended waiting for a callee to return. The procedures must logically return in the opposite order from which they were called.

I said that this is naturally modeled by a stack, where the running procedure's AR is on the top, its caller's AR is next from the top, and so on. The "stack" in memory is intended to replicate this logical structure - each procedure's AR is represented by a stack frame, with the running procedure call's AR at the top of the stack. Though stack frames implement ARs, they are not the same - as we'll see in the factorial example, an x86 procedure's argument values are in its caller's stack frame, even though they belong to the AR of the callee.

We maintain two pointers into the stack: the stack pointer or SP, which points to the last memory word pushed on the stack (that is, just below the top of the stack), and the frame pointer or FP, which points to the first word pushed in the current call's stack frame. The FP and SP bound the running procedure's stack frame - by using addresses relative to the FP or SP, each procedure's code can address any data on the stack frame without having to encode absolute, constant addresses.

When we call a function, we "push" a new frame onto the stack by subtracting from SP, saving FP onto the stack, and then updating FP to point to the bottom word of the new frame (just above the old SP, which pointed to the top word of the caller frame). Returning from a function requires us to "pop" the callee's frame by adding to SP, and then restoring the FP to the saved value stored on the stack.

Caller- vs. callee-saved registers

Not all of a procedure call's AR is on the stack. As you saw with the Z86, processors have a small set of registers, very fast but very small memories. Many data must be stored in registers or fixed memory locations, including the program counter (PC), the condition codes, and the FP and SP. In addition, several processors, including the Z86, only allow ordinary computation to be done on registers; while the full x86 partially supports computation in memory, register-register operations are also common.

On most processors, all procedure calls share the same register set. Hence, the register values used by a procedure call are also part of the AR; when a procedure call is made, registers shared by the caller and callee must be saved onto the stack and restored back when the callee returns. The caller and callee divide this responsibility by adopting a split-register-set convention:

Caller-saved registers must be saved into the stack by the caller prior to the procedure call, if the caller wants to use their values again. The callee is free to use caller-saved registers without saving them first.
Callee-saved registers are saved into the stack by the callee. The callee must save callee-saved registers before reusing them, so the caller can assume their values do not change.

The calling convention

When one procedure calls another, both caller and callee must agree on several things - where the callee's arguments and return value are placed, the ways the caller and callee are allowed to use each other's stack frames, which registers are caller-saved and callee-saved, and so on. While the processor provides some support for these matters, in general they must be decided as a matter of programming convention. This calling convention must be agreed between the processor architects, the OS authors, the compiler authors, and the authors of previously compiled libraries. If there is disagreement, your program might fail to work because one designer assumed something was safe while another designer decided that such an action would be wrong or unsafe.

x86 procedure calling convention

The x86 provides several instructions that are specialized to procedure call and stack manipulation, but even so there are many calling conventions on the x86. The one below is the cdecl convention used by gcc on Linux.

Register usage conventions

On the x86, the stack pointer and frame pointer are stored in registers. The SP register is called %esp (extended stack pointer), while the FP register is called %ebp (extended base pointer). The "extended" comes from the fact that the x86 used to have a 16-bit address space, and was extended to have a 32-bit one.

Both %esp and %ebp are callee-saved - because the callee is always going to overwrite them, making them callee-saved reduces the code size by not having to generate code to restore them at every call. However, you don't need to explicitly save %esp; as shown below, the stack frame is laid out so that the top of the previous stack frame is always a fixed offset from %ebp, so you can compute the old %esp instead of storing it.

The 6 other x86 integer registers also have special names, rather than numbers. 3 of them are caller-saved:

%eax
%edx
%ecx

The other 3 are callee-saved:

%ebx
%esi
%edi

All of these are used for computation. %eax is also used to store the return value when a procedure returns; the other registers have special purposes only in some relatively rare cases (such as multiplication and division instructions).

Stack layout

On the x86, each call's argument values and return address are stored on the stack. A callee's arguments are stored at the top of its caller's stack frame, with the first argument at the top, the second argument just below, and so on.

Even though it too is supplied by the caller, the return address for a call is actually stored at the bottom of the callee's frame, just above the first argument to the callee. This odd split has mostly to do with the behavior of the special x86 instructions for procedure call. The second-from-bottom word of a callee's stack frame is generally the saved FP of its caller.

This word description gives rise to the following diagram of the stack frame layout, as used by the GCC C compiler on x86 Linux [1]:

direction   |---------------------|                   ^
of stack    | Space for arguments | <== %esp (SP)     |
growth      | to called routines  |                   |
     ^      |---------------------|                   |
     |      |  Local variables,   |                   |
     |      |  temporary values   |                   |
     |      |---------------------|                   | Current
     |      | Other callee-saved  |                   | (callee's)
     |      |      registers      |                   | stack
     |      |---------------------|                   | frame
     |      |   Saved %ebp (FP)   | <== %ebp (FP)     |
     |      |---------------------|                   |
     |      |   Return address    |                   |
     |      |---------------------|----------------------------
     |      |  Argument 1 passed  |                   |
     |      |      to callee      |                   |
     |      |---------------------|                   |
     |      |  Argument 2 passed  |                   |
     |      |      to callee      |                   |
     |      |---------------------|                   |
     |      |   More arguments    |                   |
     |      |        .            |                   |	Previous
     |      |        .            |                   | (caller's)
     |      |        .            |                   | frame
     |      |        .            |                   |
     |      |        .            |                   |
     |      |    Argument N       |                   |
     |      |---------------------|                   |
     |      |    Caller-saved     |                   |
     |      |     registers       |                   |
     |      |---------------------|                   |
     |      |   Rest of stack     |                   |
     |      |         |           |                   |
     |      |         |           |    	       	      |
     |      |         |           |                   |
     |      |         V           |                   |
     |      |---------------------|                   V

Some things to note about this stack layout:

The actual order in which arguments appear on the stack is immaterial, so long as there is a simple rule that allows every argument to be accessed at a fixed, predictable offset from the FP. While the cdecl calling convention uses the above order, others use the opposite order.
We consider the return address (RA) to be part of the callee's stack frame, because the callee is responsible for popping RA off the stack before returning. However, the FP actually points to the saved FP, which is above RA on the stack. This is due to the semantics of the LEAVE instruction, which is what x86 typically uses to restore SP and FP before returning to the caller. See below.
The calling convention doesn't actually dictate where in the stack frame each saved registers is actually saved, nor where each local variable and temporary value is in the frame - these are local to a single call, so nobody has to agree with anybody else on this. The above depiction is a particular version of gcc on a particular system.
Older x86 C compilers would store argument values on the stack by pushing them, one argument at a time, onto the stack, pushing the rightmost argument first (so the leftmost one is on top, as required (only) by the calling convention). gcc usually allocates the whole stack frame at once, including extra space in the stack frame to store argument values if the procedure expects to call another. The arguments are still stored with the leftmost one on top, but this is generally done by writing directly into addresses at fixed displacements from the SP, not by an implicit "push" that both decreases SP and writes at the location pointed to by SP.

Summary of the calling sequence

In brief, this is how gcc on Linux for x86 implements procedure call:

First, the caller makes the actual call:
1. First, the caller pushes caller-saved register values onto the stack. These values are not used by the callee, so they can be pushed in any order.
2. The caller pushes argument values onto the stack, rightmost first (in accordance with the convention that the leftmost argument is on top).
3. Using the CALL instruction, the caller pushes the return address on the stack and then jumps to the callee.
Then, before executing its "actual" code, the callee executes its prologue to set up its stack frame:
1. The callee pushes the old frame pointer from %ebp onto the stack.
2. The callee then updates the FP %ebp to point to where the old FP is located, establishing the bottom of the stack frame.
3. The callee then allocates the rest of its stack frame by subtracting the frame size from the SP.
The callee does its work, saving its return value if any to %eax.
When done, the callee executes its epilogue to restore the previous SP and FP and return to the caller:
1. Using LEAVE, the callee pops off its stack frame up to and including the saved FP, while also restoring the saved FP into %ebp .
2. Then, using RET, the callee pops the return address off the stack and jumps back to it.
Finally, the caller does some cleanup work before resuming its own code:
1. If needed, the caller copies the return value %eax to memory or another register.
2. The caller pops the arguments off the stack.
3. When each caller-saved register's value is needed again, that register's saved value is loaded back from the stack.

We will see an example of most of these steps in the factorial example below.

Calling sequence example: Factorial

To study the calling sequence more closely, we'll use the following recursive factorial routine, which defines the factorial of all non-positive integers to be 1:

### Compiled from factorial_test.c , using
###   gcc -Wall -Werror -Wextra -S factorial_test.c
### This hand-annotated version of the resulting assembly code comes from
### factorial_test.annotated.s .
### 
### int factorial (int n);
### n ==> 8(%ebp) - (%ebp) is old %ebp, 4(%ebp) is return address, so
###                 8(%ebp) is first and only argument
### Return value ==> %eax (the register)
### Temp. value t1 ==> %eax
### Temp. value t2 ==> -12(%ebp) (location on stack)
### N.B. %eax is overloaded as both return value and temp. value
factorial:
	## [prologue]
	pushl	%ebp		# Push old frame pointer %ebp on stack
	movl	%esp, %ebp	# %ebp := %esp (currently pointing to old %ebp)
	subl	$40, %esp	# Allocate 40-byte stack frame (no push/pop for us)

	## if (n <= 0) {
	##   return (1);
	## }
	cmpl	$0, 8(%ebp)	# Compute n-0 (== n) - don't store result,
				# but set condition codes
	jg	fact_positive	# if (n > 0) goto fact_positive
	movl	$1, %eax	# result := 1
	jmp	fact_end	# goto fact_end

fact_positive:	
	## else {
	##   int recursive_result = factorial(n-1);
	##   return (n * recursive_result);
	## }
	movl	8(%ebp), %eax	# t1 (%eax) := n
	subl	$1, %eax	# t1 -= 1
	movl	%eax, (%esp)	# copy t1 to last word in our stack frame
	call	factorial	# push return addr on stack, goto recursive call
	movl	%eax, -12(%ebp) # t2 (-12(%ebp)) := recursive call's result
	movl	8(%ebp), %eax	# result := n
	imull	-12(%ebp), %eax	# result *= t2 (truncate upper word of product)

fact_end:	
	## [epilogue]
	leave			# pop stack frame (%esp := %ebp), pop and restore old %ebp
	ret			# pop return address, return there (to caller)
	.size	factorial, .-factorial

Call

Let's consider the sequence of instructions executed when factorial(n) makes a recursive call on factorial(n-1). The call itself is implemented in the two-instruction sequence in the fact_positive code block:

	movl	%eax, (%esp)	# copy t1 to last word in our stack frame
	call	factorial	# push return addr on stack, goto recursive call

Previously, we computed the argument to the recursive call, n-1, and stored it into register %eax. Now, we push this argument onto the stack right at the top of our stack frame. (We don't need to decrease %esp to push the value, because we have already allocated a stack frame big enough to store it.) Ordinarily, we'd also store the values of the caller-saved registers as well, %eax, %edx, and %ecx, but this is not necessary in this case - we don't use %edx or %ecx, and we never need to use the current value of %eax again (notice that the source code annotation doesn't use n-1 again).

With the recursive call argument loaded onto the stack, we use CALL to make the recursive call. CALL does three things:

It decreases %esp by 1 word length (4 bytes), pushing space for an additional word on the stack.
It writes the return address (the address immediately following the CALL) onto the top of the stack frame, which is the same word we just allocated.
Finally, it jumps to the address specified in the instruction, which in this case is the beginning of the same function (factorial) we just left.

Prologue

Before it can do any work, the factorial procedure needs to set up its stack frame. As discussed above, the first step in doing this is to push the old frame pointer on the stack, so we can safely update the frame pointer to point to the new frame. This is done with the instruction

	pushl	%ebp		# Push old frame pointer %ebp on stack

Like CALL, the PUSHL instruction does multiple things:

It pushes space for a new word on the stack, decreasing %esp by 1 word length.
It writes the register specified in the instruction (here %ebp) into the newly pushed stack word.

After the old FP is saved, the SP now points to the old FP. We want to establish that point as the bottom of our stack frame, so we set FP equal to SP:

	movl	%esp, %ebp	# %ebp := %esp (currently pointing to old %ebp)

Unlike the Z86's four types of move operation, the x86 has only one move instruction that accepts various types of register, memory, and immediate operands.

Finally, we push space for the rest of the stack frame onto the stack, by decreasing the SP by the frame size:

	subl	$40, %esp	# Allocate 40-byte stack frame (no push/pop for us)

Alternatively, we can just use PUSHL to allocate stack space when we need it, and its inverse instruction POPL to free up unused stack space. This is not the approach taken by gcc, however.

Epilogue

Once factorial has finished its work and written its result into the %eax return value register, it is ready to pop off its stack frame and return to its caller. This is done with a pair of instructions. The first is

	leave			# pop stack frame (%esp := %ebp), pop and restore old %ebp

LEAVE is a complex instruction, perhaps more so than even CALL. It does three things:

It pops off all but the bottom two words of the stack frame, by setting %esp equal to %ebp, so that only the saved %ebp and the return address remain.
It reads the value at the memory address 0(%esp), which is the saved %ebp, and stores it into the register %ebp.
It pops the saved %ebp off the stack, by increasing %esp by a word length.

Once LEAVE is done, the only word in the stack frame is the return address; %esp contains the address of the return address in memory.

The last instruction in factorial is

	ret			# pop return address, return there (to caller)

RET is, roughly, the opposite of CALL:

It reads the return address from the top of the stack frame into a hidden register.
It pops the return address off the stack by increasing %esp by a word.
Finally, it jumps to the return address by setting the program counter (PC) equal to the RA as stored in the hidden register.

Post-call cleanup

Now that we have returned from the recursive call factorial(n-1), our call factorial(n) needs to restore the register set and stack frame before resuming execution. Because factorial(n) doesn't save any caller-saved registers before recursively calling itself, it doesn't need to restore those registers' values. It does need to save the recursive result somewhere, as it will overwrite the %eax register where the recursive result is currently stored. For this, it simply flushes %eax into the local variable section of its stack frame:

	movl	%eax, -12(%ebp) # t2 (-12(%ebp)) := recursive call's result

Aside: How do I get the assembly code for my program?

use gcc -S to compile a single .c file and get a .s assembly language file instead of a machine-code .o object file or executable program
use objdump -d to disassemble an executable program or object file
use gdb's disassemble command to get the disassembly for a single routine (especially useful with /m flag to get source code (only if compiled with -g))

References:

Based on: Michael L. Scott, Programming Language Pragmatics (3rd ed.). Morgan Kaufmann, 2009. Figure 8.11 (CD supplement, page 177).