CSE 341 -- Storage Management

It is useful to understand how storage is managed in different programming languages and for different kinds of data. Three important cases are:


Static Storage Allocation

Static storage allocation is appropriate when the storage requirements are known at compile time. For a compiled, linked language, the compiler can include the specific memory address for the variable or constant in the code it generates. (This may be adjusted by an offset at link time.)

Examples:


Stack-Based Storage Allocation

Stack-based storage allocation is appropriate when the storage requirements are not known at compile time, but the requests obey a last-in, last-out discipline.

Examples:

Stack-based allocation is normally used in C, Ada, Algol, and Pascal for local variables in a procedure and for procedure call information. It allows for recursive procedures, and also allocates data only when the procedure or function has been called -- but is reasonably efficient at the same time.

Typically a pointer to the base of the current stack frame is held in a register, say R0. A reference to a local scalar variable can be compiled as a load of the contents of R0 plus a fixed offset. Note that this relies on the data having known size. To compile a reference to a dynamically sized object, e.g. an array, use indirection. The stack contains an array descriptor, of fixed size, at a known offset from the base of the current stack frame. The descriptor then contains the actual address of the array, in addition to bounds information.

References to non-local variables can be handled by several techniques -- the most common is using static links. This is beyond the scope of what we'll cover in 341 this year -- see the CSE 505 lecture notes on the implementation of block-structured languages if you are interested, or take CSE 401. Most variable references are either to local variables or global variables, so often compilers will handle global variable references more efficiently than references to arbitrary non-local variables.

There are two important limitations to pure stack-based storage allocation. First, the only way to return data from a procedure or function is to copy it -- if you try to simply return a reference to it, the storage for the date will have vanished after the procedure or function returns. This isn't an issue for scalar data (integers, floats, booleans), but is an issue for large objects such as arrays. For this reason you can't for example return an array from a function in Algol-60. Second, you can't return a procedure or function as a value, or assign a procedure or function to a global variable (procedures or function values aren't first class citizens). The reason for this restriction is that procedure-valued variables need to be closures -- that is, they need to include a reference to the procedure body along with a pointer to the environment of definition of the procedure. With a stack-based storage allocation scheme, if we tried to say assign a procedure to a global variable, environment of definition may have disappeared when we try to call the procedure.


Heap-Based Storage Allocation

The most flexible allocation scheme is heap-based allocation. Here, storage can be allocated and deallocated dynamically at arbitrary times during program execution. This will be more expensive than either static or stack-based allocation. Heap-based allocation is used ubiquitously in languages such as Lisp and Smalltalk.

Issue: when is storage allocated and deallocated?

Allocation is easy. In C, malloc (a standard library function) allocates fresh storage. In Lisp, a new cons cell is allocated when the cons function is called, array storage can be allocated using make-array, and so forth. In Smalltalk, new storage is allocated when someone sends the new message to a class.

Deallocation is harder. There are two approaches: programmer-controlled and automatic. In languages such as C, the programmer is in charge of deciding when heap storage can be freed (in C using the free function). This is efficient but can lead to problems if the programmer makes a mistake -- either storage is not freed even though it is no longer needed (memory leak), or is freed but referred to later (dangling pointer). Dangling pointers can lead to type insecurities or other errors -- one can refer to storage that has been re-allocated for another purpose with data of a different type.

Lisp and Smalltalk, as well as various other languages, use automatic storage management. There is no explicit deallocate function; rather, storage is automatically reclaimed some time after it is no longer accessible. The most common technique is garbage collection; its simplest form is mark and sweep. In for example Lisp, when the system runs out of storage, first mark all cons cells as unreferenced. Then start with known references (global variables, local variables in the stack, etc) and mark the cells it references as to be kept. Follow the car and cdr pointers, marking those cells as to be kept, then the cells they refer to, and so forth. (Stop when you encounter a cons cell that is already marked to prevent infinite looping.) After the process is completed, sweep through memory, linking all the unreferenced cells into the free list.

There has been much research on garbage collection, and a variety of more efficient algorithms have been developed. (For example, the average lifetime of objects is bimodal -- either objects have very short lifetimes or long lifetimes -- and efficient algorithms can take advantage of this fact.)

An alternative to garbage collection is reference counting. Here, the implementation keeps track of how many references there are to an object, and reclaims it when the count drops to zero. However, reference counting won't reclaim circular structures.

Complications: need to distinguish between storage with pointers and storage with other kinds of data (e.g. floats), objects of different sizes, small integer coding, etc. A few languages have an automatically-invoked finalize method or function that is invoked just before an object is reclaimed. Smalltalk has weak arrays. If the only reference to an object is from a weak array, the object will be garbage-collected and the corresponding array element set to nil. As an example of using a weak array, suppose you wanted to keep performance statistics on some objects in a simulation, and kept these objects in an array. The act of keeping them in an ordinary array would prevent them from being garbage collected, even though they were no longer referenced from anywhere else.

How are activation records (stack frames) allocated in languages such as Smalltalk? Since blocks hold onto their environment of creation and can be bound to global or instance variables, in general activation records can't be allocated on a stack. In addition, activation records are themselves full-fledged objects -- this capability is used for example by the debugger, and can be used to create new kinds of control regimes (e.g. coroutines). However, most activation records can be allocated on a stack (and don't need to be real objects) -- this is much faster than making an object for every stack frame and storing it in the heap. So efficient implementations allocate activations on the stack and only make them into objects when needed.