CSE503: Software Engineering
Lecture 21  (February 26, 1999)

David Notkin

v     Lackwit (and LCLint intro, if time)

Ø      Examples of approaches that check partial specifications and that exploit type information

v     Lackwit (O’Callahan & Jackson)

Ø      Code-oriented tool that exploits type inference

Ø      Answers queries about C programs

§        e.g., “locate all potential assignments to this field”

Ø      Accounts for aliasing, calls through function pointers, type casts

Ø      Efficient

§        e.g., answers queries about a Linux kernel (157KLOC) in under 10 minutes on a PC

v     Placement

Ø      Lexical tools are very general, but are often imprecise because they have no knowledge of the underlying programming language

Ø      Syntactic tools have some knowledge of the language, are harder to implement, but can give more precise answers

Ø      Semantic tools have deeper knowledge of the language, but generally don’t scale, don’t work on real languages and are hard to implement

Ø      The goal of Lackwit, thus, is to use some semantic basis, in a scalable way, on a real language (C)

v     It is a static tool

Ø      Can work on incomplete programs

Ø      Make assumptions about missing code, or supply stubs

v     Sample queries

Ø      Which integer variables contain file handles?

Ø      Can pointer foo in function bar be passed to free()?  If so, what paths in the call graph are involved?

Ø      Field f of variable v has an incorrect value; where in the source might it have changed?

Ø      Which functions modify the cur_veh field of map_manager_global?

v     Lackwit analysis

Ø      Approximate (may return false positives)

Ø      Conservative (will not return false negatives) under some conditions

§        C’s type system has holes

§        Lackwit makes assumptions similar to those made by programmers (e.g., “no out-of-bounds memory accesses”)

§        Lackwit is unsound only for programs that don’t satisfy these assumptions

v     Query commonalities

Ø      There are a huge number of names for storage locations

§        local and global variables; procedure parameters; for records, etc., the sub-components

Ø      Values flow from location to location, which can be associated with many different names

Ø      Archetypal query: Which other names identify locations to which a value could flow to or from a location with this given name?

Ø      Answers can be given textually or graphically

v     An example query

Ø      Query about the cur_veh field of map_manager_global

Ø      Shaded ovals are functions extracting fields from the global

Ø      Unshaded ovals pass pointers to the structure but don’t manipulate it

Ø      Edges between ovals are calls

Ø      Rectangles are globals

Ø      Edges to rectangles are variable accesses

Ø      Claim

§        This graph shows which functions would have to be checked when changing the invariants of the current vehicle object

§        Requires semantics, since many of the relationships are induced by aliasing over pointers

v     Underlying technique

Ø      Use type inference, allowing type information to be exploited to reduce information about values flowing to locations (and thus names)

Ø      But what to do in programming languages without rich type systems?

Ø      Trivial example

§        DollarAmt getSalary(EmployeeNum e)

Ø      Relatively standard declaration

Ø      Allows us to determine that there is no way for the value of e to flow to the result of the function

§        That is, e can (and surely would) affect what DollarAmt gets returned, but e itself (nor any computation that is based on e) will be able to flow to the return value

Ø      Because they have different types

Ø      Consider an alternative

§        int getSalary(int e)

Ø      Another, perhaps more common, way to declare the same function

Ø      This doesn’t allow the direct inference that e’s value doesn’t flow to the function return

§        Because they have the same type

v     But maybe one could analyze the program to determine this anyway; Lackwit does so by using a type inference mechanism for precision

v     Lackwit’s type system ignores the C type declarations

Ø      Computes new types in a richer type system

Ø      Incomplete type information

Ø      Cover example from the paper

Ø      void* return1st(void* x, void* y) {
    return x; }

Ø      (a refb, b) ®f a refb

Ø      The type variable a indicates that the type of the contents of the pointer x is unconstrained

Ø      But it must be the same as the type of the contents of the return value

Ø      Increases the set of queries that Lackwit can answer with precision

v     Polymorphism

char* ptr1;
struct timeval* ptr2;
char** ptr3;

return1st(ptr1,ptr2);
return1st(ptr2,ptr3);


Ø      Both calls match the previous function declaration

Ø           This is solved (basically) by giving return1st a richer type and instantiating it at every call site

§             (c refb, d) ®d c refb

§        (e refa, f) ®c e refa

v     Type stuff

Ø      Modified form of Hindley-Milner algorithm “W”

Ø      Efforts made to handle

Ø      Mutable types

Ø      Recursive types

Ø      Null pointers

Ø      Uninitialized data

Ø      Type casts

Ø      Declaration order

v     Morphin case study

Ø      Robot control program of about 17KLOC

Ø      Vehicle object contains two queue objects

Ø      Client was investigating combining these two queues into one

Ø      Queried each queue object to discover operations performed and their contexts

Ø      The two graphs each contained 171 nodes

Ø      But each graph had only five nodes highlighted as “accessor” nodes

Ø      Example

§        These five matches helped identify code to be changed

§        grep would have returned false matches and missed matches when parameters were passed to functions

§        Context-sensitivity needed to distinguish the two queue objects because both are passed as arguments to the same queue functions

v     Recap

Ø      Helps find relationships among variables in a C program

Ø      Exploits type inference to understand values flowing to locations and thus names

Ø      Approximate, although safe under many (most?) conditions

Ø      Reasonably efficient

Ø      Although I didn’t show the numbers, they are now better than reported in the ICSE paper