Project Description

Navigation


Database Project Home index.htm

Project Description descript/descript.htm

Database structure
dbstructure/
dbstructure.htm
.

Source files
src/src.htm.

Management of the project management/
management.htm

MySQL database system mysql/mysql.htm

Appendices are in appendix/appendix.htm.


Gabriel's Home

Kalon's Home

Patrick's Home

Phu's Home

CSE444 Spring 99 homepage

Daikon Research -
Discovering Program Invariants

A group of researchers at the University of Washington led by Michael Ernst has built a system called Daikon for dynamically detecting program invariants.

Program invariants are properties true at a program point such as:

     "x = 0 (mod 4)"      or      "x > abs(y)"      or      "x = 3*y -7"

These are the kinds of things that you might put in an assert statement. The technique is to run the program and to examine the values it computes, looking for properties and relationships over those values.

Daikon instruments a program by installing probe statements throughout the code at certain program points, such as the entry to and exit from procedures. The probe statements write values to a data trace file during program execution.

Later the invariant detection system reads the trace files into a database of some sort, and queries the database while looking for invariants. Users can also query the database asking for a counterexample to an invariant that was not reported by the invariant detector.

The system currently (before using the mysql database and C functions provided by this Database Project) reads the data traces into memory and queries it there, so the system is unnecessarily slow and tends to run out of memory (since everything is kept in main memory).

It would be preferable to use a real database. A data trace file contains one declaration for each instrumented program point (indicating the name of the program point and the names and types of the variables visible there), then an arbitrary number of data values for each program point. For instance, it might state that program point "p" was executed 1000 times, and that on 300 of those occasions, the value of variable x was 22 and the value of y was 5; on 200 occasions, the value of x was 100 and y was 5; on 200 more occasions, x was 1 and y was 1; and so forth. This database would contain information about multiple program points, each with an arbitrary (but predetermined) number of variables and an arbitrary number of executions. Two essential queries are asking for a subset of the values at a program point (for instance, above if I asked about y, it would be 5 on 500 occasions, 1 on 200 occasions, and so forth) and asking whether a particular property (such as "x <= 3^y") is true (and if not, supplying counterexamples).

http://www.cs.washington.edu/homes/mernst/invariants-dist/
contains additional information on the research.

Database Project

Goals of the Project

Primary Goals

Two main features are vitally important. These are:

1. Efficient memory management.
Currently the test cases are being stored in upwards of 200MB files. It is the client's wish that this be reduced so that larger test cases can be run, and so that the search space is not as large. Thus any significant reduction of size of storage (by at least some constant factor) is desired. Because of the sheer size of the data there may be limitations as to what platforms and what software are available to use.

2. No loss of information.
There must be no loss of any information that is currently kept or planned to be used in the immediate future. Any information that is currently kept by the inefficient dictionary system must be able to be either directly accessed or calculated from our storage space.

Back to top

Secondary Goals

Some of the secondary goals have changed slightly over the course of the project, and some have been dropped entirely. Timestamping became a higher priority, as it makes our tables have a unique key for the values, making life significantly easier. Using the database for queries, however, has been de-emphasized, as the client has concerns that too many things would have to be rewritten on his end to accommodate this feature. Here are our revised secondary goals:

1. Timestamping.
Currently there is no way to make any temporal relationship tests. The timestamp value is a number that indicates temporal sequence, not actual time.

2. User interface.
The client has expressed no interest in a nice, easy user interface. However, most databases have some kind of nice user interface, as this makes debugging easier, it makes the client's life easier, and it incorporates abstraction and encapsulation of data better. Depending on the platform, a clean user interface can be very easy or very hard, and thus will be platform independent.

3. Platform.
The client's main platforms used are Sun systems. However, the client has expressed interest in other platforms, and does not mind what platform eventually gets used, so long as it is a platform that the program already runs on. A Sparcstation running UNIX (SunOS) was used for development.

Our choice of using MySQL as the database management system lends itself to easily portable code and has proven to be a good choice.

Back to top

Finished Project overview

The Database project consists of three modules in its current form.

  • The first module translates the data that comes out of the dynamic invariant program and converts it to a form that our database can understand.
  • The second module is the database itself; a separate instance of the database is created for every tracefile. (Each tracefile may contain several runs of the client's program.)
  • The final module is the interface to the database. It handles returning information from the database in the form that the client expects.

 Back to top

ER Diagram/Schema

The ER diagram is somewhat erroneous, as the current schema does not easily port to a "good" ER diagram. There are not many relationships that are necessary to get back -- for the most part, the information retrieval is handled via queries that do not involve joins, and thus the relationships between tables are not very important.

Our schema plan is as follows:
For every program point we keep track of its name, the type of point (begin, end, loop), the variable names of the function, the variable types of the function, and comparable variables. For each uniquely identified program point (by name), we create a separate table. That table's name is dependent on the program point's name, and has entries for each instance of a program point's calls. Because these entries vary from function to function (one function can have any number of parameters), we have to create a new table for every function/program point. Each of these rows is uniquely identified by a timestamp based on its relative distance from the beginning of the data trace file.

Note that this schema does not accept multi-dimensional arrays, which are currently not handled by the Daikon data generation software.


Status of the Project 6/6/99 - Testing and Debugging

The revised version of the source code is in directory revisedver3. There are comments at the beginning of each function.

A bug was discovered during a runthrough Friday. The bug arises from the fact that we currently use long ints for all kinds of number types (even characters will be converted to integers to save space). However, the address type is unsigned long int. So, the strtol (string to long) gives the same output for all the address types - a wrong result.

The solution is to discover the actual types of the variables (not what is stored in the result set, but in the vtypes columns of the ProgramPoints table), and convert the strings to appropriate types. In this way, we guarantee that all types in the .dtrace file will be taken off, including int, int[], char, char[], address (unsigned long), and even types added in future expansions, including float, float[], ...

The current versions of idquery and idinterface rely on the type stored in the result set to convert the string to long int or long int array. The conversion takes place in the format_tuple() function. A specific solution for fixing this is having nameconv() return both names and type of a given variable name. This information will then be passed into the format_tuple() for the conversion process.

Phu spent a little time to experimenting on an improved version, but as of June 6 has not come up with a full solution code. Improved code in work is in the ver3 directory, and the files are named *1.h *1.c. (The "1" does not mean version 1, it is merely a way of assigning unique filenames.)

Back to top


Webmaster: Patrick Snyder
Last modified: 6/8/99
This file: descript.htm

Back to top