Navigation
Database Project Home
index.htm
Project Description
descript/descript.htm
Database structure
dbstructure/
dbstructure.htm.
Source files
src/src.htm.
Management of the project
management/
management.htm
MySQL database system
mysql/mysql.htm
Appendices are in
appendix/appendix.htm.
Gabriel's Home
Kalon's Home
Patrick's Home
Phu's Home
CSE444 Spring 99 homepage
|
Daikon Research -
Discovering Program Invariants
A group of researchers at the
University of Washington
led by
Michael Ernst
has built
a system called Daikon for dynamically detecting program invariants.
Program
invariants are properties true at a program point such as:
"x = 0 (mod 4)" or
"x > abs(y)" or
"x = 3*y -7"
These are the kinds of things
that you might put in an assert statement.
The technique is to run
the program and to examine the values it computes, looking for
properties and relationships over those values.
Daikon instruments a program by installing probe statements throughout the code
at certain program points, such as the entry to and exit from procedures.
The probe statements write values to a data trace file during program
execution.
Later the invariant detection system reads the trace files into a database
of some sort,
and queries the database while looking for invariants.
Users can
also query the database asking for a counterexample to an invariant
that was not reported by the invariant detector.
The system
currently (before using the mysql database and C functions
provided by this Database Project)
reads the data traces into memory and queries it there, so
the system is unnecessarily slow and tends to run out of memory
(since everything is kept in main memory).
It would be preferable to
use a real database. A data trace file contains one declaration for
each instrumented program point (indicating the name of the program
point and the names and types of the variables visible there), then
an arbitrary number of data values for each program point. For
instance, it might state that program point "p" was executed 1000
times, and that on 300 of those occasions, the value of variable x
was 22 and the value of y was 5; on 200 occasions, the value of x
was 100 and y was 5; on 200 more occasions, x was 1 and y was 1; and
so forth. This database would contain information about multiple
program points, each with an arbitrary (but predetermined) number of
variables and an arbitrary number of executions. Two essential
queries are asking for a subset of the values at a program point
(for instance, above if I asked about y, it would be 5 on 500
occasions, 1 on 200 occasions, and so forth) and asking whether a
particular property (such as "x <= 3^y") is true (and if not,
supplying counterexamples).
http://www.cs.washington.edu/homes/mernst/invariants-dist/
contains additional information on the research.
Database Project
Primary Goals
Two main features are vitally important. These are:
1. Efficient memory management.
Currently the test cases are being stored in upwards of 200MB files.
It is the client's wish that this be reduced so that larger test cases can be run,
and so that the search space is not as large.
Thus any significant reduction of size of storage (by at least some constant factor)
is desired. Because of the sheer size of the data there may be limitations as to
what platforms and what software are available to use.
2. No loss of information.
There must be no loss of any information that
is currently kept or planned to be used in the immediate future.
Any information that is currently kept by the inefficient dictionary system
must be able to be either directly accessed or calculated from our storage space.
Back to top
|
Secondary Goals
Some of the secondary goals have
changed slightly over
the course of the project, and some have been dropped entirely.
Timestamping became a higher priority, as it makes our tables
have a unique key for the values, making life significantly easier.
Using the database for queries, however, has been de-emphasized,
as the client has concerns that too many things would have to be
rewritten on his end to accommodate this feature.
Here are our revised secondary goals:
1. Timestamping.
Currently there is no way
to make any temporal relationship tests.
The timestamp value is a number that indicates temporal
sequence, not actual time.
2. User interface.
The client has expressed no interest
in a nice, easy user interface.
However, most databases have some
kind of nice user interface, as this makes debugging easier, it
makes the client's life easier, and it incorporates abstraction and
encapsulation of data better. Depending on the platform, a clean
user interface can be very easy or very hard, and thus will be
platform independent.
3. Platform.
The client's main platforms used are Sun systems.
However, the client has expressed interest in other platforms, and does
not mind what platform eventually gets used, so long as it is a platform that
the program already runs on.
A Sparcstation running UNIX (SunOS) was used for development.
Our
choice of using MySQL as the database management system
lends itself to easily portable code and has proven
to be a good choice.
Back to top
|
The Database project consists of three modules in its
current form.
-
The first module translates the data that comes out of
the dynamic invariant program and converts it to a form that our
database can understand.
-
The second module is the database itself; a
separate instance of the database is created for every
tracefile. (Each tracefile may contain several runs of
the client's program.)
-
The final module is the interface to the
database. It handles returning information from the database in
the form that the client expects.
Back to top
|
|
The
ER diagram
is somewhat erroneous, as
the current schema does not easily port to a "good" ER diagram.
There are not many relationships that are necessary to get back --
for the most part, the information retrieval is handled
via queries that do not involve joins, and
thus the relationships between tables are not very important.
Our schema plan is as follows:
For every program point we keep
track of its name, the type of point (begin, end, loop), the
variable names of the function, the variable types of the function,
and comparable variables. For each uniquely identified program point
(by name), we create a separate table. That table's name is
dependent on the program point's name, and has entries for each
instance of a program point's calls. Because these entries vary from
function to function (one function can have any number of
parameters), we have to create a new table for every
function/program point. Each of these rows is uniquely identified by
a timestamp based on its relative distance
from the beginning of the data trace file.
Note that this schema does not accept multi-dimensional arrays,
which are currently not handled by the Daikon data generation
software.
Status of the Project 6/6/99 - Testing and Debugging
The revised version of the source code is in
directory revisedver3. There are comments
at the beginning of each function.
A bug was discovered during a runthrough Friday.
The bug arises from the fact that we currently use
long ints for all kinds of number types
(even characters will be converted to integers to save space).
However, the address type is unsigned long int.
So, the strtol (string to long) gives the same output for all the address
types - a wrong result.
The solution is to discover the actual types of the variables (not what is
stored in the result set, but in the vtypes columns of the ProgramPoints
table), and convert the strings to appropriate types.
In this way, we
guarantee that all types in the .dtrace file will be taken off, including
int, int[], char, char[], address (unsigned long), and even types
added in future expansions, including float, float[], ...
The current versions of idquery and idinterface rely on the type stored
in the result set to convert the string to long int or long int array.
The conversion takes place in the format_tuple() function.
A specific solution
for fixing this is having nameconv() return both names and type of a given
variable name.
This information will then be passed into the
format_tuple() for the conversion process.
Phu spent a little time to experimenting on an improved version,
but as of June 6 has not come up with a full solution code.
Improved code in work is in the ver3 directory, and the files
are named *1.h *1.c.
(The "1" does not mean version 1, it is merely a way of assigning
unique filenames.)
Back to top
|
Webmaster:
|
Patrick Snyder
|
Last modified:
|
6/8/99
|
This file:
|
descript.htm
|
Back to top
|
|