Sample Research Projects
Here are some sample research topics related to the Cecil/Diesel
and Vortex/Whirlwind projects, for undergraduate
senior honors theses, graduate quals
projects, and graduate PhD research
directions. RA funding for graduate students is available for most
of these projects.
Note that this listing is only partial, and it does not
include project ideas outside the scope of the Cecil/Diesel and
Vortex/Whirlwind projects. Please talk to us for more information and
additional ideas.
PhD-Level Research Topics
- Efficient, Expressive Rhodium-based Optimizations: The
Rhodium project has successfully demonstrated that a wide range of
optimizations can be expressed and proved correct automatically.
However, the current Rhodium system includes only a proof-of-concept
interpreter for executing Rhodium optimizations inside the ccompiler,
which is much too slow for practical use. We have several cool ideas
for automatically turning separate declarative Rhodium rules into
efficient monolithic optimizations, based on ideas from partial
evaluation and data-structure specialization. We also have a bunch of
ideas for expanding the scope of what kinds of program analyses and
transformations can be handled well via a special declarative
implementation language. Here's a link to a recent proposal about a
bunch of Rhodium-related research: proposal,
bib
- Staged Compilation: A major research effort which we'd like
to undertake using Whirlwind is to support systematically staging
optimizations across multiple compilation stages (separate compile
time, library link time, application link time, load time, and
run-time), in order to blend the advantages of static and dynamic
compilation without suffering their limitations. We have previously worked on two kinds of
staged compilation, one as part of a C compiler and one for
optimizations written in a styled subset of ML. New work could
profitably exploit the Rhodium language for optimizations, and it
could explore a much more practical trade-off between compiler size
and compilation speed, particularly useful for static compilation
stages. Here's a link to an older description of possible staged
compilation research: proposal,
bib
- Interoperability Across Language Models: A major obstacle
to the widespread use of high-level languages, and also to the use of
multiple domain-specific languages in implementing components of a
single program, is the difficulty in interoperability between
languages with different data and execution models. High-level
languages often have a C foreign-function interface, but this
interface is a "least-common denominator" interface that
only supports the simplest sorts of procedure calls and parameter
passing, and it's difficult for low-level languages to call high-level
languages or for two high-level languages to interoperate. There are a
number of interesting and important projects here in working on
supporting efficient, cheap sharing of data structures across
languages, in spite of differences in data layout and garbage
collection strategies, and in supporting more expressive forms of
control transfer across language boundaries, such as exceptions,
returning multiple results, backtracking, threads, and the like.
Here's a link to a recent proposal about interoperability research: proposal,
bib
- Advanced Software Units: We have long been exploring ideas
for expressive, flexible software units, including multiple and
predicate dispatching, powerful multiple inheritance, modules and
modular typechecking, and most recently parameterized modules. An
interesting alternative to explicitly parameterized modules (and in
fact explicitly parameterized types and functions) is to introduce
virtual, nested modules. Virtual nested modules can easily
capture patterns of intra- and inter-module name references, and allow
overriding modules to rebind those names, obviating the need for the
factory pattern. Previous researchers have explored virtual nested
classes in the context of Java, Beta, and Scala, but no one has
explored how it might work for modules containing open classes and
multiply dispatched generic functions.
- New Concurrency Models: Since new hardware performance
improvements are likely to come from increased parallelism rather than
increased single-chip performance, it is increasingly important for
software to be (or be compiled to be) concurrent. But writing
concurrent programs is notoriously hard. We would like to work
towards high-level language support for writing concurrent programs
that are reliable and correct, with most of the burden for high
performance being pushed onto the language implementation. Atomic
blocks seem to be a good step forward towards this goal. Are there
other steps that can be taken? For example, it might be more reliable
(i.e., less error-prone) to assume all code was in one giant atomic
block by default, and have programmers insert boundaries between
atomic blocks only after careful consideration (rather than the
reverse, which is the current approach). Similarly, data structures
that are potentially shared might benefit from being marked
differently, to help the programmer reason about the possible effects
of concurrent activities. Are there other interesting interactions
between high-level language features and concurrency? For example,
predicate dispatching and other advanced pattern-matching-style
function calling mechanisms might need to be handled carefully in the
face of concurrent updates to the parts of arguments being
pattern-matched upon. What implementation strategies can make these
high-level semantics efficient?
- Representation Optimizations: In high-level languages,
instance variables often implicitly contain pointers to other data
structures. This is less time & space efficient than what is
possible in lower-level languages like C++, where component objects
can be represented in-line w/o the extra level of indirection. On the
other hand, the semantics in the high-level language is much simpler;
in C++, replacing Foo* f; with Foo f; isn't a simple storage
optimization but rather alters the meaning of assignment, sharing,
etc. This difference is a major distinction between high-level and
low-level languages, both from the standpoint of cleanliness of
semantics and efficiency of implementation. Perhaps we can get the
best of both worlds: can the compiler optimize the program to avoid
extra levels of indirection, either automatically or directed by
programmer hints, to improve performance without affecting the
semantics of the program? When is this possible? What runtime data
structures are necessary to support this w/o changing the semantics of
references to the instance variable? How much performance improvement
can be obtained this way? Some previous work has been done in the
context of ML, by exploiting the static type information. Can we
accomplish the same thing in an OO language? A dynamically-typed
language, using compiler analyses?
- Advanced Optimization Frameworks: Vortex and Whirlwind
include frameworks allowing dataflow optimizations over the CFG to be
written as simple flow functions, which are then run by an engine that
computes intraprocedural dataflow solutions and can automatically
compose multiple analyses and transformations. Vortex moreover can
automatically derive context-sensitive interprocedural analysis from
these analyses, and Whirlwind allows intraprocedural analyses over the
DFG as well. Several areas for future work remain: what other kinds
of program representations can be usefully exploited, e.g. ASTs,
control dependence graphs, etc.? How can more efficient and scalable
interprocedural analyses be derived systematically, e.g. those
exploiting BDDs or other tricks?
- Optimization with Separate Compilation: Vortex exploits
whole-program information for best optimization. There exists a
mechanism for compiling libraries separately, but that disables all
optimization of the separately-compiled library, since it doesn't know
how it will be extended. There's an opportunity to work on a better
solution, one that will combine good optimization of libraries with
separate compilation. One idea is to introduce an internal interface
to libraries which expresses those assumptions the optimizer made
about the outside world and also those assumptions the outside world
can make about the library. Given this information, libraries and
clients can be compiled & optimized separately. The trick is to
determine how to construct these optimizer-level interfaces
(automatically?) and how to recover if an assumption (e.g. that a
class has no subclasses in the client or that an overriding method
definition isn't introduced) is violated (disallow? patch client with
modified code? recompile/specialize the library with different/more
general assumptions?). A good solution to this problem would make
whole-program optimization a lot more practical and accepted. This
work complements the work on staged compilation described above.
- Profiling High-Level Languages: A hard but really important
problem in high-level languages is the ability (or lack thereof) to
get a good picture of the performance of a program. Profiling tools
tend to break down in the presence of higher-order functions (as in
Lisp, ML, Smalltalk, Cecil, and Diesel), and some optimizations like
inlining can really change the structure of the program. How can
profiling tools like gprof be adapted to report useful information for
high-level languages? More importantly, how can tools help to find
performance bottlenecks in complex data-structure-based programs?
E.g., that the hash function for a hash table isn't getting good
spread, or that some list is getting too long, or that some crucial
optimization in the inner loop of the program isn't being performed,
or ....
- Regional Typechecking: Previous work on MultiJava and
Relaxed MultiJava (RMJ) has explored two opposite extremes for modular
typechecking in the face of open classes and multimethods: reject
modules that do not meet modular typing restrictions, and allow such
modules but postpone all checking to program link time. A
generalization would allow arbitrary units of software, including
in-between granularities such as package-level linking, to have
modular checks enforced and possibly discharged. E.g. programmers
could wait until a package is complete before verifying that all
modular checks are satisfied, at which point no additional checking
would be required of any clients of the package.
- "Tower of Worlds" Programming Environments: Smalltalk has an
excellent programming environment, which can be easily modified by
Smalltalk programmers while it is running. This is great for rapid
evolution and customization of the environment, but since edits
directly affect the code being run, including code for the environment
itself, any bugs introduced into core library code can bring the
environment crashing to a halt. An alternative is to introduce a
conceptually unbounded "tower" of previous stable versions of the
environment. Whenever the programmer wished to "go meta" and edit the
environment itself, a new, immutable environment is conceptually
created to use in editing the existing environment. This new
environment might be a copy of the current environment, or it might be
the previous "stable" version of the environment. That environment
itself can be edited, too, generating an unbounded number of
meta-environments being created. Copy-on-write ideas could be used to
avoid actually making any copies of environments except for code
that's changed during editing. Such an idea would solve Smalltalk's
problem while retaining its advantages. It also would be a better way
to implement the Cecil and Diesel interpreter library methods, which
suffer from a similar problem as Smalltalk and also must be compiled
without optimization to allow for user extension and editing.
- Efficient Multi-Method Dispatching: Dynamic dispatching is
a costly operation in OO languages. The Vortex and Whirlwind compilers
do a number of analyses to try to eliminate dynamic dispatches, but
where this is not possible, runtime method lookup techniques must take
over. There are a number of ways for implementing dynamic dispatching,
ranging from runtime-filling hash tables, to runtime generated machine
code implementing a linear search over a small number of cases, to
statically-determined indexed table layouts. Since Cecil and Diesel
support multi-methods, i.e. methods that require dispatching on
multiple argument types, the possibilities for implementing
multiply-dispatched messages becomes even more interesting. We have
some algorithms for integrating these various implementation
strategies, and it would be cool to implement them in the context of
Whirlwind. An open question is how to support modular compilation of
such dispatchers.
- Dynamic Profiling in Whirlwind: Vortex included support for
gathering dynamic execution profiles of common receiver classes, and
then exploiting these profiles to guide optimizations. It would be
cool to develop an updated version of these ideas, integrating other
kinds of value- and execution-count profiling, and implement it in the
context of Whirlwind. Another interesting aspect would be to develop
support for call-stack-based profiling for Whirlwind's C-- back-end,
extending the kind of profiling that gprof provides for C and hiding
runtime-system effects that cause gprof to be ineffective on
Whirlwind's C output.
- Interoperability between Diesel, Java, C++, OCaml, ...:
Interoperability among high-level languages is a major research topic
(described above). As a first cut, it would be interesting to develop
interoperability support between Diesel and Java using Java JNI. We
might even be able to add C++, OCaml, and other languages to the mix.
This would be a "hand" implementation of ideas that we would then hope
to learn from, generalize, and automate as much as possible.
- Mixing Statically and Dynamically Typed Code:
Cecil and Diesel allow static type declarations to be omitted, or
equivalently replaced with a type dynamic. Computations on
values of type dynamic are type-checked dynamically, but what
happens if a dynamic value is passed to a statically typed value?
Currently, there is no check; the value is simply assumed to be of the
right static type. In addition, run-time type-checking is applied
even to statically typed code. This ensures dynamic type safety, but
imposes performance costs on statically typed code. It would be cool
to allow the compiler (and the programmer) to trust static type
declarations, and instead impose run-time checks whenever a
dynamically typed value is passed to statically typed code. This
isn't sufficient, as a single value may have statically and
dynamically typed aliases, and so we also need to ensure that any
operations done on the dynamically typed alias (e.g. storing into an
array element) don't violate any assumptions of the statically typed
aliases. Figuring out how to do all this, so that overhead is imposed
only in dynamically typed code, would be very cool, and pragmatically
very useful as a way to blend the advantages of statically and
dynamically typed programming languages.
- Dynamic Compilation in Whirlwind: Vortex includes special
hand-written routines that dynamically generate some simple dispatcher
functions for performance. But there are many examples where
dynamically generated code, or dynamically modifiable code, could be
useful, including setting debugging breakpoints and watchpoints, and
modifying existing code. It would be cool to develop systematic
support for dynamically generated code, e.g. in the context of
Whirlwind's C-- back-end. (This might require running the C--
compiler inside the Whirlwind run-time system, which could require a
solution to the Diesel-OCaml language interoperability problem
described above.)
- Accurate Garbage Collection in Whirlwind: Whirlwind's
run-time system supports garbage collection using the Boehm-Weiser
conservative garbage collector. Because it is conservative, this
collector is less efficient than an accurate garbage collector could
be. We have an accurate garbage collector library, but it requires
the compiler to generate information describing the layout of data
structures, the stack, and registers. It would be cool to modify
Whirlwind's C-- back-end to provide this information. Open questions
include whether Whirlwind's intermediate language and compiler passes
would need to be modified to track things like run-time array sizes
and other data layout issues.
- Efficient Source-Level Debugging: Source-level debugging is
invaluable, but it is hard to support efficiently in the presence of
complex optimizations such as inlining. Currently, Vortex generates
support in Cecil and Diesel programs to do this, but it has a fairly
substantial runtime cost, even when debugging isn't needed by the
user. An alternative, developed in some previous research papers, is
to have the compiler produce tables that describe where values are and
what optimizations have been performed, for use by the debugger. This
work could be in the context of either Vortex or Whirlwind.
Whirlwind's C-- back-end would be particularly convenient for this
extension.
- Inheritance of Constructor-Like Code in Diesel: In Diesel,
objects are created through a single atomic creation expression that
takes initial values of the object's fields as arguments. E.g.,
"new point{x := 5, y := 6}". This allows immutable fields
and objects to be defined, and ensures that an object isn't accessed
before it's been initialized. However, it now becomes hard to inherit
initialization code, as you can with C++ and Java constructors. We
have some ideas for how to extend object construction and
initialization that might allow inheritance of initialization code
without losing support for immutable objects.
- DieselDoc: Javadoc is a great tool for extracting very
useable Web-based documentation from Java source files. But such a
tool assumes that all relevant information about a class is textually
nested within the class. In Cecil, Diesel, and other languages with
open classes and/or multimethods, how should a browser show the
relevant operations of a given class? In addition, it becomes
interesting to view the software under other groupings, e.g. by
module, or by function. How should such a browsing tool look? How
should it be implemented? Daria Craciunoiu wrote some initial code to
translate top-level Diesel declarations into XML, and then generate
dynamic HTML pages by reading the XML database, but much interesting
work remains to complete the implementation and design and evaluate
the best user interface. You can try out the current prototype here.
- Front-End Tools: It would be a cool project to develop a
modern implementation of a scanner and parser generator, for use by
tools written in Diesel. Diesel's language features can be exploited
to have the generator automatically produce the token and AST classes,
while still allowing easy customization by clients.
chambers@cs.washington.edu