I'm graduating in summer 2012 and am looking for academic and industrial research positions. Please take a look at my research statement, teaching statement, and cv if you're interested in hiring me.
| E-mail: | |
| Phone: | (206) 543-5129 |
| Office: | CSE 366 |
| Address: |
Joe Devietti Computer Science & Engineering, University of Washington Box 352350 Seattle, WA 98195-2350 |
| Directions: |
Campus Map of the Paul Allen Center By Car By Public Transit |
My CV is available here; below is my current list of publications.
Many of the paper links below use the ACM's Author-izer service, which tracks download statistics and provides a small kickback to various ACM Special Interest Groups for each download.
Conference PapersTo resolve this trade-off, we propose RADISH, a hybrid hardware-software race detector that is always-on and fully precise. In RADISH, hardware caches a principled subset of the metadata necessary for race detection; this subset allows the vast majority of race checks to occur completely in hardware. A flexible software layer handles persistence of race detection metadata on cache evictions and occasional queries to this expanded set of metadata. We show that RADISH is correct by proving equivalence to a conventional happens-before race detector.
Our design has modest hardware complexity: caches are completely unmodified and we piggy-back on existing coherence messages but do not otherwise modify the protocol. RADISH can furthermore leverage type-safe languages to reduce overheads substantially. Our evaluation of a simulated 8-core RADISH processor using PARSEC benchmarks shows runtime overheads from negligible to 2x. Furthermore, RADISH outperforms the leading software-only race detector by 2x-37x.
This paper introduces RCDC, a deterministic multiprocessor architecture that takes advantage of relaxed memory orderings to provide high-performance deterministic execution with low hardware complexity. RCDC has two key innovations: a hybrid HW/SW approach to enforcing determinism; and a new deterministic execution strategy that leverages data-race-free-based memory models (e.g., the models for Java and C++) to improve performance and scalability without sacrificing determinism, even in the presence of races. In our hybrid HW/SW approach, the only hardware mechanisms required are software-controlled store buffering and support for precise instruction counting; we do not require speculation. A runtime system uses these mechanisms to enforce determinism for arbitrary programs.
We evaluate RCDC using PARSEC benchmarks and show that relaxing memory ordering leads to performance and scalability close to nondeterministic execution without requiring any form of speculation. We also compare our new execution strategy to one based on TSO (total-store-ordering) and show that some applications benefit significantly from the extra relaxation. We also evaluate a software-only implementation of our new deterministic execution strategy.
A trivial non-performant approach to providing determinism is simply deterministically serializing execution. Instead, we present a compiler and runtime infrastructure that ensures determinism but resorts to serialization rarely, for handling interthread communication and synchronization. We develop two basic approaches, both of which are largely dynamic with performance improved by some static compiler optimizations. First, an ownership-based approach detects interthread communication via an evolving table that tracks ownership of memory regions by threads. Second, a buffering approach uses versioned memory and employs a deterministic commit protocol to make changes visible to other threads. While buffering has larger single-threaded overhead than ownership, it tends to scale better (serializing less often). A hybrid system sometimes performs and scales better than either approach individually.
Our implementation is based on the LLVM compiler infrastructure. It needs neither programmer annotations nor special hardware. Our empirical evaluation uses the PARSEC and SPLASH2 benchmarks and shows that our approach scales comparably to nondeterministic execution.
Recent architectural proposals arbitrarily group consecutive dynamic memory operations into atomic blocks to enforce memory ordering at a coarse grain. This provides what we call implicit atomicity, as the atomic blocks are not derived from explicit program annotations. In this paper, we make the fundamental observation that implicit atomicity probabilistically hides atomicity violations by reducing the number of interleaving opportunities between memory operations. We then propose Atom-Aid, which creates implicit atomic blocks intelligently instead of arbitrarily, dramatically reducing the probability that atomicity violations will manifest themselves. Atom-Aid is also able to report where atomicity violations might exist in the code, providing resilience and debuggability. We evaluate Atom-Aid using buggy code from applications including Apache, MySQL, and XMMS, showing that Atom-Aid virtually eliminates the manifestation of atomicity violations.
Inspired by the promise of these software-only approaches, this paper proposes a hardware bounded pointer architectural primitive that supports cooperative hardware/software enforcement of spatial memory safety for C programs. This bounded pointer is a new hardware primitive datatype for pointers that leaves the standard C pointer representation intact, but augments it with bounds information maintained separately and invisibly by the hardware. The bounds are initialized by the software, and they are then propagated and enforced transparently by the hardware, which automatically checks a pointer’s bounds before it is dereferenced. One mode of use requires instrumenting only malloc, which enables enforcement of per-allocation spatial safety for heap-allocated objects for existing binaries. When combined with simple intra-procedural compiler instrumentation, hardware bounded pointers enable a low-overhead approach for enforcing complete spatial memory safety in unmodified C programs.
This paper explores a different approach. First, we introduce the permissions-only cache to extend the bound at which transactions overflow to allow the fast, bounded case to be used as frequently as possible. Second, we propose ONETM to simplify the implementation of unbounded transactional memory by bounding the concurrency of transactions that overflow the cache. These mechanisms work synergistically to provide a simple and fast unbounded transactional memory system.
The permissions-only cache efficiently maintains the coherence permissions—but not data—for blocks read or written transactionally that have been evicted from the processor’s caches. By holding coherence permissions for these blocks, the regular cache coherence protocol can be used to detect transactional conflicts using only a few bits of on-chip storage per overflowed cache block. ONETM allows only one overflowed transaction at a time, relying on the permissions-only cache to ensure that overflow is infrequent. We present two implementations. In ONETM-Serialized, an overflowed transaction simply stalls all other threads in the application. In ONETM-Concurrent, non-overflowed transactions and non-transactional code can execute concurrently with the overflowed transaction, providing more concurrency while retaining ONETM’s core simplifying assumption.
MELD unifies these approaches. We explain the requirements for soundly integrating a deterministic language into a deterministic execution system, and describe a simple qualifier-based type checker that ensures isolation for code written in a deterministic language. We also extend MELD to incorporate nondeterministic operations without compromising the determinism of the rest of the program. Our experiments with benchmarks from the SPLASH2 and PARSEC suites show that a small number of annotations can accelerate the performance of deterministic versions of these programs by 2-6x.
Ultimately, we find that there are competitive alternatives to determinism for debugging and replicating multithreaded programs; that determinism has high, though unproven, potential to improve testing; and that determinism has distinct security benefits in eliminating some covert timing channels. Furthermore, determinism is a unified solution for all four applications: this confers a distinct advantage over point solutions that do not compose well with one another.
In this paper we make the case for fully deterministic shared memory multiprocessing. The idea is to make memory interleaving fully deterministic, in contrast to past approaches of simply replaying an execution based on a memory interleaving log. This causes the execution of a parallel program to be only function of its inputs, making a parallel program effectively behave like a sequential program. We show that determinism can be provided with reasonable performance cost and we also discuss the benefits. Finally, we propose and evaluate a range of implementation approaches.
Before grad school, I was an undergraduate and part-time master's student at the University of Pennsylvania where I worked with Milo Martin, Steve Zdancewic, E. Lewis and Colin Blundell on transactional memory and architectural support for security. In my copious free time, I enjoy bicycle commuting, kayaking, and dreaming about getting a tablet.
I love to write code, and to write about code. I try to blog about as many of my coding projects as I can; here are some I've worked on recently: