Joe Devietti UW CSE grad student
I'm a fifth-year grad student in the Sampa group in the Computer Science and Engineering department at the University of Washington. I work with my advisors Luis Ceze and Dan Grossman on making multiprocessors easier to program by leveraging changes in both computer architectures and parallel programming models. I'm grateful to Intel Corporation for their support of my final year of PhD study with an Intel PhD Fellowship.

I'm graduating in summer 2012 and am looking for academic and industrial research positions. Please take a look at my research statement, teaching statement, and cv if you're interested in hiring me.

Contact
E-mail: my email address
Phone: (206) 543-5129
Office: CSE 366
Address: Joe Devietti
Computer Science & Engineering, University of Washington
Box 352350
Seattle, WA 98195-2350
Directions: Campus Map of the Paul Allen Center
By Car
By Public Transit
Research

My CV is available here; below is my current list of publications.

Many of the paper links below use the ACM's Author-izer service, which tracks download statistics and provides a small kickback to various ACM Special Interest Groups for each download.

Conference Papers
  • RADISH: Always-On Sound and Complete Race Detection in Software and Hardware
    International Symposium on Computer Architecture (to appear at ISCA '12), June 2012
    Data-race freedom is a valuable safety property for multithreaded programs that helps with catching bugs, simplifying memory consistency model semantics, and verifying and enforcing both atomicity and determinism. Unfortunately, existing software-only race detectors are precise but slow; proposals with hardware support offer higher performance but are imprecise. Both precision and performance are necessary to achieve the many advantages always-on race detection could provide.

    To resolve this trade-off, we propose RADISH, a hybrid hardware-software race detector that is always-on and fully precise. In RADISH, hardware caches a principled subset of the metadata necessary for race detection; this subset allows the vast majority of race checks to occur completely in hardware. A flexible software layer handles persistence of race detection metadata on cache evictions and occasional queries to this expanded set of metadata. We show that RADISH is correct by proving equivalence to a conventional happens-before race detector.

    Our design has modest hardware complexity: caches are completely unmodified and we piggy-back on existing coherence messages but do not otherwise modify the protocol. RADISH can furthermore leverage type-safe languages to reduce overheads substantially. Our evaluation of a simulated 8-core RADISH processor using PARSEC benchmarks shows runtime overheads from negligible to 2x. Furthermore, RADISH outperforms the leading software-only race detector by 2x-37x.

  • RCDC: A Relaxed-Consistency Deterministic Computer
    International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '11), March 2011
    Providing deterministic execution significantly simplifies the debugging, testing, replication, and deployment of multithreaded programs. Recent work has developed deterministic multiprocessor architectures as well as compiler and runtime systems that enforce determinism in current hardware. Such work has incidentally imposed strong memory-ordering properties. Historically, memory ordering has been relaxed in favor of higher performance in shared memory multiprocessors and, interestingly, determinism exacerbates the cost of strong memory ordering. Consequently, we argue that relaxed memory ordering is vital to achieving faster deterministic execution.

    This paper introduces RCDC, a deterministic multiprocessor architecture that takes advantage of relaxed memory orderings to provide high-performance deterministic execution with low hardware complexity. RCDC has two key innovations: a hybrid HW/SW approach to enforcing determinism; and a new deterministic execution strategy that leverages data-race-free-based memory models (e.g., the models for Java and C++) to improve performance and scalability without sacrificing determinism, even in the presence of races. In our hybrid HW/SW approach, the only hardware mechanisms required are software-controlled store buffering and support for precise instruction counting; we do not require speculation. A runtime system uses these mechanisms to enforce determinism for arbitrary programs.

    We evaluate RCDC using PARSEC benchmarks and show that relaxing memory ordering leads to performance and scalability close to nondeterministic execution without requiring any form of speculation. We also compare our new execution strategy to one based on TSO (total-store-ordering) and show that some applications benefit significantly from the extra relaxation. We also evaluate a software-only implementation of our new deterministic execution strategy.

  • CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution
    International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '10), March 2010
    [abstract][paper][slides: key pdf][code]
    The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic replication. In this work, we avoid these complications by eliminating their root cause: we develop a compiler and runtime system that runs arbitrary multithreaded C/C++ POSIX Threads programs deterministically.

    A trivial non-performant approach to providing determinism is simply deterministically serializing execution. Instead, we present a compiler and runtime infrastructure that ensures determinism but resorts to serialization rarely, for handling interthread communication and synchronization. We develop two basic approaches, both of which are largely dynamic with performance improved by some static compiler optimizations. First, an ownership-based approach detects interthread communication via an evolving table that tracks ownership of memory regions by threads. Second, a buffering approach uses versioned memory and employs a deterministic commit protocol to make changes visible to other threads. While buffering has larger single-threaded overhead than ownership, it tends to scale better (serializing less often). A hybrid system sometimes performs and scales better than either approach individually.

    Our implementation is based on the LLVM compiler infrastructure. It needs neither programmer annotations nor special hardware. Our empirical evaluation uses the PARSEC and SPLASH2 benchmarks and shows that our approach scales comparably to nondeterministic execution.

  • DMP: Deterministic Shared Memory Multiprocessing
    International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '09), March 2009
    Selected for IEEE Micro Top Picks '09 [article]
    Current shared memory multicore and multiprocessor systems are nondeterministic. Each time these systems execute a multithreaded application, even if supplied with the same input, they can produce a different output. This frustrates debugging and limits the ability to properly test multithreaded code, becoming a major stumbling block to the much-needed widespread adoption of parallel programming. In this paper we make the case for fully deterministic shared memory multiprocessing (DMP). The behavior of an arbitrary multithreaded program on a DMP system is only a function of its inputs. The core idea is to make inter-thread communication fully deterministic. Previous approaches to coping with nondeterminism in multithreaded programs have focused on replay, a technique useful only for debugging. In contrast, while DMP systems are directly useful for debugging by offering repeatability by default, we argue that parallel programs should execute deterministically in the field as well. This has the potential to make testing more assuring and increase the reliability of deployed multithreaded software. We propose a range of approaches to enforcing determinism and discuss their implementation trade-offs. We show that determinism can be provided with little performance cost using our architecture proposals on future hardware, and that software-only approaches can be utilized on existing systems.
  • Atom-Aid: Surviving and Detecting Atomicity Violations
    International Symposium on Computer Architecture (ISCA '08), June 2008
    [abstract][paper][slides: key ppt]
    Selected for IEEE Micro Top Picks '08 [article]
    Writing shared-memory parallel programs is error-prone. Among the concurrency errors that programmers often face are atomicity violations, which are especially challenging. They happen when programmers make incorrect assumptions about atomicity and fail to enclose memory accesses that should occur atomically inside the same critical section. If these accesses happen to be interleaved with conflicting accesses from different threads, the program might behave incorrectly.

    Recent architectural proposals arbitrarily group consecutive dynamic memory operations into atomic blocks to enforce memory ordering at a coarse grain. This provides what we call implicit atomicity, as the atomic blocks are not derived from explicit program annotations. In this paper, we make the fundamental observation that implicit atomicity probabilistically hides atomicity violations by reducing the number of interleaving opportunities between memory operations. We then propose Atom-Aid, which creates implicit atomic blocks intelligently instead of arbitrarily, dramatically reducing the probability that atomicity violations will manifest themselves. Atom-Aid is also able to report where atomicity violations might exist in the code, providing resilience and debuggability. We evaluate Atom-Aid using buggy code from applications including Apache, MySQL, and XMMS, showing that Atom-Aid virtually eliminates the manifestation of atomicity violations.

  • HardBound: Architectural Support for Spatial Safety of the C Programming Language
    International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '08), March 2008
    [abstract][paper][slides: pdf pptx][video]
    The C programming language is at least as well known for its absence of spatial memory safety guarantees (i.e., lack of bounds checking) as it is for its high performance. C’s unchecked pointer arithmetic and array indexing allow simple programming mistakes to lead to erroneous executions, silent data corruption, and security vulnerabilities. Many prior proposals have tackled enforcing spatial safety in C programs by checking pointer and array accesses. However, existing software-only proposals have significant drawbacks that may prevent wide adoption, including: unacceptably high runtime overheads, lack of completeness, incompatible pointer representations, or need for non-trivial changes to existing C source code and compiler infrastructure.

    Inspired by the promise of these software-only approaches, this paper proposes a hardware bounded pointer architectural primitive that supports cooperative hardware/software enforcement of spatial memory safety for C programs. This bounded pointer is a new hardware primitive datatype for pointers that leaves the standard C pointer representation intact, but augments it with bounds information maintained separately and invisibly by the hardware. The bounds are initialized by the software, and they are then propagated and enforced transparently by the hardware, which automatically checks a pointer’s bounds before it is dereferenced. One mode of use requires instrumenting only malloc, which enables enforcement of per-allocation spatial safety for heap-allocated objects for existing binaries. When combined with simple intra-procedural compiler instrumentation, hardware bounded pointers enable a low-overhead approach for enforcing complete spatial memory safety in unmodified C programs.

  • Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory
    International Symposium on Computer Architecture (ISCA '07), June 2007
    [abstract][paper][slides: pdf ppt]
    Hardware transactional memory has great potential to simplify the creation of correct and efficient multithreaded programs, allowing programmers to exploit more effectively the soon-to-be-ubiquitous multi-core designs. Several recent proposals have extended the original bounded transactional memory to unbounded transactional memory, a crucial step toward transactions becoming a generalpurpose primitive. Unfortunately, supporting the concurrent execution of an unbounded number of unbounded transactions is challenging, and as a result, many proposed implementations are complex.

    This paper explores a different approach. First, we introduce the permissions-only cache to extend the bound at which transactions overflow to allow the fast, bounded case to be used as frequently as possible. Second, we propose ONETM to simplify the implementation of unbounded transactional memory by bounding the concurrency of transactions that overflow the cache. These mechanisms work synergistically to provide a simple and fast unbounded transactional memory system.

    The permissions-only cache efficiently maintains the coherence permissions—but not data—for blocks read or written transactionally that have been evicted from the processor’s caches. By holding coherence permissions for these blocks, the regular cache coherence protocol can be used to detect transactional conflicts using only a few bits of on-chip storage per overflowed cache block. ONETM allows only one overflowed transaction at a time, relying on the permissions-only cache to ensure that overflow is infrequent. We present two implementations. In ONETM-Serialized, an overflowed transaction simply stalls all other threads in the application. In ONETM-Concurrent, non-overflowed transactions and non-transactional code can execute concurrently with the overflowed transaction, providing more concurrency while retaining ONETM’s core simplifying assumption.

Workshop Papers
  • The Case For Merging Execution- and Language-level Determinism with MELD
    Workshop on Determinism and Correctness in Parallel Programming (WoDet '12), held in conjunction with ASPLOS '12, March 2012
    [abstract][paper][slides: key pdf]
    Nondeterminism is a key contributor to the difficulty of parallel programming. Many research projects have shown how to provide deterministic parallelism, but with unfortunate trade-offs. Deterministic execution enforces determinism for arbitrary programs but with significant runtime cost, while deterministic languages enforce determinism statically (without runtime overhead) but only for fork-join programs expressible in their static type systems.

    MELD unifies these approaches. We explain the requirements for soundly integrating a deterministic language into a deterministic execution system, and describe a simple qualifier-based type checker that ensures isolation for code written in a deterministic language. We also extend MELD to incorporate nondeterministic operations without compromising the determinism of the rest of the program. Our experiments with benchmarks from the SPLASH2 and PARSEC suites show that a small number of annotations can accelerate the performance of deterministic versions of these programs by 2-6x.

  • The Deterministic Execution Hammer: How Well Does it Actually Pound Nails?
    Workshop on Determinism and Correctness in Parallel Programming (WoDet '11), held in conjunction with ASPLOS '11, March 2011
    [abstract][paper][slides: key pdf]
    This paper takes a critical look at the benefits provided by state-of-the-art deterministic execution techniques. Specifically, we look at four applications of deterministic execution: debugging, fault-tolerant replication, testing, and security. For each application, we discuss what an ideal system would provide, and then look at how deterministic systems compare to the ideal. Further, we discuss alternative approaches, not involving determinism, and we judge whether or not these alternatives are more suitable. Along the way, we identify open questions and suggest future work.

    Ultimately, we find that there are competitive alternatives to determinism for debugging and replicating multithreaded programs; that determinism has high, though unproven, potential to improve testing; and that determinism has distinct security benefits in eliminating some covert timing channels. Furthermore, determinism is a unified solution for all four applications: this confers a distinct advantage over point solutions that do not compose well with one another.

  • Lock Prediction
    USENIX Workshop on Hot Topics in Parallelism (HotPar '10), accepted for poster session, June 2010
    This paper makes the case for studying and exploiting the predictability of lock operations in multithreaded programs. We discuss several uses of lock prediction and identify the two key events we believe are useful to predict: (1) which thread will be the next one to acquire a given lock and (2) which lock a given thread will acquire next. We describe multiple prediction algorithms and show that while their accuracy varies widely, they are likely accurate enough for many uses.
  • The Case for System Support for Concurrency Exceptions
    USENIX Workshop on Hot Topics in Parallelism (HotPar '09), March 2009
    In this position paper we argue that concurrency errors should be fail-stop. We want to put concurrency errors in the same category as division-by-zero, segmentation fault in unmanaged languages and cast exceptions in managed languages. This would make nondeterminism in multithreaded execution be much more manageable. Concurrency exceptions would improve the debugging process during development and make crashes due to concurrency errors that happen in the field be more descriptive. Our goal in this paper is to justify our position, propose a general approach to concurrency exceptions and discuss system requirements and implications. Specifically, we discuss the semantics of concurrency exceptions at the language level, their implications in the compiler and run-time systems, how they should be delivered and, finally, how they are enabled by efficient architecture support.
  • Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic!
    Workshop on Software and Hardware Challenges of Manycore Platforms (SHCMP '08), held in conjunction with ISCA '08, June 2008
    [abstract][paper][slides: odp pdf ppt]
    The root of all (or most) evil in programming shared memory multiprocessors is that execution is not deterministic. Bugs are hard to find and non-repeatable. This makes debugging a nightmare and gives little assurance in the testing - there is no way to know how the program will behave in a different environment. Testing and debugging is already difficult with a single thread on uniprocessors. Pervasive parallel programs and chip multiprocessors will make this difficulty worse, and more widespread throughout our industry.

    In this paper we make the case for fully deterministic shared memory multiprocessing. The idea is to make memory interleaving fully deterministic, in contrast to past approaches of simply replaying an execution based on a memory interleaving log. This causes the execution of a parallel program to be only function of its inputs, making a parallel program effectively behave like a sequential program. We show that determinism can be provided with reasonable performance cost and we also discuss the benefits. Finally, we propose and evaluate a range of implementation approaches.

Technical Reports
  • Code-Centric Communication Graphs for Shared-Memory Multithreaded Programs
    Technical Report UW-CSE-09-05-02, May 2009
    With more and more complex pieces of software using explicit multithreading, tools to extract the structure of such software are increasingly important. We present a novel tool that builds graphs describing how threads in shared-memory parallel programs communicate. Compared to prior work, our communication graphs are code-centric: nodes represent units of code (e.g., functions) and edges represent inter-thread shared-memory communication via these units of code. Our approach is dynamic, using actual executions to build graphs, and exploits binary-code instrumentation to work for large, real-world applications. The graphs are useful for understanding software structure and computing program properties, such as the effect of nondeterministic thread-scheduling on the communication pattern.
About me

Before grad school, I was an undergraduate and part-time master's student at the University of Pennsylvania where I worked with Milo Martin, Steve Zdancewic, E. Lewis and Colin Blundell on transactional memory and architectural support for security. In my copious free time, I enjoy bicycle commuting, kayaking, and dreaming about getting a tablet.

Code

I love to write code, and to write about code. I try to blog about as many of my coding projects as I can; here are some I've worked on recently: