--------------------------------------------------- Return-Path: ddion@june Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id PAA19439 for ; Sun, 2 Feb 1997 15:57:01 -0800 Received: (ddion@localhost) by june.cs.washington.edu (8.8.5+CS/7.2ju) id PAA01644; Sun, 2 Feb 1997 15:57:01 -0800 From: ddion@june (David Dion) Message-Id: <199702022357.PAA01644@june.cs.washington.edu> Subject: Treadmarks 552-reading summary To: bershad@cs Date: Sun, 2 Feb 1997 15:57:00 -0800 (PST) Cc: ddion@june (David Dion) X-Mailer: ELM [version 2.4 PL23] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Treadmarks provides distributed shared memory (DSM) for a network of workstations. A major advantage of DSM over message-passing parallel programming is that it is easier for the programmer. Unfortunately, traditional DSM is slow due to expensive sequential consistency protocols. Treadmarks adopts a relaxed memory model to alleviate expensive consistency. The idea behind Treadmarks is that since data races are prevented using synchronization, changes to shared memory need to be propagated only at synchronization points. Propagation is enforced with an algorithm based on the happens-before relationship. Before a process can proceed after acquiring a lock, all shared memory accesses which "happened before" the acquire must be reflected to the acquiring processor. In the Treadmarks implementation, data is propagated from the previous holder of the lock to the current holder of the lock at the time the current holder acquires the lock. Consequently, changes can be piggy-backed on the "lock grant" message. Traditional DSM systems only allow for a single-writer. That is, before a process can write to a page of shared memory it must have sole access to that page. While this is a simple model with a straightforward implementation, it suffers from large communication requirements for invalidation and false sharing. Treadmarks allows multiple writers to concurrently update the same page of shared memory. When a shared page is written, a local copy of the page is created to save its original contents. When a synchronization point is reached, the modified page is compared to the saved page and a diff is created. These diffs are used to propogate changes to other processes. In a correct program with proper synchronization, it is impossible for diffs to overlap. Performance of Treadmark applications is tied to the communication-to-computation ratio. When this ratio is high, overhead dominates execution time and speedup suffers. When the ratio is low, speedup can be near linear. --------------------------------------------------- Return-Path: matthai@franklin.cs.washington.edu Received: from franklin.cs.washington.edu (franklin.cs.washington.edu [128.95.2.103]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id TAA03753 for ; Mon, 3 Feb 1997 19:18:32 -0800 Received: from localhost (localhost [127.0.0.1]) by franklin.cs.washington.edu (8.8.3+CSE/7.2ws+) with SMTP id TAA06475 for ; Mon, 3 Feb 1997 19:18:32 -0800 (PST) Message-Id: <199702040318.TAA06475@franklin.cs.washington.edu> X-Mailer: exmh version 1.5.3 12/28/94 To: bershad@franklin.cs.washington.edu Subject: Treadmarks 552-reading summary Reply-to: Matthai Philipose Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 03 Feb 1997 19:18:31 PST From: Matthai Philipose Treadmarks is a user-level library that, with support from the virtual memory hardware, implements distributed shared memory (DSM) in software. The goal of the system seems to be enable high-performance parallel applications to run on networks of workstations (NOWs), which are characterized by fast processors, and high-bandwidth, high-latency communication. The paper identifies two main challenges associated with implementing DSM in such a settting: providing an intuitive, but high-performance consistency model, and avoiding false sharing in face of caching for low-latency access. They point out that the most "natural" consistency model, that of sequential consistency, where all processors see the same ordering of accesses is prohibitively expensive in the NOW setting, because of the latency of maintaining consistency has to be payed at every write to memory. They observe that in conventional parallel programs, access to shared and possibly conflict-generating memory is explicitly limited to one processor at a time by the programmer. It is therefore only necessary to specify the view of memory that all processors see when this processor exits the synchronized region. This consistency model is commonly called the release consistency model. Their key insight is that release consistency may be implemented either in an eager manner (invalidations are generated the lock is released) or in a lazy manner (only the copy of the process re-acquiring the lock is invalidated). Since only the latter semantics are strictly required by the conventional (lock acquire/ release programming model), the lazy scheme avoids unnecessary broadcasts. The solution for false-sharing is similar to the idea of sub-blocking in distributed-system caches. When a piece of DSM is written, keep track only of the incremental changes made, and broadcast only those at the next barrier checkin. Of course, this solution is a win only in the case where a page is written sparsely, which is presumably the common case. They evaluate their system on two "large" programs (it is unclear how large these programs are, but they are certainly long-running), and achieve the holy-grail of parallel-system makers, a linear speedup. I wasn't really sure of what to make of the results; specifically, I did not get a feel of how well the system would deal with other programs (e.g. the SPLASH benchmarks). I was also disappointed to see the effect of the building this system on the user-level: their were veiled references to the fact that communication is expensive in NOWs, but many of the reasons they gave for this (kernel interrups, context switches, and lots of network interface code) could well be due to their user-level implementation. --------------------------------------------------- Return-Path: mernst@nishin.cs.washington.edu Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id VAA04267 for ; Mon, 3 Feb 1997 21:01:06 -0800 Received: from nishin.cs.washington.edu (nishin.cs.washington.edu [128.95.4.39]) by june.cs.washington.edu (8.8.5+CS/7.2ju) with ESMTP id VAA20047 for ; Mon, 3 Feb 1997 21:01:06 -0800 Received: (from mernst@localhost) by nishin.cs.washington.edu (8.8.4/8.8.2) id VAA10579; Mon, 3 Feb 1997 21:01:05 -0800 Date: Mon, 3 Feb 1997 21:01:05 -0800 Message-Id: <199702040501.VAA10579@nishin.cs.washington.edu> From: Michael Ernst To: bershad@nishin.cs.washington.edu Subject: Treadmarks 552-reading summary TreadMarks: Shared Memory Computing on Networks of Workstations by Amza, Cox, Dwarkadas, Keleher, Lu, Rajamony, Yu, and Zwaenepoel TreadMarks implements a shared memory abstraction for networks of workstations. The shared memory model is accepted as simpler to program, but less efficient, than message passing. The programming model is fairly standard; note that conversion of a program to TreadMarks is not automatic but requires the programmer to decide what is shared and what isn't, for separate versions of malloc, free, etc. exist for local and global operations. (Nevertheless, the leap to TreadMarks is not as great as to a different model.) TreadMarks also supplies barriers and locks. The contribution of TreadMarks is the use of lazy release consistency. Sequential consistency gives results that could have resulted from running all processes on a single machine. Eager release consistency relaxes that model to communicate only when locks are released -- there is no need to update copies on remote machines which cannot access the data, but when a machine releases a lock, then any other processor could acquire it, so they should have a consistent view of the data. Lazy release consistency communicates the new values only to the next machine which acquires the lock, reducing communication from O(n) to O(1), where n is the number of processors. After all, only the processor that can access the data need have the latest copy -- stale copies that can't be referenced are fine. This is a good idea and should outperform other consistency models, but the paper's performance numbers omit small runs and provide no comparisons with other consistency models, which I would have liked to see. One potential downside is the requirement that the programs communicate only via locks, which TreadMarks can monitor. --------------------------------------------------- Return-Path: yasushi@silk Received: from silk.cs.washington.edu (silk.cs.washington.edu [128.95.2.238]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id VAA04339 for ; Mon, 3 Feb 1997 21:14:39 -0800 Received: (yasushi@localhost) by silk.cs.washington.edu (8.7.2/7.2ws+) id VAA18491; Mon, 3 Feb 1997 21:14:38 -0800 (PST) Date: Mon, 3 Feb 1997 21:14:38 -0800 (PST) From: yasushi@silk Message-Id: <199702040514.VAA18491@silk.cs.washington.edu> To: bershad@silk Subject: Treadmarks 552-reading summary Treadmarks is a distributed memory system implemented fully as a user level library. There are two novel aspects in this system. One is the lazy release consistency protocol, and the other is multiple writer protocol. Lazy release consistensy is an improvement over eager release consistensy in a sense that it requires fewer message exchanges. This is a win in Treadmarks because the system is single threaded, and the message send & receive consume significant amount of CPU cycles. I'm not sure if this is a win if message processing can overlap with the main computation(eg, like in DASH). Multiple writer protocol allows multiple writers in single page. The system assumes that locks are properly held, therefore no two writers write on the same location at same time. Using this assumption, Treadmarks extracts the updated portion of the page by taking a diff with the original page contents, and merge the diffs from multiple writers. This approach mix well with the release consistency model where a proper synchronization of memory accesses is required. The paper spends many pages to describe what the benchmark programs do. I didn't understand. I wonder why they didn't use more standard stuff(eg, Splash-2). --------------------------------------------------- Return-Path: rgrimm@cs.washington.edu Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id WAA05272 for ; Mon, 3 Feb 1997 22:03:58 -0800 Received: from [128.95.8.127] (h127.dyn.cs.washington.edu [128.95.8.127]) by june.cs.washington.edu (8.8.5+CS/7.2ju) with SMTP id WAA23617 for ; Mon, 3 Feb 1997 22:03:43 -0800 X-Sender: rgrimm@june.cs.washington.edu Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 3 Feb 1997 22:01:17 -0800 To: bershad@cs From: rgrimm@cs.washington.edu (Robert Grimm) Subject: Treadmarks 552-reading summary Amza et al. describe a software-only distributed shared memory (DSM) system called TreadMarks: It features a simple, yet powerful interface that requires explicit allocation of shared memory (as opposed to local memory acquired through the standard library routines) and provides synchronization primitives in the form of mutex locks and barriers. The implementation relies on a lazy release consistency model to avoid the high message costs of sequential consistency (which easily leads to thrashing of highly contested pages between several nodes) and of eager release consistency (which, while it shows improvements over sequential consistency, still leads to too much message traffic). To further reduce message overhead, TreadMarks uses a multiple-writer protocol which allows several nodes to concurrently write to a single page of memory and uses shadow-pages (called twins) in combination with modification records (called diffs) to efficiently communicate changes between the nodes. The evaluation of the basic TreadMarks operations shows reasonable performance, and two applications that were converted to utilize TreadMarks (mostly) show good to almost linear speedup. TreadMarks provides users with clean and simple distributed shared memory model and interface, allows for sharing at arbitrary granularity (which is not limited by the page size), is implemented as a user-level library (and does not require modifications to the kernel), runs on a wide variety of hardware platforms available today (as opposed to hardware DSM), and shows good performance for the studied applications. At the same time, the evaluation shows that application performance is highly dependent on the communication-to-computation ratio, favoring problems that partition well across nodes and require relatively few message exchanges per second (on the order of a few hundred). This suggests that TreadMarks is best used for certain classes of applications, namely traditional super-computing applications, which is further underlined by the SIMD-type programming model (e.g., the use of barriers). It seems questionable if other classes of applications, such as large databases, are suitable for this programming model, and if so, whether they would see any reasonable performance. --------------------------------------------------- Return-Path: sparekh@cs.washington.edu Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id WAA05786 for ; Mon, 3 Feb 1997 22:33:48 -0800 Received: from yoda (h250.dyn.cs.washington.edu [128.95.8.250]) by june.cs.washington.edu (8.8.5+CS/7.2ju) with ESMTP id WAA25447; Mon, 3 Feb 1997 22:33:45 -0800 Message-Id: <199702040633.WAA25447@june.cs.washington.edu> From: "Sujay Parekh" To: Cc: Subject: Treadmarks 552-reading summary Date: Mon, 3 Feb 1997 22:37:37 -0800 X-MSMail-Priority: Normal X-Priority: 3 X-Mailer: Microsoft Internet Mail 4.70.1155 MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit The TreadMarks system provides a shared-memory abstraction to programmers of distributed applications. DSM is a more convenient abstraction than the usual message passing, because it frees the programmer from worrying constantly about the when, (with) who and what of explicit communication. However, there are several issues in designing a DSM system, mainly relating to the semantics of the abstraction. The TreadMarks paper focuses on their model of memory consistency, and their unique implementation technique. A fundamental assumption is that all shared access and synchronization is done via the TreadMarks API. The consistency model is the semantics for how the programmer can expect the memory system to behave. The traditional sequential consistency model is often too strict and leads to too much communication overhead. TreadMarks introduces a relaxed consistency model called release consistency. It is based on a partial order, called happens-before, defined on synchronization operations and shared memory accesses. Their implementation of this model is a lazy release consistency. Lazy consistency reduces communication since no data invalidation is sent until the potentially invalid data is accessed on a remote node. While this is a significantly more complicated model than eager release consistency, the communications saved seem to justify the effort. TreadMarks uses traditional virtual-memory protection mechanisms to control access to shared memory. Hence, it can run at user-level, without needing special kernel modifications. In most systems, however, this is a page-level control which is often too coarse for fine-grained data. TreadMarks addresses this problem by following a multiple-writer protocol. This is a really elegant solution that not only reduces false sharing, but also reduces communication overhead when multiple nodes are write-sharing the same page of data. In general, there is evidence message passing leads to more efficient applications, but the programmer has to manage the communication explicitly. TreadMarks is a step towards more efficient DSM systems. In the related work, they mention other potentially more beneficial techniques like entry consistency and object-based DSM which could make DSM more viable. --------------------------------------------------- Return-Path: tian@wally Received: from wally.cs.washington.edu (wally.cs.washington.edu [128.95.2.122]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id XAA06237 for ; Mon, 3 Feb 1997 23:05:17 -0800 Received: (tian@localhost) by wally.cs.washington.edu (8.8.3+CSE/7.2ws+) id XAA09186; Mon, 3 Feb 1997 23:05:16 -0800 (PST) Date: Mon, 3 Feb 1997 23:05:16 -0800 (PST) From: Tian Lim To: Brian Bershad Subject: Treadmarks 552-reading summary Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Treadmarks is a user level, software DSM system. It provides a separate API to perform shared memory allocations, and locks and barriers on shared data. It is understood that a sequential consistency model is prohibitively expensive, and so it builds on the "eager" release protocol of Munin by making it lazy - instead of broadcasting changes when locks are released, only the new holder of the lock receives the changes. As a result, less bandwidth is used. Because they assume only lock holders will access the data, they can avoid broadcasting invalidate messages. Treadmarks deals with false sharing the same way Munin does - a multiple writer protocol that allows more than one writer to mutate a page, and sends page diffs instead of pages during updates. I am curious as to why they did not attempt something like the sharing annotations in Munin to customize the consistency model. Is the lazy release protocol sufficiently powerful that it exceeds the performance gains possible by having the programmer annotate expected data access patterns? In addition, the performance graphs showing almost-linear speedup do not reveal the memory usage model. It is not clear to me that these applications share at a fine grain, i.e. the communication-to-computation ratio they speak of is not high. Since this ratio is application dependent, it is unclear how generally DSM can be applied. The authors also argue that the programming model is simpler than message passing, but from their discussion, application performance is highly sensitive to how the computation is divided, both from an algorithmic (e.g. they had to use load balancing in the linkage problem) and data access (finer grain sharing increases communication overhead) point of view. I am curious to know if this is any easier than figuring out when to send messages to whom. --------------------------------------------------- Return-Path: echris@merganser Received: from merganser.cs.washington.edu (merganser.cs.washington.edu [128.95.2.192]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id IAA00696 for ; Tue, 4 Feb 1997 08:22:00 -0800 Received: (echris@localhost) by merganser.cs.washington.edu (8.8.3+CSE/7.2ws+) id IAA00699; Tue, 4 Feb 1997 08:21:59 -0800 (PST) Date: Tue, 4 Feb 1997 08:21:59 -0800 (PST) Message-Id: <199702041621.IAA00699@merganser.cs.washington.edu> From: E Christopher Lewis To: bershad@cs Subject: Treadmarks 552-reading summary Reply-To: echris@cs.washington.edu This paper summarizes yet another DSM implementation. TreadMarks may be distinguished from other work in the following ways: the target platform is NOW, it is software-based, the implementation requires no kernel mods (achieving portability?), it distinguishes shared and non-shared memory, and it uses lazy release consistency and a multi-writer protocol. The paper's value is as a discussion of the above details (in particular consistency models) in a single system. I will now take issue with the motivation for DSM that appears in the abstract and introduction. The authors argue that DSM can play an important role in incremental parallelism, for most sequential data structures can remain unchanged after parallelization (the usual stuff). DSM is dangerous. Though it provides a convenient mechanism for communication, it is often sold as a way to ignore locality. Locality is so important in a NOW that algorithm changes are often/probably necessary to achieve good performance. The incremental dream does not allow for this reality. --------------------------------------------------- Return-Path: govindk@shasta.ee.washington.edu Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id IAA00900 for ; Tue, 4 Feb 1997 08:36:15 -0800 Received: from shasta.ee.washington.edu (shasta.ee.washington.edu [128.95.28.11]) by june.cs.washington.edu (8.8.5+CS/7.2ju) with SMTP id IAA21264 for ; Tue, 4 Feb 1997 08:36:15 -0800 Received: from andes.ee.washington.edu by shasta.ee.washington.edu (4.1/SMI-4.1) id AA01855; Tue, 4 Feb 97 08:37:27 GMT Received: from andes by andes.ee.washington.edu (SMI-8.6/SMI-SVR4) id IAA03779; Tue, 4 Feb 1997 08:37:00 -0800 Sender: govindk@shasta.ee.washington.edu Message-Id: <32F765AC.C31@shasta.ee.washington.edu> Date: Tue, 04 Feb 1997 08:37:00 -0800 From: Govindarajan K X-Mailer: Mozilla 3.01Gold (X11; I; SunOS 5.5 sun4u) Mime-Version: 1.0 To: bershad@cs Subject: Treadmarks 552-reading summary Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit TreadMarks is a Distributed Shared Memory system operating at the user level. The authors using the primitives for locking critical sections, introduce the concept of lazy release consistency to minimize the communication overhead in maintaining consistency in the DSM system. The authors use TreadMarks to implement consistency in a NOW. The traditional method for consistency maintainance is by using the sequential consistency model. This is a simple model where in the consistency is based on a happened before paradigm. But the problem is that it is too communication intensive. It also suffers from false sharing( occurs when two or more unrelated data objects are located in the same page and are written concurrently by separate processors). What the lazy release algorithm does is that it releases the information concerning the modification of a page to just the process which procures the lock next, thereby reducing the communication overhead. The problem involved in this however is that the releasing processor cannot forget about the modifications done to the page , since another processor may be in need of this information. The updates of the page is conveyed in an intelligent fashion. The difference between the modified page and its twin created in the user space is run-length encoded and transmitted to the next process which needs the page. Note however the TreadMarks paradigm assumes that there is no datarace in the processes. The authors have compared their paradigm with previously existing work and have shown significant improvement. --------------------------------------------------- Return-Path: sungeun@wormwood Received: from wormwood.cs.washington.edu (wormwood.cs.washington.edu [128.95.2.107]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id IAA00991 for ; Tue, 4 Feb 1997 08:45:06 -0800 Received: (sungeun@localhost) by wormwood.cs.washington.edu (8.8.3+CSE/7.2ws+) id IAA00714; Tue, 4 Feb 1997 08:45:05 -0800 (PST) Date: Tue, 4 Feb 1997 08:45:05 -0800 (PST) Message-Id: <199702041645.IAA00714@wormwood.cs.washington.edu> From: Sung-Eun Choi To: bershad@cs Subject: Treadmarks 552-reading summary Reply-To: sungeun@cs.washington.edu TreadMarks is a distributed shared memory system for networks of workstations. The system leverages off of virtual memory protection to detect accesses to shared data. Consequently, no hardware was built or kernels hacked. Two performance issues arise when implementing such a system. The first is that of providing shared data consistency; the second, avoiding false sharing. The "natural" model for providing consistency is to implement sequential consistency, i.e., there is a total ordering on all memory acceses. Though relatively simple to implement, sequential consistency generally results in a large volume of mostly unnecessary network traffic. By relaxing the consistency requirements to be sensitive to synchronization, network traffic is greatly reduced and performance improved. False sharing is a by-product of using the VM protection mechanisms to detect accesses. In a naive system, when different processors access different shared variables incidentally located on the same page, much unnecessary updating of the data is spawned. The Treadmarks system allows multiple writers per page and then merges the updates when consistency must be realized. The system provides a shared memory abstraction to the programmer via two mechanisms: shared memory allocation and synchronization. I cannot comment on whether or not I think this is a suffient API, for I do not believe that shared memory is an appropriate programming model for parallel applications. While they claim that shared memory "facilitates the transition from sequential to parallel programs", unless they mean poorly performing parallel programs, I do not believe this statement. Memory allocation and synchronization ignore locality issues, which are key in distributed parallel applications. Moreover, this "transition" that they speak of requires identifying the "possible sources of parallelism" in the sequential program, a not so easy task that many have developed tools for.