---------------------------------------------------
Return-Path: ddion@june
Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id PAA19439 for <bershad@whistler.cs.washington.edu>; Sun, 2 Feb 1997 15:57:01 -0800
Received: (ddion@localhost) by june.cs.washington.edu (8.8.5+CS/7.2ju) id PAA01644; Sun, 2 Feb 1997 15:57:01 -0800
From: ddion@june (David Dion)
Message-Id: <199702022357.PAA01644@june.cs.washington.edu>
Subject: Treadmarks 552-reading summary
To: bershad@cs
Date: Sun, 2 Feb 1997 15:57:00 -0800 (PST)
Cc: ddion@june (David Dion)
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit


Treadmarks provides distributed shared memory (DSM) for a network of
workstations.  A major advantage of DSM over message-passing parallel
programming is that it is easier for the programmer.  Unfortunately,
traditional DSM is slow due to expensive sequential consistency
protocols.  Treadmarks adopts a relaxed memory model to alleviate
expensive consistency.

The idea behind Treadmarks is that since data races are prevented using
synchronization, changes to shared memory need to be propagated only
at synchronization points.  Propagation is enforced with an algorithm
based on the happens-before relationship.  Before a process can
proceed after acquiring a lock, all shared memory accesses which
"happened before" the acquire must be reflected to the acquiring
processor.  In the Treadmarks implementation, data is propagated from
the previous holder of the lock to the current holder of the lock at
the time the current holder acquires the lock.  Consequently, changes
can be piggy-backed on the "lock grant" message.

Traditional DSM systems only allow for a single-writer.  That is,
before a process can write to a page of shared memory it must have
sole access to that page.  While this is a simple model with a
straightforward implementation, it suffers from large communication
requirements for invalidation and false sharing.  Treadmarks allows
multiple writers to concurrently update the same page of shared
memory.  When a shared page is written, a local copy of the page is
created to save its original contents.  When a synchronization point
is reached, the modified page is compared to the saved page and a diff
is created.  These diffs are used to propogate changes to other
processes.  In a correct program with proper synchronization, it is
impossible for diffs to overlap.

Performance of Treadmark applications is tied to the
communication-to-computation ratio.  When this ratio is high, overhead
dominates execution time and speedup suffers.  When the ratio is low,
speedup can be near linear.
---------------------------------------------------
Return-Path: matthai@franklin.cs.washington.edu
Received: from franklin.cs.washington.edu (franklin.cs.washington.edu [128.95.2.103]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id TAA03753 for <bershad@whistler.cs.washington.edu>; Mon, 3 Feb 1997 19:18:32 -0800
Received: from localhost (localhost [127.0.0.1]) by franklin.cs.washington.edu (8.8.3+CSE/7.2ws+) with SMTP id TAA06475 for <bershad>; Mon, 3 Feb 1997 19:18:32 -0800 (PST)
Message-Id: <199702040318.TAA06475@franklin.cs.washington.edu>
X-Mailer: exmh version 1.5.3 12/28/94
To: bershad@franklin.cs.washington.edu
Subject: Treadmarks 552-reading summary
Reply-to: Matthai Philipose <matthai@cs.washington.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Mon, 03 Feb 1997 19:18:31 PST
From: Matthai Philipose <matthai@franklin.cs.washington.edu>

Treadmarks is a user-level library that, with support from
the virtual memory hardware, implements distributed shared
memory (DSM) in software. The goal of the system seems to be
enable high-performance parallel applications to run on
networks of workstations (NOWs), which are characterized by 
fast processors, and high-bandwidth, high-latency communication.

The paper identifies two main challenges associated with implementing
DSM in such a settting: providing an intuitive, but high-performance
consistency model, and avoiding false sharing in face of caching for 
low-latency access. 

They point out that the most "natural" consistency model, that of
sequential consistency, where all processors see the same ordering
of accesses is prohibitively expensive in the NOW setting, because
of the latency of maintaining consistency has to be payed at every
write to memory. They observe that in conventional parallel programs,
access to shared and possibly conflict-generating memory is explicitly
limited to one processor at a time by the programmer. It is therefore
only necessary to specify the view of memory that all processors see
when this processor exits the synchronized region. This consistency
model is commonly called the release consistency model. Their key
insight is that release consistency may be implemented either
in an eager manner (invalidations are generated the lock
is released) or in a lazy manner (only the copy of the process
re-acquiring the lock is invalidated). Since only the latter
semantics are strictly required by the conventional (lock acquire/
release programming model), the lazy scheme avoids unnecessary broadcasts.

The solution for false-sharing is similar to the idea of sub-blocking
in distributed-system caches. When a piece of DSM is written, keep
track only of the incremental changes made, and broadcast only those
at the next barrier checkin. Of course, this solution is a win only
in the case where a page is written sparsely, which is presumably the 
common case.

They evaluate their system on two "large" programs (it is unclear how
large these programs are, but they are certainly long-running), and
achieve the holy-grail of parallel-system makers, a linear speedup.
I wasn't really sure of what to make of the results; specifically, I
did not get a feel of how well the system would deal with other
programs (e.g. the SPLASH benchmarks). I was also disappointed to
see the effect of the building this system on the user-level: their
were veiled references to the fact that communication is expensive
in NOWs, but many of the reasons they gave for this (kernel interrups, 
context switches, and lots of network interface code) could well be
due to their user-level implementation.


---------------------------------------------------
Return-Path: mernst@nishin.cs.washington.edu
Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id VAA04267 for <bershad@whistler.cs.washington.edu>; Mon, 3 Feb 1997 21:01:06 -0800
Received: from nishin.cs.washington.edu (nishin.cs.washington.edu [128.95.4.39]) by june.cs.washington.edu (8.8.5+CS/7.2ju) with ESMTP id VAA20047 for <bershad@june.cs.washington.edu>; Mon, 3 Feb 1997 21:01:06 -0800
Received: (from mernst@localhost) by nishin.cs.washington.edu (8.8.4/8.8.2) id VAA10579; Mon, 3 Feb 1997 21:01:05 -0800
Date: Mon, 3 Feb 1997 21:01:05 -0800
Message-Id: <199702040501.VAA10579@nishin.cs.washington.edu>
From: Michael Ernst <mernst@cs.washington.edu>
To: bershad@nishin.cs.washington.edu
Subject: Treadmarks 552-reading summary 

TreadMarks: Shared Memory Computing on Networks of Workstations
by Amza, Cox, Dwarkadas, Keleher, Lu, Rajamony, Yu, and Zwaenepoel

TreadMarks implements a shared memory abstraction for networks of
workstations.  The shared memory model is accepted as simpler to program,
but less efficient, than message passing.  The programming model is fairly
standard; note that conversion of a program to TreadMarks is not automatic
but requires the programmer to decide what is shared and what isn't, for
separate versions of malloc, free, etc. exist for local and global
operations.  (Nevertheless, the leap to TreadMarks is not as great as to a
different model.)  TreadMarks also supplies barriers and locks.

The contribution of TreadMarks is the use of lazy release consistency.
Sequential consistency gives results that could have resulted from running
all processes on a single machine.  Eager release consistency relaxes that
model to communicate only when locks are released -- there is no need to
update copies on remote machines which cannot access the data, but when a
machine releases a lock, then any other processor could acquire it, so they
should have a consistent view of the data.  Lazy release consistency
communicates the new values only to the next machine which acquires the
lock, reducing communication from O(n) to O(1), where n is the number of
processors.  After all, only the processor that can access the data need
have the latest copy -- stale copies that can't be referenced are fine.

This is a good idea and should outperform other consistency models, but the
paper's performance numbers omit small runs and provide no comparisons with
other consistency models, which I would have liked to see.  One potential
downside is the requirement that the programs communicate only via locks,
which TreadMarks can monitor.
---------------------------------------------------
Return-Path: yasushi@silk
Received: from silk.cs.washington.edu (silk.cs.washington.edu [128.95.2.238]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id VAA04339 for <bershad@whistler.cs.washington.edu>; Mon, 3 Feb 1997 21:14:39 -0800
Received: (yasushi@localhost) by silk.cs.washington.edu (8.7.2/7.2ws+) id VAA18491; Mon, 3 Feb 1997 21:14:38 -0800 (PST)
Date: Mon, 3 Feb 1997 21:14:38 -0800 (PST)
From: yasushi@silk
Message-Id: <199702040514.VAA18491@silk.cs.washington.edu>
To: bershad@silk
Subject: Treadmarks 552-reading summary 

Treadmarks is a distributed memory system implemented fully as a user
level library. There are two novel aspects in this system. One is the
lazy release consistency protocol, and the other is multiple writer
protocol.

Lazy release consistensy is an improvement over eager release
consistensy in a sense that it requires fewer message exchanges.  This
is a win in Treadmarks because the system is single threaded, and the
message send & receive consume significant amount of CPU cycles. I'm
not sure if this is a win if message processing can overlap with the
main computation(eg, like in DASH).

Multiple writer protocol allows multiple writers in single page.  The
system assumes that locks are properly held, therefore no two writers
write on the same location at same time. Using this assumption,
Treadmarks extracts the updated portion of the page by taking a diff
with the original page contents, and merge the diffs from multiple
writers. This approach mix well with the release consistency model
where a proper synchronization of memory accesses is required.

The paper spends many pages to describe what the benchmark programs
do. I didn't understand. I wonder why they didn't use more standard
stuff(eg, Splash-2).

---------------------------------------------------
Return-Path: rgrimm@cs.washington.edu
Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id WAA05272 for <bershad@whistler.cs.washington.edu>; Mon, 3 Feb 1997 22:03:58 -0800
Received: from [128.95.8.127] (h127.dyn.cs.washington.edu [128.95.8.127]) by june.cs.washington.edu (8.8.5+CS/7.2ju) with SMTP id WAA23617 for <bershad@cs>; Mon, 3 Feb 1997 22:03:43 -0800
X-Sender: rgrimm@june.cs.washington.edu
Message-Id: <v02140b02af1c810d03a1@[128.95.8.127]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Mon, 3 Feb 1997 22:01:17 -0800
To: bershad@cs
From: rgrimm@cs.washington.edu (Robert Grimm)
Subject: Treadmarks 552-reading summary

Amza et al. describe a software-only distributed shared memory (DSM)
system called TreadMarks: It features a simple, yet powerful interface
that requires explicit allocation of shared memory (as opposed to
local memory acquired through the standard library routines) and
provides synchronization primitives in the form of mutex locks and
barriers. The implementation relies on a lazy release consistency
model to avoid the high message costs of sequential consistency (which
easily leads to thrashing of highly contested pages between several
nodes) and of eager release consistency (which, while it shows
improvements over sequential consistency, still leads to too much
message traffic). To further reduce message overhead, TreadMarks uses
a multiple-writer protocol which allows several nodes to concurrently
write to a single page of memory and uses shadow-pages (called twins)
in combination with modification records (called diffs) to efficiently
communicate changes between the nodes.  The evaluation of the basic
TreadMarks operations shows reasonable performance, and two
applications that were converted to utilize TreadMarks (mostly) show
good to almost linear speedup.

TreadMarks provides users with clean and simple distributed shared
memory model and interface, allows for sharing at arbitrary
granularity (which is not limited by the page size), is implemented as
a user-level library (and does not require modifications to the
kernel), runs on a wide variety of hardware platforms available today
(as opposed to hardware DSM), and shows good performance for the
studied applications. At the same time, the evaluation shows that
application performance is highly dependent on the
communication-to-computation ratio, favoring problems that partition
well across nodes and require relatively few message exchanges per
second (on the order of a few hundred). This suggests that TreadMarks
is best used for certain classes of applications, namely traditional
super-computing applications, which is further underlined by the
SIMD-type programming model (e.g., the use of barriers). It seems
questionable if other classes of applications, such as large
databases, are suitable for this programming model, and if so, whether
they would see any reasonable performance.


---------------------------------------------------
Return-Path: sparekh@cs.washington.edu
Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id WAA05786 for <bershad@whistler.cs.washington.edu>; Mon, 3 Feb 1997 22:33:48 -0800
Received: from yoda (h250.dyn.cs.washington.edu [128.95.8.250]) by june.cs.washington.edu (8.8.5+CS/7.2ju) with ESMTP id WAA25447; Mon, 3 Feb 1997 22:33:45 -0800
Message-Id: <199702040633.WAA25447@june.cs.washington.edu>
From: "Sujay Parekh" <sparekh@cs.washington.edu>
To: <bershad@cs>
Cc: <sparekh@cs>
Subject: Treadmarks 552-reading summary
Date: Mon, 3 Feb 1997 22:37:37 -0800
X-MSMail-Priority: Normal
X-Priority: 3
X-Mailer: Microsoft Internet Mail 4.70.1155
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

The TreadMarks system provides a shared-memory abstraction to programmers
of distributed applications.  DSM is a more convenient abstraction than the
usual message passing, because it frees the programmer from worrying
constantly about the when, (with) who and what of explicit communication. 
However, there are several issues in designing a DSM system,  mainly
relating to the semantics of the abstraction.  The TreadMarks paper focuses
on their model of memory consistency, and their unique implementation
technique.  A fundamental assumption is that all shared access and
synchronization is done via the TreadMarks API.

The consistency model is the semantics for how the programmer can expect
the memory system to behave.  The traditional sequential consistency model
is often too strict and leads to too much communication overhead. 
TreadMarks introduces a relaxed consistency model called release
consistency.  It is based on a partial order, called happens-before,
defined on synchronization operations and shared memory accesses.  Their
implementation of this model is a lazy release consistency.  Lazy
consistency reduces communication since no data invalidation is sent until
the potentially invalid data is accessed on a remote node.  While this is a
significantly more complicated model than eager release consistency, the
communications saved seem to justify the effort.

TreadMarks uses traditional virtual-memory protection mechanisms to control
access to shared memory.  Hence, it can run at user-level, without needing
special kernel modifications.  In most systems, however, this is a
page-level control which is often too coarse for fine-grained data. 
TreadMarks addresses this problem by following a multiple-writer protocol. 
This is a really elegant solution that not only reduces false sharing, but
also reduces communication overhead when multiple nodes are write-sharing
the same page of data.

In general, there is evidence message passing leads to more efficient
applications, but the programmer has to manage the communication
explicitly.  TreadMarks is a step towards more efficient DSM systems.  In
the related work, they mention other potentially more beneficial techniques
like entry consistency and object-based DSM which could make DSM more
viable.

---------------------------------------------------
Return-Path: tian@wally
Received: from wally.cs.washington.edu (wally.cs.washington.edu [128.95.2.122]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id XAA06237 for <bershad@whistler.cs.washington.edu>; Mon, 3 Feb 1997 23:05:17 -0800
Received: (tian@localhost) by wally.cs.washington.edu (8.8.3+CSE/7.2ws+) id XAA09186; Mon, 3 Feb 1997 23:05:16 -0800 (PST)
Date: Mon, 3 Feb 1997 23:05:16 -0800 (PST)
From: Tian Lim <tian@wally.cs.washington.edu>
To: Brian Bershad <bershad@wally.cs.washington.edu>
Subject: Treadmarks 552-reading summary
Message-ID: <Pine.OSF.3.91.970203230318.9143A-100000@wally.cs.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

Treadmarks is a user level, software DSM system.  It provides a separate 
API to perform shared memory allocations, and locks and barriers on 
shared data.   It is understood that a sequential consistency model is 
prohibitively expensive, and so it builds on the "eager" release protocol 
of Munin by making it lazy - instead of broadcasting changes when locks 
are released,  only the new holder of the lock receives the changes.  As 
a result, less bandwidth is used.  Because they assume only lock holders 
will access the data, they can avoid broadcasting invalidate messages.  
Treadmarks deals with false sharing the same way Munin does - a multiple 
writer protocol that allows more than one writer to mutate a page, and 
sends page diffs instead of pages during updates.

I am curious as to why they did not attempt something like the sharing
annotations in Munin to customize the consistency model.  Is the lazy
release protocol sufficiently powerful that it exceeds the performance
gains possible by having the programmer annotate expected data access
patterns?  In addition, the performance graphs showing almost-linear
speedup do not reveal the memory usage model.  It is not clear to me that
these applications share at a fine grain, i.e.  the
communication-to-computation ratio they speak of is not high.  Since this
ratio is application dependent, it is unclear how generally DSM can be
applied.  The authors also argue that the programming model is simpler
than message passing, but from their discussion, application performance
is highly sensitive to how the computation is divided, both from an
algorithmic (e.g. they had to use load balancing in the linkage problem)
and data access (finer grain sharing increases communication overhead)
point of view.  I am curious to know if this is any easier than figuring
out when to send messages to whom. 


---------------------------------------------------
Return-Path: echris@merganser
Received: from merganser.cs.washington.edu (merganser.cs.washington.edu [128.95.2.192]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id IAA00696 for <bershad@whistler.cs.washington.edu>; Tue, 4 Feb 1997 08:22:00 -0800
Received: (echris@localhost) by merganser.cs.washington.edu (8.8.3+CSE/7.2ws+) id IAA00699; Tue, 4 Feb 1997 08:21:59 -0800 (PST)
Date: Tue, 4 Feb 1997 08:21:59 -0800 (PST)
Message-Id: <199702041621.IAA00699@merganser.cs.washington.edu>
From: E Christopher Lewis <echris@cs.washington.edu>
To: bershad@cs
Subject: Treadmarks 552-reading summary 
Reply-To: echris@cs.washington.edu

This paper summarizes yet another DSM implementation.  TreadMarks may be
distinguished from other work in the following ways: the target platform
is NOW, it is software-based, the implementation requires no kernel mods
(achieving portability?), it distinguishes shared and non-shared memory,
and it uses lazy release consistency and a multi-writer protocol.  The
paper's value is as a discussion of the above details (in particular
consistency models) in a single system.

I will now take issue with the motivation for DSM that appears in the
abstract and introduction.  The authors argue that DSM can play an
important role in incremental parallelism, for most sequential data
structures can remain unchanged after parallelization (the usual stuff).
DSM is dangerous.  Though it provides a convenient mechanism for
communication, it is often sold as a way to ignore locality.  Locality
is so important in a NOW that algorithm changes are often/probably
necessary to achieve good performance.  The incremental dream does not
allow for this reality.
---------------------------------------------------
Return-Path: govindk@shasta.ee.washington.edu
Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id IAA00900 for <bershad@whistler.cs.washington.edu>; Tue, 4 Feb 1997 08:36:15 -0800
Received: from shasta.ee.washington.edu (shasta.ee.washington.edu [128.95.28.11]) by june.cs.washington.edu (8.8.5+CS/7.2ju) with SMTP id IAA21264 for <bershad@cs>; Tue, 4 Feb 1997 08:36:15 -0800
Received: from andes.ee.washington.edu by shasta.ee.washington.edu (4.1/SMI-4.1)
	id AA01855; Tue, 4 Feb 97 08:37:27 GMT
Received: from andes by andes.ee.washington.edu (SMI-8.6/SMI-SVR4)
	id IAA03779; Tue, 4 Feb 1997 08:37:00 -0800
Sender: govindk@shasta.ee.washington.edu
Message-Id: <32F765AC.C31@shasta.ee.washington.edu>
Date: Tue, 04 Feb 1997 08:37:00 -0800
From: Govindarajan K <govindk@shasta.ee.washington.edu>
X-Mailer: Mozilla 3.01Gold (X11; I; SunOS 5.5 sun4u)
Mime-Version: 1.0
To: bershad@cs
Subject: Treadmarks 552-reading summary
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

TreadMarks is a Distributed Shared Memory system operating at the
user level. The authors using the primitives for locking critical
sections, introduce the concept of lazy release consistency to
minimize the communication overhead in maintaining consistency in the
DSM system. The authors use TreadMarks to implement consistency in
a NOW.

The traditional method for consistency maintainance is by using
the sequential consistency model. This is a simple model where
in the consistency is based on a happened before paradigm. But
the problem is that it is too communication intensive. It also
suffers from false sharing( occurs when two or more unrelated data
objects are located in the same page and are written concurrently by
separate processors).

What the lazy release algorithm does is that it releases the 
information concerning the modification of a page to just the
process which procures the lock next, thereby reducing the communication
overhead. The problem involved in this however is that the releasing
processor cannot forget about the modifications done to the page , since
another processor may be in need of this information. The updates of the
page is conveyed in an intelligent fashion. The difference between the
modified page and its twin created in the user space is run-length
encoded and transmitted to the next process which needs the page.
Note however the TreadMarks paradigm assumes that there is no datarace
in the processes.

The authors have compared their paradigm with previously existing work
and have shown significant improvement.
---------------------------------------------------
Return-Path: sungeun@wormwood
Received: from wormwood.cs.washington.edu (wormwood.cs.washington.edu [128.95.2.107]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id IAA00991 for <bershad@whistler.cs.washington.edu>; Tue, 4 Feb 1997 08:45:06 -0800
Received: (sungeun@localhost) by wormwood.cs.washington.edu (8.8.3+CSE/7.2ws+) id IAA00714; Tue, 4 Feb 1997 08:45:05 -0800 (PST)
Date: Tue, 4 Feb 1997 08:45:05 -0800 (PST)
Message-Id: <199702041645.IAA00714@wormwood.cs.washington.edu>
From: Sung-Eun Choi <sungeun@cs.washington.edu>
To: bershad@cs
Subject: Treadmarks 552-reading summary
Reply-To: sungeun@cs.washington.edu

TreadMarks is a distributed shared memory system for networks of
workstations.  The system leverages off of virtual memory protection
to detect accesses to shared data.  Consequently, no hardware was
built or kernels hacked.

Two performance issues arise when implementing such a system.  The
first is that of providing shared data consistency; the second,
avoiding false sharing.  The "natural" model for providing consistency
is to implement sequential consistency, i.e., there is a total
ordering on all memory acceses.  Though relatively simple to
implement, sequential consistency generally results in a large volume
of mostly unnecessary network traffic.  By relaxing the consistency
requirements to be sensitive to synchronization, network traffic is
greatly reduced and performance improved.  False sharing is a
by-product of using the VM protection mechanisms to detect accesses.
In a naive system, when different processors access different shared
variables incidentally located on the same page, much unnecessary
updating of the data is spawned.  The Treadmarks system allows
multiple writers per page and then merges the updates when consistency
must be realized.

The system provides a shared memory abstraction to the programmer via
two mechanisms: shared memory allocation and synchronization.  I
cannot comment on whether or not I think this is a suffient API, for I
do not believe that shared memory is an appropriate programming model
for parallel applications.  While they claim that shared memory
"facilitates the transition from sequential to parallel programs",
unless they mean poorly performing parallel programs, I do not believe
this statement.  Memory allocation and synchronization ignore locality
issues, which are key in distributed parallel applications.  Moreover,
this "transition" that they speak of requires identifying the
"possible sources of parallelism" in the sequential program, a not so
easy task that many have developed tools for.