

## **Basic Correctness Properties**

- Deadlock-free -- no cyclic buffer dependencies
- Livelock-free -- controllers preemptively steal resources from each other without completing
- Starvation-free -- a process does not make progress while others do
- Cache Coherence
- Possibly Sequential Consistency

 

 Basic Assumptions of Design
 Cache

 • Single Level Cache
 • Standa

 • Transactions on bus atomic
 • W

 • Cache can stall process to perform multiaction updates -- makes actions look atomic
 • W

 • w.r.t. each other
 • Transactions

## Cache Tags and Controller

- · Standard bus operations from cache controller
  - Assert Request
  - · Wait for bus grant
  - Drive address and command
  - Wait for command to be accepted
  - Transfer data



## Reporting Snoop Results

- The snoopers must come to some "decision" about bus transactions so memory can know if it's supposed to deliver data ... when and how?
- Fixed delay counted in clock cycles
  - Snoopers check their tag set -- could be locked because processor is updating
  - Add ability to extend
  - · Fixed delay may not be conservative but it works
  - · Pentium Pro, HP and Sun processor use this

© Copyright, Lawrence Snyder, 1999

# Reporting Snoop Results, II

- Variable delay -- memory assumes the caches will deliver until all caches have said they won't
  - Allows variable amount of time for snooper to reply, say because it is locked out by processor
  - SGI Challenge uses variable, but with
  - speculative access
- Memory can keep a bit per block indicating whether it is in a cache dirty
  - · Doesn't need snoopers, but uses memory

© Copyright, Lawrence Snyder, 1999





#### Atomicity

10

- Even with atomic bus the protocol requires multiple operations by multiple controllers, and multiple requests can be outstanding at once
- P1 wants to perform a BusRdX but cannot get access, meanwhile P2 is performing a BusRd on data P1 has modified

Consider two caches simultaneously issuing write to same block they hold shared

- P1 promotes S-->M and issues upgrade
- · P2 does too, but wins arbitration
- P1 downgrades M--> I, but upgrade request till out
- P1's revises to be BusRdX
- Therefore, snoop against requests



## Serialization

- To speed writes, it may seem smart to let the processor go while the snooper is getting exclusive access to block and possibly filling it
- But other writes might be asserted in this interval trashing coherence if write to same block, or SC if write to any block -- vulnerability
- To be conservative a processor has to be stalled until the BusRdx is complete and write is visible to other processors

© Copyright, Lawrence Snyder, 1999

Copyright, Lawre

## But There Is An Optimization

- Relax "completed" to "committed"
- Then, it is sufficient to be asserting exclusive ownership for writes on the bus, since all caches will see that even, the serializing event
- · This is sufficient for
  - Coherence
  - Sequential Consistency
- Notice that write-backs are really separate and need not be ordered
- 13

#### Fetch Deadlock, Write Livelock

- Situation: Two controllers have data to service the other's request but they don't do so until their request is fulfilled
- Fix: Service requests while waiting for yours
- In invalidation protocol, consider all processors trying to write to one location ... by the time a processor has it in the cache and ready to write, it is invalidated by some other processor

© Copyright, Lawrence Snyder, 1999

 Fix: Let processor write if it is granted exclusive ownership

Implementing Test&Set
Two operations: read and write
Should the lock be cacheable

Yes, get locality and spin in cache
No, get faster response

To get atomicity, lock-down the bus between the read and write components
Sweeter solution: Read the value exclusively, but don't yield exclusivity until the write is done

# Shared Memory Without A Bus

- The bus is a centralized point where writes and reads can be serialized
- How are coherency and sequential consistency achieved without a bus?
- It is possible to broadcast, but this is both expensive and potentially very complicated
- Directory-based cache coherence is solution



Copyright, Lawre

## Preliminaries

- The machines being considered are called distributed shared memory (DSM) class
- The subclass is the CC-NUMA, cache coherent, non-uniform memory access
- On an access-fault by the processor ...
  - Find out information about the state of the cache block in other machines
  - Determine exact location of copies, if necessary
  - Communicate with other controllers to
  - implement the protocol

© Copyright, Lawrence Snyder, 1999

### Terminology

- · Home node, node whose main memory has block allocated
- Dirty node, node with modified value
- · Owner, node holding valid copy, usually the home or dirty node
- Exclusive node, holds only valid cached copy
- Requesting node, (local) node asking for blk
- · Locally allocated / remotely allocated

Sample Directory Scheme · Local node has access fault

- · Sends request to home node for directory info • Read -- directory tells which node has valid data · Data is requested
  - Write -- directory tells nodes with copies · Invalidation or update requests are sent
- · Acknowledgments are returned
- · Processor waits for all ACKs for completion

Notice that many transactions can be "in the air" at once, leading possibly to races

ight, Lawr

19





#### A Closer Look I (Write)

- On a write access fault at P<sub>x</sub>, the local directory controller determines if block is locally/remotely allocated; if remote finds home
- Controller sends request to home node for blk
- Home controller looks up directory entry of blk
  - Dirty bit OFF, the home has a clean copy
    - Home node sends data to P<sub>x</sub> w/ presence vector • Home controller clears directory, sets x<sup>th</sup> bit ON and sets dirty bit ON
    - · Px controller sends invalidation requests to all
    - nodes listed in the presence vector



### Alternative Directory Schemes

- · The "bit vector directory" storage-costly
- Consider improvements to Mblk\*P cost
- Increase block size, cluster processors
- Just keep list of Processor Ids of sharers
   Need overflow scheme
  - Five slots suffices
- Link the shared items together
  - · Home keeps the head of list
  - List is doubly-linked
  - New sharer adds self to head of listObvious protocol suffices, but watch for races

© Copyright, Lawrence Snyder, 1999

© Copyright, Lawrence Sn

#### Assessment

- A obvious difference between directory and bus solutions is that for directories, the invalidate request scales as the number of processors that are sharing
- Directories take memory --
  - 1 bit per block per processor + c
  - If a block is B bytes, 8B processors imply 100% overhead to store the directory

Copyright, Lawre

## Performance Data

To see how much sharing takes place and how many invalidations must be sent, experiments were run

- · Summarizing the data
  - Usually, there are few sharers
  - The mode is 1 other process sharing, ~60
  - The "tail" of the distribution stretches out for some applications
- Remote activity increases as the number of processors
- Larger block sizes increase traffic, 32 is good





### Higher Level Optimization

31

- Organizing nodes as SMPs with one coherent memory and one directory controller can improve performance since one processor might fetch data that the next processor wants ... it is already present
- The main liability is that the controller resource, and probably its channel into the network are shared

right, Lawrence Snyder, 199

# Serialization

- · The bus defines the ordering on writes in SMPs
- · For directory systems, memory (home) does
- If home always had the value, FIFO would work
   Consider a block in modified state and two nodes request
   exclusive access in an invalidation protocol. The requests
   reach home in one order, but they could reach the owner
   in a different order. Which order prevails?
- Fix: Add a "busy state" indicating a transaction is in flight

© Copyright, Lawrence Snyder, 1999

Four Solutions To Ensure Serialization

- Buffer At Home -- keep request at home, service in order ... lower concurrency, overflo
- Buffer at requesters with linked list
- NACK and retry -- when directory is busy, just "return to sender"
- Forward to dirty node -- serialize at home for clean, serialize at dirty node otherwise

## Origin 2000

- Intellectual descendant of Stanford DASH
- Two processors per node
- · Caches use MESI protocol
- · Directory has 7 states
  - Stable: unowned, shared, exclusive (cl/dirty in \$)
  - Busy states: Processor not ready to handle new requests to that block: read, readex, uncached
    Poison: used for other purposes
- Directory uses extended bit-vector
- HUB is the interface

## Origin-2000 Directory

- The "approximate" bit-vector solution
  - Two processors / node
  - Scaling beyond 64 processors necessary
- · Three interpretations are possible
  - Exclusive state: bits are processor address
  - Two sizes -- 16-bit and 64-bit vectors
  - · Coarse vector -- P/64 nodes are grouped
  - The last two schemes are dynamically selected in large configurations

© Copyright, Lawrence Snyder, 1999

## Specific Choices

- Generally the Origin 2000 follows the protocols discussed with minor variations and optimizations
- The specifics are interesting because they emphasize two points:
  - The basic ideas discussed really apply
  - Many simplifying assumptions must be revisited to get a system built and deployed

© Copyright, Lawrence Snyder, 1999

© Copyright, Lawrence Sny