Today: caching, coherence, and consistency

Caching is a fundamental idea in distributed systems [and elsewhere]
 - keep a duplicate copy of data somewhere faster
 - challenge: how do we keep the cached copy consistent with the
   master?
 - ...ideally so that a user couldn't tell the cache was even there?
   (performance aside)
 - what does that even mean?

Why do we want caching?
 - reduce load on a bottleneck service
 - better latency
 - higher-level picture: move data to where we want to compute instead
   of sending the computation to the data (RPC)

Motivating example:

Web service with two-tier architecture
 - front-end receives requests from clients, generates webpages
 - backend database stores all the state needed
 - this (stateless front-end servers) is a common design pattern
   - don't need to worry about front-end failures
   - all data is stored in the DB, so just need to get
     durability/consistency right there

Problem: all lookups have to go to the database
  - high latency and potential bandwidth limit
  - Solution: add a cache on the same machine as the front-end
  - store results from database in memory
  - why did this help?
    - don't need to access the network on every requests
    - hopefully locality of access pattern, so caching small piece of DB
      satisfy most requests
    - cache is probably faster (latency & throughput capacity) because
      it's storing data in memory instead of on disk

[aside: lots of other thigns you might want to cache]
 - file data in NFS
 - DNS
 - view information in lab 2
 - ...

Cache details:
  - What do we do with writes?
    - update cache first, then update database
    - synchronously (write-through): safe but probably slow
    - asynchronously (write-back): faster but could lose data if cache crashes
  - What if the cache runs out of space?
    - throw data away (e.g., LRU replacement), don't worry about it
  - Does this cache behave the way we'd like it to?
  - Can you tell that the cache is there?
  
Coherence & consistency:
  - strict coherence: the value returned by a read operation is always
    the same as the value most recently written to that object
  - sometimes this gets called consistency, unfortunately
     - we'll define coherence: properties about the behavior exhibited by
       multiple reads/writes to the *same* address
     - and consistency: properties about the behavior exhibited by
       multiple reads/writes to *different* addresses
     - unfortunately, this all gets muddled together, including in the
       papers we read. hopefully it's clear from context, but not always...

Revisiting the single-client cache:
  - Are there coherence problems?
    - (perhaps surprisingly) no - as long as all writes go to the
      cache first and all reads check there first, always see latest write

Now let's make it harder...
 - we want to have more front-end servers to scale up the system
 - each front-end server has its own cache
 - suppose we just use the same protocol as before
 - does this now provide coherence? what goes wrong?
   - A writes new value, B reads from old cached value -- potentially
     never gets updated!
   - A and B both write simultaneously; B's value hits the DB last so
     it overwrites A, but A and B both have their writes in the cache
   - even worse: unpredictable behavior, because evicting from the
     cache

How could we fix this?

Idea: send invalidations (or updates) to the caches as well as the
database
  - do we update the other caches first or the DB?
  - either way, we're still in trouble:
    - if update the DB first:
       - A updates DB, B reads from cache before invalidation
    - if update the other caches first:
       - A and B concurrently write object, A updates DB first, but
         B's update reaches the caches first

Idea: lock the bus while we send invalidations
  - when A writes X:
    - A notifies all caches and DB not to allow access to X, waits for acks
    - A updates DB, then updates caches, waits for acks
    - A releases lock on X
  - Does this work?
    - Yes. No concurrent accesses, and after update, all caches have
      same value as DB
  - Is it efficient?
    - Nope!
    - Maybe not so bad on physical hardware where locking the bus
      has low-level meaning... but even modern processors don't do
      this

Better idea: exclusive ownership
 - basic idea: at most one cache can have dirty data at one point
 - keep track of states: invalid (no cached data), exclusive (can be
   dirty), shared (know no one has a dirty copy)
 - X has exclusive access => no one else has a shared copy
 - state machines
 - How do we transition to exclusive state?
   - send a RPC to everyone else, wait for responses
   - if they were in shared state, go to invalid
   - if they were in exclusive state: they write back their changes,
     go to invalid
   - either way, they write down the new node in exclusive state
   - can't acquire shared unless no node is in exclusive
 - Does this work?

Observation: strict coherence comes with a performance cost

What if we wanted something cheaper?
  - Maybe if it was ok for us to see a value as long as it hadn't been
    replaced more than 10 seconds ago?
  - Maybe we're ok with any old value as long as it's not before our
    last update?
  - You can define an infinite number of possibilities...

Back to coherence vs consistency
  - Coherence: properties about behavior exhibited by accesses to a
    single object
  - Consistency: properties about behavior exhibited by multiple
    accesses to different objects

Example:
  Suppose we have 3 nodes; and they are sending results to a single 
backend store. Then we can make it more complex by adding sharding to the store.
And caches to the nodes.

Note: this is written as C code, but could be PUTs and GETs in Lab 2 in place of load/stores on a processor.

  node0:
    v0 = f0();
    done0 = true;
  node1:
    while(done0 == false)
      ;
    v1 = f1(v0);          
    done1 = true;
  node2:
    while(done1 == false)
      ;
    v2 = f2(v0, v1);

  Intuitive intent:
    node2 should execute f2() with results from node0 and node1
    waiting for node1 implies waiting for node0

  Problem A:
    Suppose every operation is done in order; you wait for the operation to complete before moving to the next one.  And the data is stored on the same server.

    Then we know that f0 is written if done0 is written. 

    What if we want to speed things up?  After all, we can do the RPC to write done0 before waiting for the v0 RPC to complete.

    We're still ok if the RPC's are processed, one at a time, and in order 
     sent (e.g., with client-specific sequence #'s).
     This has a name: events occur in "processor order"

    Now suppose f0 and done0 are stored on different shards. 
Where a value is stored shouldn't make any difference, right?

    OK if node1 sees writes in the order they are issued.

    But suppose we don't wait for each store to complete before moving
    to the next operation. Then node1 *might* observe done0
    being true *before* v0 is initialized.

    We can prevent this problem by slowing everything down to a crawl.
    Issue one write, wait for it to complete, before issuing the next write.

    What if we have caches?  The copy of the data might be out of date.
    That is, update might not have reached the cache.  So if a node
    reads the copy, it will see the old copy.  And there might be
    a cached copy of f0 but not done0 -- node1 might then see done0 as true 
    (its up to date) even when f0 is not up to date.

  Problem B:
    CPU2 may see CPU1's writes before CPU0's writes
    i.e. CPU2 and CPU1 disagree on order of CPU0 and CPU1 writes

   Example: suppose we try to keep caches up to date by sending the
   new data to every node?  Does that help?  No: the order of arrival
   might differ on the different nodes.  Rather, we need to keep the same
   order everywhere.

Behavior of this example depends on memory model:
 - weakly consistent
 - eventually consistent
 - serializable
 - linearizable

Strongest model: linearizability
 - a memory system is linearizable iff
   every processor sees updates in the same order that they actually
   happened in real time
 - i.e., sees the result of the most recent write that finished before
   its read started
 - captures the fact that operations take some amount of time
 - each operation "actually" took place at some unknown point on the
   timeline between when it started and finished

P1:       W(x)1
P2: R(x)0          R(x) 1
linearizable!

P1: W(x)1
P2:               R(x)2  R(x)2
P3:       W(x)2
also linearizable


P1: W(x)1
P2:               R(x)1  R(x)1
P3:       W(x)2
not linearizable. how could this happen? caching at P2
implementation challenge: even though no explicit communication
between P3 and P2, P2 needs to see P3's value


Slightly less strong model: sequential consistency (serializability)
 - as though all operations from all processors were executed in a
   sequential order
 - operations by each individual processor appear in that sequence in
   program order (i.e., in the order they were executed on that
   processor)
 - no "real time" constraint as in linearizability

P1: W(x)1
P2:               R(x)2  R(x)2
P3:       W(x)2
linearizable, so it's serializable

P1: W(x)1
P2:               R(x)1  R(x)1
P3:       W(x)2
not linearizable, but serializable:
  W(x)1 R(x)1 R(x)1 W(x)2 is a valid order

How to implement sequential consistency?

Requirement 1: Program order requirement
  - each process must ensure that previous memory operation complete
    before starting next memory operation in program order
     - needs to get ack back from memory/caches
  - cache based systems: write must generate invalidate message for
    all cached copies
     - write is complete only once invalidates acked

Requirement 2: Write atomicity
  - writes to same location must be serialized
    i.e., there is one definite order of the writes, and they are made
    visible in that same order to all processors
  - value of write can't be returned by any read until write complete:
    all invalidates acked

Causal consistency:
 - a read returns a causally consistent version of the data
 - after receiving a message M from a node, reads will return all
   updates that node made prior to sending M
 - if write(X) happens-before read(X), will see the effects of that
    write
 - cascading

Is this weaker than sequential consistency?
 - Yes!
 - Don't need to decide on an order for causally unrelated writes
 - Can build a system that does not coordinate on causally unrelated
 writes
 - In particular, if two nodes do not communicate with each other (say
   in case of a partition), can still ensure causal consistency
    - strongest level where this is true
    - relevant to disconnected operation, weak consistency storage
    systems
    - we'll look at some examples of systems that provide this later

P1: W(x)1
P2:            R(y)2  R(x)0
P3:      W(y)2
yes, causally consistent -- also sequentially consistent
no causal connection between P1's write and the others

P1: W(x)1
P2:                  R(y)2  R(x)0
P3:      R(x)1 W(y)2
no, not causally consistent -- P3 saw P1's W(x) before writing y, but
P2 doesn't


Weaker consistency:
 - weak consistency: everything goes
 - eventual consistency:
   - if all writes stop, eventually the system will converge to a
     consistent state, reads will return the same value
   - in the meantime, anything goes
 - eventual consistency is popular: Redis, Cassandra, MongoDB, etc
  - why? performance


----------

Ivy DSM

Distributed shared memory
 - build a runtime environment where many machines share memory
 - make a distributed system look like a giant uniprocessor
 - why? simplicity
   - don't worry about who you're communicating with
   - don't have to explicitly send messages, worry about message
   timing -- coherence system will take care of it
   - could potentially even use an existing non-distributed system
   - cheaper than a giant multiprocessor
 - this idea goes in and out of style

Approach:
 - use h/w virtual memory and protection to make DSM transparent
 - recall: h/w MMU installs page-granularity mapping from virtual
   address to physical
    - including read and write permission
    - permission violation leads to a trap to the OS
 - here, exploit this to fetch pages remotely/run cache coherence
 protocol

Uses exactly the protocol we saw before
  - on read to invalid page: trap to Ivy,
    - Ivy asks manager for read access
    - if someone has it in exclusive mode, have them relinquish it and
      send the page
    - otherwise, send a copy from the manager
    - add the node to a list of readers
    - node installs read-only mapping
  - on write to invalid page, trap to Ivy,
    - ask manager for read access
    - manager invalidates all cached copies [and waits for acks]
    - sends latest copy
    - node installs read/write mapping

Granularity of coherence
  - h/w systems: usually one cache line, ~64bytes
  - here: one page (4KB)
  - why the difference?
    - kinda has to be at the VM system's granularity
    - software cache coherence more expensive, amortize work on larger
    pages
  - what could go wrong?
    - false sharing leads to ping-ponging
    - real problem even on hardware DSM today
    - particularly bad case: synchronization

What semantics does Ivy's memory model provide?
 - coherency of individual variables
 - what about consistency?
   - sequential consistency: it satisfies the two conditions

Design options: [Table 1]
  - we just talked about "invalidation"/"centralized manager" = "okay"
  - "write broadcast" means broadcast on every write, don't try to
    track states
  - "fixed" would mean each page is owned by a particular processor
    and we just send reads/writes to it as rpcs
  - Distributed manager: partitioning: hash(page number) ->
    responsible node
    - no real differences, just better load balancing
  - Dynamic distributed manager: have the exclusive owner be the
    manager
      - other nodes just forward on to whoever they think the manager is
     
What about performance?
  - speedup curve
  - What would we hope for? N nodes => N * single node performance
  - Why wouldn't it be linear?
    - algorithm is not parallel
    - DSM system introduces overhead
  - What's up with sort?

DSM impact

Grappa

----------

Consensus:
 - fundamental problem: getting a group of nodes to agree on a value
 - lots of applications: replicated state machines, atomic broadcast,
    leader election, failure detection...
 - including lab 3
 - next week: Paxos; this week, a bit of introduction

Consensus problem:
 - multiple processes, each starting with an input value
 - processes run some consensus protocol, then output chosen value
   once it's complete
 - safety:
    - consistency: all non-faulty processes choose the same value
    - validity: the value they chose was proposed by one of the
      processes
       (this just rules out vacuous solutions like "always choose 0")
 - termination: eventually all non-faulty processes output a value

Assumptions about the world:
 - asynchronous network
   - messages can be delayed indefinitely
   - but messages that are sent repeatedly will eventually be received
 - even just one process can crash
 
FLP result:
  No deterministic consensus protocol is both safe and live in an
  asynchronous network where at most one process can crash
   (!)

Warning: handwaving imminent!

Most handwavy intuition:
 - suppose A never receives a message from B (e.g., despite repeated
   messages)
 - is B crashed, or just slow?
 - Should A wait?
    - if yes: might wait forever!
    - if no: maybe B was just slow, and will come to a different
    decision

More formal model:
 - consider executions of a distributed system, i.e., which messages
   to deliver in which order (and which to delay)
 - bivalent state: a state where the network could affect which value
   agents could choose

Proof sketch:
 - there are bivalent starting conditions when there can be failures
    - e.g., half the nodes propose 0 and half propose 1
    - suppose half fail immediately -- which value should be chosen
      depends on which half it was 
    - note that this was cheating: FLP holds even when there's only
      *one* failure -- but same sort of argument
 - for any bivalent state, there's some path that leads to another
   bivalent state
    - intuition: suppose there's some message m that makes the system
      univalent -- the deciding point for the algorithm. what if we
      delay it?
        - in fact, eventually we can reach some point where delivering
          m keeps the system bivalent!
    - we can repeat this indefinitely

So what:
 - this says the problem is unsolvable in theory
 - we still need consensus algorithms!
 - change any of the assumptions about the model to avoid the result
    - ok to be safe but not guarantee termination
    - guarantee termination with high probability
    - bound message delivery time
    - loosely synchronized clocks
    - failure detectors (even weak ones)
    - etc
 - Paxos is safe but doesn't guarantee termination

Why stick with the asynchronous model anyway?
  - we could come up with some bound on message delivery that the
    system essentially never violates, e.g., 1 hour
      - so unlikely to be violated that we might as well not support
        it -- e.g., if it was less likely than a cosmic ray corrupting
        our memory, which we don't support either
  - we could come up with an algorithm that uses that fact and is both
    safe and live
  - that algorithm would probably somehow hard-code that bound, which
    isn't great
  - asynchronous algorithms like Paxos are self-tuning and inherently
    avoid tail latency problems