Today: caching, coherence, and consistency Caching is a fundamental idea in distributed systems [and elsewhere] - keep a duplicate copy of data somewhere faster - challenge: how do we keep the cached copy consistent with the master? - ...ideally so that a user couldn't tell the cache was even there? (performance aside) - what does that even mean? Why do we want caching? - reduce load on a bottleneck service - better latency - higher-level picture: move data to where we want to compute instead of sending the computation to the data (RPC) Motivating example: Web service with two-tier architecture - front-end receives requests from clients, generates webpages - backend database stores all the state needed - this (stateless front-end servers) is a common design pattern - don't need to worry about front-end failures - all data is stored in the DB, so just need to get durability/consistency right there Problem: all lookups have to go to the database - high latency and potential bandwidth limit - Solution: add a cache on the same machine as the front-end - store results from database in memory - why did this help? - don't need to access the network on every requests - hopefully locality of access pattern, so caching small piece of DB satisfy most requests - cache is probably faster (latency & throughput capacity) because it's storing data in memory instead of on disk [aside: lots of other thigns you might want to cache] - file data in NFS - DNS - view information in lab 2 - ... Cache details: - What do we do with writes? - update cache first, then update database - synchronously (write-through): safe but probably slow - asynchronously (write-back): faster but could lose data if cache crashes - What if the cache runs out of space? - throw data away (e.g., LRU replacement), don't worry about it - Does this cache behave the way we'd like it to? - Can you tell that the cache is there? Coherence & consistency: - strict coherence: the value returned by a read operation is always the same as the value most recently written to that object - sometimes this gets called consistency, unfortunately - we'll define coherence: properties about the behavior exhibited by multiple reads/writes to the *same* address - and consistency: properties about the behavior exhibited by multiple reads/writes to *different* addresses - unfortunately, this all gets muddled together, including in the papers we read. hopefully it's clear from context, but not always... Revisiting the single-client cache: - Are there coherence problems? - (perhaps surprisingly) no - as long as all writes go to the cache first and all reads check there first, always see latest write Now let's make it harder... - we want to have more front-end servers to scale up the system - each front-end server has its own cache - suppose we just use the same protocol as before - does this now provide coherence? what goes wrong? - A writes new value, B reads from old cached value -- potentially never gets updated! - A and B both write simultaneously; B's value hits the DB last so it overwrites A, but A and B both have their writes in the cache - even worse: unpredictable behavior, because evicting from the cache How could we fix this? Idea: send invalidations (or updates) to the caches as well as the database - do we update the other caches first or the DB? - either way, we're still in trouble: - if update the DB first: - A updates DB, B reads from cache before invalidation - if update the other caches first: - A and B concurrently write object, A updates DB first, but B's update reaches the caches first Idea: lock the bus while we send invalidations - when A writes X: - A notifies all caches and DB not to allow access to X, waits for acks - A updates DB, then updates caches, waits for acks - A releases lock on X - Does this work? - Yes. No concurrent accesses, and after update, all caches have same value as DB - Is it efficient? - Nope! - Maybe not so bad on physical hardware where locking the bus has low-level meaning... but even modern processors don't do this Better idea: exclusive ownership - basic idea: at most one cache can have dirty data at one point - keep track of states: invalid (no cached data), exclusive (can be dirty), shared (know no one has a dirty copy) - X has exclusive access => no one else has a shared copy - state machines - How do we transition to exclusive state? - send a RPC to everyone else, wait for responses - if they were in shared state, go to invalid - if they were in exclusive state: they write back their changes, go to invalid - either way, they write down the new node in exclusive state - can't acquire shared unless no node is in exclusive - Does this work? Observation: strict coherence comes with a performance cost What if we wanted something cheaper? - Maybe if it was ok for us to see a value as long as it hadn't been replaced more than 10 seconds ago? - Maybe we're ok with any old value as long as it's not before our last update? - You can define an infinite number of possibilities... Back to coherence vs consistency - Coherence: properties about behavior exhibited by accesses to a single object - Consistency: properties about behavior exhibited by multiple accesses to different objects Example: Suppose we have 3 nodes; and they are sending results to a single backend store. Then we can make it more complex by adding sharding to the store. And caches to the nodes. Note: this is written as C code, but could be PUTs and GETs in Lab 2 in place of load/stores on a processor. node0: v0 = f0(); done0 = true; node1: while(done0 == false) ; v1 = f1(v0); done1 = true; node2: while(done1 == false) ; v2 = f2(v0, v1); Intuitive intent: node2 should execute f2() with results from node0 and node1 waiting for node1 implies waiting for node0 Problem A: Suppose every operation is done in order; you wait for the operation to complete before moving to the next one. And the data is stored on the same server. Then we know that f0 is written if done0 is written. What if we want to speed things up? After all, we can do the RPC to write done0 before waiting for the v0 RPC to complete. We're still ok if the RPC's are processed, one at a time, and in order sent (e.g., with client-specific sequence #'s). This has a name: events occur in "processor order" Now suppose f0 and done0 are stored on different shards. Where a value is stored shouldn't make any difference, right? OK if node1 sees writes in the order they are issued. But suppose we don't wait for each store to complete before moving to the next operation. Then node1 *might* observe done0 being true *before* v0 is initialized. We can prevent this problem by slowing everything down to a crawl. Issue one write, wait for it to complete, before issuing the next write. What if we have caches? The copy of the data might be out of date. That is, update might not have reached the cache. So if a node reads the copy, it will see the old copy. And there might be a cached copy of f0 but not done0 -- node1 might then see done0 as true (its up to date) even when f0 is not up to date. Problem B: CPU2 may see CPU1's writes before CPU0's writes i.e. CPU2 and CPU1 disagree on order of CPU0 and CPU1 writes Example: suppose we try to keep caches up to date by sending the new data to every node? Does that help? No: the order of arrival might differ on the different nodes. Rather, we need to keep the same order everywhere. Behavior of this example depends on memory model: - weakly consistent - eventually consistent - serializable - linearizable Strongest model: linearizability - a memory system is linearizable iff every processor sees updates in the same order that they actually happened in real time - i.e., sees the result of the most recent write that finished before its read started - captures the fact that operations take some amount of time - each operation "actually" took place at some unknown point on the timeline between when it started and finished P1: W(x)1 P2: R(x)0 R(x) 1 linearizable! P1: W(x)1 P2: R(x)2 R(x)2 P3: W(x)2 also linearizable P1: W(x)1 P2: R(x)1 R(x)1 P3: W(x)2 not linearizable. how could this happen? caching at P2 implementation challenge: even though no explicit communication between P3 and P2, P2 needs to see P3's value Slightly less strong model: sequential consistency (serializability) - as though all operations from all processors were executed in a sequential order - operations by each individual processor appear in that sequence in program order (i.e., in the order they were executed on that processor) - no "real time" constraint as in linearizability P1: W(x)1 P2: R(x)2 R(x)2 P3: W(x)2 linearizable, so it's serializable P1: W(x)1 P2: R(x)1 R(x)1 P3: W(x)2 not linearizable, but serializable: W(x)1 R(x)1 R(x)1 W(x)2 is a valid order How to implement sequential consistency? Requirement 1: Program order requirement - each process must ensure that previous memory operation complete before starting next memory operation in program order - needs to get ack back from memory/caches - cache based systems: write must generate invalidate message for all cached copies - write is complete only once invalidates acked Requirement 2: Write atomicity - writes to same location must be serialized i.e., there is one definite order of the writes, and they are made visible in that same order to all processors - value of write can't be returned by any read until write complete: all invalidates acked Causal consistency: - a read returns a causally consistent version of the data - after receiving a message M from a node, reads will return all updates that node made prior to sending M - if write(X) happens-before read(X), will see the effects of that write - cascading Is this weaker than sequential consistency? - Yes! - Don't need to decide on an order for causally unrelated writes - Can build a system that does not coordinate on causally unrelated writes - In particular, if two nodes do not communicate with each other (say in case of a partition), can still ensure causal consistency - strongest level where this is true - relevant to disconnected operation, weak consistency storage systems - we'll look at some examples of systems that provide this later P1: W(x)1 P2: R(y)2 R(x)0 P3: W(y)2 yes, causally consistent -- also sequentially consistent no causal connection between P1's write and the others P1: W(x)1 P2: R(y)2 R(x)0 P3: R(x)1 W(y)2 no, not causally consistent -- P3 saw P1's W(x) before writing y, but P2 doesn't Weaker consistency: - weak consistency: everything goes - eventual consistency: - if all writes stop, eventually the system will converge to a consistent state, reads will return the same value - in the meantime, anything goes - eventual consistency is popular: Redis, Cassandra, MongoDB, etc - why? performance ---------- Ivy DSM Distributed shared memory - build a runtime environment where many machines share memory - make a distributed system look like a giant uniprocessor - why? simplicity - don't worry about who you're communicating with - don't have to explicitly send messages, worry about message timing -- coherence system will take care of it - could potentially even use an existing non-distributed system - cheaper than a giant multiprocessor - this idea goes in and out of style Approach: - use h/w virtual memory and protection to make DSM transparent - recall: h/w MMU installs page-granularity mapping from virtual address to physical - including read and write permission - permission violation leads to a trap to the OS - here, exploit this to fetch pages remotely/run cache coherence protocol Uses exactly the protocol we saw before - on read to invalid page: trap to Ivy, - Ivy asks manager for read access - if someone has it in exclusive mode, have them relinquish it and send the page - otherwise, send a copy from the manager - add the node to a list of readers - node installs read-only mapping - on write to invalid page, trap to Ivy, - ask manager for read access - manager invalidates all cached copies [and waits for acks] - sends latest copy - node installs read/write mapping Granularity of coherence - h/w systems: usually one cache line, ~64bytes - here: one page (4KB) - why the difference? - kinda has to be at the VM system's granularity - software cache coherence more expensive, amortize work on larger pages - what could go wrong? - false sharing leads to ping-ponging - real problem even on hardware DSM today - particularly bad case: synchronization What semantics does Ivy's memory model provide? - coherency of individual variables - what about consistency? - sequential consistency: it satisfies the two conditions Design options: [Table 1] - we just talked about "invalidation"/"centralized manager" = "okay" - "write broadcast" means broadcast on every write, don't try to track states - "fixed" would mean each page is owned by a particular processor and we just send reads/writes to it as rpcs - Distributed manager: partitioning: hash(page number) -> responsible node - no real differences, just better load balancing - Dynamic distributed manager: have the exclusive owner be the manager - other nodes just forward on to whoever they think the manager is What about performance? - speedup curve - What would we hope for? N nodes => N * single node performance - Why wouldn't it be linear? - algorithm is not parallel - DSM system introduces overhead - What's up with sort? DSM impact Grappa ---------- Consensus: - fundamental problem: getting a group of nodes to agree on a value - lots of applications: replicated state machines, atomic broadcast, leader election, failure detection... - including lab 3 - next week: Paxos; this week, a bit of introduction Consensus problem: - multiple processes, each starting with an input value - processes run some consensus protocol, then output chosen value once it's complete - safety: - consistency: all non-faulty processes choose the same value - validity: the value they chose was proposed by one of the processes (this just rules out vacuous solutions like "always choose 0") - termination: eventually all non-faulty processes output a value Assumptions about the world: - asynchronous network - messages can be delayed indefinitely - but messages that are sent repeatedly will eventually be received - even just one process can crash FLP result: No deterministic consensus protocol is both safe and live in an asynchronous network where at most one process can crash (!) Warning: handwaving imminent! Most handwavy intuition: - suppose A never receives a message from B (e.g., despite repeated messages) - is B crashed, or just slow? - Should A wait? - if yes: might wait forever! - if no: maybe B was just slow, and will come to a different decision More formal model: - consider executions of a distributed system, i.e., which messages to deliver in which order (and which to delay) - bivalent state: a state where the network could affect which value agents could choose Proof sketch: - there are bivalent starting conditions when there can be failures - e.g., half the nodes propose 0 and half propose 1 - suppose half fail immediately -- which value should be chosen depends on which half it was - note that this was cheating: FLP holds even when there's only *one* failure -- but same sort of argument - for any bivalent state, there's some path that leads to another bivalent state - intuition: suppose there's some message m that makes the system univalent -- the deciding point for the algorithm. what if we delay it? - in fact, eventually we can reach some point where delivering m keeps the system bivalent! - we can repeat this indefinitely So what: - this says the problem is unsolvable in theory - we still need consensus algorithms! - change any of the assumptions about the model to avoid the result - ok to be safe but not guarantee termination - guarantee termination with high probability - bound message delivery time - loosely synchronized clocks - failure detectors (even weak ones) - etc - Paxos is safe but doesn't guarantee termination Why stick with the asynchronous model anyway? - we could come up with some bound on message delivery that the system essentially never violates, e.g., 1 hour - so unlikely to be violated that we might as well not support it -- e.g., if it was less likely than a cosmic ray corrupting our memory, which we don't support either - we could come up with an algorithm that uses that fact and is both safe and live - that algorithm would probably somehow hard-code that bound, which isn't great - asynchronous algorithms like Paxos are self-tuning and inherently avoid tail latency problems