Deep dive on the design of causal consistent databases.

Today's distributed systems are undoubtedly complex and varied, requiring different data consistency guarantees. Amongst all, Linearizability and Strict-serializability provide the strongest guarantees, and also the most intuitive and easy to understand. Unfortunately, these are also the most complex to design for today's massively scalable data platforms and distributed systems that require ultra-low latency. While there are many prominent databases that provide such strong consistency guarantees, their applications remain limited to niche domains. Most applications instead opt for relaxed consistency models and databases and storage industries have done a phenomenal job of catering to the needs by providing a variety of consistency guarantees. In this article, we will learn about one such relatively stronger consistency guarantee called Causal Consistency.

What is Causality?

Let's try to understand causality with a classic chat application example.

Let's say, Bill wants to invite friends to his birthday party. He posted a message to a group chat, the conversation is as follows,

Bill: Hey everyone! I'm throwing a birthday party at my place this Saturday at 7 pm. You're invited 🎂🥳

Warren: Sounds like a plan, Bill! I'll be there for sure. 🎁

Here, these two messages are causally related, ie. the response from Warren is casually related to Bill's invitation. In other words, the second message from Warren is the cause of the first message from Bill. Warren's message wouldn't even exist if the birthday invitation wasn't sent in the first place. Hence, they are casually related, as one causes the other. The causality is sometimes also referred to as the happened-before relationship. In our example, Bill's message happened before Warren's message.

So far so good, what is Causal Consistency then?

To understand, imagine what would happen if someone reading the above group conversation sees the message from Warren before the message from Bill. Would the conversation be meaningful? Certainly not. Our chat application should ensure that the messages maintain their causal order no matter who and from where are reading this conversation. The guarantee of providing order on the causality-related events is what a causal consistent model ensures.

A causal consistency model does not provide any ordering guarantees on events that are not causally related.

Consider the following events: [e1, e2, e3]. Say, e1 and e3 are causally related. As such, a causally consistent system should enforce ordering only on e1 and e3. The system is free to execute e2 in any order. This provides a lot of scope for such a system to promote concurrency. The following are possible valid ordering under the causal consistency model:

[e1, e2, e3],
[e1, e3, e2],
[e2, e1, e3]

Causal consistency is more prominent in distributed systems as opposed to standalone systems. Take, for instance, the diagram below of a geo-distributed database that is storing the conversation from our previous example of Bill and Warren. In this scenario, we have two running instances of databases one in Region-1 and the other in Region-2. With causal consistency enabled, two different clients reading data from two different regions would read the conversation in the same order, ie. Bill's message is followed by Warren's message. Since the write at t0 happened before the write at t1, such a causal consistent system provides monotonic writes.

Geo distributed causally consistent database.

Now take another case below of a geo-distributed database that does not provide causal consistency as the client reading from Region-1 observes Bill's message first followed by Warren's but the client in Region-2 observes messages the other way around.

Geo distributed database without causal consistency.

One might argue that building the causal relationship is seemingly less arduous, after all, all the application needs to do is to ensure a certain order on all messages. Whilst this is possible, enforcing a strict ordering on every message, regardless, means we are offering Linearizability or Sequential like guarantees. These are more restricted forms of guarantees and as such provides less scope of concurrency. Instead, we would like to build a system that understands the causal relationship between events and only ensures an order for causally related events.

So, how do we infer a causal relationship?

We would require some notion of a dependency graph to build a happened-before relationship. Intuitively, a happened-before relationship can be denoted as events plotted in the timeline. As such, one way to model this timeline is to use the notion of monotonically increasing time. Unfortunately, physical computer clocks cannot be used for such a purpose because they are sensitive to clock skews and clock drifts as such they don't guarantee monotonically increasing time. Thankfully, Logical clocks are attractive options to solve this problem because of their forward progressive nature and immunity towards clock skews and clock drifts.

Two well-known logical clocks are Lamport and Vector clocks. Vector clocks address a limitation of Lamport clocks by being able to differentiate between events that are concurrent and those that are causally related events, Lamport clocks can't assertively identify concurrency between events. However, one major shortcoming of Vector clocks is that they grow linearly on space and time with the number of nodes in any distributed systems, as such scaling them beyond a certain point might be a challenge, Lamport clocks shine here.

How do we represent a Logical Clock?

As mentioned above, a Logical clock is a monotonically increasing clock and how we build this notion of forward progressive time is very much dependent on the implementation. Having said that, a physical clock can always be used as a Logical clock provided it doesn't suffer from anomalies like clock skews and clock drifts. Providing such guarantees with physical clocks is a very challenging problem, however, databases like Google Spanner use specialized physical clocks to solve various timing-related issues. They use specialized physical clock infrastructure called TrueTime which is based out of atomic-clocks and GPS technologies. TrueTrue still suffers from the clock skew problem, but skewness is upper bound to 7ms. Unlike Google Spanner, which is a managed cloud database, most databases operate on customer premises in an unmanaged capacity, therefore, building a casual consistent solution with a physical clock is not feasible.

We can use something like a monotonically increasing counter to represent the Logical clock but this will not be very practical in database systems. This is because database systems are complex systems and we would need some notion of date and time to make sense of what's going on in the system. Think of a case of debugging an issue by reading the log files, how would you debug a causal consistency-related issue when all you see in the log files are some randomly looking counters? Instead, what we need is a clock that is a combination of physical clocks and monotonically increasing counters, such that we have a clock that can be represented in data and time format, accompanied by the forward progressive nature of counters. Such clocks are often referred to as Hybrid Logical Clocks (HLC).

What does the HLC look like?

There are several ways to model HLC and one possible way to represent it as a 64-bit unsigned integer. Such that the most significant 32-bit represents the physical time and the least 32-bit represents the logical time (counters), like so,

class HybridLogicalClock {
public:
    // Returns the unix-epoch timestamp.
    uint64_t getPhysicalClock() {
        return _hybridLogicalClock >> 32;
    }
    
    // Returns a counter representing the logical clock.
    uint64_t getLogicalClock() {
        return _hybridLogicalClock & 0xFFFFFFFF;
    }
    ...
private:
    uint64_t _hybridLogicalClock = 0;
};

Here, the most significant 32 bits represent the physical clock and the least significant 32 bits represent the logical clock.

Next, we need to ensure that our clock is monotonically increasing. As such, we require some specialized logic to update the physical and logical components of our HLC. The logic would roughly look like so,

void HybridLogicalClock::updateClock(uint64_t providedHlc) {
    uint64_t providedPhysicalClock = providedHlc >> 32;
    uint64_t providedLogicalClock = providedHlc & 0xFFFFFFFF;
   
    if (providedPhysicalClock > getPhysicalClock()) {
        _hybridLogicalClock = providedHlc;
    } else if (providedPhysicalClock == getPhysicalClock()) {
        _hybridLogicalClock = 
            (providedPhysicalClock << 32) |                                       
            (std::max(providedLogicalClock,
                     getLogicalClock()) & 0xFFFFFFFF));             
    } else {
        _hybridLogicalClock = 
            ((getPhysicalClock() << 32) |
            ((getLogicalClock() + 1) & 0xFFFFFFFF));
   }
}

We update our HLC to the providedHlc if the physical component of their clock is greater than ours, otherwise, we check and maximize the logical component of our HLC clock. If we follow the same logic to update our clock, then we can guarantee that our HLC does not fall behind.

Plumbing Everything

Until now, we discussed HLC as a primitive to ensure causal consistency, now we will see how it's tied to the databases to ensure causal consistency.

We assume that our database orders each write with a timestamp. This assumption is fair because all databases associate their writes by timestamp in some way or the other, consider, the Write-ahead-log of a database, where every write is timestamped based on some monotonically increasing time, potentially. We will use HLC to timestamp each write in our database.

Now consider a primary-secondary-like architecture for our database such that all writes are performed by the primary but reads can be served both by primary and secondary. Assume that the secondary is always catching up with the primary and hence reading from the secondary may result in stale data. You can consider this model as a replica set in MongoDB if you are familiar with MongoDB deployment models.

Next, we require causal consistency to support the following use case:

Consider the diagram below with a primary and a secondary. Here, a client first writes id1=1 at timestamp t1 (HLC) to the primary and then attempts to read the value of id1 from the secondary. In such a case, the client will not be able to read its own value for id1 because the secondary is lagging behind and has entries up to the timestamp t0. Remember that causality is about happened-before relation, ie. if you write a value and then attempt to read the same value later, you should be able to read it. This phenomenon is also called, read-your-own-writes. In this scenario, this guarantee is being violated and hence breaks the principle of causal consistency. We would want the secondary to return the value for id1 written by the client.

A database that doesn't suffice read-your-own-writes.

To enable read-your-own-writes, the secondary must wait before returning the response to the client until it has caught up with the primary, ie. wait until the record with timestamp t1 is replicated. Once the secondary knows that it has the record that the client requested, it should return it back. Databases like MongoDB use this mechanism.

Now the question is, how would the secondary knows whether it has the requested data or not?

The answer is simple, the client should send the HLC-timestamp to the secondary. The secondary will compare its own latest HLC-timestamp with the client's HLC-timestamp. If the secondary's HLC-timestamp is greater than or equal to the client-provided timestamp, then it knows for sure that it has caught up with the primary up to that point and hence, it can search for the requested data and return the result back to the client. Until then it must wait for the replication to finish.

The revised diagram now looks like this. We can see that the client sends the write request with timestamp t0. The server responds back with its latest timestamp t1. The client then uses the same timestamp t1 to read back its own write. The secondary waits for the replication to finish and then returns the result back to the client.

A database that suffice read-your-own-writes.

If we observe carefully, piggybacking this timestamp on each request for a particular client is what represents a dependency graph. As long as the database sends up the most recent timestamp and the client continues to send that timestamp back to the database, we will suffice our requirement of causal consistency. Note that this propagation of timestamp should be abstracted away from the client, the database drivers or the communication protocol should handle this internally.

The last thing we need to address is how to propagate the timestamp in a geo-distributed database. The idea is seemingly the same as above. Every time a database instance sends an internal message to another database instance in a different geo-location, it sends its own latest HLC timestamp. The receiving database would then update its timestamp accordingly. We already discussed an algorithm to update the HLC, we would apply the same here. As long as we stick to this protocol, timestamps will continue to advance on each database instance regardless of which location they are located in. One thing to note is, now that time is always moving forward even in a geo-distributed database, we have a notion of Global Logical Timestamp. A Global Logical Timestamp is a timestamp that is the maximum of all database instances. We must pass this Global Logical Timestamp when a client connects to our geo-distributed database. The client can then use this timestamp as the starting point to perform causal reads and writes.

Wrapping up

In reality, implementing causal consistency in a database is a challenging task. This article aimed at providing a design perspective of such a system and we covered various concepts and designs. We started with understanding the fundamental of causality and causal consistency. Then, we discussed ways to detect the causal relationship, we ended up using Hybrid Logical Clocks. After that, we applied the concepts to a hypothetical database system and saw how everything would work. Later, we solved the causality problem for a geo-distributed database and came up with the concept of a Global Logical Timestamp.

This is the end of this article, I hope you find this useful. Thank you for your time :-)

The Design of Causal Consistent Databases