About this talk
With the increasing popularity of distributed data storage, the spectrum of choice between consistency and availability becomes more relevant. We'll take a look at different points on the C/A spectrum to best fit specific needs. When network partitions are rare, this becomes a latency/consistency choice; then we'll look at PACELC theorem to better describe this relationship. Finally, we'll look at new and novel ways of pushing distributed ACID transactions to their limits.
Hello everyone, I'm Charlie. And I'm going to be talking about real-ish time shared state. So a look at latency versus consistency in distributed data stores. So first I'm gonna talk a little bit about CAP, and I can never know how to pronounce this one, pay-sa-leck, maybe, theories. Then we'll have a look at some useful consistency choices on the scale of latency versus consistency, and then we're gonna have a look at some new systems that are pushing CAP to the limit of what should be possible. Right, so first of all, I'm building some kind of service, or making an app, or whatever. I need a database. Fantastic, here's my database. Single node, great. Customers are happy, it's consistent and available. Brilliant. Oh, no, it blew up. We lost all of the data, people are unhappy, and I went bust. So let's do version 2.0. It's distributed this time. So we've got three nodes. They're all connected over the network, and great, we're happy. If one blows up, it shouldn't matter too much, hopefully. Well, this actually introduces a load of weird new choices that we have to get around to make our users happy. This can be summed up with something called CAP theory. So basically, in a distributed system, you should generally pick two of these rules. So the first one's consistency, which basically means that the state, if you access state anywhere on any of the nodes, it should be consistent. Availability in the event of a network partition. So that means that if a node is cut off from the rest, can you still access state from that, even if it is a little bit out of date or wrong? And then the third one is tolerance of network partitions. So in the case that the network does go down, or a node is cut off, can it recover? Can you add nodes? Can you take them away? That kind of stuff. So generally, pick two but it's not a hard and fast rule. Some crappier databases have one or none. So, be careful. Um, great. So network partitions, a real thing. So a really interesting story is this woman, she's like a 75-year-old Albanian woman, she was digging in her garden one day, and she hit some copper pipes, or so she thought. So she like, great, I'll rip them out and I can sell the copper. Turned out, that was the only Internet inline to Albania, and she cut the internet off to about three million people for five hours. So network partitions are real and very common place. So, let's go ahead a tick that tolerance of network partitions box. You're gonna need it, as you've all probably found out some point in your careers. Networks are pretty flaky things. Great, so that leaves us with two choices, consistency and availability. I will say this is a bit of a simplistic view. This is more of a spectrum in reality, but we'll keep it as three ticky boxes for now. Cool, so we've had that network partition. Oh, no, our user's unhappy. So at this point, we need to choose if this node is gonna be consistency or available. If it's consistent, it might return an error, or if it was the master, in some strange cases, it may return something, but probably not. Or is it available? Are we happy that it's just gonna return some stale or wrong state, and the user doesn't care, hopefully. But most of the time, hopefully, you won't be having network partitions. This is kind of a rare case, hopefully, most of the time, 99% of the time, you're not gonna be expecting network partitions. Fantastic. So effectively, if we don't have a network partition, we should be able to tick both the consistency and availability boxes, hopefully. Well, yes and no, there's an added problem of latency. So, even if your nodes are connected, there may be some kind of latency involved in a read or write operation. So say I write to this node, it needs to inform the other nodes that a write operation is taking place, and this is gonna involve some latency, obviously. So let's reframe the question. Is it availability or a consistency choice? Well not anymore, there's no network partition. It's more of a consistency versus latency choice, and this is a lot more than a binary choice, so we'll also turn into a slider, rather than ticky, boxes. Great, so latency versus consistency, or availability versus consistency, so this leads us nicely on to the PACELC theory. If someone knows the correct pronunciation of that, stick your hand up and let me know. If there's a network partition, then generally you choose between availability and consistency, else, if there's no network partition, you get a choice between latency and consistency. Great, why is this useful? Well, basically, most of the time, hopefully, you're not in the network partition, so you can kind of tune this latency/consistency question around what your users are going to want to see, and what you expect your application to actually be doing. So I'll go through a kind of spectrum of choices on this scale, and we can have a look at why they might be useful, or not, as the case may be. So first on the scale is kind of very low on the consistency side, but very good for low latency, is something called eventual consistency. So this is an idea of, you're going to see some of the writes. They're probably not going to be in order. You won't see the most recent writes, but you'll always see something. You can go to any node and it will give you some state. Might be stale, might be wrong, doesn't matter. And the idea is that also if you stop writing, eventually all nodes will catch up to be consistent, but again, there might not any guarantees on this. A really good example of this is DNS. So if you update a DNS record, it will update the top level domain server, and that will trickle down the DNS system. Great, so next on the scale is something called a consistent prefix. This is a little bit more consistent, but still quite low latency. This is an idea of eventual consistency but with ordering. So you will see a consistent prefix, meaning, what you do see will be consistently ordered, but you might not see all of the most recent writes. You don't really know. And then the idea also with this, is generally, it's not time bound or version bound. So you can't say I want a consistent prefix within this length of time. It will just give you what it has and you have to hope for the best. So a good example of this might be a routes timeline on a map. So maybe you don't care that the map is in realtime or anything, you don't care if there's a big delay, but you do want the nodes on the map to be in order to give you a sense of a route. Okay, so the next one is session based. This is actually a pretty useful one. The idea here is that consistency is scoped to a client's session. Monotonic read and writes, that's a fancy way of saying that if you read your next read is guaranteed to be at the same level or further. And if you write, again, it's guaranteed to be further than the last write. And then, also, read your writes. You should always be able to see your own writes and writes follows reads means that writes block reads. So the idea is that if I was say, making a notes-taking app, it's only gonna have one user connecting to the database. I'm happy that I can write to that database as a single user, and I'll see all of my writes and see all of my reads and they're in order. If someone else connects to that node or another node, maybe they'll see a different view. So good for single-user applications kind of thing. Right, next on the scale is continuous consistency. This is an idea of you give a consistency requirement that needs to be fulfilled. So this could be a time bound. Maybe I want the most consistent view, but don't worry about the last 30 seconds, or I want the most consistent view, but the last K versions of this doesn't really matter if they're out of order or wrong. A really popular version of this is something called bounded staleness, Where again, you just say, I'm willing for the value to be stale within this many versions, and great, it will give you that value. It's kind of quite far off the scale of consistency but again, you have that kind of consistency unit, where you're willing to have an error. A good example of this is reporting sports scores. You might not worry about, it might not have to be update to the second, but you want it to be in order, and you want it to say, okay, I want the football score within two minutes of actually happening. So finally on the scale, we have strong consistency. This is pretty straightforward. You're gonna see all the writes in order, and you'll see every write, but on the flip side of this, to keep that state consistent, all the nodes have to agree on it, basically. So there's gonna be a high amount of latency with this. A good example of what this might be used for is online transaction processing. Like, basically financial transactions and stuff. You want to know that all of the transactions are in order, and you don't want to read false states, you want the most up-to-date state, and it has to be in order. Great. So some interesting examples that you can go and use today. Lot of people probably use mongoDB at some point or another. You will generally, if you do a request, it will go to a master node, and that will have the correct information on it, but you can also state that you don't really care about this, go to the closest node, lowest latency node, and it will give you a more available version of that state, but it could be wrong, it might not have replicated with the master yet. Cockroach DB, this has just gone 1.0. Really interesting use case. Basically the idea is that you can add and remove nodes and it'll scale your data automatically for you, and it should be really resilient, hence, being a cockroach, can't kill them. And it's strongly consistent. So that's all it does. So useful, if that's what you want. Azure Cosmos DB. This is a Microsoft product. Really cool, because it actually offers all five of the states we just saw. Each request, you can state the amount of consistency you want and then it will give you this, in a sense that if I say I want an eventually consistency state, it will be really quick. If I want it to be completely consistent, it's gonna take a little while longer. And then, the final one is Cloud Spanner, which I'm gonna go a little bit more detail, because it's particularly interesting. So Cloud Spanner kind of says that we've got a magic way of ticking all three of the CAP boxes, and it was created by the guy who actually created the CAP theory, Eric Brewer, so hopefully, he should know a little bit about it. The idea here is that if you can ensure that there's never a network partition, you don't have to worry about it, so it's always available, and always consistent. How can they ensure this? Well, it is Google, so they have lots of cash. So basically, the first thing they've gone and done is they have network redundancy and they have their own private fibre network that covers the globe, so they can ensure that if you snip one of these fibre wires, there's always gonna be another two. It shouldn't have any effect, hopefully. Not sure if they test that. Who knows? Each node in a cluster of database nodes is highly available. So in this case, they use something called two-phase commit, which is where each node will ask all of the other nodes, or a majority of the other nodes, whether they're willing to accept this state change and then it will do a commit, once it's got a majority of them coming back as saying yes. And then each of the clusters, each of the nodes, sorry, is itself a highly available Paxos Cluster, which means that if one of the nodes in a node goes down, then it doesn't matter, it should still carry on. So each node is highly available, and then the cluster of nodes is highly available. It uses something called TrueTime, which is quite revolutionary. They use a series of atomic clocks and GPS receivers to get really precise view of time. This makes it really useful for something called, multi versioning concurrency control. So that's an idea of, if you know the time precisely, and you have a timestamp on a set of data, and then you know the timestamp of a read, then you can just give them that version of the data. If a write happens during a read, or just after a read, it doesn't matter. You're giving them that version, and that's what they're gonna get. So that means, effectively for reads, you don't have to do round trip to ask the other nodes if it's consistent. You can just use that snapshot in time, and because they've got such precise time, it's really quick and really precise, whereas most computers, time is a little bit imprecise, and it obviously, has variable drift problems and stuff like that. And then the other thing is Google's SRE team literally wrote the book and created the term SRE, so they should probably know what they're doing. Yeah, so we'll tick the box. And their, kind of, addendum to this is that this combination is possible in practise if you control the whole network, which is rare over the wide area, even then, it requires significant redundancy of network paths, architectural planning to manage correlated failures, and very careful operation, especially for upgrades. So you can probably do this, but you need Google, Microsoft, Facebook money, so good luck. So conclusions are, it's great to think about CAP theory, but it is for when you're in a network partition, so hopefully, most of the time you won't be, I don't know. But the times you're not think of it more as a latency over consistency issue, and then, yeah, if you've got lots of money, use Cloud Spanner. Great, thank you very much for listening. Any questions? - [Participant] I know that Cloud Spanner, actually, just came out of Beta two months ago. - Yep Have you actually used it? - No, it's I think, like three grand, minimum to start using it. So I'd love to, but, no. But there's some really interesting papers. They released a load of really massively technical papers, if you're that way inclined, but they also released this paper called Spanner, CAP Theory, and TrueTime, which gives a nice overview. I'd suggest everyone go and have a look at that 'cause that kind of gives this in more understandable terms for someone like me. And it's really interesting how they're using these clocks and the highly redundant, wide area network stuff is fascinating. But, yeah, as I said, you need lots of cash to do it.