148: CAP Theorem: Choose Two, Or Is It One?

Take Up Code - A podcast by Take Up Code: build your own computer games, apps, and robotics with podcasts and live classes

Categories:

Consistency, Availability, and Partition Tolerance are three aspects of distributed computing. And a fourth not included in the theorem is Latency. The original CAP theorem said that you could only be guaranteed to get two aspects between the three options, consistency, availability, and partition tolerance. But here’s where it went wrong. You see, consistency and availability are up to you to decide how to handle. And they don’t even have to be all of one and zero of the other. They can be combined. But a network failure is outside of your ability to control. Sure CAP says you can design around this too. But CAP implies that you could choose to design a system that’s consistent and available. Those are two of the original three choices. This ignores the fact that networks will fail. Modern distributed systems can no longer ignore partition tolerance. As designs scale out more by using more computers across increasing distances, network disruptions become more frequent. Because partition tolerance is becoming a necessary choice, you’re left with the choice of how you want to deal with the situation when you don’t have a complete system. If you choose consistency over availability, then when a request for some information is made you can either wait for the network to come back or return an error. Or maybe you wait for a while and then return an error. Either way, the computer making the request won’t wait forever and will timeout eventually on its own. Consistency says that you need to make sure you have the most up-to-date information before replying. Even if you have some information and could reply with what you have, consistency says you double-check first. And since you can’t do that, then an error is best. Now if you choose availability, then you reply with the information you have. Even if there might be more up-to-date information that you just can’t get to right at that moment. This means that the information should be available. You can’t just make up a response. If you reply with an answer, then it should at least have a chance to be the right answer. If a computer requests some information that a server simply does not have and can’t get to at all because of the network outage, then choosing availability means that you should at least return right away with an error. Listen to the full episode, or read the full transcript below, to hear more including how latency fits into all this. Transcript Many explanations simplify this by saying that you need to choose two of the first three aspects. But that’s not quite right. I’ll explain each of these aspects and then how they’re related. First of all, consistency. If you know about databases, then this has a different meaning. I’ll explain databases in a future series of episodes. For now, consistency means that if you write some information, then you should read back the same information. If you instead get back old information, then this is called stale information. Sort of like how bread tastes better when it’s fresh. Availability also has a slightly different meaning than what you might expect. It doesn’t mean that a system is working fully. And it doesn’t even mean that you can always get a reply at all. It just means that if you do happen to connect to a working server computer, then you should get back some reply even if it’s not the latest information. Partition tolerance means that a distributed system can continue to work when there are network problems that isolate some server computers from others. Let’s say you have computers participating in a distributed system located in the United States, Canada, and Australia. If a hurricane causes all the united States computers to be disconnected from both Canada and Australia but otherwise they remain running, then a system is said to be partition tolerant if customers can continue to use the system. The