Almost everything we do is time dependent. Every mutation splits the temporal state of the program in two: one before the change, and one after. For a simple system, we can precisely define any state based on the initial state and the lines of code. It evolves in a deterministic, predictable way.
But many programs aren’t that simple. In a concurrent system, there is no single timeline. We have multiple actions that can happen in any order, producing a fractured spread of new time lines. Concurrent systems describe everything from threads sharing memory to independent actors to changes in our real world. And concurrent systems are notoriously hard to design correctly. There are simply too many possible behaviors to reason through.
So we’ll reason with TLA+ instead. We’ve already done some basic nondeterminism with either and with. In this chapter, we introduce the idea of processes and labels, which give us the structure we need to spec out and test generalized concurrent code.
Labels
Before we talk about concurrency, we need to cover labels. The last time we used labels was back in Chapter 2 with the wire example. That’s because we don’t need labels for single-process applications, which we’ve been writing so far. We need labels to accurately describe concurrent systems.
Labels determine the grain of atomicity of your spec. TLC executes everything in a label in a single step, or action. Then it checks all invariants and looks for the next label to execute (action to take). Just as TLC checks all possible behaviors on every either and with, it also checks all possible behaviors on the set of next possible labels. In other words, if you have a concurrent system, TLC will test all available next actions for a possible error.
When translating PlusCal into TLA+, we get an extra pc (“program counter”) variable that tracks which label we're currently on. If pc = "A" then the next state will consist of everything under the A label. We can jump to a label in the same process with goto NameOfLabel. Since specifications are smaller than programs, goto is a lot easier to reason about in PlusCal than in a programming language, and it's often quite useful.
Tip
PlusCal automatically defines a “Done” label at the end of every process. You cannot use “Done” as part of your own label, but you can jump to it with goto.
You must have a label at the beginning of each process and before every while.
You may not place a label inside a macro or a with statement.
You must place a label after every goto.
If you use an either or an if and any possible branch has a label inside it, you must place a label after the end of the control structure.
You may not assign to the same variable twice in a label.
With that, we’re ready to talk concurrency.
Processes
The most important thing about this system is that it is concurrent. This means there’s no enforced order to when either process runs: the writer could write a dozen messages before the reader reads six, and then the writer could only add one more before the reader reads the rest. We do this by using the process keyword. Each process must be assigned to a value; in this case strings. Unlike with single-process algorithms, all processes must explicitly use (and begin with) labels.
TLC is able to choose any process to run. It executes one label in that process, calculates invariants, and then chooses the next label in the next process to run. Note that pc is no longer a single value. Now it’s a function that represents the current label each process can execute.
The reader also has a local variable . current_message is inaccessible to the writer process or anything in a define block. However, a macro can use it if called in the process. Like global variables, local variables can also be defined with in, in which case TLC will check all of the possible starting states.
Run this with MaxQueueSize <- 2 and INVARIANT BoundedQueue. You should see it immediately fail. TLC starts by immediately running the reader process, which tries to Head an empty queue. Since Head is undefined for empty sequences, the spec fails. The problem is that we have no way of forcing the reader to wait until there’s something in the queue. Let’s fix that.
Await
Here, I put the await after the append to queue. This has slightly different behavior: the step can’t happen until the await is true with the updated queue. If taking the action would end with queue being above the maximum size, the await disables the action entirely. This can be a little confusing when you first encounter it, so I recommend always placing your awaits at the beginning of the step unless you have a good reason not to.
If you run this, it should pass (9 states).
Deadlocks
await prevents a process from continuing until its conditions are met. What happens when all of our processes are prevented from continuing?
First, since both of the processes write to the queue, we pull the add logic into a macro named add_to_queue. To simulate the reader process failing, we use a common PlusCal pattern I call possibly: an either with two branches, one of which does nothing (skip). In the other, we need to use a new label. This is because we’ve already modified both current_message and queue in the Read action. Since you cannot assign to the same variable twice in the same step, we add the NotifyFailure label. Since one of the branches of the either has a label in it, we’d need to put a new label after end either if we wanted to write more in the process. However, the end of the either is the end of the while block and the end of the while block is the end of the process, so we don't need another label.
Try running this. You should see a new error: Deadlock Reached. A deadlock is when none of the processes in your spec can take any actions. Usually this is because of an await statement, but it can also happen with with x in S if S is the empty set. Usually deadlocks are bad. If you’re writing a spec where a deadlock isn’t bad, you can disable the check in the model, under Model Overview > What to Check? > Deadlock.
Process Sets
We made two changes here. The first is that instead of assigning reader to a value, we’re saying it’s in the set {"r1", "r2"}. TLC will create two copies of reader: one for each element, and assign each of them its own set of local variables. During the model checking, at every step TLC can advance “writer” or “r1” or “r2”. Second, to distinguish the two readers in the message queue, we call add_to_queue with self instead of “fail”. If a process has multiple copies, such as “r1” and “r2”, self is whatever value that given copy is assigned to.
Note
All process names across all processes must be comparable. Since the value for writer is a string, the value for reader can be either a set of strings or a set of model values.
If we run this, we should still see a deadlock. While multiple readers may reduce the chances of deadlocks, it does not eliminate them entirely, and TLC will still catch that error.
Procedures
If you run this, you should see the same expected deadlock. A procedure has the same syntax as a macro, except that it has labels in it. In addition, you can define local variables for a procedure in the same manner you would processes. You can only define the local variables as equaling an expression (=), though, but not being some element of a set (in). We exit the procedure with return. Return does not return any value to the calling process. It simply ends the procedure.
In order to call a procedure in a process, we have to prefix it with call. A called procedure must be immediately followed by a label, the end of an enclosing block, a goto, or a return.
Procedures must be defined after macros and before processes. A good rule of thumb to remember this is that procedures can use macros but macros cannot use procedures, so procedures must follow macros. Similarly, processes can call procedures and macros, but procedures cannot use processes.
Tip
When using process sets that use procedures or macros, you can still use self inside of the procedure or macro. It will refer to the value of the calling process.
Example
We can use processes to model anything concurrent, not just algorithms. One common use case is to use processes to model time periods : where some external activity happens every so often. For this example, we’ll have several clients consume some shared resource that periodically renews. This is a generic implementation and can represent clients calling a rate-limited API, loggers cutting a tree farm, scheduling CPU time, etc.
We have two constants: one that represents the total possible resources in the system, and one that represents the maximum a given actor can consume per tick. The actor will continuously consume from the global pool of resources, eventually depleting them all. We want to make it so that it never consumes more resources than are possible (depleting to zero is fine).
Whenever time runs it resets resources_left back to the cap. Now the actor cannot ever deadlock, and our spec passes (22 states).
time remains the same. The constant Actors will have to be a set of strings or a set of model values. In these kinds of cases, usually using a set of model values is preferable to using a set of strings. Since resources_needed is local to each actor, they don’t need to all have the same value. Try Actors <- [model value] {a1, a2} and rerun. You should see that it still passes (69 states).
Since UseResources is under the same label as the await, we can only step into it if there are enough resources available. However, once we do, the while loop will keep running until we’ve consumed our fill. Since we’re destructively updating resources_needed, we need to reset it at the end of the loop. However, this way we can update it to a different value than at the beginning of the process. The actor may first need one unit of resource, then two, then one again, etc.
If we run this, we now violate ResourceInvariant again. One actor can start consuming, but halfway through another actor can deplete the rest of the pool, at which point the first actor breaks the invariant.
We’ll try two fixes for this: one that won’t succeed and one that will. Our first fix will be to only let each actor complete once before refreshing. The cap is currently 6, there are currently two actors, and the most each can consume per complete iteration is 2. 6 >= 2 * 2, so limiting them should work, right?
When we run this, we once again violate ResourceInvariant. TLC picks resource_cap = 1, and as 1 < 2 * 2 our “fix” no longer works. This is why it’s important to look at larger state spaces.
SYMMETRY SETS
The downside of larger state spaces is how “larger” gets intractable much too quickly. Try rerunning the model without ResourceInvariant as a checked invariant. That change balloons our search space from 389 to 2085 states. If you add a third actor, you now have 24,485 states!
This is where symmetry sets can make a big difference. If you convert Actors to a symmetry set, the model should pass with only 6040 states – less than a quarter the size of the original state space. Using symmetry sets is often a good first-pass optimization for your slower specs.
1. TLC sets resource_cap to 1.
2. a1 reserves 1 resource and enters UseResources.
3. Before a1 does anything, Tick happens, resetting reserved to 0.
4. a2 reserves 1 resource and enters UseResources.
5. Both a1 and a2 resolve UseResources, bringing resources_left to -1.
Instead of trying reserved to reset every tick, let's try instead having the actors gradually mark when they've consumed resources and no longer need capacity reserved.
This passes (1588 states) .
Summary
In this chapter, we introduced concurrent specifications and how we can model them in PlusCal. We also observed that there’s a wide range of exciting new problems that concurrent specifications run into, such as race conditions and deadlocks. We also saw how to model effects with processes.
Modeling concurrency is one of the best-known use cases for TLA+. We programmers are very good at reasoning about deterministic code and very bad at reasoning about concurrent systems, but the risks and dangers of the latter are so much higher. As we saw in our example, specification can be a powerful tool for safely managing concurrency.
In the next chapter, we will learn how to write temporal properties: invariants that apply to entire behaviors at once.