TLA+ Part 1 — TLA+ to ILP Part 1 — Reducing problem

This is the first part of the TLA+ series. For your convenience you can find other parts using the links below:
Part 1 — TLA+ to ILP Part 1 — Reducing problem
Part 2 — TLA+ to ILP Part 2 — Performance check

Today we are going to reduce some TLA+ algorithm to ILP and see if it works. Let’s begin.

We take the Model Checking example by Tomasz Masternak from Exactly Once. It’s a short code presenting possible race conditions and issues with distributed transaction consisting of sending a message to the out queue and updating the message in a database. See the code first and understand what’s going on there as I won’t be covering it here.

Now, what we want to do is to model the same problem in ILP and see if we can get correct results. Let’s decompose the TLA+ model:

  • We have an input queue with some messages
  • We have an output queue with sent messages and their transaction id
  • We have a database storing some transaction id
  • Each message may be processed multiple times in case of failures
  • Business transaction is represented as incrementing a transaction counter
  • Message is marked as processed

We need to represent the model in ILP. We are going to store an array of bits holding messages. I’ll present this in the rough and dirty way without too much of an abstraction but in a production code you would probably encapsulate all the nasty stuff.

The code:

A lot in here so let’s go line by line.

First, we initialize the solver in 1-13. Just a regular CPLEX.

Next, we store some hyperparameters: how many messages (line 15) and how many times each message can be processed (two times, line 16). We then calculate the number of all steps which solver should emulate and this is exactly number of messages times how many times we try processing a single message.

Next, we’re going to store things as bit flags so let’s define some helper function which calculates how many bits we need to store some bounded integer (line 19). It’s just a logarithm, nothing fancy (I’m running this in old .NET version so I don’t have Math.Log2 method yet…). We then calculate how many bits we need to store retries counter and transaction counter (lines 21-22).

Now, each record is going to hold the following:

  • Whether it is currently in the input queue
  • Wheter it was processed
  • Whether it’s stored in the output queue
  • Whether it’s in database
  • Retries counter
  • Transaction counter stored in the output queue
  • Transaction counter stored in the database

This is in lines 21-37.

Lovely, now an actual bit mask in lines 40-50. We initialize flags so that initially all messages are in the input queue, everything else is zeroed-out.

Next, we define some helper methods to convert bits to integer, integer to bits etc. This is in lines 52-85. We also define some collection for debug variables to be able to fix issues easier (line 91) – this doesn’t affect the model checking at all.

Okay, let’s now begin with an actual calculation. Let’s ignore failmessage for now and move on to line 117. We iterate through steps and start by defining a variable indicating whether we did a break in the loop – it’s this line in TILA+. The variable will be used in all operations. In ILP we can’t just break the iteration – we need to update all variables because we don’t know the values yet, so we use this flag to effectively do nothing.

First, we implement the Receive part. We first calculate how many messages are there in the queue (lines 123-126) and if we have a message to process (lines 128-130). Next, we define a free variable indicating which message from the queue we start processing (lines 132-135). This is an important part as we don’t process messages in any order, we are free to model race conditions, caches, network delays, or whatever else. We just get a message from the queue.

Now, we get the bits of the message to a helper array to keep things easier and faster. We use Array.Get operation which is slow (squared time complexity). We do it just once in lines 138-145.

However, since we don’t actually know how many messages there are, it may be that we select a message which was already processed. We make sure that if we selected the message then that message must have been in the queue (lines 148-151). However, if there are no more messages to process then we’ll need to emulate a no-op instructions later on.

Okay, now we move on to the business processing part. It’s trivial as it’s just a counter calculation at this point (line 157). However, later on we’ll need to emulate an assignment which in TLA+ copies values so we’ll need to copy them as well.

It’s a high time for some actual processing. We start with sending the message to the output queue in line 161. We first start by optionally failing the process. We create a free variable in line 161 which can be set only if we’re still processing (which is true at this point as nothing have failed yet). Then, we move on to the failing procedure (we call it in line 163).

The failing procedure starts by extracting the failure counter (line 97) and then checks if this message failed too many times already (lines 98-100). It’s true if and only if it was processed exactly once and it failed this time. Now we need to start performing conditional operations.
If the message failed too many times then we need to remove it from the input queue (line 103) but otherwise we don’t touch the bit at all (we rewrite its value). We do similar for processing flag – if the message failed too many times then we mark it as processed, otherwise we don’t touch the flag.

Now, we calculate the new failures count and convert it to bits (lines 108-109). We then store these bits in the record (lines 110-112).

We now carry on with processing. We update the flag indicating whether we failed and stopped the iteration (line 164). Next, we try putting the message in the output queue (line 167) and we use conditionals again – we send the message if and only if we’re still processing the loop (meaning that the message hasn’t failed too many times and hasn’t failed now). We also update the transaction counter in the same manner (lines 168-171). Notice here that we calculate the transaction counter and create new variables for it – that’s because this line effectively copies the value so we need to create a copy as well.

Next, we perform similar processing for updating the database (lines 174-185).

Finally, we acknowledge the message. We again check the failure (lines 189-192) and then update flags – remove the message from the input queue and mark it as processed.

Since we modified variables in place, we need to store them in the bitset. We do that in lines 200-206. And we’re done.

That’s it for representing the code. Now we need to model the invariant. We start with NoGhost invariant which says that either the message is not in the output queue or it’s stored in the database. So we get each message one by one and check whether it’s in the input queue (line 217) and it’s not in the database (line 219) – in that case the message is ghost We finally add the constraint that this must be true so there must be some ghost message (line 222). If ILP solver finds an evaluation and shows a solution then it means that the invariant was broken and that the business code is wrong. If ILP fails to find a solution – it means that the invariant holds.

Now, we need some code to actually retrieve the history in a friendly manner (lines 228-262). And we’re done. Output:

How do we read that? First, it took 0.02 second to find that the invariant is broken. We can see that in the first loop iteration was processing a message (so there was still something in the input queue), chose first message from the queue, and then failed when putting it in the output queue. Step 2 tried doing the same and succeeded, but later failed when updating the message in the database. Finally, we can see that the message is not in the input queue, was processed, is in the output queue but not in the database. We can also see the failures count set to one and that the transaction id was stored correctly in the output queue.

Next time we’re going to experiment with other invariants, and later on we’ll run some performance tests.