This is the third part of the Distributed Designs series. For your convenience you can find other parts in the table of contents in Part 1 — Outbox without idempotency nor synchronous commit

Last time we saw how to use transactional outbox pattern with multiple databases and no synchronous commit. We also learned how to take the lock on the database end to make sure that we can scale out the Relay. However, what if we use Multi-version Concurrency Control (MVCC) or can’t enforce pessimistic locking in general? Can we implement the lock in that case?

Solution

We need to implement leases. The idea is:

Create a table with leases
Take the lease if it’s not taken yet
Process rows
Release the lease

The only tricky part is how to release the lease automatically in case of a crash. However, this we can do with timestamps. The idea is:

CREATE TABLE leases(name VARCHAR(MAX), lease_time DATETIME);
INSERT INTO leases(name, lease_time) VALUES ('lease_name', NULL);

1 2	CREATE TABLE leases(name VARCHAR(MAX), lease_time DATETIME); INSERT INTO leases(name, lease_time) VALUES ('lease_name', NULL);

Now, we want to take the lease if and only if it’s not taken yet. Let’s do this:

while(true){
   beginTransaction(REPEATABLE READ);
   lease_name, existingTimestamp = SELECT name, lease_date FROM leases WHERE lease_date < NOW() AND name = 'lease_name';
   if(lease_name IS NOT NULL) {
      UPDATE leases SET lease_date = NOW() + 1 minute WHERE name = 'lease_name';
      try{
          COMMIT; // First commit
      }catch{
          // Someone updated the lease before us. Let's try again
          // The transaction is now rolled back
          continue;
      }
      
      // beginTransaction(REPEATABLE READ); Process messages and set their status properly; COMMIT;

      beginTransaction(REPEATABLE READ);
      UPDATE leases SET lease_date = NOW() WHERE name = 'lease_name';
      COMMIT;
   }else {
      // We didn't find the lease, so it's not available
      // Let's roll back and back off
      ROLLBACK;
      sleep(5);
      continue;
   }
}

while(true){

beginTransaction(REPEATABLE READ);

lease_name, existingTimestamp = SELECT name, lease_date FROM leases WHERE lease_date < NOW() AND name = 'lease_name';

if(lease_name IS NOT NULL) {

UPDATE leases SET lease_date = NOW() + 1 minute WHERE name = 'lease_name';

try{

COMMIT; // First commit

}catch{

// Someone updated the lease before us. Let's try again

// The transaction is now rolled back

continue;

}

// beginTransaction(REPEATABLE READ); Process messages and set their status properly; COMMIT;

beginTransaction(REPEATABLE READ);

UPDATE leases SET lease_date = NOW() WHERE name = 'lease_name';

COMMIT;

}else {

// We didn't find the lease, so it's not available

// Let's roll back and back off

ROLLBACK;

sleep(5);

continue;

}

We take the lease row. If it’s missing, then it means that the lease is not available (someone else holds it). We wait and try again.

If the lease is available, then we try to take it. We try updating the row and book it for one minute. We then try to commit. If this fails, then it means that someone else modified the lease. We need to back off and try again.

However, if we succeed to take the lease, then we can get the messages from the outbox table. We don’t need to do anything special now because nobody else will fiddle with the outbox. We need to finish in one minute, and then we can release the lease.

Why does it work? What isolation level should I use?

Let’s stop for a moment and analyze why it even works and what isolation level we should use. We tried updating the lease with the following:

UPDATE leases SET lease_date = NOW() + 1 minute WHERE name = 'lease_name';

1	UPDATE leases SET lease_date = NOW() + 1 minute WHERE name = 'lease_name';

What we should in fact do is this:

UPDATE leases SET lease_date = NOW() + 1 minute WHERE name = 'lease_name' AND lease_date = existingTimestamp;

1	UPDATE leases SET lease_date = NOW() + 1 minute WHERE name = 'lease_name' AND lease_date = existingTimestamp;

If we run this query with REPEATABLE READ, then the database engine simply can’t let the UPDATE to complete if someone else modified the row. That’s because the UPDATE needs to read the row again (to check the filtering criteria). Since we run on REPEATABLE READ, then the second read must return exactly the same data. Therefore, either nobody else modified the row, or the UPDATE fails and the transaction is rolled back.

Therefore, after the first commit we can be sure that we have the lease. I can’t really imagine a database that would claim that it meets the ACID requirements and would still let this commit to succeed just to break the data later on. Such a database would be a very interesting (not necessarily useful) case. Obviously, with distributed databases we may get some different isolation levels, some the actual optimizations may be different and break this mechanism. However, according to the SQL standard, this solution should work, as it simply implements the Compare-and-swap operation on the database level.

In short, this is your checklist:

Use REPEATABLE READ
If you can enforce eager locks in the database, just use them without and leases and rely on the locks maintained by the database engine
If you can’t enforce eager locks (so the database engine uses optimistic locking), then implement the lease protocol defined above

Now you can scale your Relay to multiple instances and still get the correct results.

Summary

We implemented a generic lock for MVCC. You can use it for the transactional outbox pattern or for whatever else.