Async Wandering Part 15 — How async in C# tricks you and how to order async continuations

This is the fifteenth part of the Async Wandering series. For your convenience you can find other parts in the table of contents in Part 1 – Why creating Form from WinForms in unit tests breaks async?

You probably heard that async is all about not blocking the operating system level thread. This is the fundamental principle of asynchronous programming. If you block the operating system level thread, you lose all the benefits of the asynchronous code.

You also need to keep in mind how to write your C# code. You probably heard that you should keep async all the way up. This is rather easy to keep because the compiler takes care of that. What’s slightly harder to remember is to keep ConfigureAwait(false) all the way down. If you don’t do it this way, the compiler won’t help you and you may run into some nasty deadlocking issues, especially if you use some weird SynchronizationContext.

Last but not least, you probably know that the asynchronous code is only useful if your code is IO-bound. You probably heard that many times. However, what might be very surprising is that C# actually does a lot to make your application work even if your code is CPU-bound and you still use async. This is very misleading and may let you believe that you know async, whereas you only know async in C#. Let’s see why.

There is plenty of no threads!

One of the best articles about async is C# is titled There Is No Thread. Stephen Cleary shows that it’s all about continuations and juggling the lambdas to run your code when some IO-bound operation finishes. I even used this title in my Internals of Async talk in which I explain all the internals of synchronization contexts, continuations, and the machinery behind the scenes.

However, it’s only a figure of speech. At the very end of the day, we need to have some thread to run the code. Depending on your synchronization context, there may be some dedicated thread to run the continuations (like in desktop or Blazor applications), or we can use threads from the thread pool. If you think carefully about the asynchronous code, you should notice that this is the place where C# either bites you hard (and causes many deadlocks) or saves your application even if you are doing something very wrong. How? Because C# uses many threads.

By default, C# uses the thread pool to run continuations. The thread pool runs some not-so-trivial algorithm to spawn new threads when there is plenty of work to be done. This is not part of the asynchronous programming paradigm per se. This is just the implementation detail of C#’s asynchronous code which heavily impacts how your applications scale. Other languages don’t do it in the same way and what works well in C# may fail badly somewhere else. For instance, Python’s asyncio uses just one thread even though Python supports multithreading. While this is just an implementation detail, it have tremendous performance implications. Let’s see why.

One thread can kill you

Let’s take a typical message processing flow. We take a message from the service bus, refresh the lease periodically, and process the message in the background. Let’s say that our flow is IO-bound and we use async to benefit from non-blocking thread instead of spawning multiple threads. Let’s simulate the system. You can find the whole code in this gist.
We start with a message that will store when we received it, when we refreshed the lease for the last time, the identifier of the message, and the final status (if we lost the message or finished successfully):

Now, we want to configure timings in our application. We can specify how long it takes to receive the message from the bus, how many operations we need to perform on each message, and how long they all take:

Finally, we have some statistics showing how we did:

Let’s now see the scaffodling code:

We have the Simulate method that runs the magic. It starts by initializing the timings and setting up some monitoring thread to print statistics every three seconds.

When it comes to the timings: we will run 20 loops for each message. In each loop’s iteration, we will do some CPU-bound operation (taking 100 milliseconds), and then some IO-bound operation (taking 1000 milliseconds). We can see that the CPU operation is 10 times shorter than the IO-bound one.

Finally, we have the heart of our system:

We receive the message, keep refreshing the lease, and process the message. Some error-handling code is omitted for brevity.

Receiving the message is rather straightforward – we check if we have more messages in the queue, then take one, otherwise we simply return:

Keeping a lease is also clear – we wait for some time, then refresh the lease and check if we made it on time:

Finally, the heart of our message processing. We simply run a loop and do the work:

Notice that we block the thread for the CPU-bound operation and use await for the IO-bound one.

We also have this logging method that prints the timestamp, thread ID, and the message:

Let’s run the code, let it go for a while, and then see what happens:

We can see that all messaged were processed successfully in around 50 seconds. Processing a message was taking around 22 seconds which makes perfect sense since we had 20 iterations taking around 1100 milliseconds each. No failures, all was good.

Let’s now increase the CPU-bound operation time to 1 second (to match the IO-bound part). This is what happens:

This time it took nearly 2 minutes to process all the messages. Each message is now taking around 40 seconds. Still, all worked.

Let’s now talk about threads. You can see that the examples use multiple threads to handle the messages. In the second execution, there were around 60 active messages at one time, so this created many threads (we can see that at least 50 threads were created based on the log above). Our application scales well and we can’t complain. Seems like async is doing a really good job!

However, what would happen if we moved this code to some other asynchronous platform? For instance, to Python’s asyncio that uses only single thread? We can emulate that in C# by running the code above in a WinForms context that forces continuations to go through one thread. Let’s change the CPU-bound operation duration to 100 milliseconds (to the original value) and let’s run this from the WinForms app now:

It wasn’t that bad and we can see that we indeed ran on a single thread. First, notice that now it took nearly 4 minutes to complete. That’s understandable as we now run things on a single thread. Also, notice that each message was taking around 30-40 seconds to complete. That is much longer than before. This is because messages compete for the CPU time and we don’t have any parallelism. It’s also worth noting that we lost 3 messages. That’s not that bad. The system overscaled just a bit and couldn’t deal with the load but the stabilized and finished processing.

Let’s now increase the CPU-bound duration to 1 second and try again:

And here things start to collapse. It took us 8 minutes to process all messages, each of them was taking around 1 minute, and we failed to process 90 out of one hundred. We lost 90% of all of the messages. Our system became unreliable just because we increased the CPU-bound part of the message processing. But why did it break the application exactly? What happened?

You don’t control the priority of continuations

Our application runs three distinct operations in total:

  • Take the message from the queue
  • Refresh the lease of the message
  • Do some processing of the message

Every single time we await the task, we release the thread and let it do something else. Once the IO-bound operation finishes, we schedule it to run on the same thread. However, the order of continuations doesn’t reflect the importance of what we should do.

In order to keep the system stable, we need to refresh the leases. Therefore, if there is any continuation that wants to refresh the lease (the continuation in KeepLease method), it should run before everything else.

Once we don’t have any continuations for refreshing the leases, we should run continuations for message processing. Obviously, if some KeepLease continuation gets scheduled, it should preempt other continuations.

Finally, when we have no continuations for refreshing the leases or processing the messages, we should run the continuation for getting new message from the queue. In other words, we receive a new message only when we have some idle CPU time that we can use to process something more.

Unfortunately, the async in C# doesn’t let you easily prioritize the continuations. However, this is not a problem most of the times because C# uses multiple threads! Once a continuation is free to run, the thread pool will grow to run the continuation earlier if possible. This is not part of the async programming paradigm and you can’t take it for granted. However, when we run things on a single thread, then continuations have no priorities and message processing continuations may stop lease refreshing continuations from running. Even worse, we may run continuation that receives new message from the bus even though we are already overloaded.

Depending on the nature of your platform (be it C# with different synchronization context, Python with single-threaded asyncio, or JavaScript with one and only one thread), you may get different results. Your application may scale well or may fail badly.

Let’s fix it

We can fix this issue in many ways. Conceptually, we need three different queues: the first one represents the lease refreshments, the second is for message processing, and the third is for getting new message from the bus. We would then have one processor that would check each of the queues in order and execute the operations accordingly. Unfortunately, rewriting the application from async paradigm to a consumer with multiple queues is not straightforward.

Instead, we can reorder the continuations. The trick is to introduce a priority for each continuation. We do the following:

  1. We store a desired priority of continuations running on a thread
  2. When a continuation wants to run, it checks if the desired priority is equal to the priority of the continuation
  3. If it’s equal, then the continuation resets the desired priority to some invalid value and continues
  4. Otherwise, the continuation bumps the priority if possible and lets other continuations run

The protocol is based on the following idea: some continuation sets the desired priority to be at least the priority of the continuation and then lets other continuations to run. If there are other continuations of lower priority, they will simply release the CPU and reschedule themselves. If there are continuations of some higher priority, they will bump the desired priority. And if there are no continuations, then th original continuation will finally get the CPU, run the code, and reset the priority, so other continuations can run the same dance over and over. Here is the code:

We need to run this method every single time when we run await. For instance, KeepLease becomes this:

You can find the full snippet here. Let’s see it in action:

We can see that the code runs much slower and it takes 35 minutes to complete. However, all messages are processed successfully and the code scales automatically. We don’t need to manually control the thread pool size, but the application simply processes fewer or more messages depending on the actual CPU-bound processing time.

Summary

async programming is very hard. We were told many times that it’s as simple as putting async here and await there. C# did a lot to get rid of deadlocks as nearly every platform now uses no synchronization context (as compared with old ASP.NET which had its own context or all the desktop apps that were running with a single thread). Also, C# uses a thread pool and can fix many programmer’s mistakes that can limit the scalability.

However, asynchronous programming can be implemented in many other ways. You can’t assume that it will use many threads or that the coroutine will be triggered immediately. Many quirks can decrease the performance. Python’s asyncio is a great example of how asynchronous programming can work much differently, especially if you take the Python’s performance into consideration. What’s IO-bound in C#, can easily become CPU-bound in Python because Python is way slower.