Logging in distributed system Part 1 — Correlations

This is the first part of the Logging in distributed system series. For your convenience you can find other parts using the links below (or by guessing the address):
Part 1 — Correlations
Part 2 — Action filters
Part 3 — Exceptions
Part 4 — Passing data to other systems
Part 5 — Parsing logs

If you are interested in the topic see the talk page

Distributed systems are hard. Patterns of Enterprise Architecture says that “You shouldn’t try to break up a single application into Web services that talk to each other unless you really need to.” However, nowadays microservices (and even nanoservices) are very popular. One issue when working with such an architecture goes down to logging — there are things which simply must be reliable because without them debugging errors is close to impossible.

General idea

Generally it is a good idea to have logs from different components gathered in one place. That’s why there are applications like ELK stack (Elasticserach, Logstash and Kibana) which download logs, transform them to common format and display. Search capabilities and built-in alerts are also nice features.

In order to analyse logs, you need to be able to correlate them. That’s why correlation id is often a “must-have” pattern. In this series I am going to show you a simple implementation of logger handling correlation ids and Lamport’s happened-before relation for ordering events. Do not consider this a one solution to rule them all, this is just a base implementation ready to be extended as you need.

Components

Let’s assume that our system consists of Sitefinity CMS, Azure Webjob and REST API based on WebAPI 2. All these components are hosted in Azure. We want to design a code for logging events in an uniform way.

Interfaces

First, let’s start with interfaces:

We have an interface for our logger. It gives us only one method to log event, accepting information level (see below) and string message. Ideally, logger implementation should add necessary infrastructure details to the message (node id, correlation id, timestamp etc).

Since we need to maintain logical time (Lamport’s happened-before) and correlation ID, we have an interface to store these data. Also, we store activity which represents currently executed action.

Also, we might use a bunch of constants to store data:

This is it. Now we need to implement interfaces. Let’s start with correlators.

Correlators

We basically have two entry points: one for web application and one for web job. Web application should get the correlation ID from request headers or create new one. Web job always needs to create the ID on its own. Let’s start with base correlator:

The only logic here is for maintaining logical time. Since we don’t want to use locks to decrease performance, we use interlocked operation to update logical time.

Let’s implement correlator for web applications:

Implementation is rather straightforward. Correlation ID and logical time are extracted from headers or generated if they are absent. Activity is not set yet, it will be done later.

We also need to implement correlator for web jobs. This is pretty easy since it doesn’t need to bother with headers and always generates data on start:

Loggers

We have similar case with loggers. Let’s start with base class:

Method for logging message adds infrastructure details. The only interesting things is why we have two IDs: correlation and logger. This is because Sitefinity internally might create multiple requests for each widget which in fact are completely separate (each widget does job on its own) but are part of one page (and so they should be correlated). You can treat this as a scope or something similar.

We also add fancy new lines before and after the message because it makes parsing logs easier. Since we want to push logs to Azure OMS, we need to parse them in job handler (which will be shown in one of the next parts).

Let’s go with logger for Azure:

We extract node ID from environment variables which are set automatically by Azure platform. Also, we have a hack for missing file MonAgentListener.dll, which sometimes is not copied to the output directory. This library is required for Azure trace listener and is a part of Azure SDK so if your build server doesn’t have this SDK installed, you need to add the dll manually to your solution and make sure that it is copied to the bin directory.

Now, logger for web applications:

Web apps have no other way to log anything to Azure infrastructure but to use trace listener so we simply use Trace class. Also, since I register lots of stuff by convention, I mark this logger as registered manually with attribute. This is only a matter of your DI container and conventions.

Web job logger looks a bit different, since it should use TextWriter injected by the framework:

And that’s all.

Summary

We have our loggers and correlators, we are ready to use them in production code. In next part we will see how to create them automatically with various action filters.