This is the second part of the Custom memory allocation series. For your convenience you can find other parts in the table of contents in Part 1 — Allocating object on a stack

Hi! Last time we saw how to allocate managed object on a stack. Today we are going to implement custom list which stores objects instead of references to them. This post is inspired by Unsafe Part 3: Benchmarking a java UnsafeArrayList

Introduction

First, we need to examine how the array is stored in memory. Let’s consider a one-dimensional, ordinary (not jagged) array of integers. It is a reference so it has pointer to method table. We can lock on it, so it has sync block. However, it has additional field between method table pointer and its content — one integer representing the size of the array.

So if we store an element in array[0] it is physically placed 8 bytes after the beginning of the array (to be precise: the address pointed by reference).

Our idea is simple: instead of storing references to objects we will store objects themselves in the array of integers. We will need to copy objects by hand and then we will need to recover the reference using pointer arithmetic. Let’s go.

Adding object to the list

Let’s start with declaring class and fields:

public class UnsafeList< T > : IEnumerable where T : class
{
	private readonly int _elementSize;
	private T _target;
	private int[] _storage;
	private int _currentIndex = -1;

	public UnsafeList(int size, int elementSize)
	{
		_elementSize = elementSize;
		_storage = new int[size * _elementSize];
		_target = default(T);
	}
	
	// ...
}

We declare generic class and store size of the element (including sync block), dummy reference called _target (we will use it to get references to stored objects), integer array for objects, and current index in the array. Next, let’s implement addition:

public void Add(T item)
{
	_currentIndex++;
	unsafe
	{
		TypedReference reference = __makeref(item);
		int* itemAddress = (int*)(*(int*)*(int*)&reference - sizeof(int));
		for (int i = 0; i < _elementSize; ++i)
		{
			var value = *itemAddress;
			_storage[_currentIndex * _elementSize + i] = value;
			itemAddress = itemAddress + 1;
		}
	}
}

We first create typed reference to added item. Next, we calculate the address of the element and move back four bytes to include sync block. Finally, we iterate over all object bytes and copy them one by one to our array.

Let's now implement getter:

public T Get(int index)
{
	unsafe
	{
		TypedReference arrayReference = __makeref(_storage);
		TypedReference targetReference = __makeref(_target);
		*(int*) *(int*) &targetReference = (*(int*) *(int*) (&arrayReference) + index * _elementSize*sizeof (int) + 3*sizeof (int));
		return _target;
	}
}

Looks pretty easy. We get two typed references: one pointing to storage array, another one pointing to dummy reference. Next, we modify dummy reference and store there an address of the element to retrieve. We perform simple pointer arithmetic:
(*(int*) *(int*) (&arrayReference) + index * _elementSize*sizeof (int) + 3*sizeof (int))
First, we retrieve the pointer to pointer to pointer to data. Next, we dereference just enough times to get the address of the array. Next, we move forward 3*sizeof(int): we need to skip two integers of the array (pointer to method table and size of the array), and we need to omit sync block of the object.

And this is it.

Testing program

Let's now evaluate our solution. We start with two helper functions:

class Poco
{
	public int Field1;
	public int Field2;
	public int Field3;
	public int Field4;
	public int Field5;
	public int Field6;
	public int Field7;
	public int Field8;
	public int Field9;
	public int Field10;
	public int Field11;
	public int Field12;
	public int Field13;
	public int Field14;
	public int Field15;
	public int Field16;
}


static Poco GetPoco(int element)
{
	var poco = new Poco
	{
		Field1 = element,
		Field2 = element - 1,
		Field3 = element - 2,
		Field4 = element - 3,
		Field5 = element - 4,
		Field6 = element - 5,
		Field7 = element - 6,
		Field8 = element - 7,
		Field9 = element - 7,
		Field10 = element - 6,
		Field11 = element - 5,
		Field12 = element - 4,
		Field13 = element - 3,
		Field14 = element - 2,
		Field15 = element - 1,
		Field16 = element
	};
	return poco;
}

static int SumPoco(Poco poco)
{
	int sum = 0;
	sum += poco.Field1;
	sum += poco.Field2;
	sum += poco.Field3;
	sum += poco.Field4;
	sum += poco.Field5;
	sum += poco.Field6;
	sum += poco.Field7;
	sum += poco.Field8;
	sum += poco.Field9;
	sum += poco.Field10;
	sum += poco.Field11;
	sum += poco.Field12;
	sum += poco.Field13;
	sum += poco.Field14;
	sum += poco.Field15;
	sum += poco.Field16;
	return sum;
}

We declare a class with sixteen fields. Next, we create a method which will create objects, and finally a method performing few additions.
Next, let's compare three different containers: ordinary array, built-in list, and our custom list.

The code testing the array goes as follows:

static void TestArray(int elementsCount)
{
	Console.WriteLine("Array");
	var stopwatch = new Stopwatch();

	stopwatch.Start();
	var array = new Poco[elementsCount];
	for (int i = 0; i < elementsCount; ++i)
	{
		array[i] = GetPoco(i);
	}
	stopwatch.Stop();
	Console.WriteLine(string.Format("Insertion time: {0} ", stopwatch.ElapsedMilliseconds));

	int sum = 0;
	stopwatch.Restart();
	for (int i = 0; i < elementsCount; ++i)
	{
		sum += SumPoco(array[i]);
	}
	stopwatch.Stop();

	Console.WriteLine(string.Format("Sum: {0} ", sum));
	Console.WriteLine(string.Format("Calculation time: {0} ", stopwatch.ElapsedMilliseconds));
}

We create an array, create and store few objects, check insertion time, restart the stopwatch, check calculation time, and print results. Almost the same goes for the built-in list:

static void TestNormal(int elementsCount)
{
	Console.WriteLine("Normal");
	var stopwatch = new Stopwatch();

	stopwatch.Start();
	var normalList = new List< Poco >(elementsCount);
	for (int i = 0; i < elementsCount; ++i)
	{
		normalList.Add( GetPoco(i) );
	}
	stopwatch.Stop();
	Console.WriteLine(string.Format("Insertion time: {0} ", stopwatch.ElapsedMilliseconds));

	int sum = 0;
	stopwatch.Restart();
	for (int i = 0; i < elementsCount; ++i)
	{
		sum += SumPoco( normalList[i] );
	}
	stopwatch.Stop();

	Console.WriteLine(string.Format("Sum: {0} ", sum));
	Console.WriteLine(string.Format("Calculation time: {0} ", stopwatch.ElapsedMilliseconds));
}

We create a list with predefined size in order to avoid reallocation. And now the code for our list:

static void TestCustom(int elementsCount)
{
	Console.WriteLine("Custom");
	var stopwatch = new Stopwatch();

	stopwatch.Start();
	var unsafeList = new UnsafeList< Poco >(elementsCount, 18);
	for (int i = 0; i < elementsCount; ++i)
	{
		unsafeList.Add(GetPoco(i));
	}
	stopwatch.Stop();
	Console.WriteLine(string.Format("Insertion time: {0} ", stopwatch.ElapsedMilliseconds));

	int sum = 0;
	stopwatch.Restart();
	for (int i = 0; i < elementsCount; ++i)
	{
		sum += SumPoco(unsafeList.Get(i));
	}
	stopwatch.Stop();

	Console.WriteLine(string.Format("Sum: {0} ", sum));
	Console.WriteLine(string.Format("Calculation time: {0} ", stopwatch.ElapsedMilliseconds));
}

As you can see, all these codes look almost the same. The only difference is API for performing operations on collections.

Let's now invoke the test methods:

static void Main(string[] args)
{
	var testsCount = 3;
	var elementsCount = 20 * 1000 * 1000;
	for (int i = 0; i < testsCount; ++i)
	{
		TestArray(elementsCount);
		GC.Collect();
		GC.WaitForPendingFinalizers();
	}
	Console.WriteLine("------------------------");
	for (int i = 0; i < testsCount; ++i)
	{
		TestNormal(elementsCount);
		GC.Collect();
		GC.WaitForPendingFinalizers();
	}
	Console.WriteLine("------------------------");
	for (int i = 0; i < testsCount; ++i)
	{
		TestCustom(elementsCount);
		GC.Collect();
		GC.WaitForPendingFinalizers();
	}
}

We test each collection three times. Each time we create 20 millions objects, perform test, and run GC. I performed the test on the same configuration as in part 1, you can see the results below:
Unsafe list benchmark
As we can see, insertion time is almost the same for ordinary array and for built-in list. However, inserting to unsafe list is much faster. When it comes to calculation time, unsafe list is a bit slower, probably because of additional time to extract the reference to the object.

You can test the code with this gist.

Summary

You can now compare achieved results with the blog post describing the java implementation. Our results are a bit different, but still interesting. You can compare different architectures as well.

Bonus chatter: what needs to be changed in order to get this code working on x64?