Analysing Optimisations in the Wire Serialiser

23 Aug 2016 - 1310 words

Recently Roger Johansson wrote a post titled Wire – Writing one of the fastest .NET serializers, describing the optimisation that were implemented to make Wire as fast as possible. He also followed up that post with a set of benchmarks, showing how Wire compared to other .NET serialisers:

Using BenchmarkDotNet, this post will analyse the individual optimisations and show how much faster each change is. For reference, the full list of optimisations in the original blog post are:

Looking up value serializers by type
Looking up types when deserializing
Byte buffers, allocations and GC
Clever allocations
Boxing, Unboxing and Virtual calls
Fast creation of empty objects

Looking up value serializers by type

This optimisation changed code like this:

public ValueSerializer GetSerializerByType(Type type)
{
  ValueSerializer serializer;
 
  if (_serializers.TryGetValue(type, out serializer))
    return serializer;
 
  //more code to build custom type serializers.. ignore for now.
}

into this:

public ValueSerializer GetSerializerByType(Type type)
{
  if (ReferenceEquals(type.GetTypeInfo().Assembly, ReflectionEx.CoreAssembly))
  {
    if (type == TypeEx.StringType) //we simply keep a reference to each primitive type
      return StringSerializer.Instance;
 
    if (type == TypeEx.Int32Type)
      return Int32Serializer.Instance;
 
    if (type == TypeEx.Int64Type)
      return Int64Serializer.Instance;
    ...
}

So it has replaced a dictionary lookup with an if statement. In addition it is caching the Type instance of known types, rather than calculating them every time. As you can see the optimisation pays off in some circumstance but not in others, so it’s not a clear win. It depends on where the type is in the list of if statements. If it’s near the beginning (e.g. System.String) it’ll be quicker than if it’s near the end (e.g. System.Byte[]), which makes sense as all the other comparisons have to be done first.

Full benchmark code and results

Looking up types when deserializing

The 2nd optimisation works by removing all unnecessary memory allocations, it did this by:

Using a custom struct (value type) rather than a class
Pre-calculating a hash code once, rather than each time a comparison is needed.
Doing string comparisons with raw byte [], rather than deserialising to a string

Full benchmark code and results

Note: these results nicely demonstrate how BenchmarkDotNet can show you memory allocations as well as the time taken.

Interestingly they hadn’t actually removed all memory allocations as the comparisons between OptimisedLookup and OptimisedLookupCustomComparer show. To fix this I sent a P.R which removes unnecessary boxing, by using a Custom Comparer rather than the default struct comparer.

Byte buffers, allocations and GC

Again removing unnecessary memory allocations were key in this optimisation, most of which can be seen in the NoAllocBitConverter. Clearly serialisation spends a lot of time converting from the in-memory representation of an object to the serialised version, i.e. a byte []. So several tricks were employed to ensure that temporary memory allocations were either removed completely or if that wasn’t possible, they were done by re-using a buffer from a pool rather than allocating a new one each time (see “Buffer recycling”)

Full benchmark code and results

Clever allocations

This optimisation is perhaps the most interesting, because it’s implemented by creating a custom data structure, tailored to the specific needs of Wire. So, rather than using the default .NET dictionary, they implemented FastTypeUShortDictionary. In essence this data structure optimises for having only 1 item, but falls back to a regular dictionary when it grows larger. To see this in action, here is the code from the TryGetValue(..) method:

public bool TryGetValue(Type key, out ushort value)
{
    switch (_length)
    {
        case 0:
            value = 0;
            return false;
        case 1:
            if (key == _firstType)
            {
                value = _firstValue;
                return true;
            }
            value = 0;
            return false;
        default:
            return _all.TryGetValue(key, out value);
    }
}

Like we’ve seen before, the performance gains aren’t clear-cut. For instance it depends on whether FastTypeUShortDictionary contains the item you are looking for (Hit v Miss), but generally it is faster:

Full benchmark code and results

Boxing, Unboxing and Virtual calls

This optimisation is based on the widely used trick that I imagine almost all .NET serialisers employ. For a serialiser to be generic, is has to be able to handle any type of object that is passed to it. Therefore the first thing it does is use reflection to find the public fields/properties of that object, so that it knows the data is has to serialise. Doing reflection like this time-and-time again gets expensive, so the way to get round it is to do reflection once and then use dynamic code generation to compile a delegate than you can then call again and again.

If you are interested in how to implement this, see the Wire compiler source or this Stack Overflow question. As shown in the results below, compiling code dynamically is much faster than reflection and only a little bit slower than if you read/write the property directly in C# code:

Full benchmark code and results

Fast creation of empty objects

The final optimisation trick used is also based on dynamic code creation, but this time it is purely dealing with creating empty objects. Again this is something that a serialiser does many time, so any optimisations or savings are worth it.

Basically the benchmark is comparing code like this:

FormatterServices.GetUninitializedObject(type);

with dynamically generated code, based on Expression trees:

var newExpression = ExpressionEx.GetNewExpression(typeToUse);
Func<TestClass> optimisation = Expression.Lambda<Func<TestClass>>(newExpression).Compile();

However this trick only works if the constructor of the type being created is empty, otherwise it has to fall back to the slow version. But as shown in the results below, we can see that the optimisation is a clear win and worth implementing:

Full benchmark code and results

Summary

So it’s obvious that Roger Johansson and Szymon Kulec (who also contributed performance improvements) know their optimisations and as a result they have steadily made the Wire serialiser faster, which makes is an interesting project to learn from.