Analysing Pause times in the .NET GC

Over the last few months there have been several blog posts looking at GC pauses in different programming languages or runtimes. It all started with a post looking at the latency of the Haskell GC, next came a follow-up that compared Haskell, OCaml and Racket, followed by Go GC in Theory and Practice, before a final post looking at the situation in Erlang.

After reading all these posts I wanted to see how the .NET GC compares to the other runtime implementations.


The posts above all use a similar test program to exercise the GC, based on the message-bus scenario that Pusher initially described, fortunately Franck Jeannin had already started work on a .NET version, so this blog post will make us of that.

At the heart of the test is the following code:

for (var i = 0; i < msgCount; i++)
{
    var sw = Stopwatch.StartNew();
    pushMessage(array, i);
    sw.Stop();
    if (sw.Elapsed > worst)
    {
        worst = sw.Elapsed;
    }
}

private static unsafe void pushMessage(byte[][] array, int id)
{
    array[id % windowSize] = createMessage(id);               
}

The full code is available

So we are creating a ‘message’ (that is actually a byte[1024]) and then putting it into a data structure (byte[][]). This is repeated 10 million times (msgCount), but at any one time there are only 200,000 (windowSize) messages in memory, because we overwrite old ‘messages’ as we go along.

We are timing how long it takes to add the message to the array, which should be a very quick operation. It’s not guaranteed that this time will always equate to GC pauses, but it’s pretty likely. However we can also double check the actual GC pause times by using the excellent PerfView tool, to give us more confidence.


Workstation GC vs. Server GC

Unlike the Java GC that is very configurable, the .NET GC really only gives you a few options:

  • Workstation
  • Server
  • Concurrent/Background

So we will be comparing the Server and Workstation modes, but as we want to reduce pauses we are going to always leave Concurrent/Background mode enabled.

As outlined in the excellent post Understanding different GC modes with Concurrency Visualizer, the 2 modes are optimised for different things (emphasis mine):

Workstation GC is designed for desktop applications to minimize the time spent in GC. In this case GC will happen more frequently but with shorter pauses in application threads. Server GC is optimized for application throughput in favor of longer GC pauses. Memory consumption will be higher, but application can process greater volume of data without triggering garbage collection.

Therefore Workstation mode should give us shorter pauses than Server mode and the results bear this out, below is a graph of the pause times at different percentiles, recorded with by HdrHistogram.NET (click for full-size image):

Histogram - Array - WKS v SVR

Note that the X-axis scale is logarithmic, the Workstation (WKS) pauses starts increasing at the 99.99%’ile, whereas the Server (SVR) pauses only start at the 99.9999%’ile, although they have a larger maximum.

Another way of looking at the results is the table below, here we can clearly see that Workstation has a-lot more GC pauses, although the max is smaller. But more significantly the total GC pause time is much higher and as a result the overall/elapsed time is twice as long (WKS v. SVR).

Workstation GC (Concurrent) vs. Server GC (Background) (On .NET 4.6 - Array tests - all times in milliseconds)

GC Mode Max GC Pause # GC Pauses Total GC Pause Time Elapsed Time Peak Working Set (MB)
Workstation - 1 28.0 1,797 10,266.2 21,688.3 550.37
Workstation - 2 23.2 1,796 9,756.6 21,018.2 543.50
Workstation - 3 19.3 1,800 9,676.0 21,114.6 531.24
Server - 1 104.6 7 646.4 7,062.2 2,086.39
Server - 2 107.2 7 664.8 7,096.6 2,092.65
Server - 3 106.2 6 558.4 7,023.6 2,058.12

Therefore if you only care about the reducing the maximum pause time then Workstation mode is a suitable option, but you will experience more GC pauses overall and so the throughput of your application will be reduced. In addition, the working set is higher for Server mode as it allocates 1 heap per CPU.

Fortunately in .NET we have the choice of which mode we want to use, according to the fantastic article Modern garbage collection the GO runtime has optimised for pause time only:

The reality is that Go’s GC does not really implement any new ideas or research. As their announcement admits, it is a straightforward concurrent mark/sweep collector based on ideas from the 1970s. It is notable only because it has been designed to optimise for pause times at the cost of absolutely every other desirable characteristic in a GC. Go’s tech talks and marketing materials don’t seem to mention any of these tradeoffs, leaving developers unfamiliar with garbage collection technologies to assume that no such tradeoffs exist, and by implication, that Go’s competitors are just badly engineered piles of junk.


Max GC Pause Time compared to Amount of Live Objects

To investigate things further, let’s look at how the maximum pause times vary with the number of live objects. If you refer back to the sample code, we will still be allocating 10,000,000 message (msgCount), but we will vary the amount that are kept around at any one time by changing the windowSize value. Here are the results (click for full-size image):

GC Pause times compared to WindowSize

So you can clearly see that the max pause time is proportional (linearly?) to the amount of live objects, i.e. the amount of objects that survive the GC. Why is this that case, well to get a bit more info we will again use PerfView to help us. If you compare the 2 tables below, you can see that the ‘Promoted MB’ is drastically different, a lot more memory is promoted when we have a larger windowSize, so the GC has more work to do and as a result the ‘Pause MSec’ times go up.

GC Events by Time - windowSize = 100,000
All times are in msec. Hover over columns for help.
GC
Index
GenPause
MSec
Gen0
Alloc
MB
Peak
MB
After
MB
Promoted
MB
Gen0
MB
Gen1
MB
Gen2
MB
LOH
MB
21N39.4431,516.3541,516.354108.647104.8310.000107.2000.0311.415
30N38.5161,651.4660.000215.847104.8000.000214.4000.0311.415
41N42.7321,693.9081,909.754108.647104.8000.000107.2000.0311.415
50N35.0671,701.0121,809.658215.847104.8000.000214.4000.0311.415
61N54.4241,727.3801,943.226108.647104.8000.000107.2000.0311.415
70N35.2081,603.8321,712.479215.847104.8000.000214.4000.0311.415

Full PerfView output

GC Events by Time - windowSize = 400,000
All times are in msec. Hover over columns for help.
GC
Index
GenPause
MSec
Gen0
Alloc
MB
Peak
MB
After
MB
Promoted
MB
Gen0
MB
Gen1
MB
Gen2
MB
LOH
MB
20N10.31976.17076.17076.13368.9830.00072.3180.0003.815
31N47.192666.0890.000708.556419.2310.000704.0160.7253.815
40N145.3471,023.3691,731.925868.610419.2000.000864.0700.7253.815
51N190.7361,278.3142,146.923433.340419.2000.000428.8000.7253.815
60N150.6891,235.1611,668.501862.140419.2000.000857.6000.7253.815
71N214.4651,493.2902,355.430433.340419.2000.000428.8000.7253.815
80N148.8161,055.4701,488.810862.140419.2000.000857.6000.7253.815
91N225.8811,543.3452,405.485433.340419.2000.000428.8000.7253.815
100N148.2921,077.1761,510.516862.140419.2000.000857.6000.7253.815
111N225.9171,610.3192,472.459433.340419.2000.000428.8000.7253.815

Full PerfView output


Going ‘off-heap’

Finally, if we really want to eradicate GC pauses in .NET, we can go off-heap. To do that we can write unsafe code like this:

var dest = array[id % windowSize];
IntPtr unmanagedPointer = Marshal.AllocHGlobal(dest.Length);
byte* bytePtr = (byte *) unmanagedPointer;

// Get the raw data into the bytePtr (byte *) 
// in reality this would come from elsewhere, e.g. a network packet
// but for the test we'll just cheat and populate it in a loop
for (int i = 0; i < dest.Length; ++i)
{
    *(bytePtr + i) = (byte)id;
}

// Copy the unmanaged byte array (byte*) into the managed one (byte[])
Marshal.Copy(unmanagedPointer, dest, 0, dest.Length);

Marshal.FreeHGlobal(unmanagedPointer);

Note: I wouldn’t recommend this option unless you have first profiled and determined that GC pauses are a problem, it’s called unsafe for a reason.

Histogram - Array - SVR v OffHeap

But as the graph shows, it clearly works (the off-heap values are there, honest!!). But it’s not that surprising, we are giving the GC nothing to do (because off-heap memory isn’t tracked by the GC), we get no GC pauses!


To finish let’s get a final work from Maoni Stephens, the main GC dev on the .NET runtime, from GC ETW events – 2 – Maoni’s WebLog:

It doesn’t even mean for the longest individual GC pauses you should always look at full GCs because full GCs can be done concurrently, which means you could have gen2 GCs whose pauses are shorter than ephemeral GCs. And even if full GCs did have longest individual pauses, it still doesn’t necessarily mean you should only look at them because you might be doing these GCs very infrequently, and ephemeral GCs actually contribute to most of the GC pause time if the total GC pauses are your problem.

Note: Ephemeral generations and segments - Because objects in generations 0 and 1 are short-lived, these generations are known as the ephemeral generations.

So if GC pause times are a genuine issue in your application, make sure you analyse them correctly!