A look at the internals of 'Tiered JIT Compilation' in .NET Core15 Dec 2017 - 2312 words
The .NET runtime (CLR) has predominantly used a just-in-time (JIT) compiler to convert your executable into machine code (leaving aside ahead-of-time (AOT) scenarios for the time being), as the official Microsoft docs say:
At execution time, a just-in-time (JIT) compiler translates the MSIL into native code. During this compilation, code must pass a verification process that examines the MSIL and metadata to find out whether the code can be determined to be type safe.
But how does that process actually work?
The same docs give us a bit more info:
JIT compilation takes into account the possibility that some code might never be called during execution. Instead of using time and memory to convert all the MSIL in a PE file to native code, it converts the MSIL as needed during execution and stores the resulting native code in memory so that it is accessible for subsequent calls in the context of that process. The loader creates and attaches a stub to each method in a type when the type is loaded and initialized. When a method is called for the first time, the stub passes control to the JIT compiler, which converts the MSIL for that method into native code and modifies the stub to point directly to the generated native code. Therefore, subsequent calls to the JIT-compiled method go directly to the native code.
Simple really!! However if you want to know more, the rest of this post will explore this process in detail.
In addition, we will look at a new feature that is making its way into the Core CLR, called ‘Tiered Compilation’. This is a big change for the CLR, up till now .NET methods have only been JIT compiled once, on their first usage. Tiered compilation is looking to change that, allowing methods to be re-compiled into a more optimised version much like the Java Hotspot compiler.
How it works
But before we look at future plans, how does the current CLR allow the JIT to transform a method from IL to native code? Well, they say ‘a pictures speaks a thousand words’
Before the method is JITed
After the method has been JITed
The main things to note are:
- The CLR has put in a ‘precode’ and ‘stub’ to divert the initial method call to the
PreStubWorker()method (which ultimately calls the JIT). These are hand-written assembly code fragments consisting of only a few instructions.
- Once the method had been JITed into ‘native code’, a stable entry point it created. For the rest of the life-time of the method the CLR guarantees that this won’t change, so the rest of the run-time can depend on it remaining stable.
- The ‘temporary entry point’ doesn’t go away, it’s still available because there may be other methods that are expecting to call it. However the associated ‘precode fixup’ has been re-written or ‘back patched’ to point to the newly created ‘native code’ instead of
- The CLR doesn’t change the address of the
callinstruction in the method that called the method being JITted, it only changes the address inside the ‘precode’. But because all method calls in the CLR go via a precode, the 2nd time the newly JITed method is called, the call will end up at the ‘native code’.
For reference, the ‘stable entry point’ is the same memory location as the
IntPtr that is returned when you call the RuntimeMethodHandle.GetFunctionPointer() method.
If you want to see this process in action for yourself, you can either re-compile the CoreCLR source and add the relevant debug information as I did or just use WinDbg and follow the steps in this excellent blog post (for more on the same topic see ‘Advanced Call Processing in the CLR’ and Vance Morrison’s excellent write-up ‘Digging into interface calls in the .NET Framework: Stub-based dispatch’).
Finally, the different parts of the Core CLR source code that are involved are listed below:
- JIT Helpers for ‘PrecodeFixupThunk’
- PrecodeFixupThunk (i386 assembly)
- ThePreStub (i386 assembly)
Note: this post isn’t going to look at how the JIT itself works, if you are interested in that take a look as this excellent overview written by one of the main developers.
JIT and Execution Engine (EE) Interaction
The make all this work the JIT and the EE have to work together, to get an idea of what is involved, take a look at this comment describing the rules that determine which type of precode the JIT can use. All this info is stored in the EE as it’s the only place that has the full knowledge of what a method does, so the JIT has to ask which mode to work in.
In addition, the JIT has to ask the EE what the address of a functions entry point is, this is done via the following methods:
- Then calls MethodDesc::GetMultiCallableAddrOfCode(..)
Precode and Stubs
There are different types or ‘precode’ available, ‘FIXUP’, ‘REMOTING’ or ‘STUB’, you can see the rules for which one is used in MethodDesc::GetPrecodeType(). In addition, because they are such a low-level mechanism, they are implemented differently across CPU architectures, from a comment in the code:
There two implementation options for temporary entrypoints:
(1) Compact entrypoints. They provide as dense entrypoints as possible, but can’t be patched to point to the final code. The call to unjitted method is indirect call via slot.
(2) Precodes. The precode will be patched to point to the final code eventually, thus the temporary entrypoint can be embedded in the code. The call to unjitted method is direct call to direct jump.
We use (1) for x86 and (2) for 64-bit to get the best performance on each platform. For ARM (1) is used.
There’s also a whole lot more information about ‘precode’ available in the BOTR.
Finally, it turns out that you can’t go very far into the internals of the CLR without coming across ‘stubs’ (or ‘trampolines’, ‘thunks’, etc), for instance they’re used in
- Virtual Method (Interface) Dispatch
- Jump Stubs
- Shared Generics
- Dll Import callbacks
- and probably some more I’ve missed!
Before we go any further I want to point out that Tiered Compilation is very much work-in-progress. As an indication, to get it working you currently have to set an environment variable called
COMPLUS_EXPERIMENTAL_TieredCompilation. It appears that the current work is focussed on the infrastructure to make it possible (i.e. CLR changes), then I assume that there has to be a fair amount of testing and performance analysis before it’s enabled by default.
If you want to learn about the goals of the feature and how it fits into the wider process of ‘code versioning’, I recommend reading the excellent design docs, including the future roadmap possibilities.
To give an indications of what has been involved so far, there has been work going on in the:
- Debugger (e.g. Breakpoints aren’t hit if tiered jitting recompiled the method before the debugger was attached and Source line breakpoints stop working when tiered jitting replaces the code)
- Profiling APIs - e.g. Tiered jitting: Implement additional profiler APIs
- Diagnostics - (all tracked via Tiered jitting: Design/Implement appropriate diagnostics, e.g. Tiered Jitting: Fix IL to native mapping for ETW)
- Interpreter - yes the CLR has a built-in Interpreter
- Many other places
If you want to follow along you can take a look at the related issues/PRs, here are the main ones to get you started:
- Tiered Compilation step 1
- WIP - Tiered Jitting Part Deux
- All PRs by Noah Falk (main Microsoft Developer working on the feature)
There is also some nice background information available in Introduce a tiered JIT and if you want to understand how it will eventually makes use of changes in the JIT (‘MinOpts’), take a look at Low Tier Back-Off and JIT: enable aggressive inline policy for Tier1.
History - ReJIT
As an quick historical aside, you have previously been able to get the CLR to re-JIT a method for you, but it only worked with the Profiling APIs, which meant you had to write some C/C++ COM code to make it happen! In addition ReJIT only allowed the method to be re-compiled at the same level, so it wouldn’t ever produce more optimised code. It was mostly meant to help monitoring or profiling tools.
How it works
Finally, how does it work, again lets look at some diagrams. Firstly, as a recap, lets take a look at how things ends up once a method had been JITed, with tiered compilation turned off (the same diagram as above):
Now, as a comparison, here’s what the same stage looks like with tiered compilation enabled:
The main difference is that tiered compilation has forced the method call to go through another level of indirection, the ‘pre stub’. This is to make it possible to count the number of times the method is called, then once it has hit the threshold (currently 30), the ‘pre stub’ is re-written to point to the ‘optimised native code’ instead:
Note that the original ‘native code’ is still available, so if needed the changes can be reverted and the method call can go back to the unoptimised version.
Using a counter
We can see a bit more details about the counter in this comments from prestub.cpp:
/*************************** CALL COUNTER ***********************/
// If we are counting calls for tiered compilation, leave the prestub
// in place so that we can continue intercepting method invocations.
// When the TieredCompilationManager has received enough call notifications
// for this method only then do we back-patch it.
BOOL fCanBackpatchPrestub = TRUE;
BOOL fEligibleForTieredCompilation = IsEligibleForTieredCompilation();
CallCounter * pCallCounter = GetCallCounter();
fCanBackpatchPrestub = pCallCounter->OnMethodCalled(this);
In essence the ‘stub’ calls back into the TieredCompilationManager until the ‘tiered compilation’ is triggered, once that happens the ‘stub’ is ‘back-patched’ to stop it being called any more.
Why not ‘Interpreted’?
And the answer I got was:
There’s already an Interpreter available, or is it not considered suitable for production code?
Its a fine question, but you guessed correctly - the interpreter is not in good enough shape to run production code as-is. There are also some significant issues if you want debugging and profiling tools to work (which we do). Given enough time and effort it is all solvable, it just isn’t the easiest place to start.
How different is the overhead between non-optimised and optimised JITting?
On my machine non-optimized jitting used about ~65% of the time that optimized jitting took for similar IL input sizes, but of course I expect results will vary by workload and hardware. Getting this first step checked in should make it easier to collect better measurements.
Why not LLVM?
Finally, why aren’t they using a LLVM to compile the code, from Introduce a tiered JIT (comment)
There were (and likely still are) significant differences in the LLVM support needed for the CLR versus what is needed for Java, both in GC and in EH, and in the restrictions one must place on the optimizer. To cite just one example: the CLRs GC currently cannot tolerate managed pointers that point off the end of objects. Java handles this via a base/derived paired reporting mechanism. We’d either need to plumb support for this kind of paired reporting into the CLR or restrict LLVM’s optimizer passes to never create these kinds of pointers. On top of that, the LLILC jit was slow and we weren’t sure ultimately what kind of code quality it might produce.
So, figuring out how LLILC might fit into a potential multi-tier approach that did not yet exist seemed (and still seems) premature. The idea for now is to get tiering into the framework and use RyuJit for the second-tier jit. As we learn more, we may discover there is indeed room for higher tier jits, or, at least, understand better what else we need to do before such things make sense.
There is more background info in Introduce a tiered JIT
One of my favourite side-effects of Microsoft making .NET Open Source and developing out in the open is that we can follow along with work-in-progress features. It’s great being able to download the latest code, try them out and see how they work under-the-hood, yay for OSS!!
Discuss this post on Hacker News