Analysing C# code on GitHub with BigQuery

Just over a year ago Google made all the open source code on GitHub available for querying within BigQuery and as if that wasn’t enough you can run a terabyte of queries each month for free!

So in this post I am going to be looking at all the C# source code on GitHub and what we can find out from it. Handily a smaller, C# only, dataset has been made available (in BigQuery you are charged per byte read), called fh-bigquery:github_extracts.contents_net_cs and has

  • 5,885,933 unique ‘.cs’ files
  • 792,166,632 lines of code (LOC)
  • 37.17 GB of data

Which is a pretty comprehensive set of C# source code!

The rest of this post will attempt to answer the following questions:

  1. Tabs or Spaces?
  2. regions: ‘should be banned’ or ‘okay in some cases’?
  3. ‘K&R’ or ‘Allman’, where do C# devs like to put their braces?
  4. Do C# developers like writing functional code?

Then moving onto some less controversial C# topics:

  1. Which using statements are most widely used?
  2. What NuGet packages are most often included in a .NET project
  3. How many lines of code (LOC) are in a typical C# file?
  4. What is the most widely thrown Exception?
  5. ‘async/await all the things’ or not?
  6. Do C# developers like using the var keyword? (Updated)

Before we end up looking at repositories, not just individual C# files:

  1. What is the most popular repository with C# code in it?
  2. Just how many files should you have in a repository?
  3. What are the most popular C# class names?
  4. ‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?

If you want to try the queries for yourself (or find my mistakes), all of them are available in this gist. There’s a good chance that my regular expressions miss out some edge-cases, after all Regular Expressions: Now You Have Two Problems:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Tabs or Spaces?

In the entire data-set there are 5,885,933 files, but here we only include ones that have more than 10 lines starting with a tab or a space

Tabs Tabs % Spaces Spaces % Total
799,055 17.15% 3,859,528 82.85% 4,658,583

Clearly, C# developers (on GitHub) prefer Spaces over Tabs, let the endless debates continue!! (I think some of this can be explained by the fact that Visual Studio uses ‘spaces’ by default)

If you want to see how C# compares to other programming languages, take a look at 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?.

regions: ‘should be banned’ or ‘okay in some cases’?

It turns out that there are an impressive 712,498 C# files (out of 5.8 million) that contain at least one #region statement (query used), that’s just over 12%. (I’m hoping that a lot of those files have been auto-generated by a tool!)

‘K&R’ or ‘Allman’, where do C# devs like to put their braces?

C# developers overwhelmingly prefer putting an opening brace { on it’s own line (query used)

separate line same line same line (initializer)   total (with brace) total (all code)
81,306,320 (67%) 40,044,603 (33%) 3,631,947 (2.99%)   121,350,923 (15.32%) 792,166,632

(‘same line initializers’ include code like new { Name = "", .. }, new [] { 1, 2, 3.. })

Do C# developers like writing functional code?

This is slightly unscientific, but I wanted to see how widely the Lambda Operator => is used in C# code (query). Yes, I know, if you want to write functional code on .NET you really should use F#, but C# has become more ‘functional’ over the years and I wanted to see how much code was taking advantage of that.

Here’s the raw percentiles:

Percentile % of lines using lambdas
10 0.51
25 1.14
50 2.50
75 5.26
90 9.95
95 14.29
99 28.00

So we can say that:

  • 50% of all the C# code on GitHub uses => on 2.44% (or less) of their lines.
  • 10% of all C# files have lambdas on almost 1 in 10 of their lines (9.95%)
  • 5% use => on 1 in 7 lines (14.29%)
  • 1% of files have lambdas on over 1 in 3 lines (28%) of their lines of code, that’s pretty impressive!

Which using statements are most widely used?

Now on to some a bit more substantial, what are the most widely used using statements in C# code?

The top 10 looks like this (the full results are available):

using statement count
using System.Collections.Generic; 1,780,646
using System; 1,477,019
using System.Linq; 1,319,830
using System.Text; 902,165
using System.Threading.Tasks; 628,195
using System.Runtime.InteropServices; 431,867
using System.IO; 407,848
using System.Runtime.CompilerServices; 338,686
using System.Collections; 289,867
using System.Reflection; 218,369

However, as was pointed out, the top 5 are included by default when you add a new file in Visual Studio and many people wouldn’t remove them. The same applies to ‘System.Runtime.InteropServices’ and ‘System.Runtime.CompilerServices’ which are include in ‘AssemblyInfo.cs` by default.

So if we adjust the list to take account of this, the top 10 looks like so:

using statement count
using System.IO; 407,848
using System.Collections; 289,867
using System.Reflection; 218,369
using System.Diagnostics; 201,341
using System.Threading; 179,168
using System.ComponentModel; 160,681
using System.Web; 160,323
using System.Windows.Forms; 137,003
using System.Globalization; 132,113
using System.Drawing; 127,033

Finally, an interesting list is the top 10 using statements that aren’t System, Microsoft or Windows namespaces:

using statement count
using NUnit.Framework; 119,463
using UnityEngine; 117,673
using Xunit; 99,099
using Newtonsoft.Json; 81,675
using Newtonsoft.Json.Linq; 29,416
using Moq; 23,546
using UnityEngine.UI; 20,355
using UnityEditor; 19,937
using Amazon.Runtime; 18,941
using log4net; 17,297

What NuGet packages are most often included in a .NET project?

It turns out that there is also a separate dataset containing all the ‘packages.config’ files on GitHub, it’s called contents_net_packages_config and has 104,808 entries. By querying this we can see that Json.Net is the clear winner!!

package count
Newtonsoft.Json 45,055
Microsoft.Web.Infrastructure 16,022
Microsoft.AspNet.Razor 15,109
Microsoft.AspNet.WebPages 14,495
Microsoft.AspNet.Mvc 14,236
EntityFramework 14,191
Microsoft.AspNet.WebApi.Client 13,480
Microsoft.AspNet.WebApi.Core 12,210
Microsoft.Net.Http 11,625
jQuery 10,646
Microsoft.Bcl.Build 10,641
Microsoft.Bcl 10,349
NUnit 10,341
Owin 9,681
Microsoft.Owin 9,202
Microsoft.AspNet.WebApi.WebHost 9,007
WebGrease 8,743
Microsoft.AspNet.Web.Optimization 8,721
Microsoft.AspNet.WebApi 8,179

How many lines of code (LOC) are in a typical C# file?

Are C# developers prone to creating huge files that go one for 1000’s of lines? Well some are but fortunately it’s the minority of us!!

Percentiles of lines of code per file

Note the Y-axis is ‘lines of code’ and is logarithmic, the raw data is available.

Oh dear, Uncle Bob isn’t going to be happy, whilst 96% of the files have 509 LOC of less, the other 4% don’t!! From Clean Code:

Uncle Bob - Clean Code - Number of lines of code in a file

And in case you’re wondering, here’s the Top 10 longest C# files!!

File Lines
MarMot/Input/test.marmot.cs 92663
src/CodenameGenerator/WordRepos/LastNamesRepository.cs 88810
cs_inputtest/cs_02_7000.cs 63004
cs_inputtest/cs_02_6000.cs 54004
src/ML NET20/Utility/UserName.cs 52014
MWBS/Dictionary/DefaultWordDictionary.cs 48912
Sources/Accord.Math/Matrix/Matrix.Comparisons1.Generated.cs 48407
UrduProofReader/UrduLibs/Utils.cs 48255
cs_inputtest/cs_02_5000.cs 45004
css/style.cs 44366

What is the most widely thrown Exception?

There’s a few interesting results in this query, for instance who knew that so many ApplicationExceptions were thrown and NotSupportedException being so high up the list is a bit worrying!!

Exception count
throw new ArgumentNullException 699,526
throw new ArgumentException 361,616
throw new NotImplementedException 340,361
throw new InvalidOperationException 260,792
throw new ArgumentOutOfRangeException 160,640
throw new NotSupportedException 110,019
throw new HttpResponseException 74,498
throw new ValidationException 35,615
throw new ObjectDisposedException 31,129
throw new ApplicationException 30,849
throw new UnauthorizedException 21,133
throw new FormatException 19,510
throw new SerializationException 17,884
throw new IOException 15,779
throw new IndexOutOfRangeException 14,778
throw new NullReferenceException 12,372
throw new InvalidDataException 12,260
throw new ApiException 11,660
throw new InvalidCastException 10,510

‘async/await all the things’ or not?

The addition of the async and await keywords to the C# language makes writing asynchronous code much easier:

public async Task<int> GetDotNetCountAsync()
    // Suspends GetDotNetCountAsync() to allow the caller (the web server)
    // to accept another request, rather than blocking on this one.
    var html = await _httpClient.DownloadStringAsync("");

    return Regex.Matches(html, ".NET").Count;

But how much is it used? Using the query below:

SELECT Count(*) count
  REGEXP_MATCH(content, r'\sasync\s|\sawait\s')

I found that there are 218,643 files (out of 5,885,933) that have at least one usage of async or await in them.

Do C# developers like using the var keyword?

Less that they use async and await, there are 130,590 files that have at least one usage of the var keyword

Update: thanks for jairbubbles for pointing out that my var regex was wrong and supplying a fixed version!

More than they use async and await, there are 1,457,154 files that have at least one usage of the var keyword

Just how many files should you have in a repository?

90% of the repositories (that have any C# files) have 95 files or less. 95% have 170 files or less and 99% have 535 files or less.

Number of C# Files per Repository

(again the Y-axis (# files) is logarithmic)

The top 10 largest repositories, by number of C# files are shown below:

Repository # Files 23389 14241 13051 10652 10185 9338 8060 7946 7860 7765

This time we are going to look at the most popular repositories (based on GitHub ‘stars’) that contain at least 50 C# files (query used):

repo stars files 11075 237 8576 6503 8422 6351 8046 73 7123 132 7115 10652 7024 512 6184 81 5674 207 5674 142 5336 766 5130 1501 3701 957 3432 248 3340 650

Interesting that the top spot is a Google Repository! (the C# files in it are sample code for using the GRPC library from .NET)

Assuming that I got the regex correct, the most popular C# class names are the following:

Class name Count
class C 182480
class Program 163462
class Test 50593
class Settings 40841
class Resources 39345
class A 34687
class App 28462
class B 24246
class Startup 18238
class Foo 15198

Yay for Foo, just sneaking into the Top 10!!

‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?

Finally lets look at the different class names used, as with the using statement they are dominated by the default ones used in the Visual Studio templates:

File Count
AssemblyInfo.cs 386822
Program.cs 105280
Resources.Designer.cs 40881
Settings.Designer.cs 35392
App.xaml.cs 21928
Global.asax.cs 16133
Startup.cs 14564
HomeController.cs 13574
RouteConfig.cs 11278
MainWindow.xaml.cs 11169

Discuss this post on Hacker News and /r/csharp

More Information

As always, if you’ve read this far your present is yet more blog posts to read, enjoy!!

How BigQuery Works (only put in at the end of the blog post)

BigQuery analysis of other Programming Languages