Analysing C# code on GitHub with BigQuery
12 Oct 2017 - 2894 wordsJust over a year ago Google made all the open source code on GitHub available for querying within BigQuery and as if that wasn’t enough you can run a terabyte of queries each month for free!
So in this post I am going to be looking at all the C# source code on GitHub and what we can find out from it. Handily a smaller, C# only, dataset has been made available (in BigQuery you are charged per byte read), called fh-bigquery:github_extracts.contents_net_cs and has
- 5,885,933 unique ‘.cs’ files
- 792,166,632 lines of code (LOC)
- 37.17 GB of data
Which is a pretty comprehensive set of C# source code!
The rest of this post will attempt to answer the following questions:
- Tabs or Spaces?
regions
: ‘should be banned’ or ‘okay in some cases’?- ‘K&R’ or ‘Allman’, where do C# devs like to put their braces?
- Do C# developers like writing functional code?
Then moving onto some less controversial C# topics:
- Which
using
statements are most widely used? - What NuGet packages are most often included in a .NET project
- How many lines of code (LOC) are in a typical C# file?
- What is the most widely thrown
Exception
? - ‘async/await all the things’ or not?
- Do C# developers like using the
var
keyword? (Updated)
Before we end up looking at repositories, not just individual C# files:
- What is the most popular repository with C# code in it?
- Just how many files should you have in a repository?
- What are the most popular C#
class
names? - ‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?
If you want to try the queries for yourself (or find my mistakes), all of them are available in this gist. There’s a good chance that my regular expressions miss out some edge-cases, after all Regular Expressions: Now You Have Two Problems:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
Tabs or Spaces?
In the entire data-set there are 5,885,933 files, but here we only include ones that have more than 10 lines starting with a tab or a space
Tabs | Tabs % | Spaces | Spaces % | Total |
---|---|---|---|---|
799,055 | 17.15% | 3,859,528 | 82.85% | 4,658,583 |
Clearly, C# developers (on GitHub) prefer Spaces over Tabs, let the endless debates continue!! (I think some of this can be explained by the fact that Visual Studio uses ‘spaces’ by default)
If you want to see how C# compares to other programming languages, take a look at 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?.
regions
: ‘should be banned’ or ‘okay in some cases’?
It turns out that there are an impressive 712,498 C# files (out of 5.8 million) that contain at least one #region
statement (query used), that’s just over 12%. (I’m hoping that a lot of those files have been auto-generated by a tool!)
‘K&R’ or ‘Allman’, where do C# devs like to put their braces?
C# developers overwhelmingly prefer putting an opening brace {
on it’s own line (query used)
separate line | same line | same line (initializer) | total (with brace) | total (all code) | |
---|---|---|---|---|---|
81,306,320 (67%) | 40,044,603 (33%) | 3,631,947 (2.99%) | 121,350,923 (15.32%) | 792,166,632 |
(‘same line initializers’ include code like new { Name = "", .. }
, new [] { 1, 2, 3.. }
)
Do C# developers like writing functional code?
This is slightly unscientific, but I wanted to see how widely the Lambda Operator =>
is used in C# code (query). Yes, I know, if you want to write functional code on .NET you really should use F#, but C# has become more ‘functional’ over the years and I wanted to see how much code was taking advantage of that.
Here’s the raw percentiles:
Percentile | % of lines using lambdas |
---|---|
10 | 0.51 |
25 | 1.14 |
50 | 2.50 |
75 | 5.26 |
90 | 9.95 |
95 | 14.29 |
99 | 28.00 |
So we can say that:
- 50% of all the C# code on GitHub uses
=>
on 2.44% (or less) of their lines. - 10% of all C# files have lambdas on almost 1 in 10 of their lines (9.95%)
- 5% use
=>
on 1 in 7 lines (14.29%) - 1% of files have lambdas on over 1 in 3 lines (28%) of their lines of code, that’s pretty impressive!
Which using
statements are most widely used?
Now on to some a bit more substantial, what are the most widely used using
statements in C# code?
The top 10 looks like this (the full results are available):
using statement | count |
---|---|
using System.Collections.Generic; | 1,780,646 |
using System; | 1,477,019 |
using System.Linq; | 1,319,830 |
using System.Text; | 902,165 |
using System.Threading.Tasks; | 628,195 |
using System.Runtime.InteropServices; | 431,867 |
using System.IO; | 407,848 |
using System.Runtime.CompilerServices; | 338,686 |
using System.Collections; | 289,867 |
using System.Reflection; | 218,369 |
However, as was pointed out, the top 5 are included by default when you add a new file in Visual Studio and many people wouldn’t remove them. The same applies to ‘System.Runtime.InteropServices’ and ‘System.Runtime.CompilerServices’ which are include in ‘AssemblyInfo.cs` by default.
So if we adjust the list to take account of this, the top 10 looks like so:
using statement | count |
---|---|
using System.IO; | 407,848 |
using System.Collections; | 289,867 |
using System.Reflection; | 218,369 |
using System.Diagnostics; | 201,341 |
using System.Threading; | 179,168 |
using System.ComponentModel; | 160,681 |
using System.Web; | 160,323 |
using System.Windows.Forms; | 137,003 |
using System.Globalization; | 132,113 |
using System.Drawing; | 127,033 |
Finally, an interesting list is the top 10 using statements that aren’t System
, Microsoft
or Windows
namespaces:
using statement | count |
---|---|
using NUnit.Framework; | 119,463 |
using UnityEngine; | 117,673 |
using Xunit; | 99,099 |
using Newtonsoft.Json; | 81,675 |
using Newtonsoft.Json.Linq; | 29,416 |
using Moq; | 23,546 |
using UnityEngine.UI; | 20,355 |
using UnityEditor; | 19,937 |
using Amazon.Runtime; | 18,941 |
using log4net; | 17,297 |
What NuGet packages are most often included in a .NET project?
It turns out that there is also a separate dataset containing all the ‘packages.config’ files on GitHub, it’s called contents_net_packages_config and has 104,808 entries. By querying this we can see that Json.Net is the clear winner!!
package | count |
---|---|
Newtonsoft.Json | 45,055 |
Microsoft.Web.Infrastructure | 16,022 |
Microsoft.AspNet.Razor | 15,109 |
Microsoft.AspNet.WebPages | 14,495 |
Microsoft.AspNet.Mvc | 14,236 |
EntityFramework | 14,191 |
Microsoft.AspNet.WebApi.Client | 13,480 |
Microsoft.AspNet.WebApi.Core | 12,210 |
Microsoft.Net.Http | 11,625 |
jQuery | 10,646 |
Microsoft.Bcl.Build | 10,641 |
Microsoft.Bcl | 10,349 |
NUnit | 10,341 |
Owin | 9,681 |
Microsoft.Owin | 9,202 |
Microsoft.AspNet.WebApi.WebHost | 9,007 |
WebGrease | 8,743 |
Microsoft.AspNet.Web.Optimization | 8,721 |
Microsoft.AspNet.WebApi | 8,179 |
How many lines of code (LOC) are in a typical C# file?
Are C# developers prone to creating huge files that go one for 1000’s of lines? Well some are but fortunately it’s the minority of us!!
Note the Y-axis is ‘lines of code’ and is logarithmic, the raw data is available.
Oh dear, Uncle Bob isn’t going to be happy, whilst 96% of the files have 509 LOC of less, the other 4% don’t!! From Clean Code:
And in case you’re wondering, here’s the Top 10 longest C# files!!
File | Lines |
---|---|
MarMot/Input/test.marmot.cs | 92663 |
src/CodenameGenerator/WordRepos/LastNamesRepository.cs | 88810 |
cs_inputtest/cs_02_7000.cs | 63004 |
cs_inputtest/cs_02_6000.cs | 54004 |
src/ML NET20/Utility/UserName.cs | 52014 |
MWBS/Dictionary/DefaultWordDictionary.cs | 48912 |
Sources/Accord.Math/Matrix/Matrix.Comparisons1.Generated.cs | 48407 |
UrduProofReader/UrduLibs/Utils.cs | 48255 |
cs_inputtest/cs_02_5000.cs | 45004 |
css/style.cs | 44366 |
What is the most widely thrown Exception
?
There’s a few interesting results in this query, for instance who knew that so many ApplicationExceptions
were thrown and NotSupportedException
being so high up the list is a bit worrying!!
Exception | count |
---|---|
throw new ArgumentNullException | 699,526 |
throw new ArgumentException | 361,616 |
throw new NotImplementedException | 340,361 |
throw new InvalidOperationException | 260,792 |
throw new ArgumentOutOfRangeException | 160,640 |
throw new NotSupportedException | 110,019 |
throw new HttpResponseException | 74,498 |
throw new ValidationException | 35,615 |
throw new ObjectDisposedException | 31,129 |
throw new ApplicationException | 30,849 |
throw new UnauthorizedException | 21,133 |
throw new FormatException | 19,510 |
throw new SerializationException | 17,884 |
throw new IOException | 15,779 |
throw new IndexOutOfRangeException | 14,778 |
throw new NullReferenceException | 12,372 |
throw new InvalidDataException | 12,260 |
throw new ApiException | 11,660 |
throw new InvalidCastException | 10,510 |
‘async/await all the things’ or not?
The addition of the async
and await
keywords to the C# language makes writing asynchronous code much easier:
public async Task<int> GetDotNetCountAsync()
{
// Suspends GetDotNetCountAsync() to allow the caller (the web server)
// to accept another request, rather than blocking on this one.
var html = await _httpClient.DownloadStringAsync("http://dotnetfoundation.org");
return Regex.Matches(html, ".NET").Count;
}
But how much is it used? Using the query below:
SELECT Count(*) count
FROM
[fh-bigquery:github_extracts.contents_net_cs]
WHERE
REGEXP_MATCH(content, r'\sasync\s|\sawait\s')
I found that there are 218,643 files (out of 5,885,933) that have at least one usage of async
or await
in them.
Do C# developers like using the var
keyword?
Less that they use async
and await
, there are 130,590 files that have at least one usage of the var
keyword
Update: thanks for jairbubbles for pointing out that my var
regex was wrong and supplying a fixed version!
More than they use async
and await
, there are 1,457,154 files that have at least one usage of the var
keyword
Just how many files should you have in a repository?
90% of the repositories (that have any C# files) have 95 files or less. 95% have 170 files or less and 99% have 535 files or less.
(again the Y-axis (# files) is logarithmic)
The top 10 largest repositories, by number of C# files are shown below:
Repository | # Files |
---|---|
https://github.com/xen2/mcs | 23389 |
https://github.com/mater06/LEGOChimaOnlineReloaded | 14241 |
https://github.com/Microsoft/referencesource | 13051 |
https://github.com/dotnet/corefx | 10652 |
https://github.com/apo-j/Projects_Working | 10185 |
https://github.com/Microsoft/CodeContracts | 9338 |
https://github.com/drazenzadravec/nequeo | 8060 |
https://github.com/ClearCanvas/ClearCanvas | 7946 |
https://github.com/mwilliamson-firefly/aws-sdk-net | 7860 |
https://github.com/151706061/MacroMedicalSystem | 7765 |
What is the most popular repository with C# code in it?
This time we are going to look at the most popular repositories (based on GitHub ‘stars’) that contain at least 50 C# files (query used):
repo | stars | files |
---|---|---|
https://github.com/grpc/grpc | 11075 | 237 |
https://github.com/dotnet/coreclr | 8576 | 6503 |
https://github.com/dotnet/roslyn | 8422 | 6351 |
https://github.com/facebook/yoga | 8046 | 73 |
https://github.com/bazelbuild/bazel | 7123 | 132 |
https://github.com/dotnet/corefx | 7115 | 10652 |
https://github.com/SeleniumHQ/selenium | 7024 | 512 |
https://github.com/Microsoft/WinObjC | 6184 | 81 |
https://github.com/qianlifeng/Wox | 5674 | 207 |
https://github.com/Wox-launcher/Wox | 5674 | 142 |
https://github.com/ShareX/ShareX | 5336 | 766 |
https://github.com/Microsoft/Windows-universal-samples | 5130 | 1501 |
https://github.com/NancyFx/Nancy | 3701 | 957 |
https://github.com/chocolatey/choco | 3432 | 248 |
https://github.com/JamesNK/Newtonsoft.Json | 3340 | 650 |
Interesting that the top spot is a Google Repository! (the C# files in it are sample code for using the GRPC library from .NET)
What are the most popular C# class
names?
Assuming that I got the regex correct, the most popular C# class
names are the following:
Class name | Count |
---|---|
class C | 182480 |
class Program | 163462 |
class Test | 50593 |
class Settings | 40841 |
class Resources | 39345 |
class A | 34687 |
class App | 28462 |
class B | 24246 |
class Startup | 18238 |
class Foo | 15198 |
Yay for Foo
, just sneaking into the Top 10!!
‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?
Finally lets look at the different class
names used, as with the using
statement they are dominated by the default ones used in the Visual Studio templates:
File | Count |
---|---|
AssemblyInfo.cs | 386822 |
Program.cs | 105280 |
Resources.Designer.cs | 40881 |
Settings.Designer.cs | 35392 |
App.xaml.cs | 21928 |
Global.asax.cs | 16133 |
Startup.cs | 14564 |
HomeController.cs | 13574 |
RouteConfig.cs | 11278 |
MainWindow.xaml.cs | 11169 |
Discuss this post on Hacker News and /r/csharp
More Information
As always, if you’ve read this far your present is yet more blog posts to read, enjoy!!
How BigQuery Works (only put in at the end of the blog post)
- BigQuery under the hood
- Inside Capacitor, BigQuery’s next-generation columnar storage format
- In-memory query execution in Google BigQuery
- Counting uniques faster in BigQuery with HyperLogLog++
- Separation of compute and state in Google BigQuery and Cloud Dataflow (and why it matters)
- #94 BigQuery Under the Hood with Tino Tereshko and Jordan Tigani
- TECH TALK: BI Performance Benchmarks with Google BigQuery
BigQuery analysis of other Programming Languages
- Analyzing Go code with BigQuery
- Using BigQuery GitHub data to rank npm repositories
- Using BigQuery GitHub data to find out open source software development trends
- Using BigQuery to Analyze PHP on GitHub
- Extracting all Go regular expressions found on GitHub
- More advanced github code search
- Top angular directives on github, including custom directives
- 779,236 Java Logging Statements, 1,313 GitHub Repositories: ERROR, WARN or FATAL?
- /r/BigQuery