Matt on Software

Software, C#, RavenDB and other stuff


1 Comment

RavenDB v. Redis – redux

A while ago Demis Bellot did a nice write-up on RavenDB performance compared to Redis. The tl;dr is that Redis was 11.75x faster than RavenDB, when doing a life-for-like comparison of bulk inserts.

However a few months ago a new API was added to RavenDB that vastly increases the Bulk Insert speed (there is also a nice post showing the implementation details). Using the new API, I updated the benchmark code, with the following snippet:

using (var bulkInsert = store.BulkInsert())
{    
    foreach (var name in names)
    {
        batch++;

        bulkInsert.Store(new User
        {
            Id = "users/" + (++id),
            Email = name + "@" + name + ".name",
            Name = name
        });
    }
}

And it made a big difference! RavenDB BulkInsert v. Redis Benchmark

Redis is now (only) 2.37x faster than RavenDB

And it seems like this is one of the reasons Ayende/Oren implemented this feature:

Untitled

Now it is very cool that that RavenDB bulk imports are an order of magnitude faster. And being within range of Redis (which does everything really fast) is a bonus. But actually I think this is all a bit misleading.

Lies, Damned Lies, and Statistics

If you are evaluating RavenDB based purely on the speed at which it inserts new documents, then you’re probably doing it wrong. You need to also look at the read performance, query performance, how long indexes take to be “non-stale”, what features it has, etc. My point is, purely looking at the write performance misses out on other things, that are actually more important.


2 Comments

Fun with RavenDB documents keys

One nice feature of RavenDB is that you can use structured document keys instead on an index. For instance, if you have documents with the following keys:

customer/101
customer/101/order/1
customer/101/order/2
customer/101/order/3
customer/101/invoice/1
customer/101/invoice/2

 Getting all the orders for a given customer is as simple as

session.Advanced.LoadStartingWith(“customer/101/order”, 0, 128)

Note: this method is only available in RavenDB build 1.2, which is currently an unstable build. However, if you are working with an earlier build, you can use the following extension method: 

public static IEnumerable<T> LoadStartingWith<T>(this IDocumentSession session, 
                       string keyPrefix, int start = 0, int pageSize = 25)
{
    var inMemorySession = session as InMemoryDocumentSessionOperations;
    if (inMemorySession == null)
    {
        throw new InvalidOperationException(
            "LoadStartingWith(..) only works on InMemoryDocumentSessionOperations");
    }
    
    return session.Advanced.DatabaseCommands.StartsWith(keyPrefix, start, pageSize)
                .Select(inMemorySession.TrackEntity<T>)
                .ToList();
}

Implementing Auto-Save

Another nice use case is when you need to implement “auto-save” functionality (I can’t take any credit for this idea, I borrowed it from Oren/Ayende)

The basic idea is that you have your documents structured like this:

articles/1
articles/1/auto-save

Whilst a user is editing a document, you can update the “auto-save” document in the background, every 30 secs for example.

Finally when they click “save” the main document “articles/1” can be updated.

Using this method you can easily revert unsaved changes and make sure that the user doesn’t loose their work. If you’re worried about space, you can also make the auto-save document auto-expire by using the expiration bundle.

Summary

This method is interesting because it allows you to fetch related data without having to go via an index. This is possible as the documents can be fetched from the document store directly.

You not only gain a performance boost, but you save the disk space that the index would’ve taken.

However, you can only use this approach with StartsWith, more complex queries such as Contains aren’t possible.


14 Comments

How RavenDB indexes documents

When you create in index in RavenDB you are able to write a Map statement that controls how your source Json document is stored in the Lucene index.

So, given the following POCO’s

public class TShirt {
    public String Id { get; set; }
    public String Name { get; set; }
    public int BarcodeNumber { get; set; }
    public List<TShirtType> Types { get; set; }
}

public class TShirtType {
    public String Colour { get; set; }
    public String Size { get; set; }
}

You have 2 main options, which are shown below as a code sample and a diagram illustrating how the index looks

Option 1

Flatten out the nested items and index every combination of TShirt/TShirtType as a single Lucene document

from shirt in docs.TShirts
from type in shirt.Types
    select new { 
        shirt.Name, 
        type.Colour, 
        type.Size,      
    }

intersect_search_lucene_doc_internals_1

Option 2

Leave the nested TShirtType items inside the parent TShirt

from shirt in docs.TShirts
    select new { 
        shirt.Name, 
        Shirt_Types = shirt.Types.Select(t => t.Colour)
        Shirt_Sizes = shirt.Types.Select(t => t.Size)
    }

intersect_search_lucene_doc_internals_2

Note: There are a few things that RavenDB is doing for you here

  • If your Map statement contains an item that is an IEnumerable, then it will “flatten” out the items for you and index each one as a separate field inside the Lucene document (see Option 2)
  • Imaging if you do a query that matches several Lucene documents, but only 1 RavenDB document, you will only get each RavenDB doc once. This means that if you have a query with “Take(10)”, you will get 10 matching documents as expected. To see the code that handles this, take a look at Ayende’s blog post.


6 Comments

Lookout – an AppHarbor build watcher

Update you can download the app from here

A few weeks ago AppHarbor announced an API Contest, to promote their API.

So I thought that I’d give it a go and write a desktop app that monitors your builds and gives you some helpful status messages. AppHarbor provide a nice .NET SDK, but it seems that no-one has tried to use it from a Desktop app before! To be fair to them, they came up with a solution pretty quickly and it works well, except it relies on the app being able to bind to port 80 on the localhost.

Anyway after a few weeks of trying to remember my Winforms skills, I came up with Lookout (geddit) an AppHarbor build watcher. The source is up on github, if anyone wants to have a play.

Features

You are notified when the build breaks

build broken notification

and then again when it’s fixed

build fixed notification

Also you can see an overview of the most current build, showing build and deployment status. Plus you can jump to more detailed info if needed and open the website itself

build succeeded info

Finally you can access the live diagnostic messages from your site itself, using the Super Simple Logging built into AppHarbor

Live error message from application

Future Plans

Depending on time, I hope to release an update in the next month, with the following features:

  • Allow the user to see info on historical builds
  • Control deployments directly from the app
  • More information on the current build, including unit tests and a more detailed build log
  • Tidy up the code, it’s a bit of a mess (it probably breaks all of the SOLID principles and has no concept of MVC!)


3 Comments

RavenDB Query Intersection

A while ago an interesting scenario came up on the RavenDB group, that the current query mechanism couldn’t handle. The full details are in the documentation, so I’m not going to repeat them here, but the issue is due to the way that RavenDB has to index relational data in Lucene.

The solution was to allow users to Intersect queries on the server-side and then only get back the documents that match all the sub-queries. The code to do this is shown below:

session.Query<TShirt>("TShirtNested")
      .OrderBy(x => x.BarcodeNumber)
      .Where(x => x.Name == "Wolf")
      .Intersect()
      .Where(x => x.Types.Any(t => t.Colour == "Blue" && t.Size == "Small"))
      .Intersect()
      .Where(x => x.Types.Any(t => t.Colour == "Gray" && t.Size == "Large"))
      .ToList();

 

However implementing this wasn’t straight forward because RavenDB allows paging via Take()/Skip(). To see why this is an issue, lets take a simple example, where the user expects 3 docs back, i.e. Take(3).

Internally each sub-query is processed in turn, collecting the matching Raven doc ID’s each time. So firstly, “Where(x => x.Types.Any(t => t.Colour == “Blue” && t.Size == “Small”))” is applied, giving the following matches. Note: each row represents 1 RavenDB document, but several Lucene documents. Therefore, to be a match the entire row must satisfy all the sub-queries.

Matching blue & small

So far, so good, we’ve got 3 RavenDB docs that match, “tshirt/1”, “tshirt/2” & “tshirt/3”. The problem comes when we perform the 2nd sub-query as well, “Where(x => x.Types.Any(t => t.Colour == “Gray” && t.Size == “Large”))”:

Matching (blue & small) and (grey & large) - crossed out

We’ve now had to discard “tshirt/4” as a match, because it’s doesn’t satisfy both sub-queries, so we only have 2 matches in total. The solution to this is to perform the entire search again, but this time ask Lucene for twice as many matches. This gives us more candidates that match each sub-query, so by the time they’re all applied we can give the user the 3 matching documents they asked for.

In reality, the 1st query starts by getting twice as many docs as you ask for anyway, so Take(4) causes it to get 8 docs, in the hope that this will leave 4 by the end. But every time it ends up with less than it needs, it doubles the amount and repeats the process.

The reason for going to all this effort, is that it’s more efficient to get the results from Lucene in small pages, asking for 1000’s of results when we only need 100 is expensive. In reality most queries are limited to 128 matches, due to RavenDB being safe-by-default. So this mechanism is meant to be a cheap way of getting the amount of docs we need, but still allowing for the scenario where after collecting the sub-queries we don’t have enough results and have to start again.


Leave a comment

json2csharp

I can’t believe that I’ve never heard of this before, check out json2csharp, it’s based on the json class generator

Give it some json like this:

{
    "Glossary": {
        "Title": "example glossary",
        "GlossDiv": {
            "Title": "S",
            "Id": 1235434,
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}

And it generates this C# code, it handles nested lists, references, strings v. ints etc

public class GlossDef
{
    public string para { get; set; }
    public List<string> GlossSeeAlso { get; set; }
}

public class GlossEntry
{
    public string ID { get; set; }
    public string SortAs { get; set; }
    public string GlossTerm { get; set; }
    public string Acronym { get; set; }
    public string Abbrev { get; set; }
    public GlossDef GlossDef { get; set; }
    public string GlossSee { get; set; }
}

public class GlossList
{
    public GlossEntry GlossEntry { get; set; }
}

public class GlossDiv
{
    public string Title { get; set; }
    public int Id { get; set; }
    public GlossList GlossList { get; set; }
}

public class Glossary
{
    public string Title { get; set; }
    public GlossDiv GlossDiv { get; set; }
}

public class RootObject
{
    public Glossary Glossary { get; set; }
}
Follow

Get every new post delivered to your Inbox.