How to write a LINQ to SQL provider in C# and .NET 4.0+ (part 1 of 4)

Writing your very own LINQ-to-SQL engine

The code accompanying this article can be found here: C# to SQL provider sample download.

For other parts in the series, please refer to the index below:

Part 1: Introduction (this)
Part 2: Expression Visitor
Part 3: Where Clause Visitor
Part 4: Compiling Expression Trees

Background

Two years after having finished the project, I finally decided to publish a series of posts on creating a C# to SQL query provider from scratch. Many years ago, I conceptualized some sort of a translator from .NET code into SQL to make data querying a more controlled, structured experience. That was back in .NET 2.0 days. Then along came LINQ to SQL, which wasn't very optimized in its first iterations and tended to produce unwieldy and overblown SQL queries. The performance was also unremarkable because it wasn't compiled, but interpreted. That is, whenever a LINQ statement was executed, the expression visitor kicked in every time, generating command text. And believe me, that is a very slow process, as in many cases SQL Server would execute the end result faster than it would take to generate the query! It got improved over time, with an option to create compiled statements being added in later framework versions.

Expressions namespace

When LINQ-to-SQL was introduced into the framework, along with it, arguably, one of the best features since .NET 2.0 was added – the expressions namespace. I've used it extensively for many years and numerous projects, and have the boldness to consider myself an expert in it. It allows to express programming logic in a series of expressions, forming a generic template that can be applied to a combination of inputs. There are two main areas where .NET expressions are very useful. One involves dynamic compilation scenarios where IL emit is typically used. The other, which made LINQ-to-SQL possible, is expression-typed lambda statements. That is, when the assignment target of a lambda is defined as an expression then the compiler will create a new instance of the latter. In other words, lambdas are syntactic sugar not only for delegates, but also expression objects. This is a brilliant design idea on behalf of the team behind .NET. This makes it possible not only for them, but for developers to write all kinds of extensions that integrate naturally and seamlessly into the language syntax, abstracting the magic away from the consumer.

The power of .NET expression trees

I used expressions to write high-performance components, such as serializers, formatters. Typically - any .NET reflection-based logic that needs all the benefits of reflection without the performance overhead imposed by it. An expression tree can be compiled to a method, making its execution just as fast as statically compiled code.

However, as mentioned previously, there's another useful feature of .NET expressions – seamless integration into the core language through expression-typed lambda methods. That's where the difference between LINQ-to-objects and LINQ-to-SQL lies. The extension methods of the former all take regular lambda delegate types as arguments. But! The data LINQ methods all take expression-typed lambda delegate arguments. Check for yourself!

The whys

So why re-invent the wheel? There are multiple reasons why I chose to write my own integrated query parser as opposed to using LINQ-to-SQL:

A project can have its own special ORM requirements, and cannot use LINQ-to-SQL, Entities or NHibernate, all of which have their own query provider implementations. LINQ-to-SQL, for example, was too inflexible for my purposes. What if I need to use it with DataReader, SqlConnection? What if I want to get the actual DbCommand object from the statement?
It's inefficient. I've found that the original versions produced clunky, unoptimized, unjustifiably nested, unreadable SQL text. Whatever optimizations that existed were done as a separate step in the final stages of parsing, further degrading the performance. I could see that it wasn't necessary, as the approach taken by the original parser wasn't the best. It wasn't necessary to create a new level of nesting for each subsequent clause or join, or any other operation. It wasn't that much work to write the parser in a way that it would deconstruct the whole query into logical parts (the select list, the source tables or resultsets, the where clause and the order clause). Writing the parser with that mindset produced a much cleaner result in the end, while improving the parsing performance.
The out of the box LINQ-to-SQL is not compiled. My version is compile-only. Meaning, that parsing is only done once, the command text is cached, the execution and result processing logic is compiled and also cached. On subsequent calls the language-integrated query is just as fast as straight SQL, if not faster due to various micro-optimizations.
Once written, I can do whatever I want with the provider and extend it as I like.
It's an extremely valuable experience.

Getting started with your own query provider

I will not sugar-coat this. Writing a functional, practical and production-usable query provider is not an exercise for the faint of heart. For me, most of the work was stretched over 3 months of writing code whenever I could find the time outside of my main job. Fortunately, it wasn't the busiest time of my life, but the degree of dedication required was fairly high. So, overall, I'd probably allow at least 100 hours for this exercise and scale it up or down, based on complexity and level of experience. The actual programming wasn't the most time-consuming part of it. Probably, only about 20%. The rest was thinking, white-board time and brainstorming.

The components of an integrated language query provider

So, just like in any engine, there are several logical components that work together to produce the output:

A class with extension methods that can be chained together to describe a query, piecing together an expression tree from the compiler-generated expressions originating from the method arguments.
An expression visitor to translate the expression tree into a SQL statement, identify constants and query text parameters.
Compiler that generates the IL code responsible for creating the SQL command, adding parameters to it, assigning context-specific values to those parameters, executing the command, creating a data reader, iterating through the result-set, instantiating data objects, populating them with values (performing any necessary conversions) and appending to the output collection.

That, in essence, is it! For educational purposes, let's make things simpler and only focus on one method in this series – “where”. The full implementation, which is included with the sample application, includes more methods, which, in my experience, are enough for most queries. Any complex scenarios that extend beyond the supported capabilities can usually be covered by views and other means. Let's get started!

Language integrated query extension methods

To be in line with LINQ-to-objects, the SQL implementation also relies on extension methods to build up the expression tree, although, there appears to be no reason other than compatibility for not using instance methods. We'll follow the same pattern and define extension methods representing query operations. Or, rather, one operation – the where clause filtering. Joins, ordering and other constructs are beyond of the scope of this article, but are available in the full implementation.

The complete code of the class containing the extension methods can be seen below. The explanations will follow.

public static partial class SqlQueryProvider
{
    private const int mMaxTablesInResultSet = 3;
    private static MethodInfo[] mWhereMethodInfo = new MethodInfo[mMaxTablesInResultSet];

    public static DataModelQueryCommand<List<T>> GetCommand<T>(this IDataModelQuery<T> query)
        where T : new()
    {
        ResultSetCommand<T> objCommand =
            ResultSetCommand<T>.Create(GetQuery<T>(query));

        return objCommand;
    } //end method

    public static DataModelQueryCommand<List<DbQueryResultItem<T, T2>>> GetCommand<T, T2>(this IDataModelQuery<T, T2> query)
        where T : new()
        where T2 : new()
    {
        ResultSetCommand<DbQueryResultItem<T, T2>> objCommand =
            ResultSetCommand<DbQueryResultItem<T, T2>>.Create(GetQuery<T>(query));

        return objCommand;
    } //end method

    public static DataModelQueryCommand<List<DbQueryResultItem<T, T2, T3>>> GetCommand<T, T2, T3>(this IDataModelQuery<T, T2, T3> query)
        where T : new()
        where T2 : new()
        where T3 : new()
    {
        ResultSetCommand<DbQueryResultItem<T, T2, T3>> objCommand =
            ResultSetCommand<DbQueryResultItem<T, T2, T3>>.Create(GetQuery<T>(query));

        return objCommand;
    } //end method

    public static IDataModelQuery<T> Where<T>(this IDataModelQuery<T> query, Expression<Func<T, bool>> predicate)
    {
        if (mWhereMethodInfo[0] == null)
            mWhereMethodInfo[0] = (MethodInfo)MethodBase.GetCurrentMethod();

        MethodInfo objMethod = mWhereMethodInfo[0].MakeGenericMethod(new Type[] { typeof(T) });

        return GetQueryForMethodCall<T>(query, objMethod, predicate);
    }//end method

    public static IDataModelQuery<T, T2> Where<T, T2>(this IDataModelQuery<T, T2> query, Expression<Func<DbQueryResultItem<T, T2>, bool>> predicate)
    {
        if (mWhereMethodInfo[1] == null)
            mWhereMethodInfo[1] = (MethodInfo)MethodBase.GetCurrentMethod();

        MethodInfo objMethod = mWhereMethodInfo[1].MakeGenericMethod(new Type[] { typeof(T), typeof(T2) });

        return (IDataModelQuery<T, T2>)GetQueryForMethodCall<T>(query, objMethod, predicate);
    }//end method

    public static IDataModelQuery<T, T2, T3> Where<T, T2, T3>(this IDataModelQuery<T, T2, T3> query, Expression<Func<DbQueryResultItem<T, T2, T3>, bool>> predicate)
    {
        if (mWhereMethodInfo[2] == null)
            mWhereMethodInfo[2] = (MethodInfo)MethodBase.GetCurrentMethod();

        MethodInfo objMethod = mWhereMethodInfo[2].MakeGenericMethod(new Type[] { typeof(T), typeof(T2), typeof(T3) });

        return (IDataModelQuery<T, T2, T3>)GetQueryForMethodCall<T>(query, objMethod, predicate);
    }//end method

    private static DataModelQueryExt<T> GetQueryForMethodCall<T>(IDataModelQuery query, MethodInfo method, Expression argumentValue)
    {
        DataModelQueryExt<T> objQuery = GetQuery<T>(query);

        if (argumentValue == null)
            objQuery.mExpressionTree = Expression.Call(method, objQuery.mExpressionTree);
        else
            objQuery.mExpressionTree = Expression.Call(method, objQuery.mExpressionTree, argumentValue);

        return objQuery;
    }//end method

    private static DataModelQueryExt<T> GetQuery<T>(IDataModelQuery query)
    {
        //Must be immutable

        if (query is DataModelQueryExt<T>)
            return ((DataModelQueryExt<T>)query).Copy();

        if (query is IDataModelQueryExt)
            return new DataModelQueryExt<T>(((IDataModelQueryExt)query).ExpressionTree, query.QueryName);

        return new DataModelQueryExt<T>(Expression.Constant(query), query.QueryName);
    }//end method
}

Essentially, there are two generic public methods with three overloads, for result-sets containing one, two or three tables. Although, as the join clause isn't covered here, only one overload will be used. There are also a couple of private methods that either create the query container object or wrap it in each subsequent call expression. All extension methods are called on an object of type IDataModelQuery, which is the container for the expression tree. When the “where” method is called on the object, we need to determine if the object already contains an expression tree, which will be wrapped in an Expression.Call expression as the instance on which the method is called, with the predicate as the first argument. If there is no expression tree yet, it needs to be initiated by encapsulating the query object inside an Expression.Constant. Why? Because the object representing the query is a generic type with the generic argument being the type of a result-set item. Later, in the expression visitor, we'll need to get the type of the data item in order to access information about its fields/columns, as well as which database table it represents.

The complete source code for C# to SQL integrated query provider can be downloaded from here.

In the next section, we'll examine how the provider parses the expression tree chained together from the extension methods. Click the link below to continue to the next section:

C# to SQL provider - Parsing the expression tree in expression visitor.