How to write a LINQ to SQL provider in C# and .NET 4.0+ (part 2 of 4) - expression visitor

Custom LINQ-to-SQL provider: part 2, the expression visitor

This is the second post in the series of tutorials on how to make a C# to SQL integrated query provider. If you missed the introduction, it can be found in the index below:

Part 1: Introduction
Part 2: Expression Visitor (this)
Part 3: Where Clause Visitor
Part 4: Compiling Expression Trees

The code accompanying this article can be found here:

C# to SQL provider sample download.

Parsing the expression tree – expression visitor

Now the heart and soul of the language integrated query provider. This section covers the process of deconstructing the expression tree, transforming it and translating it into an actual query text. This is all handled by a class called expression visitor. Its design is based on the visitor pattern. In the previous section the formation of the expression tree was explained. The resulting structure is then passed into the visitor, which needs to make sense out of it.

It's worth noting that the code that accompanies this article was written from ground up and doesn't contain any derivatives from other similar works, including the Microsoft LINQ.

Deconstructing the expression visitor - the select and from clauses

Throughout this series of articles, we'll be parsing the following seemingly simple statement:

(new DataModelQuery<FileModel>("TestQuery")) 
.Where(file => file.DirectoryId == test.Item1).GetCommand();

It can be found in the sample project attached to this article, along with the rest of the source code used to parse it.

So, what happens when the GetCommand method is executed? This is an extension method on IDataModelQuery object that contains the expression tree for the Where method. It calls into the Create method of DataModelQueryCommandExt private class that belongs to the provider. If this expression was already parsed before, the command should already be compiled and cached, so it can be returned right away. Otherwise, a new DbExpressionVisitor object is created.

The entry point into the visitor is coincidentally the non-specific “visit” method, which contains a big switch statement branching the execution to the appropriate function based on the type of expression passed to it. So, for example, if the type of expression is “call”, then it is cast into the CallExpression object, which would contain information about the actual real query method called from the consuming code. In this case, it's the “where” method, as this article doesn't cover any other for the sake of simplicity. The strongly typed expression is then passed to the appropriate function that determines the name of the method and its arguments. All the arguments, including the instance to which the method is applied need to be visited as well. The instance (first argument) is visited because in SQL query terms it's the source from which the select clause is generated. It can be a straight table select or a subquery. The second argument needs to be parsed because it's the lambda predicate, from which we need to generate the where clause text. What happens from here, is summarized in the diagram below:

First a StringBuilder instance is created to write the SQL query text. Then the select clause is written to it by analysing the leftmost argument - the DataModelQuery<FileModel> instance. FileModel type is the source table that appears in the from clause. Its properties marked with the DataColumn attribute are the columns that appear in the select clause. The names are derived from the attribute, where specified, or, otherwise - the member name itself.

The predicate, which is the next parameter (at index 1), needs to be visited as well to prepare it for the WhereClauseExpressionVisitor. A number of transformations happen here. First, the file argument introduces the table into the scope (note that a new BaseExpressionVisitor is created for each scope). The parameter is added to a dictionary that belongs to the new visitor.

Next the body of the lambda is parsed (file.DirectoryId == test.Item1), which is essentially a BinaryExpression with parts separated by the == operator. Both sides of it are visited. Each is a MemberExpression, but there are differences. The left one refers to the select list, while that on the right references something outside of the SQL query - a local variable in the C# code. Therefore the two are handled differently:

The parameter expression "file" in file.DirectoryId is replaced with another, custom expression that describes which table the field DirectoryId belongs to.
The C# variable "test" reference is replaced with RuntimeValueExpression object that describes how to retrieve its value at runtime. See, at runtime the original "test" ConstantExpression object will actually contain the instance exposed through its Value property. But, to use it in the query, we'll need to extract it somehow. For that, there needs to be a way of identifying the complete reference path to the "test" expression, given the whole expression tree at runtime. You may have noticed that we're passing ParentObjectInfo arguments into every method. Each references the previous one, the parent expression and its member from which the expression passed to the method was derived. Thus, at any given point in the call stack, it's possible to walk all the way up to the root expression. This means that we can use expressions to describe how to do that. And the reverse is also true. That is - getting to a certain node and retrieving its value. So, once a constant expression that represents a variable is stumbled upon, it is replaced with a whole new branch that, when compiled, sequentially calls all the members of the tree, until it gets the value. Without this, we wouldn't be able to extract parameters of our query. This remapping technique is discussed in more detail further down below.

There is nothing much else that the base visitor needs to do with the lambda, and so it is returned to the caller (the visit method for a CallExpression of the original visitor) in the form:

file => ScopedTablesExpression.DirectoryId == RunTimeValueExpression.Item1

The transformed predicate expression is then fed into the WhereClauseVisitor (described in the next section) to produce the final piece of the puzzle - the where clause, along with the query parameters. All of that is encapsulated within a custom SqlExpression and returned to the command-building code that uses the select clause mappings to generate DataReader-to-model bindings, and parameter information to compile a set of instructions for supplying query values at runtime.

In general, this is how things work. For more information on the WhereClauseExpressionVisitor and generation of compiled parameter and data reader bindings to populate the SqlCommand object and extract values from the resultset, please see the following sections:

Part 3: Where Clause Visitor
Part 4: Compiling Expression Trees

Abstraction of an abstraction - describing value access in an expression tree using expressions

As discussed earlier, it may be of interest to explain in more detail how remapping of variables referenced in the tree happens when encountered during parsing by the base visitor. So, in essence, certain branches are replaced by expression chains describing how to traverse from the root to the leaf (e.g. a ConstantExpression) to get its value.

Each visit method receives an object of type ParentObjectInfo as the last argument. This object, like a stack trace, describes all the expressions unwrapped to get to the one being currently visited. Every time that another visit call is made, the argument is constructed by encapsulating the previous such object, as well as the property name of the parent expression from which the child was derived. Why is this needed? This builds a trace of expressions in the hierarchy beginning at the root expression and leading up to the expression being evaluated in a given visit method. The RemapToCommonParameter method relies on this information to generate an expression chain that describes how to extract a runtime value out of the compiler-generated expression tree. This is a fairly abstract concept, an abstraction of an abstraction. When the compiler generates an expression based on the lambda syntax, it would usually contain a number of sub-expressions, each with their own children. So, a lambda expression (e.g. item => item == localVar.value) would have an argument expression and a body expression. The latter, in its turn, could be broken down into a binary expression (e.g. item == localVar.value), which is a composite of a parameter expression on the left and a member access expression on the right. Now, the expression tree structure will always remain the same on each run through the code. What may change is the values, object instances, etc. If we reconcile that with the fact that the provider needs to generate and compile a method to extract values out of the expression tree and populate them into the database command parameters, then this is what we get - the necessity to construct and compile an expression tree to evaluate an expression tree at runtime. So, in other words:

We have an expression tree put together from the language query. It is descriptory only. I.e. it represents the query code logic (methods and lambdas) using expressions. The expression tree is not intended to be compiled, but rather interpreted, undergoing transformations in the process.
We need to write logic that compiles a custom function that would extract variable values from that expression tree at runtime.
One way to express dynamic logic is through expressions. Hence the visitor generates some expressions to read from the original expression tree.
As the original tree is being parsed once to output the compiled functions, the visitor replaces the branches that need to be evaluated at runtime, i.e. those that describe variable access with expression chains that instruct how to read values from those variables. The substitute branches are then bound to the expressions that create and populate the database command parameters and compiled into a single function accepting the original expression as the argument from which the values are read. Confusing? No doubt.