Chapter 15. LINQ to SQL and ADO.NET Entity Framework

Bob Beauchemin

In ADO.NET 3.5 and 3.5 SP1, Microsoft introduced two models designed to abstract SQL statements into a high-level language and to operate on database data as objects. The first, LINQ to SQL, is a lightweight mapping of LINQ (Language Integrated Query) calls to the SQL Server database. The other model, ADO.NET Entity Framework (EF), consists of an object-relational mapping (ORM) framework as well as query and view services built over the ADO.NET provider model. The Entity Framework has its own dialect of SQL (Entity SQL or ESQL) and can use ESQL statements or LINQ queries to access data. Although neither framework uses vanilla T-SQL as its query language, both frameworks can generate SQL statements or use existing stored procedures to access the database. This chapter is not an introduction to these frameworks as I assume that you already know their basics, but I will discuss how these frameworks interact with SQL Server, especially with respect to performance.

One way to look at the performance of an abstraction layer is to examine and profile the ADO.NET code, but both EF and LINQ to SQL are T-SQL code generators (EF is not database-specific, but I’m only talking about SQL Server here); therefore, another way to address performance is to examine the generated T-SQL code. I will look at performance from the generated T-SQL code perspective.

Many programmers who specialize in query tuning salivate over the prospect of tuning the bad queries that will assuredly result from these two data access stacks. In addition, many DBAs would like to ban LINQ or EF use when coding against their companies’ databases. Most people who profess a dislike for the generated code have never seen (or have seen very little of) the generated code. For someone who writes and tunes T-SQL code, code generating programs and frameworks that rely on code generation can be worrisome if the code generation compromises database performance. Both Entity Framework and LINQ to SQL have API calls that can expose their generated T-SQL; you can also use SQL Profiler to look at the generated T-SQL. This chapter outlines some of the performance and manageability concerns that arise through the use of these models, beginning with the dynamic generation of SQL inside applications.

LINQ to SQL and performance


Concern

LINQ to SQL and EF will proliferate the use of SQL code in applications, and will almost surely produce suboptimal dynamic SQL, causing database performance problems and plan cache pollution.


It’s almost dogma among database programmers that static SQL in stored procedures is better for security than dynamic SQL constructed using string concatenation. Besides the obvious association between dynamic SQL and SQL injection, using dynamic SQL means that all users must be given access to the underlying tables, unless you use LINQ to SQL/EF strictly with views and stored procedures. Using stored procedures, the DBA doesn’t need to give access to underlying tables to each user, only to EXECUTE permission. Using views, the DBA gives permission to the view, not the underlying tables.

Plan cache pollution refers to the fact that using many different variations of the same SQL statement produces multiple plans in the cache for what is the same query. For example,

SELECT au_fname, au_lname FROM dbo.authors WHERE au_lname = 'Smith'

would produce a different query plan from this query:

SELECT au_fname, au_lname FROM dbo.authors WHERE au_lname = 'Jones'

In simple cases like this, the SQL Server query optimizer can perform what’s known as auto-parameterization, in which case either of the queries above becomes

(@1 varchar(8000))SELECT [au_fname],[au_lname] FROM [dbo].[authors] WHERE
[au_lname]=@1

LINQ to SQL and EF make every attempt to use parameterized SQL, rather than dynamic SQL, in their code generation. Microsoft claims that LINQ to SQL minimizes if not eradicates the potential for SQL injection.

In the context of Plan cache pollution, given their code generation nature, LINQ to SQL and EF are more likely to generate more homogenous SQL than programmers who write parameterized queries themselves. And programmers who use dynamic SQL, especially those most likely to use only LINQ to SQL/EF in future projects, are likely causing plan cache pollution right now. For an extensive discussion of how the SQL Server plan caches work, I’d recommend Sangeetha Shekar’s blog series on the plan cache in the SQL Programmability and API Development Team Blog.[1] In these articles Sangeetha (who works on the plan cache team) states that there’s no cacheability difference between dynamic parameterized SQL and a stored procedure. Non-parameterized SQL suffers a slight cacheability difference unless it’s reused, which is highly unlikely.

1 Sangeetha Shekar’s blog series “Plan Cache Concepts Explained,” on the SQL Programmability & API Development Team Blog, begins with this entry: http://blogs.msdn.com/sqlprogrammability/archive/2007/01/08/plan-cache-concepts-explained.aspx.

So far, it’s been my experience that LINQ to SQL, being more table-centric in mapping, will in general generate code that’s closer to the code a T-SQL programmer would generate. EF is more object-centric and sometimes generates SQL that’s meant to construct object graphs and therefore more expensive plans result. But, as an example of the fact that code generation can carry us only so far, neither framework can generate a full outer join when using LINQ without extension methods. Entity Frameworks can use full outer joins when using the Entity SQL language directly.

One current plan cache pollution issue occurs when string literals are used in queries. The LINQ to SQL query

var query  = from a in ctx.authors where a.city = "Oakland" select a;

generates the following parameterized query:

(@p0 varchar(7))SELECT [t0].[au_id], [t0].[au_lname], [t0].[au_fname],
[t0].[phone], [t0].[address], [t0].[city], [t0].[state], [t0].[zip],
[t0].[contract] FROM [dbo].[authors] AS [t0] WHERE [t0].[city] = @p0

The same Entity Framework query generates a T-SQL query with a string literal, to get the parameterized query you’ll need to use the LINQ to Entities query:

string city = "Oakland";
var query = from a in ctx.Authors where a.city == city select a;

Note that the parameter length is seven characters exactly (the size of the string). If I replaced this with "where a.city = "Portland" " the result would be a parameterized query with a different string length (nvarchar(8)) for @p0. This pollutes the plan cache with one query for each different string size, when it’s only necessary to use the field size of the city field (20 in this case). SQL Server’s built-in auto-parameterization always uses the string length of 8000 characters or 4000 Unicode characters, and using one string size in the query parameter is preferable to one different query per string size. Both LINQ to SQL and the EF have addressed the parameter length issue in the upcoming .NET 4.0 release by choosing a default length when none is specified, but in the meantime, using these frameworks means making query plan reuse compromises.

Generating SQL that uses projection


Concern

Using LINQ to SQL and EF will encourage SELECT * FROM...–style coding because you get back a nice, clean object instead of the less-useful anonymous type you receive by doing projection. This will also make covering indexes useless.


LINQ to SQL and EF can return something other than a whole object instance. Here’s an example:

// This returns a collection of author instances
var query = from a in ctx.authors
select a;

// this returns a collection of anonymous type instances
var query = from a in ctx.authors
join ta in ctx.titleauthors on a.au_id equals ta.au_id
join t in ctx.titles on ta.title_id equals t.title_id
select new { a.au_id, t.title_id };

The collection of authors returned by the first query is updatable, and a reference to it can be used outside the function in which it’s created. The anonymous type is not updatable and cannot be used outside the function in which it’s created without using reflection or CastByExample<T>. I can see a use for anonymous types in data binding: the good old dropdown list.

You don’t necessarily need to return an anonymous type. You can define (by hand) a class that represents the projection of authorid and titleid. Or query a view that returns an instance of an object. But, in order to do this on a large-scale project, you’d need to define a class for each projection in the entire project. As a database programmer, ask yourself, “Can I list every different projection (rowset) that every query in my project returns?” Very few programmers can answer “yes” to that question, even though it may be a great tuning aid to be able to enumerate every projection your project returns. Therefore, writing a class for every anonymous type is a good idea, but it’s a lot of extra work. Perhaps there will be a tool that automatically does this in the future. If you don’t write your own objects for each projection, you’re using whole objects. That is SELECT * FROM in SQL.

This is similar to the issue you’d run into using stored procedures that return rowsets with LINQ to SQL; the designer generates a named type for the first rowset based on the shape of the first result set that a stored procedure returns, and doesn’t generate anything for the additional result sets in a multiple result set procedure. It’s a good practice to handcode named types for the multiple result set procedure yourself. Using rowset-returning procedures with EF forces you to define a class to contain the rowset produced, and aside from the extra work involved because the EF designer doesn’t do this automatically, that’s a good idea. But EF can’t use procedures that return more than one rowset (SqlDataReader.NextResult in ADO.NET).

How does this style relate to covering indexes? An overly-simplistic definition of a covering index would be “nonclustered index defined over the set of columns used by one table in a projection.” These indexes make for nicely optimized query plans, and sometimes even help with concurrency issues in SELECTs. But if we’re always doing SELECT * FROM, forget those covering indexes. The index used most commonly is the clustered index on the base table.

You shouldn’t define a covering index for every projection just because you can. Every index consumes space and affects the performance of inserts, updates, and deletes; therefore, there’s a tradeoff. In fact, I’ve also heard it said that if you need many, many covering indexes, perhaps your database isn’t as well normalized as it could be, but I’m not really sure I buy this argument.

I’d say that the ease with which every projection can become a SELECT * FROM query when using objects is, for the most point, a valid worry. If you’re concerned about database performance, you need to do coding beyond what the APIs provide.

Updating in the middle tier


Concern

Using LINQ to SQL and EF will encourage SELECT to middle-tier, then UPDATE or DELETE rather than issuing SQL UPDATE or DELETE statements that are set-based.


Neither LINQ to SQL nor Entity Framework currently contains analogs to SQL’s INSERT, UPDATE, or DELETE statements. Entity SQL could be expanded to include DML in the future, but LINQ to SQL doesn’t have a language that extends SQL. Both APIs can cause insert, update, and delete operations on the database. You create or manipulate object instances, then call SaveChanges (EF) or SubmitChanges (LINQ to SQL).

The manipulate-objects-and-save pattern works well in LINQ to SQL and reasonably well in EF. The distinction is that in EF, if there are related entities (for example, a title row contains an integrity constraint that mandates that the title’s publisher must exist in the publisher’s table), you must fetch the related entities first, causing an extra round trip to the database. One way to avoid the extra round trip is to synthesize a reference using EntityKey. I described this in a set of blog posts about deleting a single row without fetching related entities.[2]

2 See “Entity Framework Beta3—Deleting without fetching” on my blog: http://www.sqlskills.com/BLOGS/BOBB/post/Entity-Framework-Beta3-Deleting-without-fetching.aspx.

What about performing an update? The SaveChanges (EF) and SubmitChanges (LINQ to SQL) methods can perform multiple insert, update, and delete operations in a single round trip. But let’s consider the number of database round trips involved to change a single customer row. This requires one round trip to select the row and another to update the row. And what about a searched update in SQL (UPDATE...WHERE) that updates multiple rows, or an update based on a SQL join condition, or using MERGE in SQL Server 2008? My favorite example, using update over a recursive common table expression, gathers all employees reporting to a certain manager and gives them all a raise. The number of fetches required just to do the update increases if you don’t code the fetch statements in batches. Even if this doesn’t increase the number of round trips required to get the rows, the sheer number of required fetches (database-generated network traffic) increases.

Let’s address the general get-then-update pattern first. I worried about this one until I realized that in most applications I’ve worked on, you don’t usually do a blind update or delete. A customer web application fetches a row (and related rows); a pair of eyes inspects the row to ensure this is indeed the right row, and then presses the user-interface button that causes an update or delete. Therefore, get-then-update is an integral part of most applications anyway. If the update or delete of a row affects related rows, this can be accomplished with cascading update or delete in the database.

But how about multiple, searched updates without inspecting and fetching all the rows involved? Neither LINQ to SQL nor EF has a straightforward way to deal with this. Alex James wrote an excellent four-part blog series about rolling your own searched update in EF with an underlying SQL Server using .NET extension methods getting the SQL query text and string handling to turn it into an update,[3] but this method is neither compact nor straightforward. It also looks SQL Server–dependent; therefore, Microsoft would need to replicate this for each provider to make it part of the Entity Framework in a future release.

3 See “Rolling Your Own SQL Update On Top Of the Entity Framework,” by Alex James: http://blogs.msdn.com/alexj/archive/2007/12/07/rolling-your-own-sql-update-on-top-of-the-entity-framework-part-1.aspx.

LINQ to SQL contains the ultimate fallback method for this case. The DataContext.ExecuteCommand method lets you execute any SQL command, including parameters. An example would look like this:

// Initialize the DataContext using constructor of your choice
DataContext db = new DataContext(fileOrServerOrConnection);
// Use SQL statement directly
db.ExecuteCommand("UPDATE Products SET UnitPrice = UnitPrice + 1.00");

EF doesn’t have the equivalent because your data store is an object model over a conceptual data source, not the data source itself, but EF does expose the underlying DbConnection instance; therefore, you can issue your own commands against the database tables.

I’d suggest (or even mandate) using stored procedures in searched update or delete. The blind searched operation or multiple-statement update is accomplished in a single database round trip, and you can even use the OUTPUT clause in SQL Server’s DML to obtain information in rowset form showing exactly what was changed or deleted. Because this is a database-specific operation, using a stored procedure sounds like a good workaround for this problem. With the use of stored procedures as needed and the realization that most apps use the get-then-update pattern anyway, I think I’ll dismiss this worry.

Optimizing the number of database round trips


Concern

Queries generated by LINQ to SQL and EF get too much or too little data at a time. Too much data in one query is a waste. Too little data is also bad because it means extra database round trips.


Both LINQ to SQL and EF have good mechanisms to optimize data retrieval. In addition, the problem at hand does not necessarily apply only to an ORM (object-relational mapping) or even only to databases. In a filesystem graphical user interface, you don’t normally pre-fetch all of the files’ information throughout the entire filesystem when someone wants to look at the content of the C drive? On the other hand, if you know you’re going to eventually display all of the related entities’ information you likely do want to get them. If not, perhaps you want to get related entities all at once, when the first child entity is selected, or get the children one at a time when each child entity is selected.

LINQ to SQL addresses this by implementing a property on the DataContext, the DeferredLoadingEnabled property. It is set to True by default, which means it will retrieve only the Customer object when the customer has orders, for example. The related Orders objects are retrieved with extra round trips to the database, one row at a time, when the Customer instance’s Orders property is accessed in code. The related property, LoadOptions, also on the DataContext, takes a DataLoadOptions instance that allows you to control exactly how much related data is retrieved. That is, do you want only related orders or would you rather have the framework fetch orders, order details, and associated products in a single round trip? The DataLoadOptions also allows you to filter the amount of data you get from related tables; that is, you can specify that you want each customer’s associated orders, but only OrderID and OrderDate.

ADO.NET Entity Framework does this a bit differently. It doesn’t have a property that allows you to control whether deferred loading is enabled; rather, deferred loading is done by default. In order to load associated entities, there is a separate Load method, and an IsLoaded property that you can check before loading. EF also has an Include property of the query which lets you to specify which related entities can be loaded, if eager loading is desired. With EF you can also use Entity-Splitting in your design if you know you always want to retrieve OrderID and OrderDate, but no other properties, from the Orders table. Object purists may frown on composing objects based only on commonly used queries.

You can also retrieve only certain columns from a table (i.e., all the columns in Customers table except the column that contains the customer’s picture) with either a related type (Entity-Splitting) or an anonymous type. And you can always specify a join that returns an anonymous type, if desired, to get only the properties you need from related tables.

So I’d say that this worry is not only completely unwarranted, but that LINQ to SQL and EF make programmers think more about lazy loading versus eager loading. Using eager loading may be clearer and more maintainable than a join, which always returns an anonymous rowset with columns from all tables interspersed. That is, you know exactly what related data (at an object level) is being requested and retrieved. But be careful with eager loading and a join with more than two tables. The generated SQL code will produce an anonymous rowset for the first two tables, and separate queries for the remaining tables.

LINQ to SQL and stored procedures


Concern

Adoption of LINQ to SQL and EF will discourage the use of stored procedures to return rowsets. In addition, the code generators will use a subset of T-SQL query constructs, that is, only the constructs that the LINQ or ESQL language supports, rather than the full power of the T-SQL query language.


Aficionados always think that a stored procedure represents a contract between consumer and provider. Although the database metadata indicates number and type of parameters, and comprises a contract, this is absolutely not true for rowsets returned by stored procedures. There is no database metadata that records anything about the returned rowsets, or even how many rowsets a stored procedure will return. In addition, errors that occur in the middle of a stored procedure might result in rowsets not being returned. Finally, there’s always the possibility of returning multiple and/or different rowsets by using a stored procedure with conditional code. That’s not much of a rowset contract at all.

One way to ameliorate this problem in SQL Server is to use multi-statement table-valued functions to return one rowset with known metadata. The main problem with this is performance; a multi-statement table-valued function (TVF) is the equivalent of filling a table variable in code and then returning it. Almost always there is extra disk I/O involved: the I/O of reading the base tables, plus the I/O of filling the table variable, plus the I/O of reading the table variable at the consumer. There are also performance considerations as SQL Server table variables have no statistics. Therefore if the table-valued function is used as a row source in a larger query, there is no way to estimate the number of rows returned by the TVF. SQL Server 2008’s strongly typed table-valued parameters would be an analogous concept, but currently these are limited to being input-only in procedures. No strongly typed rowset result is currently supported.

Now that we’ve determined that there is no more of a rowset contract for stored procedures than for ad hoc SQL (the difference is in SQL encapsulation), what about T-SQL extensions that Entity SQL doesn’t support? There are database-specific extensions like SQL Server’s PIVOT operator, or ANSI SQL standards, like ranking and windowing functions.

LINQ aficionados are quick to talk about implementation through extension methods but the long and short of this is that these are a LINQ-ism, unrelated to LINQ to SQL. That is, the LINQ construct to SQL dialect statement mapping is fixed and embedded in the LINQ to SQL product. Using extensions to the SQL statement mapping can’t change which T-SQL statement is produced. To control what is produced, you’d need to implement equivalent concepts on the client side and leave the generated database code alone.

EF may have a better story with this because each provider/writer implements the ESQL to query mapping. Conceivably you could write a custom provider to encapsulate the supplied provider including the T-SQL–specific extensions. The ESQL language itself does not have the capability of ODBC-like escape clauses; therefore, there’d be no way to express this extended SQL-based functionality in ESQL.

I’d classify the “subset of SQL” and “stored procedure rowset is an anonymous type” problems as issues that might be worked out in future releases of databases and frameworks. Until LINQ to SQL or EF provides escape clauses in the framework, the easiest way out is the ultimate escape clause, using the stored procedure that returns (anonymous) rowsets. And the more stored procedures are used (not insert, update, and delete procedures, which enhance the model, but rowset-returning procedures), the farther away from the model you get. This interferes with the usefulness of the model in general.

Tuning and LINQ to SQL queries


Concern

LINQ to SQL and EF generated queries will be untunable because, even if you discover a performance problem, you can’t change the underlying API code to produce the exact SQL query that you want. There are too many layers of abstraction to change it.


T-SQL is a declarative language; a query is simply a description of what you want, not a description of how to physically retrieve it. Sometimes, however, the programmer has the ability to rephrase queries, resulting in better performance. Part of query tuning can consist of changing the SQL to get the plan you want, based on your intimate knowledge of the current data and the current use cases. As a simple example, you can switch between joins, correlated subqueries, and nested subqueries to see which one gives the best performance, or use EXISTS rather than a JOIN, or UNION ALL rather than an IN clause. The limitation is that, in the future, the query processor can get smarter, as a result making your past work unnecessary. Usually though, you’ve benefited from rewriting SQL for those extra years until the query processor changes.

Because the LINQ to SQL or ESQL queries are programmatically transformed into SQL queries, it is time consuming, but not impossible, to rephrase LINQ or ESQL queries to produce subtly different SQL queries and thus better performance. Because this optimization technique (rephrasing LINQ queries to change the generated SQL) is in its infancy, and is also one layer removed from simply tuning the SQL, we’ll have to see how it progresses as the frameworks become more popular. The Entity Framework team is thinking of introducing hints and providing direct control over query composition in future releases.

Besides query rewrites, you can also hint queries, and T-SQL allows a variety of query hints. This helps when the query processor chooses a suboptimal plan (uncommon, but not unheard of) or you have intimate knowledge or data and use cases. Another reason for hints is to service different use cases with the same query. SQL queries have only one plan at a time (modulo parallelized plans) and you might have to satisfy different use cases by hinting the same query differently. Because the translation to SQL is deeply embedded in the LINQ and EF source code, I can’t hint in the LINQ or ESQL code if I find a performance problem that can be helped with a hint. This means going back to using stored procedures (they work with hints) rather than using model-generated queries.

Hinting is usually not preferred over rewriting the SQL because hints tie the query processor’s hands. For example, if the statistics change so that a different plan would work better, the query processor can’t use this information because you’ve told it how to accomplish the query. You’ve changed SQL from a declarative language to an imperative language. It’s best not to put query hints in code, but separate them to a separate layer. SQL Server 2005 calls this separate layer plan guides. The plan guide is a named database object that relates a hint to a query without changing the underlying code. You can add and drop plan guides, turn them on and off at will, or re-evaluate them when things (the statistics or use cases) change.

Can you use plan guides with LINQ to SQL or EF queries? There are two things to keep in mind. First, a plan guide for a SQL statement requires an exact match on a batch-by-batch basis. Machine-generated SQL will likely make an exact match easier, but you will have to check that the guides are being used each time LINQ/EF libraries change. Second, plan guides work best if you have a limited number in your database. They’re meant to be special-case, not to add another level of complexity to a situation that is complex and becomes more complex as the layers of abstraction increase. Therefore, although you can use plan guides, use them with care.

Is this issue worth worrying about? I think we’ll need to wait and see. Will you fix a few bad SQL or bad query problems in LINQ/EF before giving up entirely, or fix performance problems in the generated SQL by going to stored procedures?

Summary

New database APIs that promise to abstract the programming model away from the underlying database model and its declarative SQL language are always exciting to programmers who want to concentrate on presentation, use cases, and business logic, and spend less time with database optimization. LINQ to SQL and EF (as well as other object-relational mappers) show promise in this area but the proof is in the efficiency of the generated code and the size and reuse potential of the plan cache that results. Should folks be waiting in anticipation of LINQ to SQL/EF related performance problems? I’m not the only one who thinks optimizing declarative languages will always have its place; we’ve seen much written already in books and blogs about the right and wrong way to use LINQ to SQL and the ADO.NET Entity Framework. This chapter should provide you with hints and concerns from a database performance perspective.

About the author

Bob Beauchemin is a database-centric application practitioner and architect, instructor, course author, writer, and Developer Skills Partner for SQLskills. Over the past few years he’s been writing and teaching his SQL Server 2005 and 2008 courses to students worldwide through Microsoft-sponsored programs, as well as private client-centric classes. He is the lead author of the books A Developer’s Guide to SQL Server 2005 and A First Look at SQL Server 2005 For Developers, author of Essential ADO.NET, and writes a database development column for MSDN magazine.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset