LINQ, short for Language Integrated Query, provides a powerful set of
mechanisms for working with collections of information, along with a
convenient syntax. You can use LINQ with the arrays and lists we saw in the
previous chapter—anything that implements IEnumerable<T>
can be used with LINQ, and
there are LINQ providers for databases and XML documents. And even if you
have to deal with data that doesn’t fit into any of these categories, LINQ
is extensible, so in principle, a provider could be written for more or less
any information source that can be accessed from .NET. This chapter will
focus mainly on LINQ to Objects—the provider for running queries against
objects and collections—but the techniques shown here are applicable to
other LINQ sources.
Collections of data are ubiquitous, so LINQ can have a profound effect on how you program. Both of your authors have found that LINQ has changed how we write C# in ways we did not anticipate. Pre-LINQ versions of C# now feel like a different and significantly less powerful language. It may take a little while to get your head around how to use LINQ, but it’s absolutely worth the effort.
LINQ is not a single language feature—it’s the culmination of several elements that were added to version 3.0 of the C# language and version 3.5 of the .NET Framework. (Despite the different version numbers, these did in fact ship at the same time—they were both part of the Visual Studio 2008 release.) So as well as exploring the most visible aspect of LINQ—the query syntax—we’ll also examine the other associated language and framework features that contribute to LINQ.
C# 3.0 added query expressions to the
language—these look superficially similar to SQL queries in some respects,
but they do not necessarily involve a database. For example, we could use
the data returned by the GetAllFilesInDirectory
code from the preceding
chapter, reproduced here in Example 8-1. This
returns an IEnumerable<string>
containing the filenames of all the files found by recursively searching
the specified directory. In fact, as we mentioned in the last chapter, it
wasn’t strictly necessary to work that hard. We implemented the function
by hand to illustrate some details of how lazy evaluation works, but as
Example 8-1 shows, we can get the .NET
Framework class library to do the work for us. The Directory.EnumerateFiles
method still enumerates the files in a lazy fashion when used in this
recursive search mode—it works in much the same way as the example we
wrote in the previous chapter.
Example 8-1. Enumerating filenames
static IEnumerable<string> GetAllFilesInDirectory(string directoryPath) { return Directory.EnumerateFiles(directoryPath, "*", SearchOption.AllDirectories); }
Since a LINQ query can work with any enumeration of objects, we can write a query that just returns the files larger than, say, 10 million bytes, as shown in Example 8-2.
Example 8-2. Using LINQ with an enumeration
var bigFiles = from file in GetAllFilesInDirectory(@"c:") where new FileInfo(file).Length > 10000000 select file; foreach (string file in bigFiles) { Console.WriteLine(file); }
As long as the C# file has a using
System.Linq;
directive at the top (and Visual Studio adds this
to new C# files by default) this code will work just fine. Notice that
we’ve done nothing special to enable the use of a query here—the
GetAllFilesInDirectory
method just returns the lazy enumeration provided by the
Directory
class. And more generally, this sort of query
works with anything that implements IEnumerable<T>
.
Let’s look at the query in more detail. It’s common to assign LINQ
query expressions into variables declared with the var
keyword, as Example 8-2 does:
var bigFiles = ...
This tells the compiler that we want it to deduce that variable’s
type for us. As it happens, it will be an IEnumerable<string>
, and we could have
written that explicitly, but as you’ll see shortly, queries sometimes end
up using anonymous types, at which point the use of var
becomes mandatory.
The first part of the query expression itself is always a from
clause. This describes the source of
information that we want to query, and also defines a so-called range variable:
from file in GetAllFilesInDirectory(@"c:")
The source appears on the right, after the in
keyword—this query runs on the files returned
by the GetAllFilesInDirectory
method.
The range variable, which appears between the from
and in
keywords, chooses the name by which we’ll refer to source items in the
rest of the query—file
in this example.
It’s similar to the iteration variable in a foreach
loop.
The next line in Example 8-2
is a where
clause:
where new FileInfo(file).Length > 10000000
This is an optional, although very common, LINQ query feature. It
acts as a filter—only items for which the expression is true will be
present in the results of the query. This clause constructs a FileInfo
object for the file, and then looks at
its Length
property so that the query
only returns files that are larger than the specified size.
The final part of the query describes what information we want to
come out of the query, and it must be either a select
or a group
clause. Example 8-2 uses a select
clause:
select file;
This is a trivial select
clause—it just
selects the range variable, which contains the filename. That’s why this
particular query ends up producing an IEnumerable<string>
. But we can put other
expressions in here—for example, we could write:
select File.ReadAllLines(file).Length;
This uses the File
class (defined
in System.IO
) to read the file’s text
into an array with one element per line, and then retrieves that array’s
Length
. This would make the query
return an IEnumerable<int>
,
containing the number of lines in each file.
You may be wondering exactly how this works. The code in a LINQ query expression looks quite different from most other C# code—it is, by design, somewhat reminiscent of database queries. But it turns out that all that syntax turns into straightforward method calls.
The C# language specification defines a process by which all LINQ
query expressions are converted into method invocations. Example 8-3 shows what
the query expression in Example 8-2 turns into. Incidentally, C#
ignores whitespace on either side of the .
syntax for member access, so the fact that
this example has been split across multiple lines to fit on the page
doesn’t stop it from compiling.
Example 8-3. LINQ query as method calls
var bigFiles = GetAllFilesInDirectory(@"c:"). Where(file => new FileInfo(file).Length > 10000000);
Let’s compare this with the components of the original query:
var bigFiles = from file in GetAllFilesInDirectory(@"c:") where new FileInfo(file).Length > 10000000 select file;
The source, which follows the in
keyword in the query expression, becomes
the starting point—that’s the enumeration returned by GetAllFilesInDirectory
in this case. The next
step is determined by the presence of the where
clause—this turns into a call to the
Where
method on the source
enumeration. As you can see, the condition in the where
clause has turned into a lambda
expression, passed as an argument to the Where
method.
The final select
clause has
turned into...nothing! That’s because it’s a trivial select
—it just selects the range
variable and nothing else, in which case there’s no need to do any
further processing of the information that comes out of the Where
method. If we’d had a slightly more
interesting expression in the select
clause, for example:
var bigFiles = from file in GetAllFilesInDirectory(@"c:")
where new FileInfo(file).Length > 10000000
select "File: " + file;
we would have seen a corresponding Select
method in the
equivalent function calls, as Example 8-4 shows.
Example 8-4. Where and Select as methods
var bigFiles = GetAllFilesInDirectory(@"c:"). Where(file => new FileInfo(file).Length > 10000000). Select(file => "File: " + file);
A question remains, though: where did the Where
and Select
methods here come from? GetAllFilesInDirectory
returns an IEnumerable<string>
, and if you examine
this interface (which we showed in the preceding chapter) you’ll see
that it doesn’t define a Where
method. And yet if you try these method-based equivalents of the query
expressions, you’ll find that they compile just fine as long as you have
a using System.Linq;
directive at the
top of the file, and a project reference to the System.Core
library. What’s going on? The
answer is that Where
and Select
in these examples are extension
methods.
One of the language features added to C# 3.0 for LINQ is
support for extension methods. These are methods
bolted onto a type by some other type. You can add new methods to an
existing type, even if you can’t change that type—perhaps it’s a type
built into the .NET Framework. For example, the built-in string
type is not something we get to change,
and it’s sealed, so we cannot derive from it either, but that doesn’t
stop us from adding new methods. Example 8-5 adds a new and not
very useful Backwards
method that
returns a copy of the string with the characters in reverse
order.[17]
Example 8-5. Adding an extension method to string
static class StringAdditions { // Naive implementation for illustrative purposes. // DO NOT USE in real code! public static string Backwards(this string input) { char[] characters = input.ToCharArray(); Array.Reverse(characters); return new string(characters); } }
Notice the this
keyword in front
of the first argument—that indicates that Backwards
is an extension method. Also notice
that the class is marked as static
—you can only define extension methods
in static classes.
As long as this class is in a namespace that’s in scope (either
because of a using
directive, or
because it’s in the same namespace as the code that wants to use it) you
can call this method as though it were a normal member of the string
class:
string stationName = "Finsbury Park"; Console.WriteLine(stationName.Backwards());
The Where
and Select
methods used in Example 8-4 are extension methods. The
System.Linq
namespace defines a static class called Enumerable
which defines these and numerous
other extension methods for IEnumerable<T>
.
Here’s the signature for one of the Where
overloads:
public static IEnumerable<TSource> Where<TSource>( this IEnumerable<TSource> source, Func<TSource, bool> predicate)
Notice that this is a generic method—the method itself takes a
type argument, called TSource
here,
and passes that through as the type argument T
for the first parameter’s IEnumerable<T>
. The result is that this
method extends IEnumerable<T>
,
whatever T
may be. In other words, as
long as the System.Linq
namespace is
in scope, all IEnumerable<T>
implementations appear to offer a Where
method.
Select
and Where
are examples of LINQ
operators—standard methods that are available wherever LINQ
is supported. The Enumerable
class in
System.Linq
provides all the LINQ
operators for IEnumerable<T>
,
but is not the only LINQ provider—it just provides query support for
collections in memory, and is sometimes referred to as LINQ to Objects.
In later chapters, we’ll see sources that support LINQ queries against
databases and XML documents. Anyone can write a new provider, because C#
neither knows nor cares what the source is or how it works—it just
mechanically translates query expressions into method calls, and as long
as the relevant LINQ operators are available, it will use them. This
leaves different data sources free to implement the various operators in
whatever way they see fit. Example 8-6 shows how you
could exploit this to provide custom implementations of the Select
and Where
operators.
Example 8-6. Custom implementation of some LINQ operators
public class Foo { public string Name { get; set; } public Foo Where(Func<Foo, bool> predicate) { return this; } public TResult Select<TResult>(Func<Foo, TResult> selector) { return selector(this); } }
These are normal methods rather than extension methods—we’re writing a custom type, so we can add LINQ operators directly to that type. Since C# just converts LINQ queries into method calls, it doesn’t matter whether LINQ operators are normal methods or extension methods. So with these methods in place, we could write the code shown in Example 8-7.
Example 8-7. Confusing but technically permissible use of a LINQ query
Foo source = new Foo { Name = "Fred" }; var result = from f in source where f.Name == "Fred" select f.Name;
C# will follow the rules for translating query expressions into method calls, just as it would for any query, so it will turn Example 8-7 into this:
Foo source = new Foo { Name = "Fred" }; var result = source.Where(f => f.Name == "Fred").Select(f => f.Name);
Since the Foo
class provides
the Where
and Select
operators that C# expects, this will
compile and run. It won’t be particularly useful, because our Where
implementation completely ignores the
predicate. And it’s also a slightly bizarre thing to do—our Foo
class doesn’t appear to represent any kind
of collection, so it’s rather misleading to use syntax that’s intended
to be used with collections. In fact, Example 8-7 has the same
effect as:
var result = source.Name;
So you’d never write code like Example 8-6 and Example 8-7 for a type as
simple as Foo
in practice—the purpose
of these examples is to illustrate that the C# compiler blindly
translates query expressions into method calls, and has no understanding
or expectation of what those calls might do. The real functionality of
LINQ lives entirely in the class library. Query expressions are just a
convenient syntax.
Query expressions can contain let
clauses. This is an interesting kind of
clause in that unlike most of the rest of a query, it doesn’t correspond
directly to any particular LINQ operator. It’s just a way of making it
easier to structure your query.
You would use a let
clause when
you need to use the same information in more than one place in a query.
For example, suppose we want to modify the query in Example 8-2 to return a FileInfo
object, rather than a filename. We
could do this:
var bigFiles = from file in GetAllFilesInDirectory(@"c:") where new FileInfo(file).Length > 10000000 select new FileInfo(file);
But this code repeats itself—it creates a FileInfo
object in the where
clause and then creates another one in
the select
clause. We can avoid this
repetition with a let
clause:
var bigFiles = from file in GetAllFilesInDirectory(@"c:")
let info = new FileInfo(file)
where info.Length > 10000000
select info;
The C# compiler jumps through some significant hoops to
make this work. There’s no need to know the details to make use of a
let
clause, but if you’re curious to
know how it works, here’s what happens. Under the covers it generates a
class containing two properties called file
and info
, and ends up generating two
queries:
var temp = from file in GetAllFilesInDirectory(@"c:") select new CompilerGeneratedType(file, new FileInfo(file)); var bigFiles = from item in temp where item.info.Length > 10000000 select item.info;
The purpose of the first query is to produce a sequence in which
the range variable is wrapped in the compiler-generated type, alongside
any variables declared with a let
clause. (It’s not actually called CompilerGeneratedType
, of course—the compiler
generates a unique, meaningless name.) This allows all these variables
to be available in all the clauses of the query.
Before we look in detail at the services LINQ offers, there are some features that apply across all of LINQ that you should be aware of.
LINQ query syntax makes implicit use of lambdas. The
expressions that appear in where
, select
, or most other clauses are written as
ordinary expressions, but as you’ve seen, the C# compiler turns queries
into a series of method calls, and the expressions become lambda
expressions.
Most of the time, you can just write the expressions you need and they work. But you need to be wary of code that has side effects. For example, it would be a bad idea to write the sort of query shown in Example 8-8.
Example 8-8. Unhelpful side effects in a query
int x = 10000;
var bigFiles = from file in GetAllFilesInDirectory(@"c:")
where new FileInfo(file).Length > x++
select file;
The where
clause here
increments a variable declared outside the scope of the query.
This is allowed (although it’s a bad idea) in LINQ to Objects. Some LINQ providers, such as the ones you would use with databases, will reject such a query at runtime.
This will have the potentially surprising result that the query
could return different files every time it runs, even if the underlying
data has not changed. Remember, the expression in the where
clause gets converted into an anonymous
method, which will be invoked once for every item in the query’s source.
The first time this runs, the local x
variable will be incremented once for every file on the disk. If the
query is executed again, that’ll happen again—nothing will reset
x
to its original state.
Moreover, queries are often executed sometime after the point at
which they are created, which can make code with side effects very hard
to follow—looking at the code in Example 8-8 it’s not possible to say
exactly when x
will be modified. We’d
need more context to know that—when exactly is the bigFiles
query evaluated? How many
times?
In practice, it is important to avoid side effects in queries.
This extends beyond simple things such as the ++
operator—you also need to be careful about
invoking methods from within a query expression. You’ll want to avoid
methods that change the state of your application.
It’s usually OK for expressions in a query to read variables from the surrounding scope, though. A small modification to Example 8-8 illustrates one way you could exploit this (see Example 8-9).
Example 8-9. Using a local variable in a query
int minSize = 10000; var bigFiles = from file in GetAllFilesInDirectory(@"c:") where new FileInfo(file).Length > minSize select file; var filesOver10k = bigFiles.ToArray(); minSize = 100000; var filesOver100k = bigFiles.ToArray(); minSize = 1000000; var filesOver1MB = bigFiles.ToArray(); minSize = 10000000; var filesOver10MB = bigFiles.ToArray();
This query makes use of a local variable as before, but this query
simply reads the value rather than modifying it. By changing the value
of that variable, we can modify how the query behaves the next time it
is evaluated. (The call to ToArray()
executes the query and puts the results into an array. This is one way
of forcing an immediate execution of the query.)
LINQ operators all share a common characteristic: they do
not modify the data they work on. For example, you can get LINQ to sort
the results of a query, but unlike Array.Sort
or List<T>.Sort
, which both modify the
order of an existing collection, sorting in LINQ works by producing a
new IEnumerable<T>
which returns objects in the specified order. The original collection is
not modified.
This is similar in style to .NET’s string
type. The string
class provides various methods that
look like they will modify the string, such as Trim
, ToUpper
, and Replace
. But strings are immutable, so all of
these methods work by building a new string—you get a modified copy,
leaving the original intact.
LINQ never tries to modify sources, so it’s able to work with
immutable sources. LINQ to Objects relies on IEnumerable<T>
, which does not provide
any mechanism for modifying the contents or order of the underlying
collection.
Of course, LINQ does not require sources to be immutable. IEnumerable<T>
can be implemented by
modifiable and immutable classes alike. The point is that LINQ will
never attempt to modify its source collections.
This approach is sometimes described as a functional style. Functional programming languages such as F# tend to have this characteristic—just as mathematical functions such as addition, multiplication, and trigonometric functions do not modify their inputs, neither does purely functional code. Instead, it generates new information based on its inputs—new enumerations layered on top of input enumerations in the case of LINQ.
C# is not a purely functional language—it’s possible and indeed common to write code that modifies things—but that doesn’t stop you from using a functional style, as LINQ shows.
Functional code is often highly composable—it tends to lead to APIs whose features can easily be combined in all sorts of different ways. This in turn can lead to more maintainable code—small, simple features are easier to design, develop, and test than complex, monolithic chunks of code, but you can still tackle complex problems by combining smaller features. Since LINQ works by passing a sequence to a method that transforms its input into a new sequence, you can plug together as many LINQ operators as you like. The fact that these operators never modify their inputs simplifies things. If multiple pieces of code are all vying to modify some data, it can become difficult to ensure that your program behaves correctly. But with a functional style, once data is produced it never changes—new calculations yield new data instead of modifying existing data. If you can be sure that some piece of data will never change, it becomes much easier to understand your code’s behavior, and you’ll have a better chance of making it work. This is especially important with multithreaded code.
Chapter 7 introduced the idea of
lazy enumeration (or deferred
execution, as it’s also sometimes called). As we saw,
iterating over an enumeration such as the one returned by GetAllFilesInDirectory
does the necessary work
one element at a time, rather than
processing everything up front. The query in Example 8-2 preserves this characteristic—if you run the code, you
won’t have to wait for GetAllFilesInDirectory
to finish before you
see any results; it will start printing filenames immediately. (Well,
almost immediately—it depends on how far it has to look before finding a
file large enough to get through the where
clause.) And in general, LINQ queries
will defer work as much as possible—merely having executed the code that
defines the query doesn’t actually do anything. So in our example, this
code:
var bigFiles = from file in GetAllFilesInDirectory(@"c:") where new FileInfo(file).Length > 10000000 select file;
does nothing more than describe the query. No
work is done until we start to enumerate the bigFiles
result with a foreach
loop. And at each iteration of that
loop, it does the minimum work required to get the next item—this might
involve retrieving multiple results from the underlying collection,
because the where
clause will keep
fetching items until it either runs out or finds one that matches the
condition. But even so, it does no more work than necessary.
The picture may change a little as you use some of the more advanced features described later in this chapter—for example, you can tell a LINQ query to sort your data, in which case it will probably have to look at all the results before it can work out the correct order. (Although even that’s not a given—it’s possible to write a source that knows all about ordering, and if you have special knowledge about your data source, it may be possible to write a source that delivers data in order while still fetching items lazily. We’ll see providers that do this when we look at how to use LINQ with databases in a later chapter.)
Although deferred execution is almost always a good thing, there’s one gotcha to bear in mind. Because the query doesn’t run up front, it will run every time you evaluate it. LINQ doesn’t keep a copy of the results when you execute the query, and there are good reasons you wouldn’t want it to—it could consume a lot of memory, and would prevent you from using the technique in Example 8-9. But it does mean that relatively innocuous-looking code can turn out to be quite expensive, particularly if you’re using a LINQ provider for a database. Inadvertently evaluating the query multiple times could cause multiple trips to the database server.
There are around 50 standard LINQ operators. The rest of this chapter describes the most important operators, broken down by the main areas of functionality. We’ll show how to use them both from a query expression (where possible) and with an explicit method call.
Sometimes it’s useful to call the LINQ query operator methods
explicitly, rather than writing a query expression. Some operators offer
overloads with advanced features that are not available in a query
expression. For example, sorting strings is a locale-dependent
operation—there are variations on what constitutes alphabetical ordering
in different languages. The query expression syntax for ordering data
always uses the current thread’s default culture for ordering. If you
need to use a different culture for some reason, or you want a
culture-independent order, you’ll need to call an overload of the
OrderBy
operator explicitly instead
of using an orderby
clause in a query
expression.
There are even some LINQ operators that don’t have an equivalent in a query expression. So understanding how LINQ uses methods is not just a case of looking at implementation details. It’s the only way to access some more advanced LINQ features.
You already saw the main filtering feature of LINQ. We
illustrated the where
clause and the
corresponding Where
operator in Example 8-2 and Example 8-3, respectively. Another filter
operator worth being aware of is called OfType
. It has no query expression equivalent,
so you can use it only with a method call. OfType
is useful when you have a collection
that could contain a mixture of types, and you only want to look at the
elements that have a particular type. For example, in a user interface
you might want to get hold of control elements (such as buttons),
ignoring purely visual elements such as images or drawings. You could
write this sort of code:
var controls = myPanel.Children.OfType<Control>();
If myPanel.Children
is a
collection of objects of some kind, this code will ensure that controls
is an enumeration that only returns
objects that can be cast to the Control
type.
Although OfType
has no
equivalent in a query expression, that doesn’t stop you from using it in
conjunction with a query expression—you can use the result of OfType
as the source for a query:
var controlNames = from control in myPanel.Children.OfType<Control>() where !string.IsNullOrEmpty(control.Name) select control.Name;
This uses the OfType
operator
to filter the items down to objects of type Control
, and then uses a where
clause to further filter the items to
just those with a nonempty Name
property.
Query expressions can contain an orderby
clause, indicating the order in which
you’d like the items to emerge from the query. In queries with no
orderby
clause, LINQ does not, in
general, make any guarantees about the order in which items emerge. LINQ
to Objects happens to return items in the order in which they emerge
from the source enumeration if you don’t specify an order, but other
LINQ providers will not necessarily define a default order. (In
particular, database LINQ providers typically return items in an
unpredictable order unless you explicitly specify an order.)
So as to have some data to sort, Example 8-10 brings back the
CalendarEvent
class from Chapter 7.
Example 8-10. Class representing a calendar event
class CalendarEvent { public string Title { get; set; } public DateTimeOffset StartTime { get; set; } public TimeSpan Duration { get; set; } }
When examples in this chapter refer to an events
variable, assume that it was
initialized with the data shown in Example 8-11.
Example 8-11. Some example data
List<CalendarEvent> events = new List<CalendarEvent> { new CalendarEvent { Title = "Swing Dancing at the South Bank", StartTime = new DateTimeOffset (2009, 7, 11, 15, 00, 00, TimeSpan.Zero), Duration = TimeSpan.FromHours(4) }, new CalendarEvent { Title = "Saturday Night Swing", StartTime = new DateTimeOffset (2009, 7, 11, 19, 30, 00, TimeSpan.Zero), Duration = TimeSpan.FromHours(6.5) }, new CalendarEvent { Title = "Formula 1 German Grand Prix", StartTime = new DateTimeOffset (2009, 7, 12, 12, 10, 00, TimeSpan.Zero), Duration = TimeSpan.FromHours(3) }, new CalendarEvent { Title = "Swing Dance Picnic", StartTime = new DateTimeOffset (2009, 7, 12, 15, 00, 00, TimeSpan.Zero), Duration = TimeSpan.FromHours(4) }, new CalendarEvent { Title = "Stompin' at the 100 Club", StartTime = new DateTimeOffset (2009, 7, 13, 19, 45, 00, TimeSpan.Zero), Duration = TimeSpan.FromHours(5) } };
Example 8-12 shows a LINQ query that orders these events by start time.
Example 8-12. Ordering items with LINQ
var eventsByStartTime = from ev in events orderby ev.StartTime select ev;
By default, the items will be sorted into ascending order. You can be explicit about this if you like:
var eventsByStartTime = from ev in events orderby ev.StartTime ascending select ev;
And, of course, you can sort into descending order too:
var eventsByStartTime = from ev in events orderby ev.StartTime descending select ev;
The expression in the orderby
clause does not need to correspond directly to a property of the source
object. It can be a more complex expression. For example, we could
extract just the time of day to produce the slightly confusing result of
events ordered by what time they start, regardless of date:
var eventsByStartTime = from ev in events orderby ev.StartTime.TimeOfDay select ev;
You can specify multiple criteria. Example 8-13 sorts the events: first by date (ignoring the time) and then by duration.
Example 8-13. Multiple sort criteria
var eventsByStartDateThenDuration = from ev in events orderby ev.StartTime.Date, ev.Duration select ev;
Four LINQ query operator methods correspond to the orderby
clause. Most obviously, there’s
OrderBy
, which takes a
single ordering criterion as a lambda:
var eventsByStartTime = events.OrderBy(ev => ev.StartTime);
That code has exactly the same effect as Example 8-12. Of course, like most LINQ
operators, you can chain this together with other ones. So we could
combine that with the Where
operator:
var longEvents = events.OrderBy(ev => ev.StartTime). Where(ev => ev.Duration > TimeSpan.FromHours(2));
This is equivalent to the following query:
var longEvents = from ev in events orderby ev.StartTime where ev.Duration > TimeSpan.FromHours(2) select ev;
You can customize the comparison mechanism used to sort the items
by using an overload that accepts a comparison object—it must implement
IComparer<TKey>
[18] where TKey
is the type
returned by the ordering expression. So in these examples, it would need
to be an IComparer<DateTimeOffset>
, since that’s
the type of the StartTime
property
we’re using to order the data. There’s not a lot of scope for discussion
about what order dates come in, so this is not a useful example for
plugging in an alternate comparison. However, string comparisons do vary
a lot—different languages have different ideas about what order letters
come in, particularly when it comes to letters with accents. The .NET
Framework class library offers a StringComparer
class that can provide an
IComparer<string>
implementation for any language and culture supported in .NET. The
following example uses this in conjunction with an overload of the
OrderBy
operator to sort the events
by their title, using a string sorting order appropriate for the
French-speaking Canadian culture, and configured for case
insensitivity:
CultureInfo cult = new CultureInfo("fr-CA"); // 2nd argument is true for case insensitivity StringComparer comp = StringComparer.Create(cult, true); var eventsByTitle = events.OrderBy(ev => ev.Title, comp);
There is no equivalent query expression—if you want to use
anything other than the default comparison for a type, you must use this
overload of the OrderBy
operator.
The OrderBy
operator method
always sorts in ascending order. To sort in descending order, there’s an
OrderByDescending
operator.
If you want to use multiple sort criteria, as in Example 8-13, a different operator comes into
play: you need to use either ThenBy
or ThenByDescending
. This is because
the OrderBy
and OrderByDescending
operators discard the order
of incoming elements and impose the specified order from scratch—that’s
the whole point of those operators. Refining an ordering by adding
further sort criteria is a different kind of operation, hence the
different operators. So the method-based equivalent of Example 8-13 would look like this:
var eventsByStartTime = events.OrderBy(ev => ev.StartTime). ThenBy(ev => ev.Duration);
Ordering will cause LINQ to Objects to iterate through the whole source collection before returning any elements—it can only sort items once it has seen all of the items.
Sometimes you’ll end up wanting to combine two sequences
of values into one. LINQ provides a very straightforward operator for
this: Concat
. There is no equivalent
in the query expression syntax. If you wanted to combine two lists of
events into one, you would use the code in Example 8-14.
Note that this does not modify the inputs. This builds a new
enumeration object that returns all the elements from existingEvents
, followed by all the elements
from newEvents
.
So this can be safer than the List<T>.AddRange
method shown in Chapter 7, because this
doesn’t modify anything. (Conversely, if you were expecting Example 8-14 to modify existingEvents
, you will be
disappointed.)
This is a good illustration of how LINQ uses the functional style described earlier.
Like mathematical functions, most LINQ operators calculate their outputs without
modifying their inputs. For example, if you have two int
variables called x
and y
,
you would expect to be able to calculate x+y
without that calculation changing either
x
or y
. Concatenation works the same way—you can
produce a sequence that is the concatenation of two inputs without
changing those inputs.
As with most LINQ operators, concatenation uses deferred
evaluation—it doesn’t start asking its source enumerations for elements
in advance. Only when you start to iterate through the contents of
allEvents
will this start retrieving
items from existingEvents
. (And it
won’t start asking for anything from newEvents
until it has retrieved all the
elements from existingEvents
.)
LINQ provides the ability to take flat lists of data and
group them. As Example 8-15 shows, we could
use this to write a LINQ-based alternative to the GetEventsByDay
method shown in Chapter 7.
Example 8-15. Simple LINQ grouping
var eventsByDay = from ev in events group ev by ev.StartTime.Date;
This will arrange the objects in the events
source into one group for each
day.
The eventsByDay
variable here
ends up with a slightly different type than anything we’ve seen before.
It’s an IEnumerable<IGrouping<DateTimeOffset,
CalendarEvent>>
. So eventsByDay
is an enumeration, and it
returns an item for each group found by the group
clause. Example 8-16 shows one way of using
this. It iterates through the collection of groupings, and for each
grouping it displays the Key
property—the value by which the items have been grouped—and then
iterates through the items in the group.
Example 8-16. Iterating through grouped results
foreach (var day in eventsByDay) { Console.WriteLine("Events for " + day.Key); foreach (var item in day) { Console.WriteLine(item.Title); } }
This produces the following output:
Events for 7/11/2009 12:00:00 AM Swing Dancing at the South Bank Saturday Night Swing Events for 7/12/2009 12:00:00 AM Formula 1 German Grand Prix Swing Dance Picnic Events for 7/13/2009 12:00:00 AM Stompin' at the 100 Club
This illustrates that the query in Example 8-15 has successfully grouped the events by
day, but let’s look at what returned in a little more detail. Each group
is represented as an IGrouping<TKey,
TElement>
, where TKey
is
the type of the expression used to group the data—a DateTimeOffset
in this case—and TElement
is the type of the elements making up
the groups—CalendarEvent
in this
example. IGrouping<TKey,
TElement>
derives from IEnumerable<TElement>
, so you can
enumerate through the contents of a group like you would any other
enumeration. (In fact, the only thing IGrouping<TKey, TElement>
adds is the
Key
property, which is the grouping
value.) So the query in Example 8-15 returns
a sequence of sequences—one for each group (see Figure 8-1).
While a LINQ query expression is allowed to end with a group
clause, as Example 8-15 does, it doesn’t have to finish there.
If you would like to do further processing, you can add an into
keyword on the
end, followed by an identifier. The continuation of the query after a
group
... into
clause will iterate over the groups, and
the identifier effectively becomes a new range variable. Example 8-17 uses this to convert
each group into an array. (Calling ToArray
on an IGrouping
effectively discards the Key
, and leaves you with just an array
containing that group’s contents. So this query ends up producing an
IEnumerable<CalendarEvent[]>
—a
collection of arrays.)
Example 8-17. Continuing a grouped query with into
var eventsByDay = from ev in events group ev by ev.StartTime.Date into dayGroup select dayGroup.ToArray();
Like the ordering operators, grouping will cause LINQ to Objects to evaluate the whole source sequence before returning any results.
The select
clause’s job is
to define how each item should look when it comes out of the query. The
official (if somewhat stuffy) term for this is
projection. The simplest possible kind of
projection just leaves the items as they are, as shown in Example 8-18.
Earlier, you saw this kind of trivial select
clause collapsing away to nothing.
However, that doesn’t happen here, because this is what’s called a
degenerate query—it contains nothing
but a trivial projection. (Example 8-2 was different, because it
contained a where
clause in addition
to the trivial select
.) LINQ never
reduces a query down to nothing at all, so when faced with a degenerate
query, it leaves the trivial select
in place, even though it appears to have nothing to do. So Example 8-18 becomes a call to the Select
LINQ operator method:
var projected = events.Select(ev => ev);
But projections often have work to do. For example, if we want to pick out event titles, we can write this:
var projected = from ev in events select ev.Title;
Again, this becomes a call to the Select
LINQ operator method, with a slightly
more interesting projection lambda:
var projected = events.Select(ev => ev.Title);
We can also calculate new values in the select
clause. This calculates the end time of
the events:
var projected = from ev in events select ev.StartTime + ev.Duration;
You can use any expression you like in the select
clause. In fact, there’s not even any
obligation to use the range variable, although it’s likely to be a bit
of a waste of time to construct a query against a data source if you
ultimately don’t use any data from that source. But C# doesn’t care—any
expression is allowed. The following slightly silly code generates one
random number for each event, in a way that is entirely unrelated to the
event in question:
Random r = new Random(); var projected = from ev in events select r.Next();
You can, of course, construct a new object in the select
clause. There’s one interesting
variation on this that often crops up in LINQ queries, which occurs when
you want the query to return multiple pieces of information for each
item. For example, we might want to display calendar events in a format
where we show both the start and the end times. This is slightly
different from how the CalendarEvent
class represents things—it stores the duration rather than the end time.
We could easily write a query that calculates the end time, but it
wouldn’t be very useful to have just that time. We’d want all the
details—the title, the start time, and the end time.
In other words, we’d be transforming the data slightly. We’d be
taking a stream of objects where each item contains Title
, StartTime
, and Duration
properties, and producing one where
each item contains a Title
, StartTime
, and EndTime
. Example 8-19 does
exactly this.
Example 8-19. Select clause with anonymous type
var projected = from ev in events select new { Title = ev.Title, StartTime = ev.StartTime, EndTime = ev.StartTime + ev.Duration };
This constructs a new object for each item. But while the
new
keyword is there,
notice that we’ve not specified the name of a type. All we have is the
object initialization syntax to populate various properties—the list of
values in braces after the new
keyword. We haven’t even defined a type anywhere in these examples that
has a Title
, a StartTime
, and an EndTime
property. And yet this compiles. And
we can go on to use the results as shown in Example 8-20.
Example 8-20. Using a collection with an anonymous item type
foreach (var item in projected) { Console.WriteLine("Event {0} starts at {1} and ends at {2}", item.Title, item.StartTime, item.EndTime); }
These two examples are using the anonymous type feature added in C# 3.0.
If we want to define a type to represent some information in our
application, we would normally use the class
or struct
keyword as described in Chapter 3. Typically, the
type definition would live in its own source file, and in a real
project we would want to devise unit tests to ensure that it works as
expected. This might be enough to put you off the idea of defining a
type for use in a very narrow context, such as having a convenient
container for the information coming out of a query. But it’s often
useful for the select
clause of a
query just to pick out a few properties from the source items,
possibly transforming the data in some way to get it into a convenient
representation.
Extracting just the properties you need can become important
when using LINQ with a database—database providers are typically
able to transform the projection into an equivalent SQL SELECT
statement. But if your LINQ query
just fetches the whole row, it will end up fetching every column
whether you need it or not, placing an unnecessary extra load on the
database and network.
There’s a trade-off here. Is the effort of creating a type worth the benefits if you’re only going to use it to hold the results of a query? If your code immediately does further processing of the data, the type will be useful to only a handful of lines of code. But if you don’t create the type, you have to deal with a compromise—you might not be able to structure the information coming out of your query in exactly the way you want.
C# 3.0 shifts the balance in favor of creating a type in this scenario, by removing most of the effort required, thanks to anonymous types. This is another language feature added mainly for the benefit of LINQ, although you can use it in other scenarios if you find it useful. An anonymous type is one that the C# compiler writes for you, based on the properties in the object initializer list. So when the compiler sees this expression from Example 8-19:
new { Title = ev.Title, StartTime = ev.StartTime, EndTime = ev.StartTime + ev.Duration };
it knows that it needs to supply a type, because we’ve not
specified a type name after the new
keyword. It will create a new class definition, and will define
properties for each entry in the initializer. It will work out what
types the properties should have from the types of the expressions in
the initializer. For example, the ev.Title
expression evaluates to a string,
so it will add a property called Title
of type string
.
Before generating a new anonymous type, the C# compiler checks to see if it has already generated one with properties of the same name and type, specified in the same order elsewhere in your project. If it has, it just reuses that type. So if different parts of your code happen to end up creating identical anonymous types, the compiler is smart enough to share the type definition. (Normally, the order in which properties are defined has no significance, but in the case of anonymous types, C# considers two types to be equivalent only if the properties were specified in the same order.)
The nice thing about this is that when we come to use the items in a collection based on an anonymous type (such as in Example 8-20) IntelliSense and compile-time checking work exactly as they always do—it’s just like working with a normal type, but we didn’t have to write it.
From the point of view of the .NET Framework, the type generated by the C# compiler is a perfectly ordinary type like any other. It neither knows nor cares that the compiler wrote the class for us. It’s anonymous only from the point of view of our C# code—the generated type does in fact have a name, it’s just a slightly odd-looking one. It’ll be something like this:
<>f__AnonymousType0`3
The C# compiler deliberately picks a name for the type that would be illegal as a C# class name (but which is still legal as far as .NET is concerned) in order to stop us from trying to use the class by its name—that would be a bad thing to do, because the compiler doesn’t guarantee to keep the name the same from one compilation to the next.
The anonymity of the type name means that anonymous types are
only any use within a single method. Suppose you wanted to return an
anonymous type (or an IEnumerable<SomeAnonymousType>
) from a
method—what would you write as the return type if the type in question
has no name? You could use Object
,
but the properties of the anonymous type won’t be visible. The best
you could do is use dynamic
, which
we describe in Chapter 18. This would make it possible
to access the properties, but without the aid of compile-time type
checking or IntelliSense. So the main purpose of anonymous types is
simply to provide a convenient way to get information from a query to
code later in the same method that does something with that
information.
Anonymous types would not be very useful without the var
keyword, another
feature introduced in C# 3.0. As we saw earlier, when you declare a
local variable with the var
keyword, the compiler works out the type from the expression you use
to initialize the variable. To see why we need this for anonymous
types to be useful, look at Example 8-19—how would you declare
the projected
local variable if we
weren’t using var
? It’s going to be
some sort of IEnumerable<T>
,
but what’s T
here? It’s an
anonymous type, so by definition we can’t write down its name. It’s
interesting to see how Visual Studio reacts if we ask it to show us
the type by hovering our mouse pointer over the variable—Figure 8-2 shows the
resultant data tip.
Visual Studio chooses to denote anonymous types with names such
as 'a
, 'b
, and so forth. These are not legal
names—they’re just placeholders, and the data tip pop up goes on to
show the structure of the anonymous types they represent.
Whether or not you’re using anonymous types in your projections, there’s an alternative form of projection that you will sometimes find useful when dealing with multiple sources.
Earlier, Example 8-15 used a
groupby
clause to add
some structure to a list of events—the result was a list containing
one group per day, with each group itself containing a list of events.
Sometimes it can be useful to go in the opposite direction—you may
have structured information that you would like to flatten into a
single list. You can do this in a query expression by writing multiple
from
clauses, as Example 8-21 shows.
Example 8-21. Flattening lists using multiple from clauses
var items = from day in eventsByday from item in day select item;
You can think of this as having roughly the same effect as the following code:
List<CalendarEvent> items = new List<CalendarEvent>(); foreach (IGrouping<DateTime, CalendarEvent> day in eventsByDay) { foreach (CalendarEvent item in day) { items.Add(item); } }
That’s not exactly how it works, because the LINQ query will use
deferred execution—it won’t
start iterating through the source items until you start trying to
iterate through the query. The foreach
loops, on the other hand, are
eager—they build the entire flattened list as soon as they run. But
lazy versus eager aside, the set of items produced is the same—for
each item in the first source, every item in the second source will be
processed.
Notice that this is very different from the concatenation
operator shown earlier. That also works with two sources, but it
simply returns all the items in the first source, followed by all
the items in the second source. But Example 8-21 will iterate
through the source of the second from
clause once for every item in the
source of the first from
clause.
(So concatenation and flattening are as different as addition and
multiplication.) Moreover, the second from
clause’s source expression typically
evaluates to a different result each time around.
In Example 8-21,
the second from
clause uses the
range variable from the first from
clause as its source. This is a common technique—it’s what enables
this style of query to flatten a grouped structure. But it’s not
mandatory—you can use any LINQ-capable source you like; for example,
any IEnumerable<T>
. Example 8-22 uses the same
source array for both from
clauses.
Example 8-22. Alternative use of multiple from clauses
int[] numbers = { 1, 2, 3, 4, 5 }; var multiplied = from x in numbers from y in numbers select x * y; foreach (int n in multiplied) { Console.WriteLine(n); }
The source contains five numbers, so the resultant multiplied
sequence contains 25 elements—the
second from
clause counts through
all five numbers for each time around the first from
clause.
The LINQ operator method for flattening multiple sources is
called SelectMany
. The
equivalent of Example 8-22 looks like
this:
var multiplied = numbers.SelectMany( x => numbers, (x, y) => x * y);
The first lambda is expected to return the collection over which
the nested iteration will be performed—the collection for the second
from
clause in the LINQ query. The
second lambda is the projection from the select
clause in the query. In queries with
a trivial final projection, a simpler form is used, so the equivalent
of Example 8-21
is:
var items = days.SelectMany(day => day);
Whether you’re using a multisource SelectMany
or a simple single-source
projection, there’s a useful variant that lets your projection know
each item’s position, by passing a number into the projection.
The Select
and
SelectMany
LINQ operators both
offer overloads that make it easy to number items. Example 8-23 uses this to build a list of numbered
event names.
Example 8-23. Adding item numbers
var numberedEvents = events. Select((ev, i) => string.Format("{0}: {1}", i + 1, ev.Title));
If we iterate over this, printing out each item:
foreach (string item in numberedEvents) { Console.WriteLine(item); }
the results look like this:
1: Swing Dancing at the South Bank 2: Formula 1 German Grand Prix 3: Swing Dance Picnic 4: Saturday Night Swing 5: Stompin' at the 100 Club
This illustrates how LINQ often makes for much more concise code
than was possible before C# 3.0. Remember that in Chapter 7, we wrote a function that takes an array
of strings and adds a number in a similar fashion. That required a
loop with several lines of code, and it worked only if we already
happened to have a collection of strings. Here we’ve turned a
collection of CalendarEvents
into a
collection of numbered event titles with just a single method
call.
As you get to learn LINQ, you’ll find this happens quite a lot—situations in which you might have written a loop, or a series of loops, can often turn into fairly simple LINQ queries.
The Zip
operator is
useful when you have two related sequences, where each element in one
sequence is somehow connected with the element at the same position in
the other sequence. You can unite the two sequences by
zipping them back into one. Obviously, the name has
nothing to do with the popular ZIP compression format. This operator is
named after zippers of the kind used in clothing.
This might be useful with a race car telemetry application of the kind we discussed in Chapter 2. You might end up with two distinct series of data produced by two different measurement sources. For example, fuel level readings and lap time readings could be two separate sequences, since such readings would likely be produced by different instruments. But if you’re getting one reading per lap in each sequence, it might be useful to combine these into a single sequence with one element per lap, as Example 8-24 shows.
Example 8-24. Zipping two sequences into one
IEnumerable<TimeSpan> lapTimes = GetLapTimes(); IEnumerable<double> fuelLevels = GetLapFuelLevels(); var lapInfo = lapTimes.Zip(fuelLevels, (time, fuel) => new { LapTime = time, FuelLevel = fuel });
You invoke the Zip
operator on
one of the input streams, passing in the second stream as the first
argument. The second argument is a projection function—it’s similar to
the projections used with the Select
operator, except it is passed two arguments, one for each stream. So the
lapInfo
sequence produced by Example 8-24 will contain one item per
lap, where the items are of an anonymous type, containing both the
LapTime
and the FuelLevel
in a single item.
Since the two sequences are of equal length here—the number of
laps completed—it’s clear how long the output sequence will be, but what
if the input lengths differ? The Zip
operator stops as soon as either one of the input sequences stops, so
the shorter of the two determines the length. Any spare elements in the
longer stream will not be used.
Sometimes you won’t want to work with an entire collection. For
example, in an application with limited screen space, you might want to
show just the next three events on the user’s calendar. While there is
no way to do this directly in a query expression, LINQ defines a
Take
operator for this purpose. As
Example 8-25 shows, you
can still use the query syntax for most of the query, using the Take
operator as the final stage.
Example 8-25. Taking the first few results of a query
var eventsByStart = from ev in events orderby ev.StartTime where ev.StartTime > DateTimeOffset.Now select ev; var next3Events = eventsByStart.Take(3);
LINQ also defines a Skip
operator which
does the opposite of Take
—it drops the first
three items (or however many you ask it to drop) and then returns all
the rest.
If you’re interested in only the very first item, you may find the
First
operator more convenient. If
you were to call Take(1)
, the method
would still return a collection of items. So this code would not
compile:
CalendarEvent nextEvent = eventsByStart.Take(1);
You’d get the following compiler error:
CS0266: Cannot implicitly convert type 'System.Collections.Generic.IEnumerable< CalendarEvent>' to CalendarEvent'. An explicit conversion exists (are you missing a cast?)
In other words, Take
always
returns an IEnumerable<CalendarEvent>
, even if we
ask for only one object. But this works:
CalendarEvent nextEvent = eventsByStart.First();
First
gets the first element
from the enumeration and returns that. (It then abandons the
enumerator—it doesn’t iterate all the way to the end of the
sequence.)
You may run into situations where the list might be empty. For
example, suppose you want to show the user’s next appointment for
today—it’s possible that there are no more appointments. If you call
First
in this scenario, it will throw
an exception. So there’s also a FirstOrDefault
operator, which returns the
default value when there are no elements (e.g., null
, if you’re dealing with a reference
type). The Last
and LastOrDefault
operators
are similar, except they return the very last element in the sequence,
or the default value in the case of an empty sequence.
A yet more specialized case is where you are expecting a sequence
to contain no more than one element. For example, suppose you modify the
CalendarEvent
class to add an
ID
property intended to be used as a
unique identifier for the event. (Most real calendar systems have a
concept of a unique ID to provide an unambiguous way of referring to a
particular calendar entry.) You might write this sort of query to find
an item by ID:
var matchingItem = from ev in events where ev.ID == theItemWeWant select ev;
If the ID
property is meant to
be unique, we would hope that this query returns no more than one item.
The presence of two or more items would point to a problem. If you use
either the First
or the FirstOrDefault
operator, you’d never notice
the problem—these would pick the first item and silently ignore any
more. As a general rule, you don’t want to ignore signs of trouble. In
this case, it would be better to use either Single
or SingleOrDefault
. Single
would be the right choice in cases
where failure to find a match would be an error, while SingleOrDefault
would be appropriate if you do
not necessarily expect to find a match. Either will throw an InvalidOperationException
if the sequence
contains more than one item. So given the previous query, you could use
the following:
CalendarEvent item = matchingItem.SingleOrDefault();
If a programming error causes multiple different calendar events
to end up with the same ID, this code will detect that problem. (And if
your code contains no such problem, this will work in exactly the same
way as FirstOrDefault
.)
You may need to discover at runtime whether certain characteristics are true about any or every element in a collection. For example, if the user is adding a new event to the calendar, you might want to warn him if the event overlaps with any existing items. First, we’ll write a helper function to do the date overlap test:
static bool TimesOverlap(DateTimeOffset startTime1, TimeSpan duration1, DateTimeOffset startTime2, TimeSpan duration2) { DateTimeOffset end1 = startTime1 + duration1; DateTimeOffset end2 = startTime2 + duration2; return (startTime1 < startTime2) ? (end1 > startTime2) : (startTime1 < end2); }
Then we can use this to see if any events overlap with the proposed time for a new entry:
DateTimeOffset newEventStart = new DateTimeOffset(2009, 7, 20, 19, 45, 00, TimeSpan.Zero); TimeSpan newEventDuration = TimeSpan.FromHours(5); bool overlaps = events.Any( ev => TimesOverlap(ev.StartTime, ev.Duration, newEventStart, newEventDuration));
The Any
operator looks to
see if there is at least one item for which the condition is true, and
it returns true
if it finds one and
false
if it gets to the end of the
collection without having found a single item that meets the condition.
So if overlaps
ends up false here, we
know that events
didn’t contain any
items whose time overlapped with the proposed new event time.
There’s also an All
operator that
returns true
only if all of the items
meet the condition. We could also have used this for our overlap
test—we’d just need to invert the sense of the test:
bool noOverlaps = events.All( ev => !TimesOverlap(ev.StartTime, ev.Duration, newEventStart, newEventDuration));
The All
operator returns
true
if you apply it to an empty
sequence. This surprises some people, but it’s difficult to say what
the right behavior is—what does
it mean to ask if some fact is true about all the elements if there
are no elements? This operator’s definition takes the view that it
returns false
if and only if at
least one element does not meet the condition. And while there is some
logic to that, you would probably feel misled if a company told you
“All our customers think our widgets are the best they’ve ever seen”
but neglected to mention that it has no customers.
There’s an overload of the Any
operator that doesn’t take a condition. You can use this to ask the
question: is there anything in this sequence? For example:
bool doIHaveToGetOutOfBedToday = eventsForToday.Any();
The Any
and All
operators are technically known as
quantifiers. More specifically,
they are sometimes referred to as the existential quantifier and the
universal quantifier, respectively.
You may also have come across the common mathematical notation for
these.
The existential quantifier is written as a backward E (∃), and is conventionally pronounced “there exists.”
This corresponds to the Any
operator—it’s true if at least one item
exists in the set that meets the condition.
The universal quantifier is written as an upside down A (∀), and is conventionally pronounced “for all.” It
corresponds to the All
operator,
and is true if all the elements in some set meet the condition. The
convention that the universal quantifier is true for any empty set
(i.e., that All
returns true
when you give it no elements,
regardless of the condition) has a splendid mathematical name: it is
called a vacuous truth.
Quantifiers are special cases of a more general operation called aggregation—aggregation operators perform calculations across all the elements in a set. The quantifiers are singled out as special cases because they have the useful property that the calculation can often terminate early: if you’re testing to see whether something is true about all the elements in the set, and you find an element for which it’s not true, you can stop right there. But for most whole-set operations that’s not true, so there are some more general-purpose aggregation operators.
Aggregation operators perform calculations that involve every
single element in a collection, producing a single value as the result.
This can be as simple as counting the number of elements—this involves
all the elements in the sense that you need to know how many elements
exist to get the correct count. And if you’re dealing with an IEnumerable<T>
, it
is usually necessary to iterate through the whole collection because in
general, enumerable sources don’t know how many items they contain in
advance. So the Count
operator
iterates through the entire collection, and returns the number of
elements it found.
LINQ to Objects has optimizations for some special cases. It
looks for an implementation of a standard ICollection<T>
interface, which defines a Count
property. (This
is distinct from the Count
operator, which, like all LINQ operators, is a method, not a
property.) Collections such as arrays and List<T>
that know how many items they
contain implement this interface. So the Count
operator may be able to avoid having
to enumerate the whole collection by using the Count
property. And more generally, the
nature of the Count
operator
depends on the source—database
LINQ providers can arrange for the database to calculate the correct
value for Count
, avoiding the need
to churn through an entire table just to count rows. But in cases
where there’s no way of knowing the count up front, such as the file
enumeration in Example 8-1, Count
can take a long time to
complete.
LINQ defines some specialized aggregation operators for numeric
values. The Sum
operator returns the
sum of the values of a given expression for all items in a collection.
For example, if you want to find out how many hours of meetings you have
in a collection of events, you could do this:
double totalHours = events.Sum(ev => ev.Duration.TotalHours);
Average
calculates the same
sum, but then divides the result by the number of items, returning the
mean value. Min
and Max
return the lowest and highest of the
values calculated by the expression.
There’s also a general-purpose aggregation operator called
Aggregate
. This lets you perform any
operation that builds up some value by performing some calculation on
each item in turn. In fact, Aggregate
is all you really need—the other aggregation operators are simply more
convenient.[19] For instance, Example 8-26 shows how to implement
Count
using Aggregate
.
The first argument here is a seed value—it’s
the starting point for the value that will be built up as the
aggregation runs. In this case, we’re building up a count, so we start
at 0. You can use any value of any type here—Aggregate
is a generic method that lets you
use whatever type you like.
The second argument is a delegate that will be invoked once for each item. It will be passed the current aggregated value (initially the seed value) and the current item. And then whatever this delegate returns becomes the new aggregated value, and will be passed in as the first argument when that delegate is called for the next item, and so on. So in this example, the aggregated value starts off at 0, and then we add 1 each time around. The final result is therefore the number of items.
Example 8-26 doesn’t look
at the individual items—it just counts them. If we wanted to implement
Sum
, we’d need to add a value from
the source item to the running total instead of just adding 1:
double hours = events.Aggregate(0.0, (total, ev) => total + ev.Duration.TotalHours);
Calculating an average is a little more involved—we need to
maintain both a running total and the count of the number of elements
we’ve seen, which we can do by using an anonymous type as the
aggregation value. And then we can use an overload of Aggregate
that lets us provide a separate
delegate to be used to determine the final value—that gives us the opportunity to
divide the total by the count:
double averageHours = events.Aggregate( new { TotalHours = 0.0, Count = 0 }, (agg, ev) => new { TotalHours = agg.TotalHours + ev.Duration.TotalHours, Count = agg.Count + 1 }, (agg) => agg.TotalHours / agg.Count);
Obviously, it’s easier to use the specialized Count
, Sum
,
and Average
operators, but this
illustrates the flexibility of Aggregate
.
While LINQ calls this mechanism Aggregate
, it is often known by other names.
In functional programming languages, it’s sometimes called
fold or reduce. The latter
name in particular has become slightly better known in recent years
thanks to Google’s much-publicized use of a programming system called
map/reduce. (LINQ’s name for map is Select
, incidentally.) LINQ’s names weren’t
chosen to be different for the sake of it—they are more consistent
with these concepts’ names in database query languages. Most
professional developers are currently likely to have rather more
experience with SQL than, say, Haskell or LISP.
LINQ provides operators for some common set-based
operations. If you have two collections, and you want to discover all
the elements that are present in both collections, you can use the
Intersect
operator:
var inBoth = set1.Intersect(set2);
It also offers a Union
operator, which provides all the elements from both input sets, but when
it comes to the second set it will skip any elements that were already
returned because they were also in the first set. So you could think of
this as being like Concat
, except it
detects and removes duplicates. In a similar vein, there’s the
Distinct
operator—this works on a single
collection, rather than a pair of collections. Distinct
ensures that it returns any
given element only once, so if your input collection happens to contain
duplicate entries, Distinct
will skip
over those.
Finally, the Except
operator
returns only those elements from the first set that do not also appear
in the second set.
LINQ supports joining of sources, in the sense typically associated with databases—given two sets of items, you can form a new set by combining the items from each set that have the same value for some attribute. This is a feature that tends not to get a lot of use when working with object models—relationships between objects are usually represented with references exposed via properties, so there’s not much need for joins. But joins can become much more important if you’re using LINQ with data from a relational database. (Although the Entity Framework, which we describe in a later chapter, is often able to represent relationships between tables as object references. It’ll use joins at the database level under the covers, but you may not need to use them explicitly in LINQ all that often.)
Even though joins are typically most useful when working with data structured for storage in a relational database, you can still perform joins across objects—it’s possible with LINQ to Objects even if it’s not all that common.
In our hypothetical calendar application, imagine that you want to add a feature where you can reconcile events in the user’s local calendar with events retrieved from his phone’s calendar, and you need to try to work out which of the imported events from the phone correspond to items already in the calendar. You might find that the only way to do this is to look for events with the same name that occur at the same time, in which case you might be able to use a join to build up a list of events from the two sources that are logically the same events:
var pairs = from localEvent in events join phoneEvent in phoneEvents on new { Title = localEvent.Title, Start = localEvent.StartTime } equals new { Title = phoneEvent.Name, Start = phoneEvent.Time } select new { Local = localEvent, Phone = phoneEvent };
A LINQ join expects to be able to compare just a single object in
order to determine whether two items should be joined. But we want to
join items only when both the title and the time match. So this example
builds an anonymously typed object to hold both values in order to be
able to provide LINQ with the single object it expects. (You can use
this technique for the grouping operators too, incidentally.) Note that
this example also illustrates how you would deal with the relevant
properties having different names. You can imagine that the imported
phone events might use different property names because you might need
to use some third-party import library, so this example shows how the
code would look if it called the relevant properties Name
and Time
instead of Title
and StartTime
. We fix this by mapping the
properties from the two sources into anonymous types that have the same
structure.
Sometimes it’s necessary to convert the results of a LINQ
query into a specific collection type. For example, you might have code
that expects an array or a List<T>
. You can still use LINQ queries
when creating these kinds of collections, thanks to the standard
ToArray
and ToList
operators. Example 8-17 used ToArray
to convert a grouping into an array of
objects. We could extend that further to convert the query into an array
of arrays, just like the original example from Chapter 7:
var eventsByDay = from ev in events group ev by ev.StartTime.Date into dayGroup select dayGroup.ToArray(); CalendarEvent[][] arrayOfEventsByDay = eventsByDay.ToArray();
In this example, eventsByDay
is
of type IEnumerable<CalendarEvent[]>
. The final
line then turns the enumeration into an array of arrays—a CalendarEvent[][]
.
Remember that LINQ queries typically use deferred execution—they
don’t start doing any work until you start asking them for elements. But
by calling ToList
or ToArray
, you will fully execute the query,
because it builds the entire list or array in one go.
As well as providing conversion operators for getting data out of
LINQ and into other data types, there are some operators for getting
data into LINQ’s world. Sometimes you will come across types that
provide only the old .NET 1.x-style nongeneric IEnumerable
interface. This is
problematic for LINQ because there’s no way for it to know what kinds of
objects it will find. You might happen to know that a collection will
always contain CalendarEvent
objects,
but this would be invisible to LINQ if you are working with a library
that uses old-style collections. So to work around this, LINQ defines a
Cast
operator—you can use this to
tell LINQ what sort of items you believe are in the collection:
IEnumerable oldEnum = GetCollectionFromSomewhere(); var items = from ev in oldEnum.Cast<CalendarEvent>() orderby ev.StartTime select ev;
As you would expect, this will throw an InvalidCastException
if it discovers any
elements in the collection that are not of the type you said. But be
aware that like most LINQ operators, Cast
uses deferred execution—it casts the
elements one at a time as they are requested, so any mismatch will not
be discovered at the point at which you call Cast
. The exception will be thrown at the
point at which you reach the first nonmatching item while enumerating
the query.
LINQ provides a convenient syntax for performing common operations on collections of data. The query expression syntax is reminiscent of database query languages, and can be used in conjunction with databases, as later chapters will show. But these queries are frequently used on objects in memory. The compiler transforms the query syntax into a series of method calls, meaning that the choice of LINQ implementation is determined by context—you can write your own custom LINQ provider, or use a built-in provider such as LINQ to Objects, LINQ to SQL, or LINQ to XML.
All providers implement standard operators—methods with well-known names and signatures that implement various common query features. The features include filtering, sorting, grouping, and the ability to transform data through a projection. You can also perform test and aggregation operations across entire sets. Queries can be composed—most operators’ output can be used as input to other operators. LINQ uses a functional style to maximize the flexibility of composition.
[17] This is even less useful than it sounds. If the string in question contains characters that are required to be used in strict sequence, such as combining characters or surrogates, naively reversing the character order will have peculiar results. But the point here is to illustrate how to add new methods to an existing type, not to explain why it’s surprisingly difficult to reverse a Unicode string.
[18] This is very similar to IComparable<T>
, introduced in the
preceding chapter. But while objects that implement IComparable<T>
can themselves be
compared with other objects of type T
, an IComparer<T>
compares two objects of
type T
—the objects being compared
are separate from the comparer.
[19] That’s true for LINQ to Objects. However, database LINQ
providers may implement Sum
,
Average
, and so on using
corresponding database query features. They might not be able to do
this optimization if you use the general-purpose Aggregate
operator.