Chapter 11. Files and Streams

Almost all programmers have to deal with storing, retrieving, and processing information in files at some time or another. The .NET Framework provides a number of classes and methods we can use to find, create, read, and write files and directories In this chapter we’ll look at some of the most common.

Files, though, are just one example of a broader group of entities that can be opened, read from, and/or written to in a sequential fashion, and then closed. .NET defines a common contract, called a stream, that is offered by all types that can be used in this way. We’ll see how and why we might access a file through a stream, and then we’ll look at some other types of streams, including a special storage medium called isolated storage which lets us save and load information even when we are in a lower-trust environment (such as the Silverlight sandbox). Finally, we’ll look at some of the other stream implementations in .NET by way of comparison. (Streams crop up in all sorts of places, so this chapter won’t be the last we see of them—they’re important in networking, for example.)

Inspecting Directories and Files

We, the authors of this book, have often heard our colleagues ask for a program to help them find duplicate files on their system. Let’s write something to do exactly that. We’ll pass the names of the directories we want to search on the command line, along with an optional switch to determine whether we want to recurse into subdirectories or not. In the first instance, we’ll do a very basic check for similarity based on filenames and sizes, as these are relatively cheap options. Example 11-1 shows our Main function.

Example 11-1. Main method of duplicate file finder

static void Main(string[] args)
{
    bool recurseIntoSubdirectories = false;

    if (args.Length < 1)
    {
        ShowUsage();
        return;
    }

    int firstDirectoryIndex = 0;

    if (args.Length > 1)
    {
        // see if we're being asked to recurse
        if (args[0] == "/sub")
        {
            if (args.Length < 2)
            {
                ShowUsage();
                return;
            }
            recurseIntoSubdirectories = true;
            firstDirectoryIndex = 1;
        }
    }

    // Get list of directories from command line.
    var directoriesToSearch = args.Skip(firstDirectoryIndex);

    List<FileNameGroup> filesGroupedByName =
        InspectDirectories(recurseIntoSubdirectories, directoriesToSearch);

    DisplayMatches(filesGroupedByName);

    Console.ReadKey();
}

The basic structure is pretty straightforward. First we inspect the command-line arguments to work out which directories we’re searching. Then we call InspectDirectories (shown later) to build a list of all the files in those directories. This groups the files by filename (without the full path) because we do not consider two files to be duplicates if they have different names. Finally, we pass this list to DisplayMatches, which displays any potential matches in the files we have found. DisplayMatches refines our test for duplicates further—it considers two files with the same name to be duplicates only if they have the same size. (That’s not foolproof, of course, but it’s surprisingly effective, and we will refine it further later in the chapter.)

Let’s look at each of these steps in more detail.

The code that parses the command-line arguments does a quick check to see that we’ve provided at least one command-line argument (in addition to the /sub switch if present) and we print out some usage instructions if not, using the method shown in Example 11-2.

Example 11-2. Showing command line usage

private static void ShowUsage()
{
    Console.WriteLine("Find duplicate files");
    Console.WriteLine("====================");
    Console.WriteLine(
        "Looks for possible duplicate files in one or more directories");
    Console.WriteLine();
    Console.WriteLine(
        "Usage: findduplicatefiles [/sub] DirectoryName [DirectoryName] ...");
    Console.WriteLine("/sub - recurse into subdirectories");
    Console.ReadKey();
}

The next step is to build a list of files grouped by name. We define a couple of classes for this, shown in Example 11-3. We create a FileNameGroup object for each distinct filename. Each FileNameGroup contains a nested list of FileDetails, providing the full path of each file that has that name, and also the size of that file.

Example 11-3. Types used to keep track of the files we’ve found

class FileNameGroup
{
    public string FileNameWithoutPath { get; set; }
    public List<FileDetails> FilesWithThisName { get; set; }
}

class FileDetails
{
    public string FilePath { get; set; }
    public long FileSize { get; set; }
}

For example, suppose the program searches two folders, c:One and c:Two, and suppose both of those folders contain a file called Readme.txt. Our list will contain a FileNameGroup whose FileNameWithoutPath is Readme.txt. Its nested FilesWithThisName list will contain two FileDetails entries, one with a FilePath of c:OneReadme.txt and the other with c:TwoReadme.txt. (And each FileDetails will contain the size of the relevant file in FileSize. If these two files really are copies of the same file, their sizes will, of course, be the same.)

We build these lists in the InspectDirectories method, which is shown in Example 11-4. This contains the meat of the program, because this is where we search the specified directories for files. Quite a lot of the code is concerned with the logic of the program, but this is also where we start to use some of the file APIs.

Example 11-4. InspectDirectories method

private static List<FileNameGroup> InspectDirectories(
    bool recurseIntoSubdirectories,
    IEnumerable<string> directoriesToSearch)
{
    var searchOption = recurseIntoSubdirectories ?
        SearchOption.AllDirectories : SearchOption.TopDirectoryOnly;

    // Get the path of every file in every directory we're searching.
    var allFilePaths = from directory in directoriesToSearch
                       from file in Directory.GetFiles(directory, "*.*",
                                                        searchOption)
                       select file;

    // Group the files by local filename (i.e. the filename without the
    // containing path), and for each filename, build a list containing the
    // details for every file that has that filename.
    var fileNameGroups = from filePath in allFilePaths
                         let fileNameWithoutPath = Path.GetFileName(filePath)
                         group filePath by fileNameWithoutPath into nameGroup
                         select new FileNameGroup
                         {
                             FileNameWithoutPath = nameGroup.Key,
                             FilesWithThisName =
                              (from filePath in nameGroup
                               let info = new FileInfo(filePath)
                               select new FileDetails
                               {
                                   FilePath = filePath,
                                   FileSize = info.Length
                               }).ToList()
                         };

    return fileNameGroups.ToList();
}

To get it to compile, you’ll need to add:

using System.IO;

The parts of Example 11-4 that use the System.IO namespace to work with files and directories have been highlighted. We’ll start by looking at the use of the Directory class.

Examining Directories

Our InspectDirectories method calls the static GetFiles method on the Directory class to find the files we’re interested in. Example 11-5 shows the relevant code.

Example 11-5. Getting the files in a directory

var searchOption = recurseIntoSubdirectories ?
    SearchOption.AllDirectories : SearchOption.TopDirectoryOnly;

// Get the path of every file in every directory we're searching.
var allFilePaths = from directory in directoriesToSearch
                   from file in Directory.GetFiles(directory, "*.*",
                                                        searchOption)
                   select file;

The overload of GetFiles we’re calling takes the directory we’d like to search, a filter (in the standard command-line form), and a value from the SearchOption enumeration, which determines whether to recurse down through all the subfolders.

Note

We’re using LINQ to Objects to build a list of all the files we require. As you saw in Chapter 8, a query with multiple from clauses works in a similar way to nested foreach loops. The code in Example 11-5 will end up calling GetFiles for each directory passed on the command line, and it will effectively concatenate the results of all those calls into a single list of files.

The GetFiles method returns the full path for each file concerned, but when it comes to finding matches, we just want the filename. We can use the Path class to get the filename from the full path.

Manipulating File Paths

The Path class provides methods for manipulating strings containing file paths. Imagine we have the path c:directory1directory2MyFile.txt. Table 11-1 shows you how you can slice that with various different Path methods.

Table 11-1. The effect of various Path methods

Method name

Result

GetDirectoryName

c:directory1directory2

GetExtension

.txt (note the leading “.”)

GetFileName

MyFile.txt

GetFileNameWithoutExtension

MyFile

GetFullPath

c:directory1directory2MyFile.txt

GetPathRoot

c:

What if we use a network path? Table 11-2 shows the results of the same methods when applied to this path:

\MyPCShare1directory2MyFile.txt

Table 11-2. The effect of various Path methods with a network path

Method name

Result

GetDirectoryName

\MyPCShare1directory2

GetExtension

.txt

GetFileName

MyFile.txt

GetFileNameWithoutExtension

MyFile

GetFullPath

\MyPCShare1directory2MyFile.txt

GetPathRoot

\MyPCShare1

Notice how the path root includes the network hostname and the share name.

What happens if we don’t use a full path, but one relative to the current directory? And what’s the current directory anyway?

Path and the Current Working Directory

The framework maintains a process-wide idea of the current working directory, which is the root path relative to which any file operations that do not fully qualify the path are made. The Directory class (as you might imagine) gives us the ability to manipulate it. Rather than a static property, there are two static methods to query and set the current value: GetCurrentDirectory and SetCurrentDirectory. Example 11-6 shows a call to the latter.

Example 11-6. Setting the current directory

Directory.SetCurrentDirectory(@"c:");

Table 11-3 shows the results we’d get if we passed @"directory2MyFile.txt" to the various Path methods after having run the code in Example 11-6. As you can see, most of the results reflect the fact that we’ve not provided a full path, but there’s one exception: GetFullPath uses the current working directory if we provide it with a relative path.

Table 11-3. The effect of various Path methods with a relative path

Method name

Result

GetDirectoryName

directory2

GetExtension

.txt

GetFileName

MyFile.txt

GetFileNameWithoutExtension

MyFile

GetFullPath

c:directory2MyFile.txt

GetPathRoot

<blank>

Warning

Path doesn’t check that the named file exists. It only looks at the input string and, in the case of GetFullPath, the current working directory.

OK, in our example, we just want the filename without the path, so we use Path.GetFileName to retrieve it. Example 11-7 shows the relevant piece of Example 11-4.

Example 11-7. Getting the filename without the full path

var fileNameGroups = from filePath in allFilePaths
                     let fileNameWithoutPath = Path.GetFileName(filePath)
                     group filePath by fileNameWithoutPath into nameGroup
                     select ...

We then use the LINQ group operator (which was described in Chapter 8) to group all of the files by name.

Path contains a lot of other useful members that we’ll need a little bit later; but we can leave it for the time being, and move on to the other piece of information that we need for our matching code: the file size. The .NET Framework provides us with a class called FileInfo that contains a whole bunch of members that help us to discover things about a file.

Examining File Information

The various functions from the System.IO classes we’ve dealt with so far have all been static, but when it comes to retrieving information such as file size, we have to create an instance of a FileInfo object, passing its constructor the path of the file we’re interested in. That path can be either an absolute path like the ones we’ve seen already, or a path relative to the current working directory. FileInfo has a lot of overlapping functionality with other classes. For example, it provides a few helpers similar to Path to get details of the directory, filename, and extension.

However, the only method we’re really interested in for our example is its Length property, which tells us the size of the file. Every other member on FileInfo has a functional equivalent on other classes in the framework. Even Length is duplicated on the stream classes we’ll come to later, but it is simpler for us to use FileInfo if we don’t intend to open the file itself.

We use FileInfo in the final part of InspectDirectories, to put the file size into the per-file details. Example 11-8 shows the relevant excerpt from Example 11-4.

Example 11-8. Getting the file size

...
select new FileNameGroup
{
    FileNameWithoutPath = nameGroup.Key,
    FilesWithThisName =
     (from filePath in nameGroup
      let info = new FileInfo(filePath)
      select new FileDetails
      {
          FilePath = filePath,
          FileSize = info.Length
      }).ToList()
};

We’re now only one method short of a sort-of-useful program, and that’s the one that trawls through this information to find and display matches: DisplayMatches, which is shown in Example 11-9.

Example 11-9. DisplayMatches

private static void DisplayMatches(
    IEnumerable<FileNameGroup> filesGroupedByName)
{
    var groupsWithMoreThanOneFile = from nameGroup in filesGroupedByName
                                    where nameGroup.FilesWithThisName.Count > 1
                                    select nameGroup;

    foreach (var fileNameGroup in groupsWithMoreThanOneFile)
    {
        // Group the matches by the file size, then select those
        // with more than 1 file of that size.
        var matchesBySize = from file in fileNameGroup.FilesWithThisName
                            group file by file.FileSize into sizeGroup
                            where sizeGroup.Count() > 1
                            select sizeGroup;

        foreach (var matchedBySize in matchesBySize)
        {
            string fileNameAndSize = string.Format("{0} ({1} bytes)",
                fileNameGroup.FileNameWithoutPath, matchedBySize.Key);
            WriteWithUnderlines(fileNameAndSize);
            // Show each of the directories containing this file
            foreach (var file in matchedBySize)
            {
                Console.WriteLine(Path.GetDirectoryName(file.FilePath));
            }
            Console.WriteLine();
        }
    }
}

private static void WriteWithUnderlines(string text)
{
    Console.WriteLine(text);
    Console.WriteLine(new string('-', text.Length));
}

We start with a LINQ query that looks for the filenames that crop up in more than one folder, because those are the only candidates for being duplicates. We iterate through each such name with a foreach loop. Inside that loop, we run another LINQ query that groups the files of that name by size—see the first emphasized lines in Example 11-9. If InspectDirectories discovered three files called Program.cs, for example, and two of them were 278 bytes long while the other was 894 bytes long, this group clause would separate those three files into two groups. The where clause in the same query removes any groups that contain only one file.

So the matchesBySize variable refers to a query that returns a group for each set of two or more files that have the same size (and because we’re inside a loop that iterates through the names, we already know they have the same name). Those are our duplicate candidates. We then write out the filename and size (and an underline separator of the same length). Finally, we write out each file location containing candidate matches using Path.GetDirectoryName.

If we compile and run that lot, we’ll see the following output:

Find duplicate files
====================
Looks for possible duplicate files in one or more directories

Usage: findduplicatefiles [/sub] DirectoryName [DirectoryName] ...
/sub - recurse into subdirectories

We haven’t given it anywhere to look! How are we going to test our application? Well, we could provide it with some command-line parameters. If you open the project properties and switch to the Debug tab, you’ll see a place where you can add command-line arguments (see Figure 11-1).

Setting command-line arguments

Figure 11-1. Setting command-line arguments

However, we could do a bit better for test purposes. Example 11-10 shows a modified Main that supports a new /test command-line switch, which we can use to create test files and exercise the function.

Example 11-10. Adding a /test switch

static void Main(string[] args)
{
    bool recurseIntoSubdirectories = false;

    if (args.Length < 1)
    {
        ShowUsage();
        return;
    }

    int firstDirectoryIndex = 0;
    IEnumerable<string> directoriesToSearch = null;
    bool testDirectoriesMade = false;

    try
    {
        // Check to see if we are running in test mode
        if (args.Length == 1 && args[0] == "/test")
        {
            directoriesToSearch = MakeTestDirectories();
            testDirectoriesMade = true;
            recurseIntoSubdirectories = true;
        }
        else
        {
            if (args.Length > 1)
            {
                // see if we're being asked to recurse
                if (args[0] == "/sub")
                {
                    if (args.Length < 2)
                    {
                        ShowUsage();
                        return;
                    }
                    recurseIntoSubdirectories = true;
                    firstDirectoryIndex = 1;
                }
            }

            // Get list of directories from command line.
            directoriesToSearch = args.Skip(firstDirectoryIndex);
        }

        List<FileNameGroup> filesGroupedByName =
            InspectDirectories(recurseIntoSubdirectories, directoriesToSearch);

        DisplayMatches(filesGroupedByName);
        Console.ReadKey();
    }
    finally
    {
        if( testDirectoriesMade )
        {
            CleanupTestDirectories(directoriesToSearch);
        }
    }

}

In order to operate in test mode, we’ve added an alternative way to initialize the variable that holds the list of directories (directoriesToSearch). The original code, which initializes it from the command-line arguments (skipping over the /sub switch if present), is still present. However, if we find the /test switch, we initialize it to point at some test directories we’re going to create (in the MakeTestDirectories method). The rest of the code can then be left as it was (to avoid running some completely different program in our test mode). Finally, we add a bit of cleanup code at the end to remove any test directories if we created them.

So, how are we going to implement MakeTestDirectories? We want to create some temporary files, and write some content into them to exercise the various matching possibilities.

Creating Temporary Files

A quick look at Path reveals the GetTempFileName method. This creates a file of zero length in a directory dedicated to temporary files, and returns the path to that file.

Note

It is important to note that the file is actually created, whether you use it or not, and so you are responsible for cleaning it up when you are done, even if you don’t make any further use of it.

Let’s create another test console application, just to try out that method. We can do that by adding the following to our main function:

string fileName = Path.GetTempFileName();
// Display the filename
Console.WriteLine(fileName);
// And wait for some input
Console.ReadKey();

But wait! If we just compile and run that, we’ll leave the file we created behind on the system. We should make sure we delete it again when we’re done. There’s nothing special about a temporary file. We create it in an unusual way, and it ends up in a particular place, but once it has been created, it’s just like any other file in the filesystem. So, we can delete it the same way we’d delete any other file.

Deleting Files

The System.IO namespace provides the File class, which offers various methods for doing things with files. Deleting is particularly simple: we just use the static Delete method, as Example 11-11 shows.

Example 11-11. Deleting a file

string fileName = Path.GetTempFileName();
try
{
    // Use the file
    // ...
    // Display the filename
    Console.WriteLine(fileName);
    // And wait for some input
    Console.ReadKey();
}
finally
{
    // Then clean it up
    File.Delete(fileName);
}

Notice that we’ve wrapped the code in which we (could) manipulate the file further in a try block, and deleted it in a finally block. This ensures that whatever happens, we’ll always attempt to clean up after ourselves.

If you compile and run this test project now, you’ll see some output like this:

C:UsersyourusernameAppDataLocalTemp	mpCA8F.tmp

The exact text will depend on your operating system version, your username, and (of course) the random filename that was created for you. If you browse to that path, you will see a zero-length file of that name.

If you then press a key, allowing Console.ReadKey to return, it will drop through to the finally block, where we delete the temporary file, using the static Delete method on the File class.

There are lots of scenarios where this sort of temporary file creation is just fine, but it doesn’t really suit our example application’s needs. We want to create multiple temporary files, in multiple different directories. GetTempFileName doesn’t really do the job for us.

If we look at Path again, though, there’s another likely looking method: GetRandomFileName. This returns a random string of characters that can be used as either a file or a directory name. It uses a cryptographically strong random number generator (which can be useful in some security-conscious scenarios), and is statistically likely to produce a unique name, thus avoiding clashes. Unlike GetTempFileName it doesn’t actually create the file (or directory); that’s up to us.

If you run the code in Example 11-12:

Example 11-12. Showing a random filename

Console.WriteLine(Path.GetRandomFileName());

you’ll see output similar to this:

xnicz3rs.juc

(Obviously, the actual characters you see will, hopefully, be different, or the statistical uniqueness isn’t all that unique!)

So, we can use that method to produce our test file and directory names. But where are we going to put the files? Perhaps one of the various “well-known folders” Windows offers would suit our needs.

Well-Known Folders

Most operating systems have a bunch of well-known filesystem locations, and Windows is no exception. There are designated folders for things like the current user’s documents, pictures, or desktop; the program files directory where applications are installed; and the system folder.

The .NET Framework provides a class called Environment that provides information about the world our program runs in. Its static method GetFolderPath is the one that interests us right now, because it will return the path of various well-known folders. We pass it one of the Environment.SpecialFolder enumeration values. Example 11-13 retrieves the location of one of the folders in which applications can store per-user data.

Example 11-13. Getting a well-known folder location

string path = Environment.GetFolderPath(Environment.SpecialFolder.ApplicationData);

Table 11-4 lists all of the well-known folders that GetFolderPath can return, and the location they give on the installed copy of Windows 7 (64-bit) belonging to one of the authors.

Table 11-4. Special folders

Enumeration

Example location

Purpose

ApplicationData

C:Usersmwa

AppDataRoaming

A place for applications to store their own private information for a particular user; this may be located on a shared server, and available across multiple logins for the same user, on different machines, if the user’s domain policy is configured to do so.

CommonApplicationData

C:ProgramData

A place for applications to store their own private information accessible to all users.

CommonProgramFiles

C:Program FilesCommon Files

A place where shared application components can be installed.

Cookies

C:Usersmwa

AppDataRoaming

MicrosoftWindowsCookies

The location where Internet cookies are stored for this user; another potentially roaming location.

Desktop

C:Usersmwa

Desktop

The current user’s desktop (virtual) folder.

DesktopDirectory

C:Usersmwa

Desktop

The physical directory where filesystem objects on the desktop are stored (currently, but not necessarily, the same as Desktop).

Favorites

C:Usersmwa

Favorites

The directory containing the current user’s favorites links.

History

C:Usersmwa

AppDataLocal

MicrosoftWindows

History

The directory containing the current user’s Internet history.

InternetCache

C:Usersmwa

AppDataLocal

MicrosoftWindows

Temporary Internet Files

The directory that contains the current user’s Internet cache.

LocalApplicationData

C:Usersmwa

AppDataLocal

A place for applications to store their private data associated with the current user. This is guaranteed to be on the local machine (as opposed to ApplicationData which may roam with the user).

MyComputer

<blank>

This is always an empty string because there is no real folder that corresponds to My Computer.

MyDocuments

C:Usersmwa

Documents

The folder in which the current user’s documents (as opposed to private application datafiles) are stored.

MyMusic

C:Usersmwa

Music

The folder in which the current user’s music files are stored.

MyPictures

C:Usersmwa

Pictures

The folder in which the current user’s picture files are stored.

Personal

C:Usersmwa

Documents

The folder in which the current user’s documents are stored (synonymous with MyDocuments).

ProgramFiles

C:Program Files

The directory in which applications are installed. Note that there is no special folder enumeration for the 32-bit applications directory on 64-bit Windows.

Programs

C:Usersmwa

AppDataRoaming

MicrosoftWindows

Start MenuPrograms

The location where application shortcuts in the Start menu’s Programs section are stored for the current user. This is another potentially roaming location.

Recent

C:Usersmwa

AppDataRoaming

MicrosoftWindows

Recent

The folder where links to recently used documents are stored for the current user. This is another potentially roaming location.

SendTo

C:Usersmwa

AppDataRoaming

MicrosoftWindows

SendTo

The location that contains the links that form the Send To menu items in the shell. This is another potentially roaming location.

StartMenu

C:Usersmwa

AppDataRoaming

MicrosoftWindows

Start Menu

The folder that contains the Start menu items for the current user. This is another potentially roaming location.

Startup

C:Usersmwa

AppDataRoaming

MicrosoftWindows

Start MenuPrograms

Startup

The folder that contains links to programs that will run each time the current user logs in. This is another potentially roaming location.

System

C:Windows

system32

The Windows system folder.

Templates

C:Usersmwa

AppDataRoaming

MicrosoftWindows

Templates

A location in which applications can store document templates for the current user. Again, this is a potentially roaming location.

Note

Notice that this doesn’t include all of the well-known folders we have these days, because the set of folders grows with each new version of Windows. Things like Videos, Games, Downloads, Searches, and Contacts are all missing. It also doesn’t support Windows 7 libraries in any meaningful sense. This is (sort of) by design. The method provides a lowest common denominator approach to finding useful folders on the system, in a way that works across all supported versions of the framework (including Windows Mobile).

So, we need to choose a path in which our current user is likely to have permission to create/read/write and delete files and directories. It doesn’t have to be one that the user can see under normal circumstances. In fact, we’re going to create files with extensions that are not bound to any applications and we should not do that in a place that’s visible to the user if we want our application to be a good Windows citizen.

Note

If you create a file in a place that’s visible to the user, like Documents or Desktop, you should ensure that it always has a default application associated with it.

There are two candidates for this in Table 11-4: LocalApplicationData and ApplicationData. Both of these offer places for applications to store files that the user wouldn’t normally see. (Of course, users can find these folders if they look hard enough. The goal here is to avoid putting our temporary test files in the same folders as the user’s documents.)

The difference between these two folders is that if the user has a roaming profile, files in the latter folder will be copied around the network as they move from one machine to another, while files in the former folder remain on the machine on which they were created. We’re building temporary files for test purposes, so LocalApplicationData looks like the right choice.

So, let’s return to our demo application, and start to implement the MakeTestDirectories method. The first thing we need to do is to create a few test directories. Example 11-14 contains some code to do that.

Example 11-14. Creating test directories

private static string[] MakeTestDirectories()
{
    string localApplicationData = Path.Combine(
        Environment.GetFolderPath(
            Environment.SpecialFolder.LocalApplicationData),
        @"Programming CSharpFindDuplicates");

    // Let's make three test directories
    var directories = new string[3];
    for (int i = 0; i < directories.Length; ++i)
    {
        string directory = Path.GetRandomFileName();
        // Combine the local application data with the
        // new random file/directory name
        string fullPath = Path.Combine(localApplicationData, directory);
        // And create the directory
        Directory.CreateDirectory(fullPath);
        directories[i] = fullPath;
        Console.WriteLine(fullPath);
    }
    return directories;
}

First, we use the GetFolderPath method to get the LocalApplicationData path. But we don’t want to work directly in that folder—applications are meant to create their own folders underneath this. Normally you’d create a folder named either for your company or for your organization, and then an application-specific folder inside that—we’ve used Programming CSharp as the organization name here, and FindDuplicates as the application name. We then use a for loop to create three directories with random names inside that. To create these new directories, we’ve used a couple of new methods: Path.Combine and Directory.CreateDirectory.

Concatenating Path Elements Safely

If you’ve written any code that manipulates paths before, you’ll have come across the leading/trailing slash dilemma. Does your path fragment have one or not? You also need to know whether the path fragment you’re going to append really is a relative path—are there circumstances under which you might need to deal with a fully qualified path instead? Path.Combine does away with all that anxiety. Not only will it check all those things for you and do the right thing, but it will even check that your paths contain only valid path characters.

Table 11-5 contains some example paths, and the result of combining them with Path.Combine.

Table 11-5. Example results of Path.Combine

Path 1

Path 2

Combined

C:hello

world

C:helloworld

C:hello

world

C:helloworld

C:hello

world

C:helloworld

hello

world

helloworld

C:hello

world.exe

chelloworld.exe

\myboxhello

world

\myboxhelloworld

world

C:hello

C:hello

The last entry in that table is particularly interesting: notice that the second path is absolute, and so the combined path is “optimized” to just that second path.

In our case, Example 11-14 combines the well-known folder with a subfolder name to get a folder location specific to this example. And then it combines that with our new temporary folder names, ready for creation.

Creating and Securing Directory Hierarchies

Directory.CreateDirectory is very straightforward: it does exactly what its name suggests. In fact, it will create any directories in the whole path that do not already exist, so you can create a deep hierarchy with a single call. (You’ll notice that Example 11-14 didn’t bother to create the Programming CSharpFindDuplicates folder—those will get created automatically the first time we run as a result of creating the temporary folders inside them.) A side effect of this is that it is safe to call it if all of the directories in the path already exist—it will just do nothing.

In addition to the overload we’ve used, there’s a second which also takes a DirectorySecurity parameter:

Directory.CreateDirectory(string path, DirectorySecurity directorySecurity)

The DirectorySecurity class allows you to specify filesystem access controls with a relatively simple programming model. If you’ve tried using the Win32 ACL APIs, you’ll know that it is a nightmare of GUIDs, SSIDs, and lists sensitive to item ordering. This model does away with much of the complexity.

Let’s extend our create function to make sure that only our current user has read/write/modify permissions on these directories. Example 11-15 modifies the previous example by explicitly granting the current user full control of the newly created folders. The new or changed lines are highlighted.

Example 11-15. Configuring access control on new directories

private static string[] MakeTestDirectories()
{
    string localApplicationData = Path.Combine(
        Environment.GetFolderPath(
            Environment.SpecialFolder.LocalApplicationData),
        @"Programming CSharpFindDuplicates");

    // Get the name of the logged in user
    string userName = WindowsIdentity.GetCurrent().Name;
    // Make the access control rule
    FileSystemAccessRule fsarAllow =
        new FileSystemAccessRule(
            userName,
            FileSystemRights.FullControl,
            AccessControlType.Allow);
    DirectorySecurity ds = new DirectorySecurity();
    ds.AddAccessRule(fsarAllow);

    // Let's make three test directories
    var directories = new string[3];
    for (int i = 0; i < directories.Length; ++i)
    {
        string directory = Path.GetRandomFileName();
        // Combine the local application data with the
        // new random file/directory name
        string fullPath = Path.Combine(localApplicationData, directory);

        // And create the directory
        Directory.CreateDirectory(fullPath, ds);

        directories[i] = fullPath;
        Console.WriteLine(fullPath);
    }
    return directories;
}

You’ll need to add a couple of using directives to the top of the file before you can compile this code:

using System.Security.AccessControl;
using System.Security.Principal;

What do these changes do? First, we make use of a type called WindowsIdentity to find the current user, and fish out its name. If you happen to want to specify the name explicitly, rather than get the current user programmatically, you can do so (e.g., MYDOMAINSomeUserId).

Then, we create a FileSystemAccessRule, passing it the username, the FileSystemRights we want to set, and a value from the AccessControlType enumeration which determines whether we are allowing or denying those rights.

If you take a look at the FileSystemRights enumeration in MSDN, you should recognize the options from the Windows security permissions dialog in the shell. You can combine the individual values (as it is a Flags enumeration), or use one of the precanned sets as we have here.

If you compile this application, and modify the debug settings to pass just the /test switch as the only command-line argument, when you run it you’ll see output similar to the following (but with your user ID, and some different random directory names):

C:UsersyourIdAppDataLocalProgramming CSharpFindDuplicatesyzw0iw3p.ysq
C:UsersyourIdAppDataLocalProgramming CSharpFindDuplicatesqke5k2ql.5et
C:UsersyourIdAppDataLocalProgramming CSharpFindDuplicates5hkhspqa.osc

If we take a look at the folder in Explorer, you should see your new directories (something like Figure 11-2).

If you right-click on one of these and choose Properties, then examine the Security tab, you should see something like Figure 11-3.

Notice how the only user with permissions on this directory is the currently logged on user (in this case ian, on a domain called idg.interact). All of the usual inherited permissions have been overridden. Rather than the regular read/modify/write checkboxes, we’ve apparently got special permissions. This is because we set them explicitly in the code.

Newly created folders

Figure 11-2. Newly created folders

Permissions on the new directory

Figure 11-3. Permissions on the new directory

We can have a look at that in more detail if we click the Advanced button, and switch to the Effective Permissions tab. Click the Select button to pick a user (see Figure 11-4). First, let’s look at the effective permissions for the local administrator (this is probably MachineNameAdministrator, unless you’ve changed your default administrator name to try to make things slightly harder for an attacker).

Selecting a user

Figure 11-4. Selecting a user

If you click OK, you’ll see the effective permissions for Administrator on that folder (Figure 11-5).

You can scroll the scroll bar to prove it for yourself, but you can see that even Administrator cannot actually access your folder! (This is not, of course, strictly true. Administrators can take ownership of the folder and mess with the permissions themselves, but they cannot access the folder without changing the permissions first.) Try again with your own user ID. You will see results similar to Figure 11-6—we have full control. Scroll the list and you’ll see that everything is ticked.

What if we wanted “not quite” full control? Say we wanted to deny the ability to write extended attributes to the file. Well, we can update our code and add a second FileSystemAccessRule. Example 11-16 shows the additional code required.

Example 11-16. Denying permissions

private static string[] MakeTestDirectories()
{
    // ...
    FileSystemAccessRule fsarAllow =
        new FileSystemAccessRule(
            userName,
            FileSystemRights.FullControl,
            AccessControlType.Allow);
    ds.AddAccessRule(fsarAllow);

    FileSystemAccessRule fsarDeny =
        new FileSystemAccessRule(
            userName,
            FileSystemRights.WriteExtendedAttributes,
            AccessControlType.Deny);
    ds.AddAccessRule(fsarDeny);

    // ...
}

Notice that we’re specifying AccessControlType.Deny.

Before you compile and run this, delete the folders you created with the last run, using Explorer—we’ll write some code to do that automatically in a minute, because it will get very boring very quickly!

You should see very similar output to last time (just with some new directory names):

C:UsersyourIdAppDataLocalProgramming CSharpFindDuplicatesslhwbtgo.sop
C:UsersyourIdAppDataLocalProgramming CSharpFindDuplicatessfndkgn.ucm
C:UsersyourIdAppDataLocalProgramming CSharpFindDuplicates	ayf1uvg.y4y
Effective permissions for Administrator on the new folder

Figure 11-5. Effective permissions for Administrator on the new folder

Effective permissions for the current user on the new folder

Figure 11-6. Effective permissions for the current user on the new folder

If you look at the permissions, you will now see both the Allow and the new Deny entries (Figure 11-7).

Permissions now that we’ve denied write extended attributes

Figure 11-7. Permissions now that we’ve denied write extended attributes

As a double-check, take a look at the effective permissions for your current user (see Figure 11-8).

Effective permissions with write extended attributes denied

Figure 11-8. Effective permissions with write extended attributes denied

In Figure 11-8 you can see that we’ve no longer got Full control, because we’ve been specifically denied Write extended attributes. Of course, we could always give that permission back to ourselves, because we’ve been allowed Change permissions, but that’s not the point!

Note

Although that isn’t the point, security permissions of all kinds are a complex affair. If your users have local or domain administrator permissions, they can usually work around any other permissions you try to manage. You should always try to abide by the principle of least permission: don’t grant people more privileges than they really need to do the job. Although that will require a little more thinking up front, and can sometimes be a frustrating process while you try to configure a system, it is much preferable to a wide-open door.

OK, delete those new directories using Explorer, and we’ll write some code to clean up after ourselves. We need to delete the directories we’ve just created, by implementing our CleanupTestDirectories method.

Deleting a Directory

You’re probably ahead of us by now. Yes, we can delete a directory using Directory.Delete, as Example 11-17 shows.

Example 11-17. Deleting a directory

private static void CleanupTestDirectories(IEnumerable<string> directories)
{
    foreach (var directory in directories)
    {
        Directory.Delete(directory);
    }
}

We’re just iterating through the set of new directories we stashed away earlier, deleting them.

OK, we’ve got our test directories. We’d now like to create some test files to use. Just before we return from MakeTestDirectories, let’s add a call to a new method to create our files, as Example 11-18 shows.

Example 11-18. Creating files in the test directories

...
CreateTestFiles(directories);
return directories;

Example 11-19 shows that method.

Example 11-19. The CreateTestFiles method

private static void CreateTestFiles(IEnumerable<string> directories)
{
    string fileForAllDirectories = "SameNameAndContent.txt";
    string fileSameInAllButDifferentSizes = "SameNameDifferentSize.txt";

    int directoryIndex = 0;
    // Let's create a distinct file that appears in each directory
    foreach (string directory in directories)
    {
        directoryIndex++;

        // Create the distinct file for this directory
        string filename = Path.GetRandomFileName();
        string fullPath = Path.Combine(directory, filename);
        CreateFile(fullPath, "Example content 1");

        // And now the one that is in all directories, with the same content
        fullPath = Path.Combine(directory, fileForAllDirectories);
        CreateFile(fullPath, "Found in all directories");

        // And now the one that has the same name in
        // all directories, but with different sizes
        fullPath = Path.Combine(directory, fileSameInAllButDifferentSizes);

        StringBuilder builder = new StringBuilder();
        builder.AppendLine("Now with");
        builder.AppendLine(new string('x', directoryIndex));
        CreateFile(fullPath, builder.ToString());
    }
}

As you can see, we’re running through the directories, and creating three files in each. The first has a different, randomly generated filename in each directory, and remember, our application only considers files with the same names as being possible duplicates, so we expect the first file we add to each directory to be considered unique. The second file has the same filename and content (so they will all be the same size) in every folder. The third file has the same name every time, but its content varies in length.

Well, we can’t put off the moment any longer; we’re going to have to create a file, and write some content into it. There are lots and lots and lots (and lots) of different ways of doing that with the .NET Framework, so how do we go about picking one?

Writing Text Files

Our first consideration should always be to “keep it simple,” and use the most convenient method for the job. So, what is the job? We need to create a file, and write some text into it. File.WriteAllText looks like a good place to start.

Writing a Whole Text File at Once

The File class offers three methods that can write an entire file out in a single step: WriteAllBytes, WriteAllLines, and WriteAllText. The first of these works with binary, but our application has text. As you saw in Chapter 10, we could use an Encoding to convert our text into bytes, but the other two methods here will do that for us. (They all use UTF-8.)

WriteAllLines takes a collection of strings, one for each line, but our code in Example 11-19 prepares content in the form of a single string. So as Example 11-20 shows, we use WriteAllText to write the file out with a single line of code. (In fact, we probably didn’t need to bother putting this code into a separate method. However, this will make it easier for us to illustrate some of the alternatives later.)

Example 11-20. Writing a string into a new file

private static void CreateFile(string fullPath, string contents)
{
    File.WriteAllText(fullPath, contents);
}

The path can be either relative or absolute, and the file will be created if it doesn’t already exist, and overwritten if it does.

This was pretty straightforward, but there’s one problem with this technique: it requires us to have the entire file contents ready at the point where we want to start writing text. This application already does that, but this won’t always be so. What if your program performs long and complex processing that produces very large volumes of text? Writing the entire file at once like this would involve having the whole thing in memory first. But there’s a slightly more complex alternative that makes it possible to generate gigabytes of text without consuming much memory.

Writing Text with a StreamWriter

The File class offers a CreateText method, which takes the path to the file to create (either relative or absolute, as usual), and creates it for you if it doesn’t already exist. If the file is already present, this method overwrites it. Unlike the WriteAllText method, it doesn’t write any data initially—the newly created file will be empty at first. The method returns an instance of the StreamWriter class, which allows you to write to the file. Example 11-21 shows the code we need to use that.

Example 11-21. Creating a StreamWriter

private static void CreateFile(string fullPath, string p)
{
    using (StreamWriter writer = File.CreateText(fullPath))
    {
        // Use the stream writer here
    }
}

We’re no longer writing the whole file in one big lump, so we need to let the StreamWriter know when we’re done. To make life easier for us, StreamWriter implements IDisposable, and closes the underlying file if Dispose is called. This means that we can wrap it in a using block, as Example 11-21 shows, and we can be assured that it will be closed even if an exception is thrown.

So, what is a StreamWriter? The first thing to note is that even though this chapter has “Stream” in the title, this isn’t actually a Stream; it’s a wrapper around a Stream. It derives from a class called TextWriter, which, as you might guess, is a base for types which write text into things, and a StreamWriter is a TextWriter that writes text into a Stream. TextWriter defines lots of overloads of Write and WriteLine methods, very similar to those we’ve been using on Console in all of our examples so far.

Note

If it is so similar in signature, why doesn’t Console derive from TextWriter? TextWriter is intended to be used with some underlying resource that needs proper lifetime management, so it implements IDisposable. Our code would be much less readable if we had to wrap every call on Console with a using block, or remember to call Dispose—especially as it isn’t really necessary. So, why make TextWriter implement IDisposable? We do that so that our text-writing code can be implemented in terms of this base class, without needing to know exactly what sort of TextWriter we’re talking to, and still handle the cleanup properly.

The File class’s CreateText method calls a constructor on StreamWriter which opens the newly created file, and makes it ready for us to write; something like this:

return new StreamWriter(fullPath, false);

Note

There’s nothing to stop you from doing this yourself by hand, and there are many situations where you might want to do so; but the helper methods on File tend to make your code smaller, and more readable, so you should consider using those first. We’ll look at using StreamWriter (and its partner, StreamReader) in this way later in the chapter, when we’re dealing with different sorts of underlying streams.

Hang on, though. We’ve snuck a second parameter into that constructor. What does that Boolean mean? When you create a StreamWriter, you can choose to overwrite any existing file content (the default), or append to what is already there. The second Boolean parameter to the constructor controls that behavior. As it happen, passing false here means we want to overwrite.

Note

This is a great example of why it’s better to define nicely named enumerations, rather than controlling this sort of thing with a bool. If the value had not been false, but some mythical value such as OpenBehavior.Overwrite, we probably wouldn’t have needed to explain what it did. C# 4.0 added the ability to use argument names when calling methods, so we could have written new StreamWriter(fullPath, append: false), which improves matters slightly, but doesn’t help you when you come across code that hasn’t bothered to do that.

So, now we can easily complete the implementation of our CreateFile method, as shown in Example 11-22.

Example 11-22. Writing a string with StreamWriter

private static void CreateFile(string fullPath, string p)
{
    using (StreamWriter writer = File.CreateText(fullPath))
    {
        writer.Write(p);
    }
}

We just write the string we’ve been provided to the file. In this particular application, Example 11-22 isn’t an improvement on Example 11-20—we’re just writing a single string, so WriteAllText was a better fit. But StreamWriter is an important technique for less trivial scenarios.

OK, let’s build and run this code again (press F5 to make sure it runs in the debugger). And everything seems to be going very well. We see the output we’d hoped for:

C:UsersmwaAppDataLocalup022gsm.241
C:UsersmwaAppDataLocalgdovysqk.cqn
C:UsersmwaAppDataLocalxyhazu3n.4pw
SameNameAndContent.txt
----------------------
C:UsersmwaAppDataLocalup022gsm.241
C:UsersmwaAppDataLocalgdovysqk.cqn
C:UsersmwaAppDataLocalxyhazu3n.4pw

That is to say, one file is found duplicated in three directories. All the others have failed to match, exactly as we’d expect.

Unfortunately, almost before we’d had a chance to read that, the debugger halted execution to report an unhandled exception. It crashes in the code we added in Example 11-17 to delete the directories, because the directories are not empty.

For now, we’re going to have to clean up those directories by hand again, and make another change to our code. Clearly, the problem is that the Directory.Delete method doesn’t delete the files and directories inside the directory itself.

This is easily fixed, because there is another overload of that method which does allow us to delete the files recursively—you just pass a Boolean as the second parameter (true for recursive deletes, and false for the default behavior).

Warning

Don’t add this parameter unless you’re absolutely sure that the code is working correctly, looking only at the test directory, and not executing this code in nontest mode. We don’t want a host of emails appearing telling us that we deleted your entire, non-backed-up source and document tree because you followed this next instruction, having deviated slightly from the earlier instructions.

If you want to avoid having to clean up the directories by hand, though, and you’re really, really sure everything is fine, you could add this, at your own risk:

Directory.Delete(directory, true);

So far, we have quietly ignored the many, many things that can go wrong when you’re using files and streams. Now seems like a good time to dive into that murky topic.

When Files Go Bad: Dealing with Exceptions

Exceptions related to file and stream operations fall into three broad categories:

  • The usual suspects you might get from any method: incorrect parameters, null references, and so on

  • I/O-related problems

  • Security-related problems

The first category can, of course, be dealt with as normal—if they occur (as we discussed in Chapter 6) there is usually some bug or unexpected usage that you need to deal with.

The other two are slightly more interesting cases. We should expect problems with file I/O. Files and directories are (mostly) system-wide shared resources. This means that anyone can be doing something with them while you are trying to use them. As fast as you’re creating them, some other process might be deleting them. Or writing to them; or locking them so that you can’t touch them; or altering the permissions on them so that you can’t see them anymore. You might be working with files on a network share, in which case different computers may be messing with the files, or you might lose connectivity partway through working with a file.

This “global” nature of files also means that you have to deal with concurrency problems. Consider this piece of code, for example, that makes use of the (almost totally redundant) method File.Exists, shown in Example 11-23, which determines whether a file exists.

Example 11-23. The questionable File.Exists method

if (File.Exists("SomeFile.txt"))
{
    // Play with the file
}

Is it safe to play with the file in there, on the assumption that it exists?

No.

In another process, even from another machine if the directory is shared, someone could nip in and delete the file or lock it, or do something even more nefarious (like substitute it for something else). Or the user might have closed the lid of his laptop just after the method returns, and may well be in a different continent by the time he brings it out of sleep mode, at which point you won’t necessarily have access to the same network shares that seemed to be visible just one line of code ago.

So you have to code extremely defensively, and expect exceptions in your I/O code, even if you checked that everything looked OK before you started your work.

Unlike most exceptions, though, abandoning the operation is not always the best choice. You often see transient problems, like a USB drive being temporarily unavailable, for example, or a network glitch temporarily hiding a share from us, or aborting a file copy operation. (Transient network problems are particularly common after a laptop resumes from suspend—it can take a few seconds to get back on the network, or maybe even minutes if the user is in a hotel and has to sign up for an Internet connection before connecting back to the office VPN. Abandoning the user’s data is not a user-friendly response to this situation.)

When an I/O problem occurs, the framework throws one of several exceptions derived from IOException (or, as we’ve already seen, IOException itself) listed here:

IOException

This is thrown when some general problem with I/O has occurred. This is the base for all of the more specific exception types, but it is sometimes thrown in its own right, with the Message text describing the actual problem. This makes it somewhat less useful for programmatic interpretation; you usually have to allow the user to intervene in some way when you catch one of these.

DirectoryNotFoundException

This is thrown when an attempt is made to access a directory that does not exist. This commonly occurs because of an error in constructing a path (particularly when relative paths are in play), or because some other process has moved or deleted a directory during an operation.

DriveNotFoundException

This is thrown when the root drive in a path is no longer available. This could be because a drive letter has been mapped to a network location which is no longer available, or a removable device has been removed. Or because you typed the wrong drive letter!

FileLoadException

This is a bit of an anomaly in the family of IOExceptions, and we’re including it in this list only because it can cause some confusion. It is thrown by the runtime when an assembly cannot be loaded; as such, it has more to do with assemblies than files and streams.

FileNotFoundException

This is thrown when an attempt is made to access a file that does not exist. As with DirectoryNotFoundException, this is often because there has been some error in constructing a path (absolute or relative), or because something was moved or deleted while the program was running.

PathTooLongException

This is an awkward little exception, and causes a good deal of confusion for developers (which is one reason correct behavior in the face of long paths is a part of Microsoft’s Designed For Windows test suite). It is thrown when a path provided is too long. But what is “too long”? The maximum length for a path in Windows used to be 260 characters (which isn’t very long at all). Recent versions allow paths up to about (but not necessarily exactly) 32,767 characters, but making use of that from .NET is awkward. There’s a detailed discussion of Windows File and Path lengths if you fall foul of the problem in the MSDN documentation at http://msdn.microsoft.com/library/aa365247, and a discussion of the .NET-specific issues at http://go.microsoft.com/fwlink/?LinkID=163666.

If you are doing anything with I/O operations, you will need to think about most, if not all, of these exceptions, deciding where to catch them and what to do when they occur.

Let’s look back at our example again, and see what we want to do with any exceptions that might occur. As a first pass, we could just wrap our main loop in a try/catch block, as Example 11-24 does. Since our application’s only job is to report its findings, we’ll just display a message if we encounter a problem.

Example 11-24. A first attempt at handling I/O exceptions

try
{
    List<FileNameGroup> filesGroupedByName =
        InspectDirectories(recurseIntoSubdirectories, directoriesToSearch);

    DisplayMatches(foundFiles);
    Console.ReadKey();
}
catch (PathTooLongException ptlx)
{
    Console.WriteLine("The specified path was too long");
    Console.WriteLine(ptlx.Message);
}
catch (DirectoryNotFoundException dnfx)
{
    Console.WriteLine("The specified directory was not found");
    Console.WriteLine(dnfx.Message);
}
catch (IOException iox)
{
    Console.WriteLine(iox.Message);
}
catch (UnauthorizedAccessException uax)
{
    Console.WriteLine("You do not have permission to access this directory.");
    Console.WriteLine(uax.Message);
}
catch (ArgumentException ax)
{
    Console.WriteLine("The path provided was not valid.");
    Console.WriteLine(ax.Message);
}
finally
{
    if (testDirectoriesMade)
    {
        CleanupTestDirectories(directoriesToSearch);
    }
}

We’ve decided to provide specialized handling for the PathTooLongException and DirectoryNotFoundException exceptions, as well as generic handling for IOException (which, of course, we have to catch after the exceptions derived from it).

In addition to those IOException-derived types, we’ve also caught UnauthorizedAccessException. This is a security exception, rather than an I/O exception, and so it derives from a different base (SystemException). It is thrown if the user does not have permission to access the directory concerned.

Let’s see that in operation, by creating an additional test directory and denying ourselves access to it. Example 11-25 shows a function to create a directory where we deny ourselves the ListDirectory permission.

Example 11-25. Denying permission

private static string CreateDeniedDirectory(string parentPath)
{
    string deniedDirectory = Path.GetRandomFileName();
    string fullDeniedPath = Path.Combine(parentPath, deniedDirectory);
    string userName = WindowsIdentity.GetCurrent().Name;
    DirectorySecurity ds = new DirectorySecurity();
    FileSystemAccessRule fsarDeny =
        new FileSystemAccessRule(
            userName,
            FileSystemRights.ListDirectory,
            AccessControlType.Deny);
    ds.AddAccessRule(fsarDeny);

    Directory.CreateDirectory(fullDeniedPath, ds);
    return fullDeniedPath;
}

We can call it from our MakeTestDirectories method, as Example 11-26 shows (along with suitable modifications to the code to accommodate the extra directory).

Example 11-26. Modifying MakeTestDirectories for permissions test

private static string[] MakeTestDirectories()
{
    // ...
    // Let's make three test directories
    // and leave space for a fourth to test access denied behavior
    var directories = new string[4];
    for (int i = 0; i < directories.Length - 1; ++i)
    {
        ... as before ...
    }

    CreateTestFiles(directories.Take(3));

    directories[3] = CreateDeniedDirectory(localApplicationData);

    return directories;
}

But hold on a moment, before you build and run this. If we’ve denied ourselves permission to look at that directory, how are we going to delete it again in our cleanup code? Fortunately, because we own the directory that we created, we can modify the permissions again when we clean up.

Finding and Modifying Permissions

Example 11-27 shows a method which can give us back full control over any directory (providing we have the permission to change the permissions). This code makes some assumptions about the existing permissions, but that’s OK here because we created the directory in the first place.

Example 11-27. Granting access to a directory

private static void AllowAccess(string directory)
{
    DirectorySecurity ds = Directory.GetAccessControl(directory);

    string userName = WindowsIdentity.GetCurrent().Name;

    // Remove the deny rule
    FileSystemAccessRule fsarDeny =
        new FileSystemAccessRule(
            userName,
            FileSystemRights.ListDirectory,
            AccessControlType.Deny);
    ds.RemoveAccessRuleSpecific(fsarDeny);

    // And add an allow rule
    FileSystemAccessRule fsarAllow =
        new FileSystemAccessRule(
            userName,
            FileSystemRights.FullControl,
            AccessControlType.Allow);
    ds.AddAccessRule(fsarAllow);

    Directory.SetAccessControl(directory, ds);
}

Notice how we’re using the GetAccessControl method on Directory to get hold of the directory security information. We then construct a filesystem access rule which matches the deny rule we created earlier, and call RemoveAccessRuleSpecific on the DirectorySecurity information we retrieved. This matches the rule up exactly, and then removes it if it exists (or does nothing if it doesn’t).

Finally, we add an allow rule to the set to give us full control over the directory, and then call the Directory.SetAccessControl method to set those permissions on the directory itself.

Let’s call that method from our cleanup code, compile, and run. (Don’t forget, we’re deleting files and directories, and changing permissions, so take care!)

Here’s some sample output:

C:UsersmwaAppDataLocalufmnho4z.h5p
C:UsersmwaAppDataLocal5chw4maf.xyu
C:UsersmwaAppDataLocals1ydovhu.0wk
You do not have permission to access this directory.
Access to the path 'C:UsersmwaAppDataLocalyjijkza.3cj' is denied.

These methods make it relatively easy to manage permissions when you create and manipulate files, but they don’t make it easy to decide what those permissions should be! It is always tempting just to make everything available to anyone—you can get your code compiled and “working” much quicker that way; but only for “not very secure” values of “working,” and that’s something that has to be of concern for every developer.

Warning

Your application could be the one that miscreants decide to exploit to turn your users’ PCs to the dark side.

I warmly recommend that you crank UAC up to the maximum (and put up with the occasional security dialog), run Visual Studio as a nonadministrator (as far as is possible), and think at every stage about the least possible privileges you can grant to your users that will still let them get their work done. Making your app more secure benefits everyone: not just your own users, but everyone who doesn’t receive a spam email or a hack attempt because the bad guys couldn’t exploit your application.

We’ve now handled the exception nicely—but is stopping really the best thing we could have done? Would it not be better to log the fact that we were unable to access particular directories, and carry on? Similarly, if we get a DirectoryNotFoundException or FileNotFoundException, wouldn’t we want to just carry on in this case? The fact that someone has deleted the directory from underneath us shouldn’t matter to us.

If we look again at our sample, it might be better to catch the DirectoryNotFoundException and FileNotFoundException inside the InspectDirectories method to provide a more fine-grained response to errors. Also, if we look at the documentation for FileInfo, we’ll see that it may actually throw a base IOException under some circumstances, so we should catch that here, too. And in all cases, we need to catch the security exceptions.

We’re relying on LINQ to iterate through the files and folders, which means it’s not entirely obvious where to put the exception handling. Example 11-28 shows the code from InspectDirectories that iterates through the folders, to get a list of files. We can’t put exception handling code into the middle of that query.

Example 11-28. Iterating through the directories

var allFilePaths = from directory in directoriesToSearch
                   from file in Directory.GetFiles(directory, "*.*",
                                                   searchOption)
                   select file;

However, we don’t have to. The simplest way to solve this is to put the code that gets the directories into a separate method, so we can add exception handling, as Example 11-29 shows.

Example 11-29. Putting exception handling in a helper method

private static IEnumerable<string> GetDirectoryFiles(
    string directory, SearchOption searchOption)
{
    try
    {
        return Directory.GetFiles(directory, "*.*", searchOption);
    }
    catch (DirectoryNotFoundException dnfx)
    {
        Console.WriteLine("Warning: The specified directory was not found");
        Console.WriteLine(dnfx.Message);
    }
    catch (UnauthorizedAccessException uax)
    {
        Console.WriteLine(
            "Warning: You do not have permission to access this directory.");
        Console.WriteLine(uax.Message);
    }

    return Enumerable.Empty<string>();
}

This method defers to Directory.GetFiles, but in the event of one of the expected errors, it displays a warning, and then just returns an empty collection.

Note

There’s a problem here when we ask GetFiles to search recursively: if it encounters a problem with even just one directory, the whole operation throws, and you’ll end up not looking in any directories. So while Example 11-29 makes a difference only when the user passes multiple directories on the command line, it’s not all that useful when using the /sub option. If you wanted to make your error handling more fine-grained still, you could write your own recursive directory search. The GetAllFilesInDirectory example in Chapter 7 shows how to do that.

If we modify the LINQ query to use this, as shown in Example 11-30, the overall progress will be undisturbed by the error handling.

Example 11-30. Iterating in the face of errors

var allFilePaths = from directory in directoriesToSearch
                   from file in GetDirectoryFiles(directory,
                                                  searchOption)
                   select file;

And we can use a similar technique for the LINQ query that populates the fileNameGroups—it uses FileInfo, and we need to handle exceptions for that. Example 11-31 iterates through a list of paths, and returns details for each file that it was able to access successfully, displaying errors otherwise.

Example 11-31. Handling exceptions from FileInfo

private static IEnumerable<FileDetails> GetDetails(IEnumerable<string> paths)
{
    foreach (string filePath in paths)
    {
        FileDetails details = null;
        try
        {
            FileInfo info = new FileInfo(filePath);
            details = new FileDetails
            {
                FilePath = filePath,
                FileSize = info.Length
            };
        }
        catch (FileNotFoundException fnfx)
        {
            Console.WriteLine("Warning: The specified file was not found");
            Console.WriteLine(fnfx.Message);
        }
        catch (IOException iox)
        {
            Console.Write("Warning: ");
            Console.WriteLine(iox.Message);
        }
        catch (UnauthorizedAccessException uax)
        {
            Console.WriteLine(
                "Warning: You do not have permission to access this file.");
            Console.WriteLine(uax.Message);
        }

        if (details != null)
        {
            yield return details;
        }
    }
}

We can use this from the final LINQ query in InspectDirectories. Example 11-32 shows the modified query.

Example 11-32. Getting details while tolerating errors

var fileNameGroups = from filePath in allFilePaths
                     let fileNameWithoutPath = Path.GetFileName(filePath)
                     group filePath by fileNameWithoutPath into nameGroup
                     select new FileNameGroup
                     {
                         FileNameWithoutPath = nameGroup.Key,
                         FilesWithThisName = GetDetails(nameGroup).ToList()
                     };

Again, this enables the query to process all accessible items, while reporting errors for any problematic files without having to stop completely. If we compile and run again, we see the following output:

C:UsersmwaAppDataLocaldcyx0fv1.hv3
C:UsersmwaAppDataLocalnf2wqwr.y3s
C:UsersmwaAppDataLocalkfilxte4.exy
Warning: You do not have permission to access this directory.
Access to the path 'C:UsersmwaAppDataLocal
2gl4q1a.ycp' is denied.
SameNameAndContent.txt
----------------------
C:UsersmwaAppDataLocaldcyx0fv1.hv3
C:UsersmwaAppDataLocalnf2wqwr.y3s
C:UsersmwaAppDataLocalkfilxte4.exy

We’ve dealt cleanly with the directory to which we did not have access, and have continued with the job to a successful conclusion.

Now that we’ve found a few candidate files that may (or may not) be the same, can we actually check to see that they are, in fact, identical, rather than just coincidentally having the same name and length?

Reading Files into Memory

To compare the candidate files, we could load them into memory. The File class offers three likely looking static methods: ReadAllBytes, which treats the file as binary, and loads it into a byte array; File.ReadAllText, which treats it as text, and reads it all into a string; and File.ReadLines, which again treats it as text, but loads each line into its own string, and returns an array of all the lines. We could even call File.OpenRead to obtain a StreamReader (equivalent to the StreamWriter, but for reading data—we’ll see this again later in the chapter).

Because we’re looking at all file types, not just text, we need to use one of the binary-based methods. File.ReadAllBytes returns a byte[] containing the entire contents of the file. We could then compare the files byte for byte, to see if they are the same. Here’s some code to do that.

First, let’s update our DisplayMatches function to do the load and compare, as shown by the highlighted lines in Example 11-33.

Example 11-33. Updating DisplayMatches for content comparison

private static void DisplayMatches(
    IEnumerable<FileNameGroup> filesGroupedByName)
{
    var groupsWithMoreThanOneFile = from nameGroup in filesGroupedByName
                                    where nameGroup.FilesWithThisName.Count > 1
                                    select nameGroup;

    foreach (var fileNameGroup in groupsWithMoreThanOneFile)
    {
        // Group the matches by the file size, then select those
        // with more than 1 file of that size.
        var matchesBySize = from match in fileNameGroup.FilesWithThisName
                            group match by match.FileSize into sizeGroup
                            where sizeGroup.Count() > 1
                            select sizeGroup;

        foreach (var matchedBySize in matchesBySize)
        {
            List<FileContents> content = LoadFiles(matchedBySize);
            CompareFiles(content);
        }
    }
}

Notice that we want our LoadFiles function to return a List of FileContents objects. Example 11-34 shows the FileContents class.

Example 11-34. File content information class

internal class FileContents
{
    public string FilePath { get; set; }
    public byte[] Content { get; set; }
}

It just lets us associate the filename with the contents so that we can use it later to display the results. Example 11-35 shows the implementation of LoadFiles, which uses ReadAllBytes to load in the file content.

Example 11-35. Loading binary file content

private static List<FileContents> LoadFiles(IEnumerable<FileDetails> fileList)
{
    var content = new List<FileContents>();
    foreach (FileDetails item in fileList)
    {
        byte[] contents = File.ReadAllBytes(item.FilePath);
        content.Add(new FileContents
        {
            FilePath = item.FilePath,
            Content = contents
        });
    }
    return content;
}

We now need an implementation for CompareFiles, which is shown in Example 11-36.

Example 11-36. CompareFiles method

private static void CompareFiles(List<FileContents> files)
{
    Dictionary<FileContents, List<FileContents>> potentiallyMatched =
        BuildPotentialMatches(files);

    // Now, we're going to look at every byte in each
    CompareBytes(files, potentiallyMatched);

    DisplayResults(files, potentiallyMatched);
}

This isn’t exactly the most elegant way of comparing several files. We’re building a big dictionary of all of the potential matching combinations, and then weeding out the ones that don’t actually match. For large numbers of potential matches of the same size this could get quite inefficient, but we’ll not worry about that right now! Example 11-37 shows the function that builds those potential matches.

Example 11-37. Building possible match combinations

private static Dictionary<FileContents, List<FileContents>>
   BuildPotentialMatches(List<FileContents> files)
{
    // Builds a dictionary where the entries look like:
    //  { 0, { 1, 2, 3, 4, ... N } }
    //  { 1, { 2, 3, 4, ... N }
    // ...
    //  { N - 1, { N } }
    // where N is one less than the number of files.
    var allCombinations = Enumerable.Range(0, files.Count - 1).ToDictionary(
        x => files[x],
        x => files.Skip(x + 1).ToList());

    return allCombinations;
}

This set of potential matches will be whittled down to the files that really are the same by CompareBytes, which we’ll get to momentarily. The DisplayResults method, shown in Example 11-38, runs through the matches and displays their names and locations.

Example 11-38. Displaying matches

private static void DisplayResults(
    List<FileContents> files,
    Dictionary<FileContents, List<FileContents>> currentlyMatched)
{
    if (currentlyMatched.Count == 0) { return; }

    var alreadyMatched = new List<FileContents>();

    Console.WriteLine("Matches");

    foreach (var matched in currentlyMatched)
    {
        // Don't do it if we've already matched it previously
        if (alreadyMatched.Contains(matched.Key))
        {
            continue;
        }
        else
        {
            alreadyMatched.Add(matched.Key);
        }
        Console.WriteLine("-------");
        Console.WriteLine(matched.Key.FilePath);
        foreach (var file in matched.Value)
        {
            Console.WriteLine(file.FilePath);
            alreadyMatched.Add(file);
        }
    }
    Console.WriteLine("-------");
}

This leaves the method shown in Example 11-39 that does the bulk of the work, comparing the potentially matching files, byte for byte.

Example 11-39. Byte-for-byte comparison of all potential matches

private static void CompareBytes(
    List<FileContents> files,
    Dictionary<FileContents, List<FileContents>> potentiallyMatched)
{
    // Remember, this only ever gets called with files of equal length.
    int fileLength = files[0].Content.Length;
    var sourceFilesWithNoMatches = new List<FileContents>();
    for (int fileByteOffset = 0; fileByteOffset < fileLength; ++fileByteOffset)
    {
        foreach (var sourceFileEntry in potentiallyMatched)
        {
            byte[] sourceContent = sourceFileEntry.Key.Content;
            for (int otherIndex = 0; otherIndex < sourceFileEntry.Value.Count;
                                                                 ++otherIndex)
            {
                // Check the byte at i in each of the two files, if they don't
                //  match, then we remove them from the collection
                byte[] otherContent =
                    sourceFileEntry.Value[otherIndex].Content;
                if (sourceContent[fileByteOffset] != otherContent[fileByteOffset])
                {
                    sourceFileEntry.Value.RemoveAt(otherIndex);
                    otherIndex -= 1;
                    if (sourceFileEntry.Value.Count == 0)
                    {
                        sourceFilesWithNoMatches.Add(sourceFileEntry.Key);
                    }
                }
            }
        }
        foreach (FileContents fileWithNoMatches in sourceFilesWithNoMatches)
        {
            potentiallyMatched.Remove(fileWithNoMatches);
        }
        // Don't bother with the rest of the file if
        // there are no further potential matches
        if (potentiallyMatched.Count == 0)
        {
            break;
        }
        sourceFilesWithNoMatches.Clear();
    }
}

We’re going to need to add a test file that differs only in the content. In CreateTestFiles add another filename that doesn’t change as we go round the loop:

string fileSameSizeInAllButDifferentContent =
    "SameNameAndSizeDifferentContent.txt";

Then, inside the loop (at the bottom), we’ll create a test file that will be the same length, but varying by only a single byte:

// And now one that is the same length, but with different content
fullPath = Path.Combine(directory, fileSameSizeInAllButDifferentContent);

builder = new StringBuilder();
builder.Append("Now with ");
builder.Append(directoryIndex);
builder.AppendLine(" extra");
CreateFile(fullPath, builder.ToString());

If you build and run, you should see some output like this, showing the one identical file we have in each file location:

C:UsersmwaAppDataLocale33yz4hg.mjp
C:UsersmwaAppDataLocalung2xdgo.k1c
C:UsersmwaAppDataLocaljcpagntt.ynd
Warning: You do not have permission to access this directory.
Access to the path 'C:UsersmwaAppDataLocalcmoof2kj.ekd' is denied.
Matches
-------
C:UsersmwaAppDataLocale33yz4hg.mjpSameNameAndContent.txt
C:UsersmwaAppDataLocalung2xdgo.k1cSameNameAndContent.txt
C:UsersmwaAppDataLocaljcpagntt.yndSameNameAndContent.txt
-------

Needless to say, this isn’t exactly very efficient; and it is unlikely to work so well when you get to those DVD rips and massive media repositories. Even your 64-bit machine probably doesn’t have quite that much memory available to it.[24] There’s a way to make this more memory-efficient. Instead of loading the file completely into memory, we can take a streaming approach.

Streams

You can think of a stream like one of those old-fashioned news ticker tapes. To write data onto the tape, the bytes (or characters) in the file are typed out, one at a time, on the continuous stream of tape.

We can then wind the tape back to the beginning, and start reading it back, character by character, until either we stop or we run off the end of the tape. Or we could give the tape to someone else, and she could do the same. Or we could read, say, 1,000 characters off the tape, and copy them onto another tape which we give to someone to work on, then read the next 1,000, and so on, until we run out of characters.

Note

Once upon a time, we used to store programs and data in exactly this way, on a stream of paper tape with holes punched in it; the basic technology for this was invented in the 19th century. Later, we got magnetic tape, although that was less than useful in machine shops full of electric motors generating magnetic fields, so paper systems (both tape and punched cards) lasted well into the 1980s (when disk systems and other storage technologies became more robust, and much faster).

The concept of a machine that reads data items one at a time, and can step forward or backward through that stream, goes back to the very foundations of modern computing. It is one of those highly resilient metaphors that only really falls down in the face of highly parallelized algorithms: a single input stream is often the choke point for scalability in that case.

To illustrate this, let’s write a method that’s equivalent to File.ReadAllBytes using a stream (see Example 11-40).

Example 11-40. Reading from a stream

private static byte[] ReadAllBytes(string filename)
{
    using (FileStream stream = File.OpenRead(filename))
    {
        long streamLength = stream.Length;
        if (streamLength > 0x7fffffffL)
        {
            throw new InvalidOperationException(
               "Unable to allocate more than 0x7fffffffL bytes" +
               "of memory to read the file");
        }
        // Safe to cast to an int, because
        // we checked for overflow above
        int bytesToRead = (int) stream.Length;
        // This could be a big buffer!
        byte[] bufferToReturn = new byte[bytesToRead];
        // We're going to start at the beginning
        int offsetIntoBuffer = 0;
        while (bytesToRead > 0)
        {
            int bytesRead = stream.Read(bufferToReturn,
                                        offsetIntoBuffer,
                                        bytesToRead);
            if (bytesRead == 0)
            {
                throw new InvalidOperationException(
                    "We reached the end of file before we expected..." +
                    "Has someone changed the file while we weren't looking?");
            }
            // Read may return fewer bytes than we asked for, so be
            // ready to go round again.
            bytesToRead -= bytesRead;
            offsetIntoBuffer += bytesRead;
        }

        return bufferToReturn;
    }
}

The call to File.OpenRead creates us an instance of a FileStream. This class derives from the base Stream class, which defines most of the methods and properties we’re going to use.

First, we inspect the stream’s Length property to determine how many bytes we need to allocate in our result. This is a long, so it can support truly enormous files, even if we can allocate only 2 GB of memory.

Note

If you try using the stream.Length argument as the array size without checking it for size first, it will compile, so you might wonder why we’re doing this check. In fact, C# converts the argument to an int first, and if it’s too big, you’ll get an OverflowException at runtime. By checking the size explicitly, we can provide our own error message.

Then (once we’ve set up a few variables) we call stream.Read and ask it for all of the data in the stream. It is entitled to give us any number of bytes it likes, up to the number we ask for. It returns the actual number of bytes read, or 0 if we’ve hit the end of the stream and there’s no more data.

Warning

A common programming error is to assume that the stream will give you as many bytes as you asked for. Under simple test conditions it usually will if there’s enough data. However, streams can and sometimes do return you less in order to give you some data as soon as possible, even when you might think it should be able to give you everything. If you need to read a certain amount before proceeding, you need to write code to keep calling Read until you get what you require, as Example 11-40 does.

Notice that it returns us an int. So even if .NET did let us allocate arrays larger than 2 GB (which it doesn’t) a stream can only tell us that it has read 2 GB worth of data at a time, and in fact, the third argument to Read, where we tell it how much we want, is also an int, so 2 GB is the most we can ask for. So while FileStream is able to work with larger files thanks to the 64-bit Length property, it will split the data into more modest chunks of 2 GB or less when we read. But then one of the main reasons for using streams in the first place is to avoid having to deal with all the content in one go, so in practice we tend to work with much smaller chunks in any case.

So we always call the Read method in a loop. The stream maintains the current read position for us, but we need to work out where to write it in the destination array (offsetIntoBuffer). We also need to work out how many more bytes we have to read (bytesToRead).

We can now update the call to ReadAllBytes in our LoadFile method so that it uses our new implementation:

byte[] contents = ReadAllBytes(item.Filename);

Note

If this was all you were going to do, you wouldn’t actually implement ReadAllBytes yourself; you’d use the one in the framework! This is just by way of an example. We’re going to make more interesting use of streams shortly.

Build and run again, and you should see output with exactly the same form as before:

C:UsersmwaAppDataLocal1ssoimgj.wqg
C:UsersmwaAppDataLocalcjiymq5b.bfo
C:UsersmwaAppDataLocaldiss5tgl.zae
Warning: You do not have permission to access this directory.
Access to the path 'C:UsersmwaAppDataLocalu1w0rj0o.2xe' is denied.
Matches
-------
C:UsersmwaAppDataLocal1ssoimgj.wqgSameNameAndContent.txt
C:UsersmwaAppDataLocalcjiymq5b.bfoSameNameAndContent.txt
C:UsersmwaAppDataLocaldiss5tgl.zaeSameNameAndContent.txt
-------

That’s all very well, but we haven’t actually improved anything. We wanted to avoid loading all of those files into memory. Instead of loading the files, let’s update our FileContents class to hold a stream instead of a byte array, as Example 11-41 shows.

Example 11-41. FileContents using FileStream

internal class FileContents
{
    public string FilePath { get; set; }
    public FileStream Content { get; set; }
}

We’ll have to update the code that creates the FileContents too, in our LoadFiles method from Example 11-35. Example 11-42 shows the change required.

Example 11-42. Modifying LoadFiles

content.Add(new FileContents
                {
                    FilePath = item.FilePath,
                    Content = File.OpenRead(item.FilePath)
                });

(You can now delete our ReadAllBytes implementation, if you want.)

Because we’re opening all of those files, we need to make sure that we always close them all. We can’t implement the using pattern, because we’re handing off the references outside the scope of the function that creates them, so we’ll have to find somewhere else to call Close.

DisplayMatches (Example 11-33) ultimately causes the streams to be created by calling LoadFiles, so DisplayMatches should close them too. We can add a try/finally block in that method’s innermost foreach loop, as Example 11-43 shows.

Example 11-43. Closing streams in DisplayMatches

foreach (var matchedBySize in matchesBySize)
{
    List<FileContents> content = LoadFiles(matchedBySize);
    try
    {
        CompareFiles(content);
    }
    finally
    {
        foreach (var item in content)
        {
            item.Content.Close();
        }
    }
}

The last thing to update, then, is the CompareBytes method. The previous version, shown in Example 11-39, relied on loading all the files into memory upfront. The modified version in Example 11-44 uses streams.

Example 11-44. Stream-based CompareBytes

private static void CompareBytes(
    List<FileContents> files,
    Dictionary<FileContents, List<FileContents>> potentiallyMatched)
{
    // Remember, this only ever gets called with files of equal length.
    long bytesToRead = files[0].Content.Length;
    // We work through all the files at once, so allocate a buffer for each.
    Dictionary<FileContents, byte[]> fileBuffers =
        files.ToDictionary(x => x, x => new byte[1024]);

    var sourceFilesWithNoMatches = new List<FileContents>();
    while (bytesToRead > 0)
    {
        // Read up to 1k from all the files.
        int bytesRead = 0;
        foreach (var bufferEntry in fileBuffers)
        {
            FileContents file = bufferEntry.Key;
            byte[] buffer = bufferEntry.Value;
            int bytesReadFromThisFile = 0;
            while (bytesReadFromThisFile < buffer.Length)
            {
                int bytesThisRead = file.Content.Read(
                    buffer, bytesReadFromThisFile,
                    buffer.Length - bytesReadFromThisFile);
                if (bytesThisRead == 0) { break; }
                bytesReadFromThisFile += bytesThisRead;
            }
            if (bytesReadFromThisFile < buffer.Length
             && bytesReadFromThisFile < bytesToRead)
            {
                throw new InvalidOperationException(
                    "Unexpected end of file - did a file change?");
            }
            bytesRead = bytesReadFromThisFile; // Will be same for all files
        }
        bytesToRead -= bytesRead;

        foreach (var sourceFileEntry in potentiallyMatched)
        {
            byte[] sourceFileContent = fileBuffers[sourceFileEntry.Key];

            for (int otherIndex = 0; otherIndex < sourceFileEntry.Value.Count;
                                                                 ++otherIndex)
            {
                byte[] otherFileContent =
                    fileBuffers[sourceFileEntry.Value[otherIndex]];
                for (int i = 0; i < bytesRead; ++i)
                {
                    if (sourceFileContent[i] != otherFileContent[i])
                    {
                        sourceFileEntry.Value.RemoveAt(otherIndex);
                        otherIndex -= 1;
                        if (sourceFileEntry.Value.Count == 0)
                        {
                            sourceFilesWithNoMatches.Add(sourceFileEntry.Key);
                        }
                        break;
                    }
                }
            }
        }
        foreach (FileContents fileWithNoMatches in sourceFilesWithNoMatches)
        {
            potentiallyMatched.Remove(fileWithNoMatches);
        }
        // Don't bother with the rest of the file if there are
        // not further potential matches
        if (potentiallyMatched.Count == 0)
        {
            break;
        }
        sourceFilesWithNoMatches.Clear();
    }
}

Rather than reading entire files at once, we allocate small buffers, and read in 1 KB at a time. As with the previous version, this new one works through all the files of a particular name and size simultaneously, so we allocate a buffer for each file.

We then loop round, reading in a buffer’s worth from each file, and perform comparisons against just that buffer (weeding out any nonmatches). We keep going round until we either determine that none of the files match or reach the end of the files.

Notice how each stream remembers its position for us, with each Read starting where the previous one left off. And since we ensure that we read exactly the same quantity from all the files for each chunk (either 1 KB, or however much is left when we get to the end of the file), all the streams advance in unison.

This code has a somewhat more complex structure than before. The all-in-memory version in Example 11-39 had three loops—the outer one advanced one byte at a time, and then the inner two worked through the various potential match combinations. But because the outer loop in Example 11-44 advances one chunk at a time, we end up needing an extra inner loop to compare all the bytes in a chunk. We could have simplified this by only ever reading a single byte at a time from the streams, but in fact, this chunking has delivered a significant performance improvement. Testing against a folder full of source code, media resources, and compilation output containing 4,500 files (totaling about 500 MB), the all-in-memory version took about 17 seconds to find all the duplicates, but the stream version took just 3.5 seconds! Profiling the code revealed that this performance improvement was entirely a result of the fact that we were comparing the bytes in chunks. So for this particular application, the additional complexity was well worth it. (Of course, you should always measure your own code against representative problems—techniques that work well in one scenario don’t necessarily perform well everywhere.)

Moving Around in a Stream

What if we wanted to step forward or backward in the file? We can do that with the Seek method. Let’s imagine we want to print out the first 100 bytes of each file that we reject, for debug purposes. We can add some code to our CompareBytes method to do that, as Example 11-45 shows.

Example 11-45. Seeking within a stream

if (sourceFileContent[i] != otherFileContent[i])
{
    sourceFileEntry.Value.RemoveAt(otherIndex);
    otherIndex -= 1;
    if (sourceFileEntry.Value.Count == 0)
    {
        sourceFilesWithNoMatches.Add(sourceFileEntry.Key);
    }
#if DEBUG
    // Remember where we got to
    long currentPosition = sourceFileEntry.Key.Content.Position;
    // Seek to 0 bytes from the beginning
    sourceFileEntry.Key.Content.Seek(0, SeekOrigin.Begin);
    // Read 100 bytes from
    for (int index = 0; index < 100; ++index)
    {
        var val = sourceFileEntry.Key.Content.ReadByte();
        if (val < 0) { break; }
        if (index != 0) { Console.Write(", "); }
        Console.Write(val);
    }
    Console.WriteLine();
    // Put it back where we found it
    sourceFileEntry.Key.Content.Seek(currentPosition, SeekOrigin.Begin);
#endif
    break;
}

We start by getting hold of the current position within the stream using the Position property. We do this so that the code doesn’t lose its place in the stream. (Even though we’ve detected a mismatch here, remember we’re comparing lots of files here—perhaps this same file matches one of the other candidates. So we’re not necessarily finished with it yet.)

The first parameter of the Seek method tells us how far we are going to seek from our origin—we’re passing 0 here because we want to go to the beginning of the file. The second tells us what that origin is going to be. SeekOrigin.Begin means the beginning of the file, SeekOrigin.End means the end of the file (and so the offset counts backward—you don’t need to say −100, just 100).

There’s also SeekOrigin.Current which allows you to move relative to the current position. You could use this to read 10 bytes ahead, for example (maybe to work out what you were looking at in context), and then seek back to where you were by calling Seek(-10, SeekOrigin.Current).

Warning

Not all streams support seeking. For example, some streams represent network connections, which you might use to download gigabytes of data. The .NET Framework doesn’t remember every single byte just in case you ask it to seek later on, so if you attempt to rewind such a stream, Seek will throw a NotSupportedException. You can find out whether seeking is supported from a stream’s CanSeek property.

Writing Data with Streams

We don’t just have to use streaming APIs for reading. We can write to the stream, too.

One very common programming task is to copy data from one stream to another. We use this kind of thing all the time—copying data, or concatenating the content of several files into another, for example. (If you want to copy an entire file, you’d use File.Copy, but streams give you the flexibility to concatenate or modify data, or to work with nonfile sources.)

Example 11-46 shows how to read data from one stream and write it into another. This is just for illustrative purposes—.NET 4 added a new CopyTo method to Stream which does this for you. In practice you’d need Example 11-46 only if you were targeting an older version of the .NET Framework, but it’s a good way to see how to write to a stream.

Example 11-46. Copying from one stream to another

private static void WriteTo(Stream source, Stream target, int bufferLength)
{
    bufferLength = Math.Max(100, bufferLength);
    var buffer = new byte[bufferLength];
    int bytesRead;

    do
    {
        bytesRead = source.Read(buffer, 0, buffer.Length);
        if (bytesRead != 0)
        {
            target.Write(buffer, 0, bytesRead);
        }
    } while (bytesRead > 0);
}

We create a buffer which is at least 100 bytes long. We then Read from the source and Write to the target, using the buffer as the intermediary. Notice that the Write method takes the same parameters as the read: the buffer, an offset into that buffer, and the number of bytes to write (which in this case is the number of bytes read from the source buffer, hence the slightly confusing variable name). As with Read, it steadily advances the current position in the stream as it writes, just like that ticker tape. Unlike Read, Write will always process as many bytes as we ask it to, so with Write, there’s no need to keep looping round until it has written all the data.

Obviously, we need to keep looping until we’ve read everything from the source stream. Notice that we keep going until Read returns 0. This is how streams indicate that we’ve reached the end. (Some streams don’t know in advance how large they are, so you can rely on the Length property for only certain kinds of streams such as FileStream. Testing for a return value of 0 is the most general way to know that we’ve reached the end.)

Reading, Writing, and Locking Files

So, we’ve seen how to read and write data to and from streams, and how we can move the current position in the stream by seeking to some offset from a known position. Up until now, we’ve been using the File.OpenRead and File.OpenWrite methods to create our file streams. There is another method, File.Open, which gives us access to some extra features.

The simplest overload takes two parameters: a string which is the path for the file, and a value from the FileMode enumeration. What’s the FileMode? Well, it lets us specify exactly what we want done to the file when we open it. Table 11-6 shows the values available.

Table 11-6. FileMode enumeration

FileMode

Purpose

CreateNew

Creates a brand new file. Throws an exception if it already existed.

Create

Creates a new file, deleting any existing file and overwriting it if necessary.

Open

Opens an existing file, seeking to the beginning by default. Throws an exception if the file does not exist.

OpenOrCreate

Opens an existing file, or creates a new file if it doesn’t exist.

Truncate

Opens an existing file, and deletes all its contents. The file is automatically opened for writing only.

Append

Opens an existing file and seeks to the end of the file. The file is automatically opened for writing only. You can seek in the file, but only within any information you’ve appended—you can’t touch the existing content.

If you use this two-argument overload, the file will be opened in read/write mode. If that’s not what you want, another overload takes a third argument, allowing you to control the access mode with a value from the FileAccess enumeration. Table 11-7 shows the supported values.

Table 11-7. FileAccess enumeration

FileAccess

Purpose

Read

Open read-only.

Write

Open write-only.

ReadWrite

Open read/write.

All of the file-opening methods we’ve used so far have locked the file for our exclusive use until we close or Dispose the object—if any other program tries to open the file while we have it open, it’ll get an error. However, it is possible to play nicely with other users by opening the file in a shared mode. We do this by using the overload which specifies a value from the FileShare enumeration, which is shown in Table 11-8. This is a flags enumeration, so you can combine the values if you wish.

Table 11-8. FileShare enumeration

FileShare

Purpose

None

No one else can open the file while we’ve got it open.

Read

Other people can open the file for reading, but not writing.

Write

Other people can open the file for writing, but not reading (so read/write will fail, for example).

ReadWrite

Other people can open the file for reading or writing (or both). This is equivalent to Read | Write.

Delete

Other people can delete the file that you’ve created, even while we’ve still got it open. Use with care!

You have to be careful when opening files in a shared mode, particularly one that permits modifications. You are open to all sorts of potential exceptions that you could normally ignore (e.g., people deleting or truncating it from underneath you).

If you need even more control over the file when you open it, you can create a FileStream instance directly.

FileStream Constructors

There are two types of FileStream constructors—those for interop scenarios, and the “normal” ones. The “normal” ones take a string for the file path, while the interop ones require either an IntPtr or a SafeFileHandle. These wrap a Win32 file handle that you have retrieved from somewhere. (If you’re not already using such a thing in your code, you don’t need to use these versions.) We’re not going to cover the interop scenarios here.

If you look at the list of constructors, the first thing you’ll notice is that quite a few of them duplicate the various permutations of FileShare, FileAccess, and FileMode overloads we had on File.Open.

You’ll also notice equivalents with one extra int parameter. This allows you to provide a hint for the system about the size of the internal buffer you’d like the stream to use. Let’s look at buffering in more detail.

Stream Buffers

Many streams provide buffering. This means that when you read and write, they actually use an intermediate in-memory buffer. When writing, they may store your data in an internal buffer, before periodically flushing the data to the actual output device. Similarly, when you read, they might read ahead a whole buffer full of data, and then return to you only the particular bit you need. In both cases, buffering aims to reduce the number of I/O operations—it means you can read or write data in relatively small increments without incurring the full cost of an operating system API call every time.

There are many layers of buffering for a typical storage device. There might be some memory buffering on the actual device itself (many hard disks do this, for example), the filesystem might be buffered (NTFS always does read buffering, and on a client operating system it’s typically write-buffered, although this can be turned off, and is off by default for the server configurations of Windows). The .NET Framework provides stream buffering, and you can implement your own buffers (as we did in our example earlier).

These buffers are generally put in place for performance reasons. Although the default buffer sizes are chosen for a reasonable trade-off between performance and robustness, for an I/O-intensive application, you may need to hand-tune this using the appropriate constructors on FileStream.

Note

As usual, you can do more harm than good if you don’t measure the impact on performance carefully on a suitable range of your target systems. Most applications will not need to touch this value.

Even if you don’t need to tune performance, you still need to be aware of buffering for robustness reasons. If either the process or the OS crashes before the buffers are written out to the physical disk, you run the risk of data loss (hence the reason write buffering is typically disabled on the server). If you’re writing frequently to a Stream or StreamWriter, the .NET Framework will flush the write buffers periodically. It also ensures that everything is properly flushed when the stream is closed. However, if you just stop writing data but you leave the stream open, there’s a good chance data will hang around in memory for a long time without getting written out, at which point data loss starts to become more likely.

In general, you should close files as early as possible, but sometimes you’ll want to keep a file open for a long time, yet still ensure that particular pieces of data get written out. If you need to control that yourself, you can call Flush. This is particularly useful if you have multiple threads of execution accessing the same stream. You can synchronize writes and ensure that they are flushed to disk before the next worker gets in and messes things up! Later in this chapter, we’ll see an example where explicit flushing is extremely important.

Setting Permissions During Construction

Another parameter we can set in the constructor is the FileSystemRights. We used this type earlier in the chapter to set filesystem permissions. FileStream lets us set these directly when we create a file using the appropriate constructor. Similarly, we can also specify an instance of a FileSecurity object to further control the permissions on the underlying file.

Setting Advanced Options

Finally, we can optionally pass another enumeration to the FileStream constructor, FileOptions, which contains some advanced filesystem options. They are enumerated in Table 11-9. This is a flags-style enumeration, so you can combine these values.

Table 11-9. FileOptions enumeration

FileOptions

Purpose

None

No options at all.

WriteThrough

Ignores any filesystem-level buffers, and writes directly to the output device. This affects only the O/S, and not any of the other layers of buffering, so it’s still your responsibility to call Flush.

RandomAccess

Indicates that we’re going to be seeking about in the file in an unsystematic way. This acts as a hint to the OS for its caching strategy. We might be writing a video-editing tool, for example, where we expect the user to be leaping about through the file.

SequentialScan

Indicates that we’re going to be sequentially reading from the file. This acts as a hint to the OS for its caching strategy. We might be writing a video player, for example, where we expect the user to play through the stream from beginning to end.

Encrypted

Indicates that we want the file to be encrypted so that it can be decrypted and read only by the user who created it.

DeleteOnClose

Deletes the file when it is closed. This is very handy for temporary files. If you use this option, you never hit the problem where the file still seems to be locked for a short while even after you’ve closed it (because its buffers are still flushing asynchronously).

Asynchronous

Allows the file to be accessed asynchronously.

The last option, Asynchronous, deserves a section all to itself.

Asynchronous File Operations

Long-running file operations are a common bottleneck. How many times have you clicked the Save button, and seen the UI lock up while the disk operation takes place (especially if you’re saving a large file to a network location)?

Developers commonly resort to a background thread to push these long operations off the main thread so that they can display some kind of progress or “please wait” UI (or let the user carry on working). We’ll look at that approach in Chapter 16; but you don’t necessarily have to go that far. You can use the asynchronous mode built into the stream instead. To see how it works, look at Example 11-47.

Example 11-47. Asynchronous file I/O

static void Main(string[] args)
{
    string path = "mytestfile.txt";
    // Create a test file
    using (var file = File.Create(path, 4096, FileOptions.Asynchronous))
    {
        // Some bytes to write
        byte[] myBytes = new byte[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
        IAsyncResult asyncResult = file.BeginWrite(
            myBytes,
            0,
            myBytes.Length,
            // A callback function, written as an anonymous delegate
            delegate(IAsyncResult result)
            {
                // You *must* call EndWrite() exactly once
                file.EndWrite(result);
                // Then do what you like
                Console.WriteLine(
                    "Called back on thread {0} when the operation completed",
                    System.Threading.Thread.CurrentThread.ManagedThreadId);
            },
            null);

        // You could do something else while you waited...
        Console.WriteLine(
            "Waiting on thread {0}...",
            System.Threading.Thread.CurrentThread.ManagedThreadId);
        // Waiting on the main thread
        asyncResult.AsyncWaitHandle.WaitOne();
        Console.WriteLine(
            "Completed {0} on thread {1}...",
            asyncResult.CompletedSynchronously ?
                "synchronously" : "asynchronously",
            System.Threading.Thread.CurrentThread.ManagedThreadId);
        Console.ReadKey();
        return;
    }
}

If you put this code in a new console application, and then compile and run, you’ll get output similar to this (the actual thread IDs will vary from run to run):

Waiting on thread 10...
Completed asynchronously on thread 10...
Called back on thread 6 when the operation completed

So, what is happening?

When we create our file, we use an overload on File.Create that takes the FileOptions we discussed earlier. (Yes, back then we showed that by constructing the FileStream directly, but the File class supports this too.) This lets us open the file with asynchronous behavior enabled.

Then, instead of calling Write, we call BeginWrite. This takes two additional parameters. The first is a delegate to a callback function of type AsyncCallback, which the framework will call when it has finished the operation to let us know that it has completed. The second is an object that we can pass in, that will get passed back to us in the callback.

Note

This user state object is common to a lot of asynchronous operations, and is used to get information from the calling site to callbacks from the worker thread. It has become less useful in C# with the availability of lambdas and anonymous methods which have access to variables in their enclosing state.

We’ve used an anonymous method to provide the callback delegate. The first thing we do in that method is to call file.EndWrite, passing it the IAsyncResult we’ve been provided in the callback. You must call EndWrite exactly once for every time you call BeginWrite, because it cleans up the resources used to carry out the operation asynchronously. It doesn’t matter whether you call it from the callback, or on the main application thread (or anywhere else, for that matter). If the operation has not completed, it will block the calling thread until it does complete, then do its cleanup. Should you call it twice with the same IAsyncResult for any reason the framework will throw an exception.

In a typical Windows Forms or WPF application, we’d probably put up some progress dialog of some kind, and just process messages until we got our callback. In a server-side application we’re more likely to want to kick off several pieces of work like this, and then wait for them to finish. To do this, the IAsyncResult provides us with an AsyncWaitHandle, which is an object we can use to block our thread until the work is complete.

So, when we run, our main thread happens to have the ID 10. It blocks until the operation is complete, and then prints out the message about being done. Notice that this was, as you’d expect, on the same thread with ID 10. But after that, we get a message printed out from our callback, which was called by the framework on another thread entirely.

It is important to note that your system may have behaved differently. It is possible that the callback might occur before execution continued on the main thread. You have to be extremely careful that your code doesn’t depend on these operations happening in a particular order.

Note

We’ll discuss these issues in a lot more detail in Chapter 16. We recommend you read that before you use any of these asynchronous techniques in production code.

Remember that we set the FileOptions.Asynchronous flag when we opened the file to get this asynchronous behavior? What happens if we don’t do that? Let’s tweak the code so that it opens with FileOptions.None instead, and see. Example 11-48 shows the statements from Example 11-47 that need to be modified

Example 11-48. Not asking for asynchronous behavior

...
// Create a test file
using (var file = File.Create(path, 4096, FileOptions.None))
{
...

If you build and run that, you’ll see some output similar to this:

Waiting on thread 9...
Completed asynchronously on thread 9...
Called back on thread 10 when the operation completed

What’s going on? That all still seemed to be asynchronous!

Well yes, it was, but under the covers, the problem was solved in two different ways. The first one used the underlying support Windows provides for asynchronous I/O in the filesystem to handle the asynchronous file operation. In the second case, the .NET Framework had to do some work for us to grab a thread from the thread pool, and execute the read operation on that to deliver the asynchronous behavior.

Note

That’s true right now, but bear in mind that these are implementation details and could change in future versions of the framework. The principle will remain the same, though.

So far, everything we’ve talked about has been related to files, but we can create streams over other things, too. If you’re a Silverlight developer, you’ve probably been skimming over all of this a bit—after all, if you’re running in the web browser you can’t actually read and write files in the filesystem. There is, however, another option that you can use (along with all the other .NET developers out there): isolated storage.

Isolated Storage

In the duplicate file detection application we built earlier in this chapter, we had to go to some lengths to find a location, and pick filenames for the datafiles we wished to create in test mode, in order to guarantee that we don’t collide with other applications. We also had to pick locations that we knew we would (probably) have permission to write to, and that we could then load again.

Isolated storage takes this one stage further and gives us a means of saving and loading data in a location unique to a particular piece of executing code. The physical location itself is abstracted away behind the API; we don’t need to know where the runtime is actually storing the data, just that the data is stored safely, and that we can retrieve it again. (Even if we want to know where the files are, the isolated storage API won’t tell us.) This helps to make the isolated storage framework a bit more operating-system-agnostic, and removes the need for full trust (unlike regular file I/O). Hence it can be used by Silverlight developers (who can target other operating systems such as Mac OS X) as well as those of us building server or desktop client applications for Windows.

This compartmentalization of the information by characteristics of the executing code gives us a slightly different security model from regular files. We can constrain access to particular assemblies, websites, and/or users, for instance, through an API that is much simpler (although much less sophisticated) than the regular file security.

Warning

Although isolated storage provides you with a simple security model to use from managed code, it does not secure your data effectively against unmanaged code running in a relatively high trust context and trawling the local filesystem for information. So, you should not trust sensitive data (credit card numbers, say) to isolated storage. That being said, if someone you cannot trust has successfully run unmanaged code in a trusted context on your box, isolated storage is probably the least of your worries.

Stores

Our starting point when using isolated storage is a store and you can think of any given store as being somewhat like one of the well-known directories we dealt with in the regular filesystem. The framework creates a folder for you when you first ask for a store with a particular set of isolation criteria, and then gives back the same folder each time you ask for the store with the same criteria. Instead of using the regular filesystem APIs, we then use special methods on the store to create, move, and delete files and directories within that store.

First, we need to get hold of a store. We do that by calling one of several static members on the IsolatedStorageFile class. Example 11-49 starts by getting the user store for a particular assembly. We’ll discuss what that means shortly, but for now it just means we’ve got some sort of a store we can use. It then goes on to create a folder and a file that we can use to cache some information, and retrieve it again on subsequent runs of the application.

Example 11-49. Creating folders and files in a store

static void Main(string[] args)
{
    IsolatedStorageFile store = IsolatedStorageFile.GetUserStoreForAssembly();
    // Create a directory - safe to call multiple times
    store.CreateDirectory("Settings");
    // Open or create the file
    using (IsolatedStorageFileStream stream = store.OpenFile(
                                "Settings\standardsettings.txt",
                                System.IO.FileMode.OpenOrCreate,
                                System.IO.FileAccess.ReadWrite))
    {
        UseStream(stream);
    }
    Console.ReadKey();
}

We create a directory in the store, called Settings. You don’t have to do this; you could put your file in the root directory for the store, if you wanted. Then, we use the OpenFile method on the store to open a file. We use the standard file path syntax to specify the file, relative to the root for this store, along with the FileMode and FileAccess values that we’re already familiar with. They all mean the same thing in isolated storage as they do with normal files. That method returns us an IsolatedStorageFileStream. This class derives from FileStream, so it works in pretty much the same way.

So, what shall we do with it now that we’ve got it? For the purposes of this example, let’s just write some text into it if it is empty. On a subsequent run, we’ll print the text we wrote to the console.

Reading and Writing Text

We’ve already seen StreamWriter, the handy wrapper class we can use for writing text to a stream. Previously, we got hold of one from File.CreateText, but remember we mentioned that there’s a constructor we can use to wrap any Stream (not just a FileStream) if we want to write text to it? Well, we can use that now, for our IsolatedStorageFileStream. Similarly, we can use the equivalent StreamReader to read text from the stream if it already exists. Example 11-50 implements the UseStream method that Example 11-49 called after opening the stream, and it uses both StreamReader and StreamWriter.

Example 11-50. Using StreamReader and StreamWriter with isolated storage

static void UseStream(Stream stream)
{
    if (stream.Length > 0)
    {
        using (StreamReader reader = new StreamReader(stream))
        {
            Console.WriteLine(reader.ReadToEnd());
        }
    }
    else
    {
        using (StreamWriter writer = new StreamWriter(stream))
        {
            writer.WriteLine(
                "Initialized settings at {0}", DateTime.Now.TimeOfDay);
            Console.WriteLine("Settings have been initialized");
        }
    }
}

In the case where we’re writing, we construct our StreamWriter (in a using block, because we need to Dispose it when we’re done), and then use the WriteLine method to write our content. Remember that WriteLine adds an extra new line on the end of the text, whereas Write just writes the text provided.

In the case where we are reading, on the other hand, we construct a StreamReader (also in a using block), and then read the entire content using ReadToEnd. This reads the entire content of the file into a single string.

So, if you build and run this once, you’ll see some output that looks a lot like this:

Settings have been initialized

That means we’ve run through the write path. Run a second (or subsequent) time, and you’ll see something more like this:

Initialized settings at 10:34:47.7014833

That means we’ve run through the read path.

Note

When you run this, you’ll notice that we end up outputting an extra blank line at the end, because we’ve read a whole line from the file—we called writer.WriteLine when generating the file—and then used Console.WriteLine, which adds another end of line after that. You have to be a little careful when manipulating text like this, to ensure that you don’t end up with huge amounts of unwanted whitespace because everyone in some processing chain is generously adding new lines or other whitespace at the end!

This is a rather neat result. We can use all our standard techniques for reading and writing to an IsolatedStorageFileStream once we’ve acquired a suitable file: the other I/O types such as StreamReader don’t need to know what kind of stream we’re using.

Defining “Isolated”

So, what makes isolated storage “isolated”? The .NET Framework partitions information written into isolated storage based on some characteristics of the executing code.

Several types of isolated store are available to you:

  • Isolation by user and assembly (optionally supporting roaming)

  • Isolation by user, domain, and assembly (optionally supporting roaming)

  • Isolation by user and application (optionally supporting roaming)

  • Isolation by user and site (only on Silverlight)

  • Isolation by machine and assembly

  • Isolation by machine, domain, and assembly

  • Isolation by machine and application

Silverlight supports only two of these: by user and site, and by user and application.

Isolation by user and assembly

In Example 11-50, we acquired a store isolated by user and assembly, using the static method IsolatedStorageFile.GetUserStoreForAssembly. This store is unique to a particular user, and the assembly in which the calling code is executing. You can try this out for yourself. If you log in to your box as a user other than the one under which you’ve already run our example app, and run it again, you’ll see some output like this:

Settings have been initialized

That means our settings file doesn’t exist (for this user), so we must have been given a new store.

As you might expect, the user is identified by the authenticated principal for the current thread. Typically, this is the logged-on user that ran the process; but this could have been changed by impersonation (in a web application, for example, you might be running in the context of the web user, rather than that of the ASP.NET process that hosts the site).

Identifying the assembly is slightly more complex. If you have signed the assembly, it uses the information in that signature (be it a strong name signature, or a software publisher signature, with the software publishing signature winning if it has both).

If, on the other hand, the assembly is not signed, it will use the URL for the assembly. If it came from the Internet, it will be of the form:

http://some/path/to/myassembly.dll

If it came from the local filesystem, it will be of the form:

file:///C:/some/path/to/myassembly.dll

Figure 11-9 illustrates how multiple stores get involved when you have several users and several different assemblies. User 1 asks MyApp.exe to perform some task, which asks for user/assembly isolated storage. It gets Store 1. Imagine that User 1 then asks MyApp.exe to perform some other task that requires the application to call on MyAssembly.dll to carry out the work. If that in turn asks for user/assembly isolated storage, it will get a different store (labeled Store 2 in the diagram). We get a different store, because they are different assemblies.

When a different user, User 2, asks MyApp.exe to perform the first task, which then asks for user/assembly isolated storage, it gets a different store again—Store 3 in the diagram—because they are different users.

User and assembly isolation

Figure 11-9. User and assembly isolation

OK, what happens if we make two copies of MyApp.exe in two different locations, and run them both under the same user account? The answer is that it depends....

If the applications are not signed the assembly identification rules mean that they don’t match, and so we get two different isolated stores.

If they are signed the assembly identification rules mean that they do match, so we get the same isolated store.

Our app isn’t signed, so if we try this experiment, we’ll see the standard “first run” output for our second copy.

Warning

Be very careful when using isolated storage with signed assemblies. The information used from the signature includes the Name, Strong Name Key, and Major Version part of the version info. So, if you rev your application from 1.x to 2.x, all of a sudden you’re getting a different isolated storage scope, and all your existing data will “vanish.” One way to deal with this is to use a distinct DLL to access the store, and keep its version numbers constant.

Isolation by user, domain, and assembly

Isolating by domain means that we look for some information about the application domain in which we are running. Typically, this is the full URL of the assembly if it was downloaded from the Web, or the local path of the file.

Notice that this is the same rule as for the assembly identity if we didn’t sign it! The purpose of this isolation model is to allow a single signed assembly to get different stores if it is run from different locations. You can see a diagram that illustrates this in Figure 11-10.

Assembly and domain isolation compared

Figure 11-10. Assembly and domain isolation compared

To get a store with this isolation level, we can call the IsolatedStorageFile class’s GetUserStoreForDomain method.

Isolation by user and application

A third level of isolation is by user and application. What defines an “application”? Well, you have to sign the whole lot with a publisher’s (Authenticode) signature. A regular strong-name signature won’t do (as that will identify only an individual assembly).

Note

If you want to try this out quickly for yourself, you can run the ClickOnce Publication Wizard on the Publish tab of your example project settings. This will generate a suitable test certificate and sign the app.

To get a store with user and application isolation, we call the IsolatedStorageFile class’s GetUserStoreForApplication method.

Note

If you haven’t signed your application properly, this method will throw an exception.

So, it doesn’t matter which assembly you call from; as long as it is a part of the same application, it will get the same store. You can see this illustrated in Figure 11-11.

Application isolation

Figure 11-11. Application isolation

Note

This can be particularly useful for settings that might be shared between several different application components.

Machine isolation

What if your application or component has some data you want to make available to all users on the system? Maybe you want to cache common product information or imagery to avoid a download every time you start the app. For these scenarios you need machine isolation.

As you saw earlier, there is an isolation type for the machine which corresponds to each isolation type for the user. The same resolution rules apply in each case. The methods you need are:

GetMachineStoreForApplication
GetMachineStoreForDomain
GetMachineStoreForAssembly

Managing User Storage with Quotas

Isolated storage has the ability to set quotas on particular storage scopes. This allows you to limit the amount of data that can be saved in any particular store. This is particularly important for applications that run with partial trust—you wouldn’t want Silverlight applications automatically loaded as part of a web page to be able to store vast amounts of data on your hard disk without your permission.

You can find out a store’s current quota by looking at the Quota property on a particular IsolatedStorageFile. This is a long, which indicates the maximum number of bytes that may be stored. This is not a “bytes remaining” count—you can use the AvailableFreeSpace property for that.

Note

Your available space will go down slightly when you create empty directories and files. This reflects the fact that such items consume space on disk even though they are nominally empty.

The quota can be increased using the IncreaseQuotaTo method, which takes a long which is the new number of bytes to which to limit the store. This must be larger than the previous number of bytes, or an ArgumentException is thrown. This call may or may not succeed—the user will be prompted, and may refuse your request for more space.

Warning

You cannot reduce the quota for a store once you’ve set it, so take care!

Managing Isolated Storage

As a user, you might want to look at the data stored in isolated storage by applications running on your machine. It can be complicated to manage and debug isolated storage, but there are a few tools and techniques to help you.

First, there’s the storeadm.exe tool. This allows you to inspect isolated storage for the current user (by default), or the current machine (by specifying the /machine option) or current roaming user (by specifying /roaming).

So, if you try running this command:

storeadm /MACHINE /LIST

you will see output similar to this (listing the various stores for this machine, along with the evidence that identifies them):

Microsoft (R) .NET Framework Store Admin 4.0.30319.1
Copyright (c) Microsoft Corporation.  All rights reserved.

Record #1
[Assembly]
<StrongName version="1"
Key="0024000004800000940000000602000000240000525341310004000001000100A5FE84898F
190EA6423A7D7FFB1AE778141753A6F8F8235CBC63A9C5D04143C7E0A2BE1FC61FA6EBB52E7FA9B
48D22BAF4027763A12046DB4A94FA3504835ED9F29CD031600D5115939066AABE59A4E61E932AEF
0C24178B54967DD33643FDE04AE50786076C1FB32F64915E8200729301EB912702A8FDD40F63DD5
A2DE218C7"
Name="ConsoleApplication7"
Version="1.0.0.0"/>

        Size : 0
Record #2
[Domain]
<StrongName version="1"
Key="0024000004800000940000000602000000240000525341310004000001000100A5FE84898F
190EA6423A7D7FFB1AE778141753A6F8F8235CBC63A9C5D04143C7E0A2BE1FC61FA6EBB52E7FA9B
48D22BAF4027763A12046DB4A94FA3504835ED9F29CD031600D5115939066AABE59A4E61E932AEF
0C24178B54967DD33643FDE04AE50786076C1FB32F64915E8200729301EB912702A8FDD40F63DD5
A2DE218C7"
Name="ConsoleApplication7"
Version="1.0.0.0"/>

[Assembly]
<StrongName version="1"
Key="0024000004800000940000000602000000240000525341310004000001000100A5FE84898F
190EA6423A7D7FFB1AE778141753A6F8F8235CBC63A9C5D04143C7E0A2BE1FC61FA6EBB52E7FA9B
48D22BAF4027763A12046DB4A94FA3504835ED9F29CD031600D5115939066AABE59A4E61E932AEF
0C24178B54967DD33643FDE04AE50786076C1FB32F64915E8200729301EB912702A8FDD40F63DD5
A2DE218C7"
Name="ConsoleApplication7"
Version="1.0.0.0"/>

        Size : 0

Notice that there are two stores in that example. One is identified by some assembly evidence (the strong name key, name, and major version info). The other is identified by both domain and assembly evidence. Because the sample application is in a single assembly, the assembly evidence for both stores happens to be identical!

Warning

You can also add the /REMOVE parameter which will delete all of the isolated storage in use at the specified scope. Be very careful if you do this, as you may well delete storage used by another application entirely.

That’s all very well, but you can’t see the place where those files are stored. That’s because the actual storage is intended to be abstracted away behind the API. Sometimes, however, it is useful to be able to go and pry into the actual storage itself.

Warning

Remember, this is an implementation detail, and it could change between versions. It has been consistent since the first version of the .NET Framework, but in the future, Microsoft could decide to store it all in one big file hidden away somewhere, or using some mystical API that we don’t have access to.

We can take advantage of the fact that the debugger can show us the private innards of the IsolatedStorageFile class. If we set a breakpoint on the store.CreateFile line in our sample application, we can inspect the IsolatedStorageFile object that was returned by GetUserStoreForApplication in the previous line. You will see that there is a private field called m_RootDir. This is the actual root directory (in the real filesystem) for the store. You can see an example of that as it is on my machine in Figure 11-12.

IsolatedStorageFile internals

Figure 11-12. IsolatedStorageFile internals

If you copy that path and browse to it using Windows Explorer, you’ll see something like the folder in Figure 11-13.

There’s the Settings directory that we created! As you might expect, if you were to look inside, you’d see the standardsettings.txt file our program created.

An isolated storage folder

Figure 11-13. An isolated storage folder

As you can see, this is a very useful debugging technique, allowing you to inspect and modify the contents of files in isolated storage, and identify exactly which store you have for a particular scope. It does rely on implementation details, but since you’d only ever do this while debugging, the code you ultimately ship won’t depend on any nonpublic features of isolated storage.

OK. So far, we’ve seen two different types of stream; a regular file, and an isolated storage file. We use our familiar stream tools and techniques (like StreamReader and StreamWriter), regardless of the underlying type.

So, what other kinds of stream exist? Well, there are lots; several subsystems in the .NET framework provide stream-based APIs. We’ll see some networking ones in Chapter 13, for example. Another example is from the .NET Framework’s security features: CryptoStream (which is used for encrypting and decrypting a stream of data). There’s also a MemoryStream in System.IO which uses memory to store the data in the stream.

Streams That Aren’t Files

In this final section, we’ll look at a stream that is not a file. We’ll use a stream from .NET’s cryptographic services to encrypt a string. This encrypted string can be decrypted later as long as we know the key. The test program in Example 11-51 illustrates this.

Example 11-51. Using an encryption stream

static void Main(string[] args)
{
    byte[] key;
    byte[] iv;

    // Get the appropriate key and initialization vector for the algorithm
    SelectKeyAndIV(out key, out iv);

    string superSecret = "This is super secret";

    Console.WriteLine(superSecret);

    string encryptedText = EncryptString(superSecret, key, iv);

    Console.WriteLine(encryptedText);

    string decryptedText = DecryptString(encryptedText, key, iv);

    Console.WriteLine(decryptedText);

    Console.ReadKey();
}

It is going to write a message to the console, encrypt it, write the encrypted text to the console, decrypt it, and write the result of that back to the console. All being well, the first line should be the same as the last, and the middle line should look like gibberish!

Note

Of course, it’s not very useful to encrypt and immediately decrypt again. This example illustrates all the parts in one program—in a real application, decryption would happen in a different place than encryption.

The first thing we do is get a suitable key and initialization vector for our cryptographic algorithm. These are the two parts of the secret key that are shared between whoever is encrypting and decrypting our sensitive data.

A detailed discussion of cryptography is somewhat beyond the scope of this book, but here are a few key points to get us going. Unenciphered data is known as the plain text, and the encrypted version is known as cipher text. We use those terms even if we’re dealing with nontextual data. The key and the initialization vector (IV) are used by a cryptographic algorithm to encrypt the unenciphered data. A cryptographic algorithm that uses the same key and IV for both encryption and decryption is called a symmetric algorithm (for obvious reasons). Asymmetric algorithms also exist, but we won’t be using them in this example.

Needless to say, if an unauthorized individual gets hold of the key and IV, he can happily decrypt any of your cipher text, and you no longer have a communications channel free from prying eyes. It is therefore extremely important that you take care when sharing these secrets with the people who need them, to ensure that no one else can intercept them. (This turns out to be the hardest part—key management and especially human factors turn out to be security weak points far more often than the technological details. This is a book about programming, so we won’t even attempt to solve that problem. We recommend the book Secrets and Lies: Digital Security in a Networked World by Bruce Schneier [John Wiley & Sons] for more information.)

We’re calling a method called SelectKeyAndIV to get hold of the key and IV. In real life, you’d likely be sharing this information between different processes, usually even on different machines; but for the sake of this demonstration, we’re just creating them on the fly, as you can see in Example 11-52.

Example 11-52. Creating a key and IV

private static void SelectKeyAndIV(out byte[] key, out byte[] iv)
{
    var algorithm = TripleDES.Create();
    algorithm.GenerateIV();
    algorithm.GenerateKey();

    key = algorithm.Key;
    iv = algorithm.IV;
}

TripleDES is an example of a symmetric algorithm, so it derives from a class called SymmetricAlgorithm. All such classes provide a couple of methods called GenerateIV and GenerateKey that create cryptographically strong random byte arrays to use as an initialization vector and a key. See the sidebar below for an explanation of why we need to use a particular kind of random number generator when cryptography is involved.

OK, with that done, we can now implement our EncryptString method. This takes the plain text string, the key, and the initialization vector, and returns us an encrypted string. Example 11-53 shows an implementation.

Example 11-53. Encrypting a string

private static string EncryptString(string plainText, byte[] key, byte[] iv)
{
    // Create a crypto service provider for the TripleDES algorithm
    var serviceProvider = new TripleDESCryptoServiceProvider();

    using (MemoryStream memoryStream = new MemoryStream())
    using (var cryptoStream = new CryptoStream(
                                    memoryStream,
                                    serviceProvider.CreateEncryptor(key, iv),
                                    CryptoStreamMode.Write))
    using (StreamWriter writer = new StreamWriter(cryptoStream))
    {
        // Write some text to the crypto stream, encrypting it on the way
        writer.Write(plainText);
        // Make sure that the writer has flushed to the crypto stream
        writer.Flush();
        // We also need to tell the crypto stream to flush the final block out to
        // the underlying stream, or we'll
        // be missing some content...
        cryptoStream.FlushFinalBlock();

        // Now, we want to get back whatever the crypto stream wrote to our memory
        // stream.
        return GetCipherText(memoryStream);
    }
}

We’re going to write our plain text to a CryptoStream, using the standard StreamWriter adapter. This works just as well over a CryptoStream as any other, but instead of coming out as plain text, it will be enciphered for us. How does that work?

An Adapting Stream: CryptoStream

CryptoStream is quite different from the other streams we’ve met so far. It doesn’t have any underlying storage of its own. Instead, it wraps around another Stream, and then uses an ICryptoTransform either to transform the data written to it from plain text into cipher text before writing it to that output stream (if we put it into CryptoStreamMode.Write), or to transform what it has read from the underlying stream and turning it back into plain text before passing it on to the reader (if we put it into CryptoStreamMode.Read).

So, how do we get hold of a suitable ICryptoTransform? We’re making use of a factory class called TripleDESCryptoServiceProvider. This has a method called CreateEncryptor which will create an instance of an ICryptoTransform that uses the TripleDES algorithm to encrypt our plain text, with the specified key and IV.

Warning

A number of different algorithms are available in the framework, with various strengths and weaknesses. In general, they also have a number of different configuration options, the defaults for which can vary between versions of the .NET Framework and even versions of the operating system on which the framework is deployed. To be successful, you’re going to have to ensure that you match not just the key and the IV, but also the choice of algorithm and all its options. In general, you should carefully set everything up by hand, and avoid relying on the defaults (unlike this example, which, remember, is here to illustrate streams).

We provide all of those parameters to its constructor, and then we can use it (almost) like any other stream.

In fact, there is a proviso about CryptoStream. Because of the way that most cryptographic algorithms work on blocks of plain text, it has to buffer up what is being written (or read) until it has a full block, before encrypting it and writing it to the underlying stream.

This means that, when you finish writing to it, you might not have filled up the final block, and it might not have been flushed out to the destination stream. There are two ways of ensuring that this happens:

  • Dispose the CryptoStream.

  • Call FlushFinalBlock on the CryptoStream.

In many cases, the first solution is the simplest. However, when you call Dispose on the CryptoStream it will also Close the underlying stream, which is not always what you want to do. In this case, we’re going to use the underlying stream some more, so we don’t want to close it just yet. Instead, we call Flush on the StreamWriter to ensure that it has flushed all of its data to the CryptoStream, and then FlushFinalBlock on the CryptoStream itself, to ensure that the encrypted data is all written to the underlying stream.

We can use any sort of stream for that underlying stream. We could use a file stream on disk, or one of the isolated storage file streams we saw earlier in this chapter, for example. We could even use one of the network streams we’re going to see in Chapter 13. However, for this example we’d like to do everything in memory, and the framework has just the class for us: the MemoryStream.

In Memory Alone: The MemoryStream

MemoryStream is very simple in concept. It is just a stream that uses memory as its backing store. We can do all of the usual things like reading, writing, and seeking. It’s very useful when you’re working with APIs that require you to provide a Stream, and you don’t already have one handy.

If we use the default constructor (as in our example), we can read and write to the stream, and it will automatically grow in size as it needs to accommodate the data being written. Other constructors allow us to provide a start size suitable for our purposes (if we know in advance what that might be).

We can even provide a block of memory in the form of a byte[] array to use as the underlying storage for the stream. In that case, we are no longer able to resize the stream, and we will get a NotSupportedException if we try to write too much data. You would normally supply your own byte[] array when you already have one and need to pass it to something that wants to read from a stream.

We can find out the current size of the underlying block of memory (whether we allocated it explicitly, or whether it is being automatically resized) by looking at the stream’s Capacity property. Note that this is not the same as the maximum number of bytes we’ve ever written to the stream. The automatic resizing tends to overallocate to avoid the overhead of constant reallocation when writing. In general, you can determine how many bytes you’ve actually written to by looking at the Position in the stream at the beginning and end of your write operations, or the Length property of the MemoryStream.

Having used the CryptoStream to write the cipher text into the stream, we need to turn that into a string we can show on the console.

Representing Binary As Text with Base64 Encoding

Unfortunately, the cipher text is not actually text at all—it is just a stream of bytes. We can’t use the UTF8Encoding.UTF8.GetString technique we saw in Chapter 10 to turn the bytes into text, because these bytes don’t represent UTF-8 encoded characters.

Instead, we need some other sort of text-friendly representation if we’re going to be able to print the encrypted text to the console. We could write each byte out as hex digits. That would be a perfectly reasonable string representation.

However, that’s not very compact (each byte is taking five characters in the string!):

0x01 0x0F 0x03 0xFA 0xB3

A much more compact textual representation is Base64 encoding. This is a very popular textual encoding of arbitrary data. It’s often used to embed binary in XML, which is a fundamentally text-oriented format.

And even better, the framework provides us with a convenient static helper method to convert from a byte[] to a Base64 encoded string: Convert.ToBase64String.

Note

If you’re wondering why there’s no Encoding class for Base64 to correspond to the Unicode, ASCII, and UTF-8 encodings we saw in Chapter 10, it’s because Base64 is a completely different kind of thing. Those other encodings are mechanisms that define binary representations of textual information. Base64 does the opposite—it defines a textual representation for binary information.

Example 11-54 shows how we make use of that in our GetCipherText method.

Example 11-54. Converting to Base64

private static string GetCipherText(MemoryStream memoryStream)
{
    byte[] buffer = memoryStream.ToArray();
    return System.Convert.ToBase64String(buffer, 0, buffer.Length);
}

We use a method on MemoryStream called ToArray to get a byte[] array containing all the data written to the stream.

Warning

Don’t be caught out by the ToBuffer method, which also returns a byte[] array. ToBuffer returns the whole buffer including any “extra” bytes that have been allocated but not yet used.

Finally, we call Convert.ToBase64String to get a string representation of the underlying data, passing it the byte[], along with a start offset into that buffer of zero (so that we start with the first byte), and the length.

That takes care of encryption. How about decryption? That’s actually a little bit easier. Example 11-55 shows how.

Example 11-55. Decryption

private static string DecryptString(string cipherText, byte[] key, byte[] iv)
{
    // Create a crypto service provider for the TripleDES algorithm
    var serviceProvider = new TripleDESCryptoServiceProvider();

    // Decode the cipher-text bytes back from the base-64 encoded string
    byte[] cipherTextBytes = Convert.FromBase64String(cipherText);

    // Create a memory stream over those bytes
    using (MemoryStream memoryStream = new MemoryStream(cipherTextBytes))
    // And create a cryptographic stream over the memory stream,
    // using the specified algorithm
    // (with the provided key and initialization vector)
    using (var cryptoStream =
                  new CryptoStream(
                      memoryStream,
                      serviceProvider.CreateDecryptor(key, iv),
                      CryptoStreamMode.Read))
    // Finally, create a stream reader over the stream, and recover the
    // original text
    using (StreamReader reader = new StreamReader(cryptoStream))
    {
        return reader.ReadToEnd();
    }
}

First, we use Convert.FromBase64String to convert our Base64 encoded string back to an array of bytes. We then construct a MemoryStream over that byte[] by passing it to the appropriate constructor.

As before, we wrap the MemoryStream with a CryptoStream, this time passing it the ICryptoTransform created by a call to CreateDecryptor on our TripleDESCryptoServiceProvider, and putting it into CryptoStreamMode.Read.

Finally, we construct our old friend the StreamReader over the CryptoStream, and read the content back as a string.

So, what’s actually happening here?

CryptoStream uses the ICryptoTransform to take care of turning the cipher text in the MemoryStream back into plain text. If you remember, that plain text is actually the set of UTF-8 encoded bytes we originally wrote to the stream with the StreamWriter back in the encryption phase. So, the StreamReader takes those and converts them back into a string for us. You can see that illustrated in Figure 11-14.

This is a very powerful example of how we can plug together various components in a kind of pipeline to achieve quite complex processing, from simple, easily understood building blocks that conform to a common pattern, but which have no dependencies on each other’s implementation details. The Stream abstraction is the key to this flexibility.

Encryption and decryption pipeline using streams

Figure 11-14. Encryption and decryption pipeline using streams

Summary

In this chapter we looked at the classes in the System.IO namespace that relate to files and streams. We saw how we can use static methods on the File, Directory, and Path classes to manage and manipulate files and folders in the filesystem, including creating, deleting, appending, and truncating data, as well as managing their access permissions.

We saw how to use StreamReader and StreamWriter to deal with reading and writing text from files, and how we can also read and write binary data using the underlying Stream objects themselves, including the ability to Seek backward and forward in the file.

We then looked at a special type of file stream called isolated storage. This gives us the ability to manage the scope of file access to particular users, machines, applications, or even assemblies. We gain control over quotas (the maximum amount of space any particular store is allowed to use), and get to use local file storage in normally restricted security contexts like that of a Silverlight application, for example.

Finally, we looked at some streams that aren’t files, including MemoryStream, which uses memory as its underlying storage mechanism, and CryptoStream, which has no storage of its own, delegating that responsibility to another stream. We showed how these patterns can be used to plug streams together into a processing pipeline.



[24] In fact, it is slightly more constrained than that. The .NET Framework limits arrays to 2 GB, and will throw an exception if you try to load a larger file into memory all at once.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset