Almost all programmers have to deal with storing, retrieving, and processing information in files at some time or another. The .NET Framework provides a number of classes and methods we can use to find, create, read, and write files and directories In this chapter we’ll look at some of the most common.
Files, though, are just one example of a broader group of entities that can be opened, read from, and/or written to in a sequential fashion, and then closed. .NET defines a common contract, called a stream, that is offered by all types that can be used in this way. We’ll see how and why we might access a file through a stream, and then we’ll look at some other types of streams, including a special storage medium called isolated storage which lets us save and load information even when we are in a lower-trust environment (such as the Silverlight sandbox). Finally, we’ll look at some of the other stream implementations in .NET by way of comparison. (Streams crop up in all sorts of places, so this chapter won’t be the last we see of them—they’re important in networking, for example.)
We, the authors of this book, have often heard our
colleagues ask for a program to help them find duplicate files on their
system. Let’s write something to do exactly that. We’ll pass the names of
the directories we want to search on the command line, along with an
optional switch to determine whether we want to recurse into
subdirectories or not. In the first instance, we’ll do a very basic check
for similarity based on filenames and sizes, as these are relatively cheap
options. Example 11-1 shows our
Main
function.
Example 11-1. Main method of duplicate file finder
static void Main(string[] args) { bool recurseIntoSubdirectories = false; if (args.Length < 1) { ShowUsage(); return; } int firstDirectoryIndex = 0; if (args.Length > 1) { // see if we're being asked to recurse if (args[0] == "/sub") { if (args.Length < 2) { ShowUsage(); return; } recurseIntoSubdirectories = true; firstDirectoryIndex = 1; } } // Get list of directories from command line. var directoriesToSearch = args.Skip(firstDirectoryIndex); List<FileNameGroup> filesGroupedByName = InspectDirectories(recurseIntoSubdirectories, directoriesToSearch); DisplayMatches(filesGroupedByName); Console.ReadKey(); }
The basic structure is pretty straightforward. First we inspect the
command-line arguments to work out which directories we’re searching. Then
we call InspectDirectories
(shown
later) to build a list of all the files in those directories. This groups
the files by filename (without the full path) because we do not consider
two files to be duplicates if they have different names. Finally, we pass
this list to DisplayMatches
, which
displays any potential matches in the files we have found. DisplayMatches
refines our test for duplicates
further—it considers two files with the same name to be duplicates only if
they have the same size. (That’s not foolproof, of course, but it’s
surprisingly effective, and we will refine it further later in the
chapter.)
Let’s look at each of these steps in more detail.
The code that parses the command-line arguments does a quick check
to see that we’ve provided at least one command-line argument (in addition
to the /sub
switch if present) and we
print out some usage instructions if not, using the method shown in
Example 11-2.
Example 11-2. Showing command line usage
private static void ShowUsage() { Console.WriteLine("Find duplicate files"); Console.WriteLine("===================="); Console.WriteLine( "Looks for possible duplicate files in one or more directories"); Console.WriteLine(); Console.WriteLine( "Usage: findduplicatefiles [/sub] DirectoryName [DirectoryName] ..."); Console.WriteLine("/sub - recurse into subdirectories"); Console.ReadKey(); }
The next step is to build a list of files grouped by name. We define
a couple of classes for this, shown in Example 11-3. We create a
FileNameGroup
object for each distinct
filename. Each FileNameGroup
contains a
nested list of FileDetails
, providing
the full path of each file that has that name, and also the size of that
file.
Example 11-3. Types used to keep track of the files we’ve found
class FileNameGroup { public string FileNameWithoutPath { get; set; } public List<FileDetails> FilesWithThisName { get; set; } } class FileDetails { public string FilePath { get; set; } public long FileSize { get; set; } }
For example, suppose the program searches two folders, c:One and c:Two, and suppose both of those folders
contain a file called Readme.txt. Our
list will contain a FileNameGroup
whose
FileNameWithoutPath
is Readme.txt
. Its nested FilesWithThisName
list will contain two FileDetails
entries, one with a FilePath
of c:OneReadme.txt
and the other with c:TwoReadme.txt
. (And each FileDetails
will contain the size of the
relevant file in FileSize
. If these two
files really are copies of the same file, their sizes will, of course, be
the same.)
We build these lists in the InspectDirectories
method, which is shown in Example 11-4.
This contains the meat of the program, because this is where we search the
specified directories for files. Quite a lot of the code is concerned with
the logic of the program, but this is also where we start to use some of
the file APIs.
Example 11-4. InspectDirectories method
private static List<FileNameGroup> InspectDirectories( bool recurseIntoSubdirectories, IEnumerable<string> directoriesToSearch) { var searchOption = recurseIntoSubdirectories ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly; // Get the path of every file in every directory we're searching. var allFilePaths = from directory in directoriesToSearch from file inDirectory.GetFiles(directory, "*.*",
searchOption) select file; // Group the files by local filename (i.e. the filename without the // containing path), and for each filename, build a list containing the // details for every file that has that filename. var fileNameGroups = from filePath in allFilePaths let fileNameWithoutPath =Path.GetFileName(filePath)
group filePath by fileNameWithoutPath into nameGroup select new FileNameGroup { FileNameWithoutPath = nameGroup.Key, FilesWithThisName = (from filePath in nameGroup let info =new FileInfo(filePath)
select new FileDetails { FilePath = filePath, FileSize =info.Length
}).ToList() }; return fileNameGroups.ToList(); }
To get it to compile, you’ll need to add:
using System.IO;
The parts of Example 11-4 that use
the System.IO
namespace to work with
files and directories have been highlighted. We’ll start by looking at the
use of the Directory
class.
Our InspectDirectories
method calls the static GetFiles
method on the
Directory
class to find the files we’re
interested in. Example 11-5 shows
the relevant code.
Example 11-5. Getting the files in a directory
var searchOption = recurseIntoSubdirectories ?
SearchOption.AllDirectories : SearchOption.TopDirectoryOnly;
// Get the path of every file in every directory we're searching.
var allFilePaths = from directory in directoriesToSearch
from file in Directory.GetFiles(directory, "*.*",
searchOption)
select file;
The overload of GetFiles
we’re
calling takes the directory we’d like to search, a filter (in the standard
command-line form), and a value from the SearchOption
enumeration, which determines
whether to recurse down through all the subfolders.
We’re using LINQ to Objects to build a list of all the files we
require. As you saw in Chapter 8, a query with multiple
from
clauses works in a similar way
to nested foreach
loops. The code in
Example 11-5 will end up calling
GetFiles
for each directory passed on
the command line, and it will effectively concatenate the results of all
those calls into a single list of files.
The GetFiles
method returns the
full path for each file concerned, but when it comes to finding matches,
we just want the filename. We can use the Path
class to get the filename from the full
path.
The Path
class provides
methods for manipulating strings containing file paths. Imagine we have
the path c:directory1directory2MyFile.txt. Table 11-1 shows you how you can
slice that with various different Path
methods.
Table 11-1. The effect of various Path methods
Method name | Result |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
What if we use a network path? Table 11-2 shows the results of the same methods when applied to this path:
\MyPCShare1directory2MyFile.txt
Table 11-2. The effect of various Path methods with a network path
Method name | Result |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Notice how the path root includes the network hostname and the share name.
What happens if we don’t use a full path, but one relative to the current directory? And what’s the current directory anyway?
The framework maintains a process-wide idea of the current working directory, which is
the root path relative to which any file operations that do
not fully qualify the path are made. The Directory
class (as you might imagine) gives
us the ability to manipulate it. Rather than a static property, there
are two static methods to query and set the current value: GetCurrentDirectory
and SetCurrentDirectory
. Example 11-6 shows a call to the
latter.
Table 11-3 shows
the results we’d get if we passed @"directory2MyFile.txt"
to the various
Path
methods after having run the
code in Example 11-6. As you can
see, most of the results reflect the fact that we’ve not provided a full
path, but there’s one exception: GetFullPath
uses the current working directory
if we provide it with a relative path.
Table 11-3. The effect of various Path methods with a relative path
Method name | Result |
---|---|
|
|
|
|
|
|
|
|
|
|
| <blank> |
Path
doesn’t check that the
named file exists. It only looks at the input string and, in the case
of GetFullPath
, the current working
directory.
OK, in our example, we just want the filename without the path, so
we use Path.GetFileName
to retrieve
it. Example 11-7 shows
the relevant piece of Example 11-4.
Example 11-7. Getting the filename without the full path
var fileNameGroups = from filePath in allFilePaths
let fileNameWithoutPath = Path.GetFileName(filePath)
group filePath by fileNameWithoutPath into nameGroup
select ...
We then use the LINQ group operator (which was described in Chapter 8) to group all of the files by name.
Path
contains a lot of other
useful members that we’ll need a little bit later; but we can leave it
for the time being, and move on to the other piece of information that
we need for our matching code: the file size. The .NET Framework
provides us with a class called FileInfo
that contains
a whole bunch of members that help us to discover things about a
file.
The various functions from the System.IO
classes we’ve dealt with so far have
all been static, but when it comes to retrieving information such as file
size, we have to create an instance of a FileInfo
object, passing its constructor the
path of the file we’re interested in. That path can be either an absolute
path like the ones we’ve seen already, or a path relative to the current
working directory. FileInfo
has a lot
of overlapping functionality with other classes. For example, it provides
a few helpers similar to Path
to get
details of the directory, filename, and extension.
However, the only method we’re really interested in for our example
is its Length
property, which tells us
the size of the file. Every other member on FileInfo
has a functional equivalent on other
classes in the framework. Even Length
is duplicated on the stream classes we’ll come to later, but it is simpler
for us to use FileInfo
if we don’t
intend to open the file itself.
We use FileInfo
in the final part
of InspectDirectories
, to put the file
size into the per-file details. Example 11-8
shows the relevant excerpt from Example 11-4.
Example 11-8. Getting the file size
... select new FileNameGroup { FileNameWithoutPath = nameGroup.Key, FilesWithThisName = (from filePath in nameGroup let info =new FileInfo(filePath)
select new FileDetails { FilePath = filePath, FileSize =info.Length
}).ToList() };
We’re now only one method short of a sort-of-useful program, and
that’s the one that trawls through this information to find and display
matches: DisplayMatches
, which is shown
in Example 11-9.
Example 11-9. DisplayMatches
private static void DisplayMatches( IEnumerable<FileNameGroup> filesGroupedByName) { var groupsWithMoreThanOneFile = from nameGroup in filesGroupedByName where nameGroup.FilesWithThisName.Count > 1 select nameGroup; foreach (var fileNameGroup in groupsWithMoreThanOneFile) { // Group the matches by the file size, then select those // with more than 1 file of that size. var matchesBySize = from file in fileNameGroup.FilesWithThisName group file by file.FileSize into sizeGroup where sizeGroup.Count() > 1 select sizeGroup; foreach (var matchedBySize in matchesBySize) { string fileNameAndSize = string.Format("{0} ({1} bytes)", fileNameGroup.FileNameWithoutPath, matchedBySize.Key); WriteWithUnderlines(fileNameAndSize); // Show each of the directories containing this file foreach (var file in matchedBySize) { Console.WriteLine(Path.GetDirectoryName(file.FilePath)); } Console.WriteLine(); } } } private static void WriteWithUnderlines(string text) { Console.WriteLine(text); Console.WriteLine(new string('-', text.Length)); }
We start with a LINQ query that looks for the filenames that crop up
in more than one folder, because those are the only candidates for being
duplicates. We iterate through each such name with a foreach
loop. Inside that loop, we run another
LINQ query that groups the files of that name by size—see the first
emphasized lines in Example 11-9. If InspectDirectories
discovered three files called
Program.cs, for example, and two of
them were 278 bytes long while the other was 894 bytes long, this group
clause would separate those three files
into two groups. The where
clause in
the same query removes any groups that contain only one file.
So the matchesBySize
variable
refers to a query that returns a group for each set of two or more files
that have the same size (and because we’re inside a loop that iterates
through the names, we already know they have the same name). Those are our
duplicate candidates. We then write out the filename and size (and an
underline separator of the same length). Finally, we write out each file
location containing candidate matches using Path.GetDirectoryName
.
If we compile and run that lot, we’ll see the following output:
Find duplicate files ==================== Looks for possible duplicate files in one or more directories Usage: findduplicatefiles [/sub] DirectoryName [DirectoryName] ... /sub - recurse into subdirectories
We haven’t given it anywhere to look! How are we going to test our
application? Well, we could provide it with some command-line parameters.
If you open the project properties and switch to the Debug
tab, you’ll see a
place where you can add command-line arguments (see Figure 11-1).
However, we could do a bit better for test purposes. Example 11-10 shows a modified Main
that supports a new /test
command-line
switch, which we can use to create test files and exercise the
function.
Example 11-10. Adding a /test switch
static void Main(string[] args) { bool recurseIntoSubdirectories = false; if (args.Length < 1) { ShowUsage(); return; } int firstDirectoryIndex = 0; IEnumerable<string> directoriesToSearch = null; bool testDirectoriesMade = false; try { // Check to see if we are running in test mode if (args.Length == 1 && args[0] == "/test") { directoriesToSearch = MakeTestDirectories(); testDirectoriesMade = true; recurseIntoSubdirectories = true; } else { if (args.Length > 1) { // see if we're being asked to recurse if (args[0] == "/sub") { if (args.Length < 2) { ShowUsage(); return; } recurseIntoSubdirectories = true; firstDirectoryIndex = 1; } } // Get list of directories from command line. directoriesToSearch = args.Skip(firstDirectoryIndex); } List<FileNameGroup> filesGroupedByName = InspectDirectories(recurseIntoSubdirectories, directoriesToSearch); DisplayMatches(filesGroupedByName); Console.ReadKey(); } finally { if( testDirectoriesMade ) { CleanupTestDirectories(directoriesToSearch); } } }
In order to operate in test mode, we’ve added an alternative way to
initialize the variable that holds the list of directories (directoriesToSearch
). The original code, which
initializes it from the command-line arguments (skipping over the /sub
switch if present), is still present.
However, if we find the /test
switch,
we initialize it to point at some test directories we’re going to create
(in the MakeTestDirectories
method).
The rest of the code can then be left as it was (to avoid running some
completely different program in our test mode). Finally, we add a bit of
cleanup code at the end to remove any test directories if we created
them.
So, how are we going to implement MakeTestDirectories
? We want to create some
temporary files, and write some content into them to exercise the various
matching possibilities.
A quick look at Path
reveals the GetTempFileName
method.
This creates a file of zero length in a directory dedicated to temporary
files, and returns the path to that file.
It is important to note that the file is actually created, whether you use it or not, and so you are responsible for cleaning it up when you are done, even if you don’t make any further use of it.
Let’s create another test console application, just to try out that method. We can do that by adding the following to our main function:
string fileName = Path.GetTempFileName();
// Display the filename
Console.WriteLine(fileName);
// And wait for some input
Console.ReadKey();
But wait! If we just compile and run that, we’ll leave the file we created behind on the system. We should make sure we delete it again when we’re done. There’s nothing special about a temporary file. We create it in an unusual way, and it ends up in a particular place, but once it has been created, it’s just like any other file in the filesystem. So, we can delete it the same way we’d delete any other file.
The System.IO
namespace
provides the File
class, which offers
various methods for doing things with files. Deleting is particularly
simple: we just use the static Delete
method, as Example 11-11 shows.
Example 11-11. Deleting a file
string fileName = Path.GetTempFileName(); try { // Use the file // ... // Display the filename Console.WriteLine(fileName); // And wait for some input Console.ReadKey(); } finally { // Then clean it up File.Delete(fileName); }
Notice that we’ve wrapped the code in which we (could) manipulate
the file further in a try
block, and
deleted it in a finally
block. This
ensures that whatever happens, we’ll always attempt to clean up after
ourselves.
If you compile and run this test project now, you’ll see some output like this:
C:UsersyourusernameAppDataLocalTemp mpCA8F.tmp
The exact text will depend on your operating system version, your username, and (of course) the random filename that was created for you. If you browse to that path, you will see a zero-length file of that name.
If you then press a key, allowing Console.ReadKey
to return, it will drop through
to the finally
block, where we delete
the temporary file, using the static Delete
method on the File
class.
There are lots of scenarios where this sort of temporary file
creation is just fine, but it doesn’t really suit our example
application’s needs. We want to create multiple temporary files, in
multiple different directories. GetTempFileName
doesn’t really do the job for
us.
If we look at Path
again, though,
there’s another likely looking method: GetRandomFileName
. This
returns a random string of characters that can be used as either a file or
a directory name. It uses a cryptographically strong random number generator (which can be useful in some
security-conscious scenarios), and is statistically likely to produce a
unique name, thus avoiding clashes. Unlike GetTempFileName
it doesn’t actually create the
file (or directory); that’s up to us.
If you run the code in Example 11-12:
you’ll see output similar to this:
xnicz3rs.juc
(Obviously, the actual characters you see will, hopefully, be different, or the statistical uniqueness isn’t all that unique!)
So, we can use that method to produce our test file and directory names. But where are we going to put the files? Perhaps one of the various “well-known folders” Windows offers would suit our needs.
Most operating systems have a bunch of well-known filesystem locations, and Windows is no exception. There are designated folders for things like the current user’s documents, pictures, or desktop; the program files directory where applications are installed; and the system folder.
The .NET Framework provides a class called Environment
that provides information about the
world our program runs in. Its static method GetFolderPath
is the one that interests us right
now, because it will return the path of various well-known folders. We
pass it one of the Environment.SpecialFolder
enumeration values. Example 11-13 retrieves the location
of one of the folders in which applications can store per-user
data.
Example 11-13. Getting a well-known folder location
string path = Environment.GetFolderPath(Environment.SpecialFolder.ApplicationData);
Table 11-4 lists all of the well-known
folders that GetFolderPath
can return,
and the location they give on the installed copy of Windows 7 (64-bit)
belonging to one of the authors.
Table 11-4. Special folders
Enumeration | Example location | Purpose |
---|---|---|
| C:Usersmwa AppDataRoaming | A place for applications to store their own private information for a particular user; this may be located on a shared server, and available across multiple logins for the same user, on different machines, if the user’s domain policy is configured to do so. |
| C:ProgramData | A place for applications to store their own private information accessible to all users. |
| C:Program FilesCommon Files | A place where shared application components can be installed. |
| C:Usersmwa AppDataRoaming MicrosoftWindowsCookies | The location where Internet cookies are stored for this user; another potentially roaming location. |
| C:Usersmwa Desktop | The current user’s desktop (virtual) folder. |
| C:Usersmwa Desktop | The physical directory where filesystem objects on the desktop are stored (currently, but not necessarily, the same as Desktop). |
| C:Usersmwa Favorites | The directory containing the current user’s favorites links. |
| C:Usersmwa AppDataLocal MicrosoftWindows History | The directory containing the current user’s Internet history. |
| C:Usersmwa AppDataLocal MicrosoftWindows Temporary Internet Files | The directory that contains the current user’s Internet cache. |
| C:Usersmwa AppDataLocal | A place for applications to
store their private data associated with the current user. This is
guaranteed to be on the local machine (as opposed to |
| <blank> | This is always an empty string because there is no real folder that corresponds to My Computer. |
| C:Usersmwa Documents | The folder in which the current user’s documents (as opposed to private application datafiles) are stored. |
| C:Usersmwa Music | The folder in which the current user’s music files are stored. |
| C:Usersmwa Pictures | The folder in which the current user’s picture files are stored. |
| C:Usersmwa Documents | The folder in which the
current user’s documents are stored (synonymous with |
| C:Program Files | The directory in which applications are installed. Note that there is no special folder enumeration for the 32-bit applications directory on 64-bit Windows. |
| C:Usersmwa AppDataRoaming MicrosoftWindows Start MenuPrograms | The location where application shortcuts in the Start menu’s Programs section are stored for the current user. This is another potentially roaming location. |
| C:Usersmwa AppDataRoaming MicrosoftWindows Recent | The folder where links to recently used documents are stored for the current user. This is another potentially roaming location. |
| C:Usersmwa AppDataRoaming MicrosoftWindows SendTo | The location that contains the links that form the Send To menu items in the shell. This is another potentially roaming location. |
| C:Usersmwa AppDataRoaming MicrosoftWindows Start Menu | The folder that contains the Start menu items for the current user. This is another potentially roaming location. |
| C:Usersmwa AppDataRoaming MicrosoftWindows Start MenuPrograms Startup | The folder that contains links to programs that will run each time the current user logs in. This is another potentially roaming location. |
| C:Windows system32 | The Windows system folder. |
| C:Usersmwa AppDataRoaming MicrosoftWindows Templates | A location in which applications can store document templates for the current user. Again, this is a potentially roaming location. |
Notice that this doesn’t include all of the well-known folders we have these days, because the set of folders grows with each new version of Windows. Things like Videos, Games, Downloads, Searches, and Contacts are all missing. It also doesn’t support Windows 7 libraries in any meaningful sense. This is (sort of) by design. The method provides a lowest common denominator approach to finding useful folders on the system, in a way that works across all supported versions of the framework (including Windows Mobile).
So, we need to choose a path in which our current user is likely to have permission to create/read/write and delete files and directories. It doesn’t have to be one that the user can see under normal circumstances. In fact, we’re going to create files with extensions that are not bound to any applications and we should not do that in a place that’s visible to the user if we want our application to be a good Windows citizen.
If you create a file in a place that’s visible to the user, like Documents or Desktop, you should ensure that it always has a default application associated with it.
There are two candidates for this in Table 11-4: LocalApplicationData
and ApplicationData
. Both of
these offer places for applications to store files that the user wouldn’t
normally see. (Of course, users can find these folders if they look hard
enough. The goal here is to avoid putting our temporary test files in the
same folders as the user’s documents.)
The difference between these two folders is that if the user has a
roaming profile, files in the latter folder will be copied around the
network as they move from one machine to another, while files in the
former folder remain on the machine on which they were created. We’re
building temporary files for test purposes, so LocalApplicationData
looks like the right
choice.
So, let’s return to our demo application, and start to implement the
MakeTestDirectories
method. The first thing we need to do is to create a few test directories.
Example 11-14 contains some code to do
that.
Example 11-14. Creating test directories
private static string[] MakeTestDirectories() { string localApplicationData = Path.Combine( Environment.GetFolderPath( Environment.SpecialFolder.LocalApplicationData), @"Programming CSharpFindDuplicates"); // Let's make three test directories var directories = new string[3]; for (int i = 0; i < directories.Length; ++i) { string directory = Path.GetRandomFileName(); // Combine the local application data with the // new random file/directory name string fullPath = Path.Combine(localApplicationData, directory); // And create the directory Directory.CreateDirectory(fullPath); directories[i] = fullPath; Console.WriteLine(fullPath); } return directories; }
First, we use the GetFolderPath
method to
get the LocalApplicationData
path. But
we don’t want to work directly in that folder—applications are meant to
create their own folders underneath this. Normally you’d create a folder
named either for your company or for your organization, and then an
application-specific folder inside that—we’ve used Programming CSharp as the organization name
here, and FindDuplicates as the
application name. We then use a for
loop to create three directories with random names inside that. To create
these new directories, we’ve used a couple of new methods: Path.Combine
and
Directory.CreateDirectory
.
If you’ve written any code that manipulates paths before,
you’ll have come across the leading/trailing slash dilemma. Does your path
fragment have one or not? You also need to know whether the path fragment
you’re going to append really is a relative path—are there circumstances under which you
might need to deal with a fully qualified path instead? Path.Combine
does away
with all that anxiety. Not only will it check all those things for you and
do the right thing, but it will even check that your paths contain only
valid path characters.
Table 11-5 contains some
example paths, and the result of combining them with Path.Combine
.
Table 11-5. Example results of Path.Combine
Path 1 | Path 2 | Combined |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The last entry in that table is particularly interesting: notice that the second path is absolute, and so the combined path is “optimized” to just that second path.
In our case, Example 11-14 combines the well-known folder with a subfolder name to get a folder location specific to this example. And then it combines that with our new temporary folder names, ready for creation.
Directory.CreateDirectory
is very straightforward: it does exactly what its name suggests. In fact,
it will create any directories in the whole path that do not already
exist, so you can create a deep hierarchy with a single call. (You’ll
notice that Example 11-14 didn’t bother to
create the Programming
CSharpFindDuplicates folder—those will get created
automatically the first time we run as a result of creating the temporary
folders inside them.) A side effect of this is that it is safe to call it
if all of the directories in the path already exist—it will just do
nothing.
In addition to the overload we’ve used, there’s a second
which also takes a DirectorySecurity
parameter:
Directory.CreateDirectory(string path, DirectorySecurity directorySecurity)
The DirectorySecurity
class
allows you to specify filesystem access controls with a relatively simple
programming model. If you’ve tried using the Win32 ACL APIs, you’ll know
that it is a nightmare of GUIDs, SSIDs, and lists sensitive to item
ordering. This model does away with much of the complexity.
Let’s extend our create function to make sure that only our current user has read/write/modify permissions on these directories. Example 11-15 modifies the previous example by explicitly granting the current user full control of the newly created folders. The new or changed lines are highlighted.
Example 11-15. Configuring access control on new directories
private static string[] MakeTestDirectories() { string localApplicationData = Path.Combine( Environment.GetFolderPath( Environment.SpecialFolder.LocalApplicationData), @"Programming CSharpFindDuplicates"); // Get the name of the logged in user string userName = WindowsIdentity.GetCurrent().Name; // Make the access control rule FileSystemAccessRule fsarAllow = new FileSystemAccessRule( userName, FileSystemRights.FullControl, AccessControlType.Allow); DirectorySecurity ds = new DirectorySecurity(); ds.AddAccessRule(fsarAllow); // Let's make three test directories var directories = new string[3]; for (int i = 0; i < directories.Length; ++i) { string directory = Path.GetRandomFileName(); // Combine the local application data with the // new random file/directory name string fullPath = Path.Combine(localApplicationData, directory); // And create the directory Directory.CreateDirectory(fullPath, ds); directories[i] = fullPath; Console.WriteLine(fullPath); } return directories; }
You’ll need to add a couple of using
directives to the top of the file before
you can compile this code:
using System.Security.AccessControl; using System.Security.Principal;
What do these changes do? First, we make use of a type called
WindowsIdentity
to find the current
user, and fish out its name. If you happen to want to specify the name
explicitly, rather than get the current user programmatically, you can do
so (e.g., MYDOMAINSomeUserId
).
Then, we create a FileSystemAccessRule
, passing it the username,
the FileSystemRights
we want to set,
and a value from the AccessControlType
enumeration which determines whether we are allowing or denying those
rights.
If you take a look at the FileSystemRights
enumeration in MSDN, you should
recognize the options from the Windows security permissions dialog in the
shell. You can combine the individual values (as it is a Flags
enumeration), or use one of the precanned
sets as we have here.
If you compile this application, and modify the debug settings to
pass just the /test
switch as the only
command-line argument, when you run it you’ll see output similar to the
following (but with your user ID, and some different random directory
names):
C:UsersyourIdAppDataLocalProgramming CSharpFindDuplicatesyzw0iw3p.ysq C:UsersyourIdAppDataLocalProgramming CSharpFindDuplicatesqke5k2ql.5et C:UsersyourIdAppDataLocalProgramming CSharpFindDuplicates5hkhspqa.osc
If we take a look at the folder in Explorer, you should see your new directories (something like Figure 11-2).
If you right-click on one of these and choose Properties, then examine the Security tab, you should see something like Figure 11-3.
Notice how the only user with permissions on
this directory is the currently logged on user (in this case ian
, on a domain called idg.interact
). All of the usual inherited
permissions have been overridden. Rather than the regular
read/modify/write checkboxes, we’ve apparently got special permissions. This is because we
set them explicitly in the code.
We can have a look at that in more detail if we click the Advanced
button, and switch to the Effective Permissions tab. Click the Select
button to pick a user (see Figure 11-4). First,
let’s look at the effective permissions for the local administrator (this
is probably MachineName
Administrator
, unless you’ve changed your
default administrator name to try to make things slightly harder for an
attacker).
If you click OK, you’ll see the effective permissions for Administrator on that folder (Figure 11-5).
You can scroll the scroll bar to prove it for yourself, but you can see that even Administrator cannot actually access your folder! (This is not, of course, strictly true. Administrators can take ownership of the folder and mess with the permissions themselves, but they cannot access the folder without changing the permissions first.) Try again with your own user ID. You will see results similar to Figure 11-6—we have full control. Scroll the list and you’ll see that everything is ticked.
What if we wanted “not quite” full control? Say we wanted to deny
the ability to write extended attributes to the file. Well, we can update
our code and add a second FileSystemAccessRule
. Example 11-16 shows the additional
code required.
Example 11-16. Denying permissions
private static string[] MakeTestDirectories() { // ... FileSystemAccessRule fsarAllow = new FileSystemAccessRule( userName, FileSystemRights.FullControl, AccessControlType.Allow); ds.AddAccessRule(fsarAllow); FileSystemAccessRule fsarDeny = new FileSystemAccessRule( userName, FileSystemRights.WriteExtendedAttributes, AccessControlType.Deny); ds.AddAccessRule(fsarDeny); // ... }
Notice that we’re specifying AccessControlType.Deny
.
Before you compile and run this, delete the folders you created with the last run, using Explorer—we’ll write some code to do that automatically in a minute, because it will get very boring very quickly!
You should see very similar output to last time (just with some new directory names):
C:UsersyourIdAppDataLocalProgramming CSharpFindDuplicatesslhwbtgo.sop C:UsersyourIdAppDataLocalProgramming CSharpFindDuplicatessfndkgn.ucm C:UsersyourIdAppDataLocalProgramming CSharpFindDuplicates ayf1uvg.y4y
If you look at the permissions, you will now see both the Allow and the new Deny entries (Figure 11-7).
As a double-check, take a look at the effective permissions for your current user (see Figure 11-8).
In Figure 11-8 you
can see that we’ve no longer got Full
control
, because we’ve been specifically denied Write extended attributes
. Of course, we could
always give that permission back to ourselves, because we’ve been allowed
Change permissions
, but that’s not the
point!
Although that isn’t the point, security permissions of all kinds are a complex affair. If your users have local or domain administrator permissions, they can usually work around any other permissions you try to manage. You should always try to abide by the principle of least permission: don’t grant people more privileges than they really need to do the job. Although that will require a little more thinking up front, and can sometimes be a frustrating process while you try to configure a system, it is much preferable to a wide-open door.
OK, delete those new directories using Explorer, and we’ll write
some code to clean up after ourselves. We need to delete the directories
we’ve just created, by implementing our CleanupTestDirectories
method.
You’re probably ahead of us by now. Yes, we can delete a
directory using Directory.Delete
, as Example 11-17 shows.
Example 11-17. Deleting a directory
private static void CleanupTestDirectories(IEnumerable<string> directories)
{
foreach (var directory in directories)
{
Directory.Delete(directory);
}
}
We’re just iterating through the set of new directories we stashed away earlier, deleting them.
OK, we’ve got our test directories. We’d now like to create some
test files to use. Just before we return from MakeTestDirectories
, let’s add a call to a new
method to create our files, as Example 11-18 shows.
Example 11-18. Creating files in the test directories
...
CreateTestFiles(directories);
return directories;
Example 11-19 shows that method.
Example 11-19. The CreateTestFiles method
private static void CreateTestFiles(IEnumerable<string> directories) { string fileForAllDirectories = "SameNameAndContent.txt"; string fileSameInAllButDifferentSizes = "SameNameDifferentSize.txt"; int directoryIndex = 0; // Let's create a distinct file that appears in each directory foreach (string directory in directories) { directoryIndex++; // Create the distinct file for this directory string filename = Path.GetRandomFileName(); string fullPath = Path.Combine(directory, filename); CreateFile(fullPath, "Example content 1"); // And now the one that is in all directories, with the same content fullPath = Path.Combine(directory, fileForAllDirectories); CreateFile(fullPath, "Found in all directories"); // And now the one that has the same name in // all directories, but with different sizes fullPath = Path.Combine(directory, fileSameInAllButDifferentSizes); StringBuilder builder = new StringBuilder(); builder.AppendLine("Now with"); builder.AppendLine(new string('x', directoryIndex)); CreateFile(fullPath, builder.ToString()); } }
As you can see, we’re running through the directories, and creating three files in each. The first has a different, randomly generated filename in each directory, and remember, our application only considers files with the same names as being possible duplicates, so we expect the first file we add to each directory to be considered unique. The second file has the same filename and content (so they will all be the same size) in every folder. The third file has the same name every time, but its content varies in length.
Well, we can’t put off the moment any longer; we’re going to have to create a file, and write some content into it. There are lots and lots and lots (and lots) of different ways of doing that with the .NET Framework, so how do we go about picking one?
Our first consideration should always be to “keep it
simple,” and use the most convenient method for the job. So, what is the
job? We need to create a file, and write some text into it. File.WriteAllText
looks like a good place to
start.
The File
class offers three
methods that can write an entire file out in a single step: WriteAllBytes
, WriteAllLines
, and WriteAllText
. The first of these works with
binary, but our application has text. As you saw in Chapter 10, we could use an Encoding
to convert our text into bytes, but
the other two methods here will do that for us. (They all use
UTF-8.)
WriteAllLines
takes a
collection of strings, one for each line, but our code in Example 11-19 prepares content in the form of
a single string. So as Example 11-20 shows, we use WriteAllText
to write the file out with a
single line of code. (In fact, we probably didn’t need to bother putting
this code into a separate method. However, this will make it easier for
us to illustrate some of the alternatives later.)
Example 11-20. Writing a string into a new file
private static void CreateFile(string fullPath, string contents) { File.WriteAllText(fullPath, contents); }
The path can be either relative or absolute, and the file will be created if it doesn’t already exist, and overwritten if it does.
This was pretty straightforward, but there’s one problem with this technique: it requires us to have the entire file contents ready at the point where we want to start writing text. This application already does that, but this won’t always be so. What if your program performs long and complex processing that produces very large volumes of text? Writing the entire file at once like this would involve having the whole thing in memory first. But there’s a slightly more complex alternative that makes it possible to generate gigabytes of text without consuming much memory.
The File
class offers a
CreateText
method,
which takes the path to the file to create (either relative or absolute,
as usual), and creates it for you if it doesn’t already exist. If the
file is already present, this method overwrites it. Unlike the
WriteAllText
method, it
doesn’t write any data initially—the newly created file will be empty at
first. The method returns an instance of the StreamWriter
class, which allows you to write
to the file. Example 11-21 shows the code
we need to use that.
Example 11-21. Creating a StreamWriter
private static void CreateFile(string fullPath, string p)
{
using (StreamWriter writer = File.CreateText(fullPath))
{
// Use the stream writer here
}
}
We’re no longer writing the whole file in one big lump, so we need
to let the StreamWriter
know when we’re done. To
make life easier for us, StreamWriter
implements IDisposable
, and closes
the underlying file if Dispose
is
called. This means that we can wrap it in a using
block, as Example 11-21 shows, and we can be assured that
it will be closed even if an exception is thrown.
So, what is a StreamWriter
? The
first thing to note is that even though this chapter has “Stream” in the
title, this isn’t actually a Stream
;
it’s a wrapper around a Stream
. It
derives from a class called TextWriter
, which, as
you might guess, is a base for types which write text into things, and a
StreamWriter
is a TextWriter
that writes text into a Stream
. TextWriter
defines lots of overloads of
Write
and WriteLine
methods, very similar to those we’ve
been using on Console
in all of our
examples so far.
If it is so similar in signature, why doesn’t Console
derive from TextWriter
? TextWriter
is intended to be used with some
underlying resource that needs proper lifetime management, so it
implements IDisposable
. Our code
would be much less readable if we had to wrap every call on Console
with a using
block, or remember to call Dispose
—especially as it isn’t really
necessary. So, why make TextWriter
implement IDisposable
? We do that
so that our text-writing code can be implemented in terms of this base
class, without needing to know exactly what sort of TextWriter
we’re talking to, and still
handle the cleanup properly.
The File
class’s CreateText
method calls a constructor on
StreamWriter
which opens the newly
created file, and makes it ready for us to write; something like
this:
return new StreamWriter(fullPath, false);
There’s nothing to stop you from doing this yourself by hand,
and there are many situations where you might want to do so; but the
helper methods on File
tend to make
your code smaller, and more readable, so you should consider using
those first. We’ll look at using StreamWriter
(and its partner, StreamReader
) in this way later in the
chapter, when we’re dealing with different sorts of underlying
streams.
Hang on, though. We’ve snuck a second parameter into that
constructor. What does that Boolean mean? When you create a StreamWriter
, you can choose to overwrite any
existing file content (the default), or append to what is already there.
The second Boolean parameter to the constructor controls that behavior.
As it happen, passing false
here
means we want to overwrite.
This is a great example of why it’s better to define nicely
named enumerations, rather than controlling this sort of thing with a
bool
. If the value had not been
false
, but some mythical value such
as OpenBehavior.Overwrite
, we
probably wouldn’t have needed to explain what it did. C# 4.0 added the
ability to use argument names when calling methods, so we could have
written new StreamWriter(fullPath, append:
false)
, which improves matters slightly, but doesn’t help
you when you come across code that hasn’t bothered to do that.
So, now we can easily complete the implementation of our
CreateFile
method, as
shown in Example 11-22.
Example 11-22. Writing a string with StreamWriter
private static void CreateFile(string fullPath, string p)
{
using (StreamWriter writer = File.CreateText(fullPath))
{
writer.Write(p);
}
}
We just write the string we’ve been provided to the file. In this
particular application, Example 11-22 isn’t an improvement on
Example 11-20—we’re just writing a
single string, so WriteAllText
was a
better fit. But StreamWriter
is an
important technique for less trivial scenarios.
OK, let’s build and run this code again (press F5 to make sure it runs in the debugger). And everything seems to be going very well. We see the output we’d hoped for:
C:UsersmwaAppDataLocalup022gsm.241 C:UsersmwaAppDataLocalgdovysqk.cqn C:UsersmwaAppDataLocalxyhazu3n.4pw SameNameAndContent.txt ---------------------- C:UsersmwaAppDataLocalup022gsm.241 C:UsersmwaAppDataLocalgdovysqk.cqn C:UsersmwaAppDataLocalxyhazu3n.4pw
That is to say, one file is found duplicated in three directories. All the others have failed to match, exactly as we’d expect.
Unfortunately, almost before we’d had a chance to read that, the debugger halted execution to report an unhandled exception. It crashes in the code we added in Example 11-17 to delete the directories, because the directories are not empty.
For now, we’re going to have to clean up those directories by hand
again, and make another change to our code. Clearly, the problem is that
the Directory.Delete
method
doesn’t delete the files and directories inside the
directory itself.
This is easily fixed, because there is another overload of that
method which does allow us to delete the files recursively—you just pass
a Boolean as the second parameter (true
for recursive deletes, and false
for the default behavior).
Don’t add this parameter unless you’re absolutely sure that the code is working correctly, looking only at the test directory, and not executing this code in nontest mode. We don’t want a host of emails appearing telling us that we deleted your entire, non-backed-up source and document tree because you followed this next instruction, having deviated slightly from the earlier instructions.
If you want to avoid having to clean up the directories by hand, though, and you’re really, really sure everything is fine, you could add this, at your own risk:
Directory.Delete(directory, true);
So far, we have quietly ignored the many, many things that can go wrong when you’re using files and streams. Now seems like a good time to dive into that murky topic.
Exceptions related to file and stream operations fall into three broad categories:
The usual suspects you might get from any method: incorrect parameters, null references, and so on
I/O-related problems
Security-related problems
The first category can, of course, be dealt with as normal—if they occur (as we discussed in Chapter 6) there is usually some bug or unexpected usage that you need to deal with.
The other two are slightly more interesting cases. We should expect problems with file I/O. Files and directories are (mostly) system-wide shared resources. This means that anyone can be doing something with them while you are trying to use them. As fast as you’re creating them, some other process might be deleting them. Or writing to them; or locking them so that you can’t touch them; or altering the permissions on them so that you can’t see them anymore. You might be working with files on a network share, in which case different computers may be messing with the files, or you might lose connectivity partway through working with a file.
This “global” nature of files also means that you have to deal with
concurrency problems. Consider this piece of code, for example, that makes
use of the (almost totally redundant) method File.Exists
, shown in
Example 11-23, which determines
whether a file exists.
Example 11-23. The questionable File.Exists method
if (File.Exists("SomeFile.txt")) { // Play with the file }
Is it safe to play with the file in there, on the assumption that it exists?
No.
In another process, even from another machine if the directory is shared, someone could nip in and delete the file or lock it, or do something even more nefarious (like substitute it for something else). Or the user might have closed the lid of his laptop just after the method returns, and may well be in a different continent by the time he brings it out of sleep mode, at which point you won’t necessarily have access to the same network shares that seemed to be visible just one line of code ago.
So you have to code extremely defensively, and expect exceptions in your I/O code, even if you checked that everything looked OK before you started your work.
Unlike most exceptions, though, abandoning the operation is not always the best choice. You often see transient problems, like a USB drive being temporarily unavailable, for example, or a network glitch temporarily hiding a share from us, or aborting a file copy operation. (Transient network problems are particularly common after a laptop resumes from suspend—it can take a few seconds to get back on the network, or maybe even minutes if the user is in a hotel and has to sign up for an Internet connection before connecting back to the office VPN. Abandoning the user’s data is not a user-friendly response to this situation.)
When an I/O problem occurs, the framework throws one of several
exceptions derived from IOException
(or, as we’ve already seen, IOException
itself) listed here:
IOException
This is thrown when some general problem with I/O has
occurred. This is the base for all of the more specific exception
types, but it is sometimes thrown in its own right, with the
Message
text describing the
actual problem. This makes it somewhat less useful for programmatic
interpretation; you usually have to allow the user to intervene in
some way when you catch one of these.
DirectoryNotFoundException
This is thrown when an attempt is made to access a directory that does not exist. This commonly occurs because of an error in constructing a path (particularly when relative paths are in play), or because some other process has moved or deleted a directory during an operation.
DriveNotFoundException
This is thrown when the root drive in a path is no longer available. This could be because a drive letter has been mapped to a network location which is no longer available, or a removable device has been removed. Or because you typed the wrong drive letter!
FileLoadException
This is a bit of an anomaly in the family of IOException
s, and we’re including it in
this list only because it can cause some confusion. It is thrown by
the runtime when an assembly cannot be loaded; as such, it has more
to do with assemblies than files and streams.
FileNotFoundException
This is thrown when an attempt is made to access a file that
does not exist. As with DirectoryNotFoundException
, this is often
because there has been some error in constructing a path (absolute
or relative), or because something was moved or deleted while the
program was running.
PathTooLongException
This is an awkward little exception, and causes a good deal of confusion for developers (which is one reason correct behavior in the face of long paths is a part of Microsoft’s Designed For Windows test suite). It is thrown when a path provided is too long. But what is “too long”? The maximum length for a path in Windows used to be 260 characters (which isn’t very long at all). Recent versions allow paths up to about (but not necessarily exactly) 32,767 characters, but making use of that from .NET is awkward. There’s a detailed discussion of Windows File and Path lengths if you fall foul of the problem in the MSDN documentation at http://msdn.microsoft.com/library/aa365247, and a discussion of the .NET-specific issues at http://go.microsoft.com/fwlink/?LinkID=163666.
If you are doing anything with I/O operations, you will need to think about most, if not all, of these exceptions, deciding where to catch them and what to do when they occur.
Let’s look back at our example again, and see what we want to do with any exceptions that might occur. As a first pass, we could just wrap our main loop in a try/catch block, as Example 11-24 does. Since our application’s only job is to report its findings, we’ll just display a message if we encounter a problem.
Example 11-24. A first attempt at handling I/O exceptions
try { List<FileNameGroup> filesGroupedByName = InspectDirectories(recurseIntoSubdirectories, directoriesToSearch); DisplayMatches(foundFiles); Console.ReadKey(); } catch (PathTooLongException ptlx) { Console.WriteLine("The specified path was too long"); Console.WriteLine(ptlx.Message); } catch (DirectoryNotFoundException dnfx) { Console.WriteLine("The specified directory was not found"); Console.WriteLine(dnfx.Message); } catch (IOException iox) { Console.WriteLine(iox.Message); } catch (UnauthorizedAccessException uax) { Console.WriteLine("You do not have permission to access this directory."); Console.WriteLine(uax.Message); } catch (ArgumentException ax) { Console.WriteLine("The path provided was not valid."); Console.WriteLine(ax.Message); } finally { if (testDirectoriesMade) { CleanupTestDirectories(directoriesToSearch); } }
We’ve decided to provide specialized handling for the PathTooLongException
and DirectoryNotFoundException
exceptions, as well as generic handling for IOException
(which, of course, we have to catch
after the exceptions derived from it).
In addition to those IOException
-derived types, we’ve also caught
UnauthorizedAccessException
. This is a
security exception, rather than an I/O exception, and so it derives from a
different base (SystemException
). It is
thrown if the user does not have permission to access the directory
concerned.
Let’s see that in operation, by creating an additional test
directory and denying ourselves access to it. Example 11-25 shows a function to create a directory
where we deny ourselves the ListDirectory
permission.
Example 11-25. Denying permission
private static string CreateDeniedDirectory(string parentPath) { string deniedDirectory = Path.GetRandomFileName(); string fullDeniedPath = Path.Combine(parentPath, deniedDirectory); string userName = WindowsIdentity.GetCurrent().Name; DirectorySecurity ds = new DirectorySecurity(); FileSystemAccessRule fsarDeny = new FileSystemAccessRule( userName, FileSystemRights.ListDirectory, AccessControlType.Deny); ds.AddAccessRule(fsarDeny); Directory.CreateDirectory(fullDeniedPath, ds); return fullDeniedPath; }
We can call it from our MakeTestDirectories
method, as Example 11-26
shows (along with suitable modifications to the code to accommodate the
extra directory).
Example 11-26. Modifying MakeTestDirectories for permissions test
private static string[] MakeTestDirectories() { // ... // Let's make three test directories // and leave space for a fourth to test access denied behavior var directories = new string[4
]; for (int i = 0; i <directories.Length - 1
; ++i) { ... as before ... } CreateTestFiles(directories.Take(3)
); directories[3] = CreateDeniedDirectory(localApplicationData); return directories; }
But hold on a moment, before you build and run this. If we’ve denied ourselves permission to look at that directory, how are we going to delete it again in our cleanup code? Fortunately, because we own the directory that we created, we can modify the permissions again when we clean up.
Example 11-27 shows a method which can give us back full control over any directory (providing we have the permission to change the permissions). This code makes some assumptions about the existing permissions, but that’s OK here because we created the directory in the first place.
Example 11-27. Granting access to a directory
private static void AllowAccess(string directory) { DirectorySecurity ds = Directory.GetAccessControl(directory); string userName = WindowsIdentity.GetCurrent().Name; // Remove the deny rule FileSystemAccessRule fsarDeny = new FileSystemAccessRule( userName, FileSystemRights.ListDirectory,
AccessControlType.Deny); ds.RemoveAccessRuleSpecific(fsarDeny); // And add an allow rule FileSystemAccessRule fsarAllow = new FileSystemAccessRule( userName, FileSystemRights.FullControl,
AccessControlType.Allow); ds.AddAccessRule(fsarAllow); Directory.SetAccessControl(directory, ds); }
Notice how we’re using the GetAccessControl
method
on Directory
to get hold of the
directory security information. We then construct a filesystem access
rule which matches the deny rule we created earlier, and call RemoveAccessRuleSpecific
on the DirectorySecurity
information we retrieved.
This matches the rule up exactly, and then removes it if it exists (or
does nothing if it doesn’t).
Finally, we add an allow rule to the set to
give us full control over the directory, and then call the Directory.SetAccessControl
method to set those
permissions on the directory itself.
Let’s call that method from our cleanup code, compile, and run. (Don’t forget, we’re deleting files and directories, and changing permissions, so take care!)
Here’s some sample output:
C:UsersmwaAppDataLocalufmnho4z.h5p C:UsersmwaAppDataLocal5chw4maf.xyu C:UsersmwaAppDataLocals1ydovhu.0wk You do not have permission to access this directory. Access to the path 'C:UsersmwaAppDataLocalyjijkza.3cj' is denied.
These methods make it relatively easy to manage permissions when you create and manipulate files, but they don’t make it easy to decide what those permissions should be! It is always tempting just to make everything available to anyone—you can get your code compiled and “working” much quicker that way; but only for “not very secure” values of “working,” and that’s something that has to be of concern for every developer.
Your application could be the one that miscreants decide to exploit to turn your users’ PCs to the dark side.
I warmly recommend that you crank UAC up to the maximum (and put up with the occasional security dialog), run Visual Studio as a nonadministrator (as far as is possible), and think at every stage about the least possible privileges you can grant to your users that will still let them get their work done. Making your app more secure benefits everyone: not just your own users, but everyone who doesn’t receive a spam email or a hack attempt because the bad guys couldn’t exploit your application.
We’ve now handled the exception nicely—but is stopping really the
best thing we could have done? Would it not be better to log the fact
that we were unable to access particular directories, and carry on?
Similarly, if we get a DirectoryNotFoundException
or FileNotFoundException
, wouldn’t we want to
just carry on in this case? The fact that someone has deleted the
directory from underneath us shouldn’t matter to us.
If we look again at our sample, it might be better to catch the
DirectoryNotFoundException
and
FileNotFoundException
inside the
InspectDirectories
method to provide a more fine-grained response to errors. Also, if we
look at the documentation for FileInfo
, we’ll see that it may actually
throw a base IOException
under some
circumstances, so we should catch that here, too. And in all cases, we need to catch the security
exceptions.
We’re relying on LINQ to iterate through the files and folders,
which means it’s not entirely obvious where to put the exception
handling. Example 11-28 shows the
code from InspectDirectories
that
iterates through the folders, to get a list of files. We can’t put
exception handling code into the middle of that query.
Example 11-28. Iterating through the directories
var allFilePaths = from directory in directoriesToSearch from file in Directory.GetFiles(directory, "*.*", searchOption) select file;
However, we don’t have to. The simplest way to solve this is to put the code that gets the directories into a separate method, so we can add exception handling, as Example 11-29 shows.
Example 11-29. Putting exception handling in a helper method
private static IEnumerable<string> GetDirectoryFiles( string directory, SearchOption searchOption) { try { return Directory.GetFiles(directory, "*.*", searchOption); } catch (DirectoryNotFoundException dnfx) { Console.WriteLine("Warning: The specified directory was not found"); Console.WriteLine(dnfx.Message); } catch (UnauthorizedAccessException uax) { Console.WriteLine( "Warning: You do not have permission to access this directory."); Console.WriteLine(uax.Message); } return Enumerable.Empty<string>(); }
This method defers to Directory.GetFiles
, but in the event of one of
the expected errors, it displays a warning, and then just returns an
empty collection.
There’s a problem here when we ask GetFiles
to search recursively: if it
encounters a problem with even just one directory, the whole operation
throws, and you’ll end up not looking in any directories. So while
Example 11-29 makes a
difference only when the user passes multiple directories on the
command line, it’s not all that useful when using the /sub
option. If you wanted to make your
error handling more fine-grained still, you could write your own
recursive directory search. The GetAllFilesInDirectory
example in Chapter 7 shows how to do that.
If we modify the LINQ query to use this, as shown in Example 11-30, the overall progress will be undisturbed by the error handling.
Example 11-30. Iterating in the face of errors
var allFilePaths = from directory in directoriesToSearch
from file in GetDirectoryFiles(directory,
searchOption)
select file;
And we can use a similar technique for the LINQ query that
populates the fileNameGroups
—it uses FileInfo
, and we need to handle exceptions for
that. Example 11-31 iterates
through a list of paths, and returns details for each file that it was
able to access successfully, displaying errors otherwise.
Example 11-31. Handling exceptions from FileInfo
private static IEnumerable<FileDetails> GetDetails(IEnumerable<string> paths) { foreach (string filePath in paths) { FileDetails details = null; try { FileInfo info = new FileInfo(filePath); details = new FileDetails { FilePath = filePath, FileSize = info.Length }; } catch (FileNotFoundException fnfx) { Console.WriteLine("Warning: The specified file was not found"); Console.WriteLine(fnfx.Message); } catch (IOException iox) { Console.Write("Warning: "); Console.WriteLine(iox.Message); } catch (UnauthorizedAccessException uax) { Console.WriteLine( "Warning: You do not have permission to access this file."); Console.WriteLine(uax.Message); } if (details != null) { yield return details; } } }
We can use this from the final LINQ query in InspectDirectories
. Example 11-32 shows the modified
query.
Example 11-32. Getting details while tolerating errors
var fileNameGroups = from filePath in allFilePaths
let fileNameWithoutPath = Path.GetFileName(filePath)
group filePath by fileNameWithoutPath into nameGroup
select new FileNameGroup
{
FileNameWithoutPath = nameGroup.Key,
FilesWithThisName = GetDetails(nameGroup).ToList()
};
Again, this enables the query to process all accessible items, while reporting errors for any problematic files without having to stop completely. If we compile and run again, we see the following output:
C:UsersmwaAppDataLocaldcyx0fv1.hv3 C:UsersmwaAppDataLocal nf2wqwr.y3s C:UsersmwaAppDataLocalkfilxte4.exy Warning: You do not have permission to access this directory. Access to the path 'C:UsersmwaAppDataLocal 2gl4q1a.ycp' is denied. SameNameAndContent.txt ---------------------- C:UsersmwaAppDataLocaldcyx0fv1.hv3 C:UsersmwaAppDataLocal nf2wqwr.y3s C:UsersmwaAppDataLocalkfilxte4.exy
We’ve dealt cleanly with the directory to which we did not have access, and have continued with the job to a successful conclusion.
Now that we’ve found a few candidate files that may (or may not) be the same, can we actually check to see that they are, in fact, identical, rather than just coincidentally having the same name and length?
To compare the candidate files, we could load them into
memory. The File
class offers three
likely looking static methods: ReadAllBytes
, which
treats the file as binary, and loads it into a byte array; File.ReadAllText
, which treats it as text, and
reads it all into a string; and File.ReadLines
, which
again treats it as text, but loads each line into its own string, and
returns an array of all the lines. We could even call File.OpenRead
to obtain a StreamReader
(equivalent to the StreamWriter
, but for reading data—we’ll see
this again later in the chapter).
Because we’re looking at all file types, not just text, we need to
use one of the binary-based methods. File.ReadAllBytes
returns a byte[]
containing the entire contents of the
file. We could then compare the files byte for byte, to see if they are
the same. Here’s some code to do that.
First, let’s update our DisplayMatches
function
to do the load and compare, as shown by the highlighted lines in Example 11-33.
Example 11-33. Updating DisplayMatches for content comparison
private static void DisplayMatches( IEnumerable<FileNameGroup> filesGroupedByName) { var groupsWithMoreThanOneFile = from nameGroup in filesGroupedByName where nameGroup.FilesWithThisName.Count > 1 select nameGroup; foreach (var fileNameGroup in groupsWithMoreThanOneFile) { // Group the matches by the file size, then select those // with more than 1 file of that size. var matchesBySize = from match in fileNameGroup.FilesWithThisName group match by match.FileSize into sizeGroup where sizeGroup.Count() > 1 select sizeGroup; foreach (var matchedBySize in matchesBySize) { List<FileContents> content = LoadFiles(matchedBySize); CompareFiles(content); } } }
Notice that we want our LoadFiles
function to
return a List
of FileContents
objects. Example 11-34 shows the FileContents
class.
Example 11-34. File content information class
internal class FileContents { public string FilePath { get; set; } public byte[] Content { get; set; } }
It just lets us associate the filename with the contents so that we
can use it later to display the results. Example 11-35 shows the implementation of
LoadFiles
, which uses ReadAllBytes
to load in the file content.
Example 11-35. Loading binary file content
private static List<FileContents> LoadFiles(IEnumerable<FileDetails> fileList) { var content = new List<FileContents>(); foreach (FileDetails item in fileList) { byte[] contents = File.ReadAllBytes(item.FilePath); content.Add(new FileContents { FilePath = item.FilePath, Content = contents }); } return content; }
We now need an implementation for CompareFiles
, which is
shown in Example 11-36.
Example 11-36. CompareFiles method
private static void CompareFiles(List<FileContents> files) { Dictionary<FileContents, List<FileContents>> potentiallyMatched = BuildPotentialMatches(files); // Now, we're going to look at every byte in each CompareBytes(files, potentiallyMatched); DisplayResults(files, potentiallyMatched); }
This isn’t exactly the most elegant way of comparing several files. We’re building a big dictionary of all of the potential matching combinations, and then weeding out the ones that don’t actually match. For large numbers of potential matches of the same size this could get quite inefficient, but we’ll not worry about that right now! Example 11-37 shows the function that builds those potential matches.
Example 11-37. Building possible match combinations
private static Dictionary<FileContents, List<FileContents>> BuildPotentialMatches(List<FileContents> files) { // Builds a dictionary where the entries look like: // { 0, { 1, 2, 3, 4, ... N } } // { 1, { 2, 3, 4, ... N } // ... // { N - 1, { N } } // where N is one less than the number of files. var allCombinations = Enumerable.Range(0, files.Count - 1).ToDictionary( x => files[x], x => files.Skip(x + 1).ToList()); return allCombinations; }
This set of potential matches will be whittled down to the files
that really are the same by CompareBytes
, which we’ll get to momentarily.
The DisplayResults
method,
shown in Example 11-38, runs through the matches
and displays their names and locations.
Example 11-38. Displaying matches
private static void DisplayResults( List<FileContents> files, Dictionary<FileContents, List<FileContents>> currentlyMatched) { if (currentlyMatched.Count == 0) { return; } var alreadyMatched = new List<FileContents>(); Console.WriteLine("Matches"); foreach (var matched in currentlyMatched) { // Don't do it if we've already matched it previously if (alreadyMatched.Contains(matched.Key)) { continue; } else { alreadyMatched.Add(matched.Key); } Console.WriteLine("-------"); Console.WriteLine(matched.Key.FilePath); foreach (var file in matched.Value) { Console.WriteLine(file.FilePath); alreadyMatched.Add(file); } } Console.WriteLine("-------"); }
This leaves the method shown in Example 11-39 that does the bulk of the work, comparing the potentially matching files, byte for byte.
Example 11-39. Byte-for-byte comparison of all potential matches
private static void CompareBytes( List<FileContents> files, Dictionary<FileContents, List<FileContents>> potentiallyMatched) { // Remember, this only ever gets called with files of equal length. int fileLength = files[0].Content.Length; var sourceFilesWithNoMatches = new List<FileContents>(); for (int fileByteOffset = 0; fileByteOffset < fileLength; ++fileByteOffset) { foreach (var sourceFileEntry in potentiallyMatched) { byte[] sourceContent = sourceFileEntry.Key.Content; for (int otherIndex = 0; otherIndex < sourceFileEntry.Value.Count; ++otherIndex) { // Check the byte at i in each of the two files, if they don't // match, then we remove them from the collection byte[] otherContent = sourceFileEntry.Value[otherIndex].Content; if (sourceContent[fileByteOffset] != otherContent[fileByteOffset]) { sourceFileEntry.Value.RemoveAt(otherIndex); otherIndex -= 1; if (sourceFileEntry.Value.Count == 0) { sourceFilesWithNoMatches.Add(sourceFileEntry.Key); } } } } foreach (FileContents fileWithNoMatches in sourceFilesWithNoMatches) { potentiallyMatched.Remove(fileWithNoMatches); } // Don't bother with the rest of the file if // there are no further potential matches if (potentiallyMatched.Count == 0) { break; } sourceFilesWithNoMatches.Clear(); } }
We’re going to need to add a test file that differs only in the
content. In CreateTestFiles
add another
filename that doesn’t change as we go round the loop:
string fileSameSizeInAllButDifferentContent = "SameNameAndSizeDifferentContent.txt";
Then, inside the loop (at the bottom), we’ll create a test file that will be the same length, but varying by only a single byte:
// And now one that is the same length, but with different content fullPath = Path.Combine(directory, fileSameSizeInAllButDifferentContent); builder = new StringBuilder(); builder.Append("Now with "); builder.Append(directoryIndex); builder.AppendLine(" extra"); CreateFile(fullPath, builder.ToString());
If you build and run, you should see some output like this, showing the one identical file we have in each file location:
C:UsersmwaAppDataLocale33yz4hg.mjp C:UsersmwaAppDataLocalung2xdgo.k1c C:UsersmwaAppDataLocaljcpagntt.ynd Warning: You do not have permission to access this directory. Access to the path 'C:UsersmwaAppDataLocalcmoof2kj.ekd' is denied. Matches ------- C:UsersmwaAppDataLocale33yz4hg.mjpSameNameAndContent.txt C:UsersmwaAppDataLocalung2xdgo.k1cSameNameAndContent.txt C:UsersmwaAppDataLocaljcpagntt.yndSameNameAndContent.txt -------
Needless to say, this isn’t exactly very efficient; and it is unlikely to work so well when you get to those DVD rips and massive media repositories. Even your 64-bit machine probably doesn’t have quite that much memory available to it.[24] There’s a way to make this more memory-efficient. Instead of loading the file completely into memory, we can take a streaming approach.
You can think of a stream like one of those old-fashioned news ticker tapes. To write data onto the tape, the bytes (or characters) in the file are typed out, one at a time, on the continuous stream of tape.
We can then wind the tape back to the beginning, and start reading it back, character by character, until either we stop or we run off the end of the tape. Or we could give the tape to someone else, and she could do the same. Or we could read, say, 1,000 characters off the tape, and copy them onto another tape which we give to someone to work on, then read the next 1,000, and so on, until we run out of characters.
Once upon a time, we used to store programs and data in exactly this way, on a stream of paper tape with holes punched in it; the basic technology for this was invented in the 19th century. Later, we got magnetic tape, although that was less than useful in machine shops full of electric motors generating magnetic fields, so paper systems (both tape and punched cards) lasted well into the 1980s (when disk systems and other storage technologies became more robust, and much faster).
The concept of a machine that reads data items one at a time, and can step forward or backward through that stream, goes back to the very foundations of modern computing. It is one of those highly resilient metaphors that only really falls down in the face of highly parallelized algorithms: a single input stream is often the choke point for scalability in that case.
To illustrate this, let’s write a method that’s equivalent to
File.ReadAllBytes
using a stream (see
Example 11-40).
Example 11-40. Reading from a stream
private static byte[] ReadAllBytes(string filename) { using (FileStream stream = File.OpenRead(filename)) { long streamLength = stream.Length; if (streamLength > 0x7fffffffL) { throw new InvalidOperationException( "Unable to allocate more than 0x7fffffffL bytes" + "of memory to read the file"); } // Safe to cast to an int, because // we checked for overflow above int bytesToRead = (int) stream.Length; // This could be a big buffer! byte[] bufferToReturn = new byte[bytesToRead]; // We're going to start at the beginning int offsetIntoBuffer = 0; while (bytesToRead > 0) { int bytesRead = stream.Read(bufferToReturn, offsetIntoBuffer, bytesToRead); if (bytesRead == 0) { throw new InvalidOperationException( "We reached the end of file before we expected..." + "Has someone changed the file while we weren't looking?"); } // Read may return fewer bytes than we asked for, so be // ready to go round again. bytesToRead -= bytesRead; offsetIntoBuffer += bytesRead; } return bufferToReturn; } }
The call to File.OpenRead
creates
us an instance of a FileStream
. This
class derives from the base Stream
class, which defines most of the methods and properties we’re going to
use.
First, we inspect the stream’s Length
property to determine how many bytes we
need to allocate in our result. This is a long
, so it can support
truly enormous files, even if we can allocate only 2
GB of memory.
If you try using the stream.Length
argument as the array size
without checking it for size first, it will compile, so you might wonder
why we’re doing this check. In fact, C# converts the argument to an
int
first, and if it’s too big,
you’ll get an OverflowException
at
runtime. By checking the size explicitly, we can provide our own error
message.
Then (once we’ve set up a few variables) we call stream.Read
and ask it
for all of the data in the stream. It is entitled to give us any number of
bytes it likes, up to the number we ask for. It returns the actual number
of bytes read, or 0
if we’ve hit the
end of the stream and there’s no more data.
A common programming error is to assume that the stream will give
you as many bytes as you asked for. Under simple test conditions it
usually will if there’s enough data. However, streams can and sometimes
do return you less in order to give you some data
as soon as possible, even when you might think it should be able to give
you everything. If you need to read a certain amount before proceeding,
you need to write code to keep calling Read
until you get what you require, as Example 11-40 does.
Notice that it returns us an int
.
So even if .NET did let us allocate arrays larger than 2 GB (which it
doesn’t) a stream can only tell us that it has read 2 GB worth of data at
a time, and in fact, the third argument to Read
, where we tell it how much we want, is also
an int
, so 2 GB is the most we can ask
for. So while FileStream
is able to
work with larger files thanks to the 64-bit Length
property, it will split the data into
more modest chunks of 2 GB or less when we read. But then one of the main
reasons for using streams in the first place is to avoid having to deal
with all the content in one go, so in practice we tend to work with much
smaller chunks in any case.
So we always call the Read
method
in a loop. The stream maintains the current read position for us, but we
need to work out where to write it in the destination array (offsetIntoBuffer
). We also need to work out how
many more bytes we have to read (bytesToRead
).
We can now update the call to ReadAllBytes
in our LoadFile
method so that
it uses our new implementation:
byte[] contents = ReadAllBytes(item.Filename);
If this was all you were going to do, you wouldn’t actually
implement ReadAllBytes
yourself;
you’d use the one in the framework! This is just by way of an example.
We’re going to make more interesting use of streams shortly.
Build and run again, and you should see output with exactly the same form as before:
C:UsersmwaAppDataLocal1ssoimgj.wqg C:UsersmwaAppDataLocalcjiymq5b.bfo C:UsersmwaAppDataLocaldiss5tgl.zae Warning: You do not have permission to access this directory. Access to the path 'C:UsersmwaAppDataLocalu1w0rj0o.2xe' is denied. Matches ------- C:UsersmwaAppDataLocal1ssoimgj.wqgSameNameAndContent.txt C:UsersmwaAppDataLocalcjiymq5b.bfoSameNameAndContent.txt C:UsersmwaAppDataLocaldiss5tgl.zaeSameNameAndContent.txt -------
That’s all very well, but we haven’t actually improved anything. We
wanted to avoid loading all of those files into memory. Instead of loading
the files, let’s update our FileContents
class to hold a stream instead of a
byte array, as Example 11-41
shows.
Example 11-41. FileContents using FileStream
internal class FileContents
{
public string FilePath { get; set; }
public FileStream Content { get; set; }
}
We’ll have to update the code that creates the FileContents
too, in our LoadFiles
method from Example 11-35. Example 11-42 shows the change required.
Example 11-42. Modifying LoadFiles
content.Add(new FileContents
{
FilePath = item.FilePath,
Content = File.OpenRead(item.FilePath)
});
(You can now delete our ReadAllBytes
implementation, if you
want.)
Because we’re opening all of those files, we need to make sure that
we always close them all. We can’t implement the using
pattern, because we’re handing off the
references outside the scope of the function that creates them, so we’ll
have to find somewhere else to call Close
.
DisplayMatches
(Example 11-33) ultimately causes
the streams to be created by calling LoadFiles
, so DisplayMatches
should close them too. We can add
a try/finally block in that method’s innermost foreach
loop, as Example 11-43 shows.
Example 11-43. Closing streams in DisplayMatches
foreach (var matchedBySize in matchesBySize) { List<FileContents> content = LoadFiles(matchedBySize); try { CompareFiles(content); } finally { foreach (var item in content) { item.Content.Close(); } } }
The last thing to update, then, is the CompareBytes
method. The
previous version, shown in Example 11-39, relied on loading
all the files into memory upfront. The modified version in Example 11-44 uses streams.
Example 11-44. Stream-based CompareBytes
private static void CompareBytes( List<FileContents> files, Dictionary<FileContents, List<FileContents>> potentiallyMatched) { // Remember, this only ever gets called with files of equal length. long bytesToRead = files[0].Content.Length; // We work through all the files at once, so allocate a buffer for each. Dictionary<FileContents, byte[]> fileBuffers = files.ToDictionary(x => x, x => new byte[1024]); var sourceFilesWithNoMatches = new List<FileContents>(); while (bytesToRead > 0) { // Read up to 1k from all the files. int bytesRead = 0; foreach (var bufferEntry in fileBuffers) { FileContents file = bufferEntry.Key; byte[] buffer = bufferEntry.Value; int bytesReadFromThisFile = 0; while (bytesReadFromThisFile < buffer.Length) { int bytesThisRead = file.Content.Read( buffer, bytesReadFromThisFile, buffer.Length - bytesReadFromThisFile); if (bytesThisRead == 0) { break; } bytesReadFromThisFile += bytesThisRead; } if (bytesReadFromThisFile < buffer.Length && bytesReadFromThisFile < bytesToRead) { throw new InvalidOperationException( "Unexpected end of file - did a file change?"); } bytesRead = bytesReadFromThisFile; // Will be same for all files } bytesToRead -= bytesRead; foreach (var sourceFileEntry in potentiallyMatched) { byte[] sourceFileContent = fileBuffers[sourceFileEntry.Key]; for (int otherIndex = 0; otherIndex < sourceFileEntry.Value.Count; ++otherIndex) { byte[] otherFileContent = fileBuffers[sourceFileEntry.Value[otherIndex]]; for (int i = 0; i < bytesRead; ++i) { if (sourceFileContent[i] != otherFileContent[i]) { sourceFileEntry.Value.RemoveAt(otherIndex); otherIndex -= 1; if (sourceFileEntry.Value.Count == 0) { sourceFilesWithNoMatches.Add(sourceFileEntry.Key); } break; } } } } foreach (FileContents fileWithNoMatches in sourceFilesWithNoMatches) { potentiallyMatched.Remove(fileWithNoMatches); } // Don't bother with the rest of the file if there are // not further potential matches if (potentiallyMatched.Count == 0) { break; } sourceFilesWithNoMatches.Clear(); } }
Rather than reading entire files at once, we allocate small buffers, and read in 1 KB at a time. As with the previous version, this new one works through all the files of a particular name and size simultaneously, so we allocate a buffer for each file.
We then loop round, reading in a buffer’s worth from each file, and perform comparisons against just that buffer (weeding out any nonmatches). We keep going round until we either determine that none of the files match or reach the end of the files.
Notice how each stream remembers its position for us, with each
Read
starting where the previous one
left off. And since we ensure that we read exactly the same quantity from
all the files for each chunk (either 1 KB, or however much is left when we
get to the end of the file), all the streams advance in unison.
This code has a somewhat more complex structure than before. The all-in-memory version in Example 11-39 had three loops—the outer one advanced one byte at a time, and then the inner two worked through the various potential match combinations. But because the outer loop in Example 11-44 advances one chunk at a time, we end up needing an extra inner loop to compare all the bytes in a chunk. We could have simplified this by only ever reading a single byte at a time from the streams, but in fact, this chunking has delivered a significant performance improvement. Testing against a folder full of source code, media resources, and compilation output containing 4,500 files (totaling about 500 MB), the all-in-memory version took about 17 seconds to find all the duplicates, but the stream version took just 3.5 seconds! Profiling the code revealed that this performance improvement was entirely a result of the fact that we were comparing the bytes in chunks. So for this particular application, the additional complexity was well worth it. (Of course, you should always measure your own code against representative problems—techniques that work well in one scenario don’t necessarily perform well everywhere.)
What if we wanted to step forward or backward in the file? We can
do that with the Seek
method. Let’s
imagine we want to print out the first 100 bytes of each file that we
reject, for debug purposes. We can add some code to our CompareBytes
method to do that, as Example 11-45 shows.
Example 11-45. Seeking within a stream
if (sourceFileContent[i] != otherFileContent[i]) { sourceFileEntry.Value.RemoveAt(otherIndex); otherIndex -= 1; if (sourceFileEntry.Value.Count == 0) { sourceFilesWithNoMatches.Add(sourceFileEntry.Key); } #if DEBUG // Remember where we got to long currentPosition = sourceFileEntry.Key.Content.Position; // Seek to 0 bytes from the beginning sourceFileEntry.Key.Content.Seek(0, SeekOrigin.Begin); // Read 100 bytes from for (int index = 0; index < 100; ++index) { var val = sourceFileEntry.Key.Content.ReadByte(); if (val < 0) { break; } if (index != 0) { Console.Write(", "); } Console.Write(val); } Console.WriteLine(); // Put it back where we found it sourceFileEntry.Key.Content.Seek(currentPosition, SeekOrigin.Begin); #endif break; }
We start by getting hold of the current position within the stream
using the Position
property. We do
this so that the code doesn’t lose its place in the stream. (Even though
we’ve detected a mismatch here, remember we’re comparing lots of files
here—perhaps this same file matches one of the other candidates. So
we’re not necessarily finished with it yet.)
The first parameter of the Seek
method tells us how far we are going to seek from our origin—we’re
passing 0
here because we want to go
to the beginning of the file. The second tells us what that origin is
going to be. SeekOrigin.Begin
means
the beginning of the file, SeekOrigin.End
means the end of the file (and
so the offset counts backward—you
don’t need to say −100
, just 100
).
There’s also SeekOrigin.Current
which allows you to move relative to the current position. You could use
this to read 10 bytes ahead, for example (maybe to work out what you
were looking at in context), and then seek back to where you were by
calling Seek(-10,
SeekOrigin.Current)
.
Not all streams support seeking. For example, some streams
represent network connections, which you might use to download
gigabytes of data. The .NET Framework doesn’t remember every single
byte just in case you ask it to seek later on, so if you attempt to
rewind such a stream, Seek
will
throw a NotSupportedException
. You
can find out whether seeking is supported from a stream’s CanSeek
property.
We don’t just have to use streaming APIs for reading. We can write to the stream, too.
One very common programming task is to copy data from one stream
to another. We use this kind of thing all the time—copying data, or
concatenating the content of several files into another, for example.
(If you want to copy an entire file, you’d use File.Copy
, but streams give you the
flexibility to concatenate or modify data, or to work with nonfile
sources.)
Example 11-46 shows how to
read data from one stream and write it into another. This is just for
illustrative purposes—.NET 4 added a new CopyTo
method to
Stream
which does this for you. In
practice you’d need Example 11-46 only if you were
targeting an older version of the .NET Framework, but it’s a good way to
see how to write to a stream.
Example 11-46. Copying from one stream to another
private static void WriteTo(Stream source, Stream target, int bufferLength) { bufferLength = Math.Max(100, bufferLength); var buffer = new byte[bufferLength]; int bytesRead; do { bytesRead = source.Read
(buffer, 0, buffer.Length); if (bytesRead != 0) { target.Write
(buffer, 0, bytesRead); } } while (bytesRead > 0); }
We create a buffer which is at least 100 bytes long. We then
Read
from the source
and Write
to the target, using the
buffer as the intermediary. Notice that the Write
method takes the same parameters as the read: the buffer, an offset into
that buffer, and the number of bytes to write (which in this case is the
number of bytes read from the source buffer, hence the slightly
confusing variable name). As with Read
, it steadily advances the current
position in the stream as it writes, just like that ticker tape. Unlike
Read
, Write
will always process as many bytes as we
ask it to, so with Write
, there’s no
need to keep looping round until it has written all the data.
Obviously, we need to keep looping until we’ve
read everything from the source stream. Notice that
we keep going until Read
returns
0
. This is how streams indicate that
we’ve reached the end. (Some streams don’t know in advance how large
they are, so you can rely on the Length
property for only certain kinds of
streams such as FileStream
. Testing
for a return value of 0
is the most
general way to know that we’ve reached the end.)
So, we’ve seen how to read and write data to and from
streams, and how we can move the current position in the stream by seeking
to some offset from a known position. Up until now, we’ve been using the
File.OpenRead
and
File.OpenWrite
methods to create our
file streams. There is another method, File.Open
, which gives us access to some extra
features.
The simplest overload takes two parameters: a string which is the
path for the file, and a value from the FileMode
enumeration.
What’s the FileMode
? Well, it lets us
specify exactly what we want done to the file when we open it. Table 11-6 shows the values available.
Table 11-6. FileMode enumeration
FileMode | Purpose |
---|---|
| Creates a brand new file. Throws an exception if it already existed. |
| Creates a new file, deleting any existing file and overwriting it if necessary. |
| Opens an existing file, seeking to the beginning by default. Throws an exception if the file does not exist. |
| Opens an existing file, or creates a new file if it doesn’t exist. |
| Opens an existing file, and deletes all its contents. The file is automatically opened for writing only. |
| Opens an existing file and seeks to the end of the file. The file is automatically opened for writing only. You can seek in the file, but only within any information you’ve appended—you can’t touch the existing content. |
If you use this two-argument overload, the file will be opened in
read/write mode. If that’s not what you want, another overload takes a
third argument, allowing you to control the access mode with a value from
the FileAccess
enumeration. Table 11-7 shows the supported values.
Table 11-7. FileAccess enumeration
FileAccess | Purpose |
---|---|
| Open read-only. |
| Open write-only. |
| Open read/write. |
All of the file-opening methods we’ve used so far have locked the
file for our exclusive use until we close or Dispose
the object—if any other program tries to
open the file while we have it open, it’ll get an error. However, it is
possible to play nicely with other users by opening the file in a
shared mode. We do this by using the overload which
specifies a value from the FileShare
enumeration, which is shown in Table 11-8.
This is a flags enumeration, so you can combine the values if you
wish.
Table 11-8. FileShare enumeration
FileShare | Purpose |
---|---|
| No one else can open the file while we’ve got it open. |
| Other people can open the file for reading, but not writing. |
| Other people can open the file for writing, but not reading (so read/write will fail, for example). |
| Other people can open the
file for reading or writing (or both). This is equivalent to
|
| Other people can delete the file that you’ve created, even while we’ve still got it open. Use with care! |
You have to be careful when opening files in a shared mode, particularly one that permits modifications. You are open to all sorts of potential exceptions that you could normally ignore (e.g., people deleting or truncating it from underneath you).
If you need even more control over the file when you open it, you
can create a FileStream
instance directly.
There are two types of FileStream
constructors—those for interop scenarios, and the “normal”
ones. The “normal” ones take a string for the file path, while the interop
ones require either an IntPtr
or a
SafeFileHandle
. These wrap a Win32 file
handle that you have retrieved from somewhere. (If you’re not already
using such a thing in your code, you don’t need to use these versions.)
We’re not going to cover the interop scenarios here.
If you look at the list of constructors, the first thing you’ll
notice is that quite a few of them duplicate the various permutations of
FileShare
, FileAccess
, and FileMode
overloads we had on File.Open
.
You’ll also notice equivalents with one extra int
parameter. This allows you to provide a hint
for the system about the size of the internal buffer you’d like the stream
to use. Let’s look at buffering in more detail.
Many streams provide buffering. This means that when you read and write, they actually use an intermediate in-memory buffer. When writing, they may store your data in an internal buffer, before periodically flushing the data to the actual output device. Similarly, when you read, they might read ahead a whole buffer full of data, and then return to you only the particular bit you need. In both cases, buffering aims to reduce the number of I/O operations—it means you can read or write data in relatively small increments without incurring the full cost of an operating system API call every time.
There are many layers of buffering for a typical storage device. There might be some memory buffering on the actual device itself (many hard disks do this, for example), the filesystem might be buffered (NTFS always does read buffering, and on a client operating system it’s typically write-buffered, although this can be turned off, and is off by default for the server configurations of Windows). The .NET Framework provides stream buffering, and you can implement your own buffers (as we did in our example earlier).
These buffers are generally put in place for performance reasons.
Although the default buffer sizes are chosen for a reasonable trade-off
between performance and robustness, for an I/O-intensive application,
you may need to hand-tune this using the appropriate constructors on
FileStream
.
As usual, you can do more harm than good if you don’t measure the impact on performance carefully on a suitable range of your target systems. Most applications will not need to touch this value.
Even if you don’t need to tune performance, you still need to be
aware of buffering for robustness reasons. If either the process or the
OS crashes before the buffers are written out to the physical disk, you
run the risk of data loss (hence the reason write buffering is typically
disabled on the server). If you’re writing frequently to a Stream
or StreamWriter
, the .NET Framework will
flush the write buffers periodically. It also ensures that everything is
properly flushed when the stream is closed. However, if you just stop
writing data but you leave the stream open, there’s a good chance data
will hang around in memory for a long time without getting written out,
at which point data loss starts to become more likely.
In general, you should close files as early as possible, but
sometimes you’ll want to keep a file open for a long time, yet still
ensure that particular pieces of data get written out. If you need to
control that yourself, you can call Flush
. This is particularly useful if you have
multiple threads of execution accessing the same stream. You can
synchronize writes and ensure that they are flushed to disk before the
next worker gets in and messes things up! Later in this chapter, we’ll
see an example where explicit flushing is extremely important.
Another parameter we can set in the constructor is the
FileSystemRights
. We used this type
earlier in the chapter to set filesystem permissions. FileStream
lets us set these directly when we
create a file using the appropriate constructor. Similarly, we can also
specify an instance of a FileSecurity
object to further control the permissions on the underlying file.
Finally, we can optionally pass another enumeration to the
FileStream
constructor, FileOptions
, which contains some advanced
filesystem options. They are enumerated in Table 11-9. This is a flags-style enumeration, so you can combine these
values.
Table 11-9. FileOptions enumeration
FileOptions | Purpose |
---|---|
| No options at all. |
| Ignores any
filesystem-level buffers, and writes directly to the output
device. This affects only the O/S, and not any of the other
layers of buffering, so it’s still your responsibility to call
|
| Indicates that we’re going to be seeking about in the file in an unsystematic way. This acts as a hint to the OS for its caching strategy. We might be writing a video-editing tool, for example, where we expect the user to be leaping about through the file. |
| Indicates that we’re going to be sequentially reading from the file. This acts as a hint to the OS for its caching strategy. We might be writing a video player, for example, where we expect the user to play through the stream from beginning to end. |
| Indicates that we want the file to be encrypted so that it can be decrypted and read only by the user who created it. |
| Deletes the file when it is closed. This is very handy for temporary files. If you use this option, you never hit the problem where the file still seems to be locked for a short while even after you’ve closed it (because its buffers are still flushing asynchronously). |
| Allows the file to be accessed asynchronously. |
The last option, Asynchronous
,
deserves a section all to itself.
Long-running file operations are a common bottleneck. How many times have you clicked the Save button, and seen the UI lock up while the disk operation takes place (especially if you’re saving a large file to a network location)?
Developers commonly resort to a background thread to push these long operations off the main thread so that they can display some kind of progress or “please wait” UI (or let the user carry on working). We’ll look at that approach in Chapter 16; but you don’t necessarily have to go that far. You can use the asynchronous mode built into the stream instead. To see how it works, look at Example 11-47.
Example 11-47. Asynchronous file I/O
static void Main(string[] args) { string path = "mytestfile.txt"; // Create a test file using (var file = File.Create(path, 4096,FileOptions.Asynchronous
)) { // Some bytes to write byte[] myBytes = new byte[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }; IAsyncResult asyncResult = file.BeginWrite
( myBytes, 0, myBytes.Length, // A callback function, written as an anonymous delegate delegate(IAsyncResult result) { // You *must* call EndWrite() exactly once file.EndWrite
(result); // Then do what you like Console.WriteLine( "Called back on thread {0} when the operation completed", System.Threading.Thread.CurrentThread.ManagedThreadId); }, null); // You could do something else while you waited... Console.WriteLine( "Waiting on thread {0}...", System.Threading.Thread.CurrentThread.ManagedThreadId); // Waiting on the main threadasyncResult.AsyncWaitHandle.WaitOne()
; Console.WriteLine( "Completed {0} on thread {1}...", asyncResult.CompletedSynchronously ? "synchronously" : "asynchronously", System.Threading.Thread.CurrentThread.ManagedThreadId); Console.ReadKey(); return; } }
If you put this code in a new console application, and then compile and run, you’ll get output similar to this (the actual thread IDs will vary from run to run):
Waiting on thread10
... Completedasynchronously
on thread10
... Called back on thread6
when the operation completed
So, what is happening?
When we create our file, we use an overload on File.Create
that takes the FileOptions
we discussed
earlier. (Yes, back then we showed that by constructing the FileStream
directly, but
the File
class supports this too.) This
lets us open the file with asynchronous behavior enabled.
Then, instead of calling Write
,
we call BeginWrite
. This takes two
additional parameters. The first is a delegate to a callback function of
type AsyncCallback
, which the framework
will call when it has finished the operation to let us know that it has
completed. The second is an object that we can pass in, that will get
passed back to us in the callback.
This user state object is common to a lot of asynchronous operations, and is used to get information from the calling site to callbacks from the worker thread. It has become less useful in C# with the availability of lambdas and anonymous methods which have access to variables in their enclosing state.
We’ve used an anonymous method to provide the callback delegate. The
first thing we do in that method is to call file.EndWrite
, passing it the IAsyncResult
we’ve been provided in the
callback. You must call EndWrite
exactly once for every time you call
BeginWrite
, because it cleans up the
resources used to carry out the operation asynchronously. It doesn’t
matter whether you call it from the callback, or on the main application
thread (or anywhere else, for that matter). If the operation has not
completed, it will block the calling thread until it does complete, then
do its cleanup. Should you call it twice with the same IAsyncResult
for any reason the framework will
throw an exception.
In a typical Windows Forms or WPF application, we’d probably put up
some progress dialog of some kind, and just process messages until we got
our callback. In a server-side application we’re more likely to want to
kick off several pieces of work like this, and then wait for them to
finish. To do this, the IAsyncResult
provides us with an AsyncWaitHandle
,
which is an object we can use to block our thread until the work is
complete.
So, when we run, our main thread happens to have the ID 10
. It blocks until the operation is complete,
and then prints out the message about being done. Notice that this was, as
you’d expect, on the same thread with ID 10
. But after that, we get
a message printed out from our callback, which was called by the framework
on another thread entirely.
It is important to note that your system may have behaved differently. It is possible that the callback might occur before execution continued on the main thread. You have to be extremely careful that your code doesn’t depend on these operations happening in a particular order.
We’ll discuss these issues in a lot more detail in Chapter 16. We recommend you read that before you use any of these asynchronous techniques in production code.
Remember that we set the FileOptions.Asynchronous
flag when we opened the
file to get this asynchronous behavior? What happens if we don’t do that?
Let’s tweak the code so that it opens with FileOptions.None
instead, and see. Example 11-48 shows the statements
from Example 11-47 that need to be
modified
Example 11-48. Not asking for asynchronous behavior
...
// Create a test file
using (var file = File.Create(path, 4096, FileOptions.None
))
{
...
If you build and run that, you’ll see some output similar to this:
Waiting on thread9
... Completedasynchronously
on thread9
... Called back on thread10
when the operation completed
What’s going on? That all still seemed to be asynchronous!
Well yes, it was, but under the covers, the problem was solved in two different ways. The first one used the underlying support Windows provides for asynchronous I/O in the filesystem to handle the asynchronous file operation. In the second case, the .NET Framework had to do some work for us to grab a thread from the thread pool, and execute the read operation on that to deliver the asynchronous behavior.
That’s true right now, but bear in mind that these are implementation details and could change in future versions of the framework. The principle will remain the same, though.
So far, everything we’ve talked about has been related to files, but we can create streams over other things, too. If you’re a Silverlight developer, you’ve probably been skimming over all of this a bit—after all, if you’re running in the web browser you can’t actually read and write files in the filesystem. There is, however, another option that you can use (along with all the other .NET developers out there): isolated storage.
In the duplicate file detection application we built earlier in this chapter, we had to go to some lengths to find a location, and pick filenames for the datafiles we wished to create in test mode, in order to guarantee that we don’t collide with other applications. We also had to pick locations that we knew we would (probably) have permission to write to, and that we could then load again.
Isolated storage takes this one stage further and gives us a means of saving and loading data in a location unique to a particular piece of executing code. The physical location itself is abstracted away behind the API; we don’t need to know where the runtime is actually storing the data, just that the data is stored safely, and that we can retrieve it again. (Even if we want to know where the files are, the isolated storage API won’t tell us.) This helps to make the isolated storage framework a bit more operating-system-agnostic, and removes the need for full trust (unlike regular file I/O). Hence it can be used by Silverlight developers (who can target other operating systems such as Mac OS X) as well as those of us building server or desktop client applications for Windows.
This compartmentalization of the information by characteristics of the executing code gives us a slightly different security model from regular files. We can constrain access to particular assemblies, websites, and/or users, for instance, through an API that is much simpler (although much less sophisticated) than the regular file security.
Although isolated storage provides you with a simple security model to use from managed code, it does not secure your data effectively against unmanaged code running in a relatively high trust context and trawling the local filesystem for information. So, you should not trust sensitive data (credit card numbers, say) to isolated storage. That being said, if someone you cannot trust has successfully run unmanaged code in a trusted context on your box, isolated storage is probably the least of your worries.
Our starting point when using isolated storage is a store and you can think of any given store as being somewhat like one of the well-known directories we dealt with in the regular filesystem. The framework creates a folder for you when you first ask for a store with a particular set of isolation criteria, and then gives back the same folder each time you ask for the store with the same criteria. Instead of using the regular filesystem APIs, we then use special methods on the store to create, move, and delete files and directories within that store.
First, we need to get hold of a store. We do that by calling one
of several static members on the IsolatedStorageFile
class. Example 11-49 starts by getting the
user store for a particular assembly. We’ll discuss what that means
shortly, but for now it just means we’ve got some sort of a store we can
use. It then goes on to create a folder and a file that we can use to
cache some information, and retrieve it again on subsequent runs of the
application.
Example 11-49. Creating folders and files in a store
static void Main(string[] args) { IsolatedStorageFile store = IsolatedStorageFile.GetUserStoreForAssembly(); // Create a directory - safe to call multiple times store.CreateDirectory
("Settings"); // Open or create the file using (IsolatedStorageFileStream
stream = store.OpenFile
( "Settings\standardsettings.txt", System.IO.FileMode.OpenOrCreate, System.IO.FileAccess.ReadWrite)) { UseStream(stream); } Console.ReadKey(); }
We create a directory in the store, called Settings. You don’t have to do this; you
could put your file in the root directory for the store, if you wanted.
Then, we use the OpenFile
method on the store to open a
file. We use the standard file path syntax to specify the file, relative
to the root for this store, along with the FileMode
and FileAccess
values that we’re already familiar
with. They all mean the same thing in isolated storage as they do with
normal files. That method returns us an IsolatedStorageFileStream
. This class derives
from FileStream
, so it works in
pretty much the same way.
So, what shall we do with it now that we’ve got it? For the purposes of this example, let’s just write some text into it if it is empty. On a subsequent run, we’ll print the text we wrote to the console.
We’ve already seen StreamWriter
, the handy
wrapper class we can use for writing text to a stream. Previously, we
got hold of one from File.CreateText
,
but remember we mentioned that there’s a constructor we can use to wrap
any Stream
(not just a FileStream
) if we want to
write text to it? Well, we can use that now, for our IsolatedStorageFileStream
. Similarly, we can
use the equivalent StreamReader
to
read text from the stream if it already exists. Example 11-50 implements the
UseStream
method that
Example 11-49 called after
opening the stream, and it uses both StreamReader
and StreamWriter
.
Example 11-50. Using StreamReader and StreamWriter with isolated storage
static void UseStream(Stream stream) { if (stream.Length > 0) { using (StreamReader reader = new StreamReader(stream)
) { Console.WriteLine(reader.ReadToEnd()
); } } else { using (StreamWriter writer = new StreamWriter(stream)
) {writer.WriteLine
( "Initialized settings at {0}", DateTime.Now.TimeOfDay); Console.WriteLine("Settings have been initialized"); } } }
In the case where we’re writing, we construct our StreamWriter
(in a using
block, because we need to Dispose
it when we’re done), and then use the
WriteLine
method to
write our content. Remember that WriteLine
adds an extra new line on the end of
the text, whereas Write
just writes
the text provided.
In the case where we are reading, on the other hand, we construct
a StreamReader
(also in a using
block), and then read the entire content
using ReadToEnd
. This reads the
entire content of the file into a single string.
So, if you build and run this once, you’ll see some output that looks a lot like this:
Settings have been initialized
That means we’ve run through the write path. Run a second (or subsequent) time, and you’ll see something more like this:
Initialized settings at 10:34:47.7014833
That means we’ve run through the read path.
When you run this, you’ll notice that we end up outputting an
extra blank line at the end, because we’ve read a whole line from the
file—we called writer.WriteLine
when generating the file—and then used Console.WriteLine
, which adds
another end of line after that. You have to be a
little careful when manipulating text like this, to ensure that you
don’t end up with huge amounts of unwanted whitespace because everyone
in some processing chain is generously adding new lines or other
whitespace at the end!
This is a rather neat result. We can use all our standard
techniques for reading and writing to an IsolatedStorageFileStream
once we’ve acquired
a suitable file: the other I/O types such as StreamReader
don’t need to know what kind of
stream we’re using.
So, what makes isolated storage “isolated”? The .NET Framework partitions information written into isolated storage based on some characteristics of the executing code.
Several types of isolated store are available to you:
Isolation by user and assembly (optionally supporting roaming)
Isolation by user, domain, and assembly (optionally supporting roaming)
Isolation by user and application (optionally supporting roaming)
Isolation by user and site (only on Silverlight)
Isolation by machine and assembly
Isolation by machine, domain, and assembly
Isolation by machine and application
Silverlight supports only two of these: by user and site, and by user and application.
In Example 11-50,
we acquired a store isolated by user and assembly, using the static
method IsolatedStorageFile.GetUserStoreForAssembly
.
This store is unique to a particular user, and the assembly in which
the calling code is executing. You can try this out for yourself. If
you log in to your box as a user other than the one under which you’ve
already run our example app, and run it again, you’ll see some output
like this:
Settings have been initialized
That means our settings file doesn’t exist (for this user), so we must have been given a new store.
As you might expect, the user is identified by the authenticated principal for the current thread. Typically, this is the logged-on user that ran the process; but this could have been changed by impersonation (in a web application, for example, you might be running in the context of the web user, rather than that of the ASP.NET process that hosts the site).
Identifying the assembly is slightly more complex. If you have signed the assembly, it uses the information in that signature (be it a strong name signature, or a software publisher signature, with the software publishing signature winning if it has both).
If, on the other hand, the assembly is not signed, it will use the URL for the assembly. If it came from the Internet, it will be of the form:
http://some/path/to/myassembly.dll
If it came from the local filesystem, it will be of the form:
file:///C:/some/path/to/myassembly.dll
Figure 11-9 illustrates how multiple stores get involved when you have several users and several different assemblies. User 1 asks MyApp.exe to perform some task, which asks for user/assembly isolated storage. It gets Store 1. Imagine that User 1 then asks MyApp.exe to perform some other task that requires the application to call on MyAssembly.dll to carry out the work. If that in turn asks for user/assembly isolated storage, it will get a different store (labeled Store 2 in the diagram). We get a different store, because they are different assemblies.
When a different user, User 2, asks MyApp.exe to perform the first task, which then asks for user/assembly isolated storage, it gets a different store again—Store 3 in the diagram—because they are different users.
OK, what happens if we make two copies of MyApp.exe in two different locations, and run them both under the same user account? The answer is that it depends....
If the applications are not signed the assembly identification rules mean that they don’t match, and so we get two different isolated stores.
If they are signed the assembly identification rules mean that they do match, so we get the same isolated store.
Our app isn’t signed, so if we try this experiment, we’ll see the standard “first run” output for our second copy.
Be very careful when using isolated storage with signed assemblies. The information used from the signature includes the Name, Strong Name Key, and Major Version part of the version info. So, if you rev your application from 1.x to 2.x, all of a sudden you’re getting a different isolated storage scope, and all your existing data will “vanish.” One way to deal with this is to use a distinct DLL to access the store, and keep its version numbers constant.
Isolating by domain means that we look for some information about the application domain in which we are running. Typically, this is the full URL of the assembly if it was downloaded from the Web, or the local path of the file.
Notice that this is the same rule as for the assembly identity if we didn’t sign it! The purpose of this isolation model is to allow a single signed assembly to get different stores if it is run from different locations. You can see a diagram that illustrates this in Figure 11-10.
To get a store with this isolation level, we can call the
IsolatedStorageFile
class’s
GetUserStoreForDomain
method.
A third level of isolation is by user and application. What defines an “application”? Well, you have to sign the whole lot with a publisher’s (Authenticode) signature. A regular strong-name signature won’t do (as that will identify only an individual assembly).
If you want to try this out quickly for yourself, you can run the ClickOnce Publication Wizard on the Publish tab of your example project settings. This will generate a suitable test certificate and sign the app.
To get a store with user and application isolation, we call the
IsolatedStorageFile
class’s
GetUserStoreForApplication
method.
If you haven’t signed your application properly, this method will throw an exception.
So, it doesn’t matter which assembly you call from; as long as it is a part of the same application, it will get the same store. You can see this illustrated in Figure 11-11.
This can be particularly useful for settings that might be shared between several different application components.
What if your application or component has some data you want to make available to all users on the system? Maybe you want to cache common product information or imagery to avoid a download every time you start the app. For these scenarios you need machine isolation.
As you saw earlier, there is an isolation type for the machine which corresponds to each isolation type for the user. The same resolution rules apply in each case. The methods you need are:
GetMachineStoreForApplication |
GetMachineStoreForDomain |
GetMachineStoreForAssembly |
Isolated storage has the ability to set quotas on particular storage scopes. This allows you to limit the amount of data that can be saved in any particular store. This is particularly important for applications that run with partial trust—you wouldn’t want Silverlight applications automatically loaded as part of a web page to be able to store vast amounts of data on your hard disk without your permission.
You can find out a store’s current quota by looking at the
Quota
property on a particular
IsolatedStorageFile
. This is a
long
, which indicates the maximum
number of bytes that may be stored. This is not a “bytes remaining”
count—you can use the AvailableFreeSpace
property for that.
Your available space will go down slightly when you create empty directories and files. This reflects the fact that such items consume space on disk even though they are nominally empty.
The quota can be increased using the IncreaseQuotaTo
method,
which takes a long
which is the new
number of bytes to which to limit the store. This
must be larger than the previous number of bytes,
or an ArgumentException
is thrown.
This call may or may not succeed—the user will be prompted, and may
refuse your request for more space.
You cannot reduce the quota for a store once you’ve set it, so take care!
As a user, you might want to look at the data stored in isolated storage by applications running on your machine. It can be complicated to manage and debug isolated storage, but there are a few tools and techniques to help you.
First, there’s the storeadm.exe tool. This allows you to inspect
isolated storage for the current user (by default), or the current
machine (by specifying the /machine
option) or current roaming user (by specifying /roaming
).
So, if you try running this command:
storeadm /MACHINE /LIST
you will see output similar to this (listing the various stores for this machine, along with the evidence that identifies them):
Microsoft (R) .NET Framework Store Admin 4.0.30319.1 Copyright (c) Microsoft Corporation. All rights reserved. Record #1 [Assembly] <StrongName version="1" Key="0024000004800000940000000602000000240000525341310004000001000100A5FE84898F 190EA6423A7D7FFB1AE778141753A6F8F8235CBC63A9C5D04143C7E0A2BE1FC61FA6EBB52E7FA9B 48D22BAF4027763A12046DB4A94FA3504835ED9F29CD031600D5115939066AABE59A4E61E932AEF 0C24178B54967DD33643FDE04AE50786076C1FB32F64915E8200729301EB912702A8FDD40F63DD5 A2DE218C7" Name="ConsoleApplication7" Version="1.0.0.0"/> Size : 0 Record #2 [Domain] <StrongName version="1" Key="0024000004800000940000000602000000240000525341310004000001000100A5FE84898F 190EA6423A7D7FFB1AE778141753A6F8F8235CBC63A9C5D04143C7E0A2BE1FC61FA6EBB52E7FA9B 48D22BAF4027763A12046DB4A94FA3504835ED9F29CD031600D5115939066AABE59A4E61E932AEF 0C24178B54967DD33643FDE04AE50786076C1FB32F64915E8200729301EB912702A8FDD40F63DD5 A2DE218C7" Name="ConsoleApplication7" Version="1.0.0.0"/> [Assembly] <StrongName version="1" Key="0024000004800000940000000602000000240000525341310004000001000100A5FE84898F 190EA6423A7D7FFB1AE778141753A6F8F8235CBC63A9C5D04143C7E0A2BE1FC61FA6EBB52E7FA9B 48D22BAF4027763A12046DB4A94FA3504835ED9F29CD031600D5115939066AABE59A4E61E932AEF 0C24178B54967DD33643FDE04AE50786076C1FB32F64915E8200729301EB912702A8FDD40F63DD5 A2DE218C7" Name="ConsoleApplication7" Version="1.0.0.0"/> Size : 0
Notice that there are two stores in that example. One is identified by some assembly evidence (the strong name key, name, and major version info). The other is identified by both domain and assembly evidence. Because the sample application is in a single assembly, the assembly evidence for both stores happens to be identical!
You can also add the /REMOVE
parameter which will delete all of the isolated storage in use at the
specified scope. Be very careful if you do this,
as you may well delete storage used by another application
entirely.
That’s all very well, but you can’t see the place where those files are stored. That’s because the actual storage is intended to be abstracted away behind the API. Sometimes, however, it is useful to be able to go and pry into the actual storage itself.
Remember, this is an implementation detail, and it could change between versions. It has been consistent since the first version of the .NET Framework, but in the future, Microsoft could decide to store it all in one big file hidden away somewhere, or using some mystical API that we don’t have access to.
We can take advantage of the fact that the debugger can show us the private innards of the IsolatedStorageFile
class. If we set a
breakpoint on the store.CreateFile
line in our sample
application, we can inspect the IsolatedStorageFile
object that was returned
by GetUserStoreForApplication
in the
previous line. You will see that there is a private field called
m_RootDir
. This is the actual root
directory (in the real filesystem) for the store. You can see an example
of that as it is on my machine in Figure 11-12.
If you copy that path and browse to it using Windows Explorer, you’ll see something like the folder in Figure 11-13.
There’s the Settings directory that we created! As you might expect, if you were to look inside, you’d see the standardsettings.txt file our program created.
As you can see, this is a very useful debugging technique, allowing you to inspect and modify the contents of files in isolated storage, and identify exactly which store you have for a particular scope. It does rely on implementation details, but since you’d only ever do this while debugging, the code you ultimately ship won’t depend on any nonpublic features of isolated storage.
OK. So far, we’ve seen two different types of stream; a regular
file, and an isolated storage file. We use our familiar stream tools and
techniques (like StreamReader
and
StreamWriter
), regardless of the
underlying type.
So, what other kinds of stream exist? Well, there are lots;
several subsystems in the .NET framework provide stream-based APIs.
We’ll see some networking ones in Chapter 13, for
example. Another example is from the .NET Framework’s security features: CryptoStream
(which is used for encrypting and
decrypting a stream of data). There’s also a MemoryStream
in System.IO
which uses memory to store the data
in the stream.
In this final section, we’ll look at a stream that is not a file. We’ll use a stream from .NET’s cryptographic services to encrypt a string. This encrypted string can be decrypted later as long as we know the key. The test program in Example 11-51 illustrates this.
Example 11-51. Using an encryption stream
static void Main(string[] args) { byte[] key; byte[] iv; // Get the appropriate key and initialization vector for the algorithm SelectKeyAndIV(out key, out iv); string superSecret = "This is super secret"; Console.WriteLine(superSecret); string encryptedText = EncryptString(superSecret, key, iv); Console.WriteLine(encryptedText); string decryptedText = DecryptString(encryptedText, key, iv); Console.WriteLine(decryptedText); Console.ReadKey(); }
It is going to write a message to the console, encrypt it, write the encrypted text to the console, decrypt it, and write the result of that back to the console. All being well, the first line should be the same as the last, and the middle line should look like gibberish!
Of course, it’s not very useful to encrypt and immediately decrypt again. This example illustrates all the parts in one program—in a real application, decryption would happen in a different place than encryption.
The first thing we do is get a suitable key and initialization vector for our cryptographic algorithm. These are the two parts of the secret key that are shared between whoever is encrypting and decrypting our sensitive data.
A detailed discussion of cryptography is somewhat beyond the scope of this book, but here are a few key points to get us going. Unenciphered data is known as the plain text, and the encrypted version is known as cipher text. We use those terms even if we’re dealing with nontextual data. The key and the initialization vector (IV) are used by a cryptographic algorithm to encrypt the unenciphered data. A cryptographic algorithm that uses the same key and IV for both encryption and decryption is called a symmetric algorithm (for obvious reasons). Asymmetric algorithms also exist, but we won’t be using them in this example.
Needless to say, if an unauthorized individual gets hold of the key and IV, he can happily decrypt any of your cipher text, and you no longer have a communications channel free from prying eyes. It is therefore extremely important that you take care when sharing these secrets with the people who need them, to ensure that no one else can intercept them. (This turns out to be the hardest part—key management and especially human factors turn out to be security weak points far more often than the technological details. This is a book about programming, so we won’t even attempt to solve that problem. We recommend the book Secrets and Lies: Digital Security in a Networked World by Bruce Schneier [John Wiley & Sons] for more information.)
We’re calling a method called SelectKeyAndIV
to get
hold of the key and IV. In real life, you’d likely be sharing this
information between different processes, usually even on different
machines; but for the sake of this demonstration, we’re just creating them
on the fly, as you can see in Example 11-52.
Example 11-52. Creating a key and IV
private static void SelectKeyAndIV(out byte[] key, out byte[] iv) { var algorithm = TripleDES.Create(); algorithm.GenerateIV(); algorithm.GenerateKey(); key = algorithm.Key; iv = algorithm.IV; }
TripleDES
is an example of a
symmetric algorithm, so it derives from a class called SymmetricAlgorithm
. All such classes provide a
couple of methods called GenerateIV
and GenerateKey
that create cryptographically strong
random byte arrays to use as an initialization vector and a key. See the
sidebar below for an explanation of why we need to use a particular kind
of random number generator when cryptography is involved.
OK, with that done, we can now implement our EncryptString
method.
This takes the plain text string, the key, and the initialization vector,
and returns us an encrypted string. Example 11-53
shows an implementation.
Example 11-53. Encrypting a string
private static string EncryptString(string plainText, byte[] key, byte[] iv) { // Create a crypto service provider for the TripleDES algorithm var serviceProvider = new TripleDESCryptoServiceProvider(); using (MemoryStream memoryStream = newMemoryStream
()) using (var cryptoStream = new CryptoStream( memoryStream, serviceProvider.CreateEncryptor(key, iv), CryptoStreamMode.Write)) using (StreamWriter writer = newStreamWriter(cryptoStream)
) { // Write some text to the crypto stream, encrypting it on the way writer.Write(plainText); // Make sure that the writer has flushed to the crypto stream writer.Flush(); // We also need to tell the crypto stream to flush the final block out to // the underlying stream, or we'll // be missing some content... cryptoStream.FlushFinalBlock(); // Now, we want to get back whatever the crypto stream wrote to our memory // stream. return GetCipherText(memoryStream); } }
We’re going to write our plain text to a CryptoStream
, using the standard StreamWriter
adapter. This
works just as well over a CryptoStream
as any other, but instead of coming out as plain text, it will be
enciphered for us. How does that work?
CryptoStream
is quite
different from the other streams we’ve met so far. It doesn’t have any
underlying storage of its own. Instead, it wraps around another Stream
, and then uses an ICryptoTransform
either to transform the data
written to it from plain text into cipher text before writing it to that
output stream (if we put it into CryptoStreamMode.Write
), or to transform what
it has read from the underlying stream and turning it back into plain
text before passing it on to the reader (if we put it into CryptoStreamMode.Read
).
So, how do we get hold of a suitable ICryptoTransform
? We’re making use of a
factory class called TripleDESCryptoServiceProvider
. This has a
method called CreateEncryptor
which
will create an instance of an ICryptoTransform
that uses the TripleDES
algorithm to encrypt our plain text,
with the specified key and IV.
A number of different algorithms are available in the framework, with various strengths and weaknesses. In general, they also have a number of different configuration options, the defaults for which can vary between versions of the .NET Framework and even versions of the operating system on which the framework is deployed. To be successful, you’re going to have to ensure that you match not just the key and the IV, but also the choice of algorithm and all its options. In general, you should carefully set everything up by hand, and avoid relying on the defaults (unlike this example, which, remember, is here to illustrate streams).
We provide all of those parameters to its constructor, and then we can use it (almost) like any other stream.
In fact, there is a proviso about CryptoStream
. Because of the way that most
cryptographic algorithms work on blocks of plain text, it has to buffer
up what is being written (or read) until it has a full block, before
encrypting it and writing it to the underlying stream.
This means that, when you finish writing to it, you might not have filled up the final block, and it might not have been flushed out to the destination stream. There are two ways of ensuring that this happens:
Dispose the CryptoStream
.
Call FlushFinalBlock
on the
CryptoStream
.
In many cases, the first solution is the simplest. However, when
you call Dispose
on the CryptoStream
it will also Close
the underlying stream, which is not
always what you want to do. In this case, we’re going to use the
underlying stream some more, so we don’t want to close it just yet.
Instead, we call Flush
on the
StreamWriter
to ensure that it has
flushed all of its data to the CryptoStream
, and then FlushFinalBlock
on the CryptoStream
itself, to ensure that the
encrypted data is all written to the underlying stream.
We can use any sort of stream for that underlying stream. We could
use a file stream on disk, or one of the isolated storage file streams
we saw earlier in this chapter, for example. We could even use one of
the network streams we’re going to see in Chapter 13.
However, for this example we’d like to do everything in memory, and the
framework has just the class for us: the MemoryStream
.
MemoryStream
is very
simple in concept. It is just a stream that uses memory as its backing
store. We can do all of the usual things like reading, writing, and
seeking. It’s very useful when you’re working with APIs that require you
to provide a Stream
, and you don’t
already have one handy.
If we use the default constructor (as in our example), we can read and write to the stream, and it will automatically grow in size as it needs to accommodate the data being written. Other constructors allow us to provide a start size suitable for our purposes (if we know in advance what that might be).
We can even provide a block of memory in the form of a byte[]
array to use as the underlying storage
for the stream. In that case, we are no longer able to resize the
stream, and we will get a NotSupportedException
if we try to write too
much data. You would normally supply your own byte[]
array when you already have one and
need to pass it to something that wants to read
from a stream.
We can find out the current size of the underlying block of memory
(whether we allocated it explicitly, or whether it is being
automatically resized) by looking at the stream’s Capacity
property. Note that this is
not the same as the maximum number of bytes we’ve
ever written to the stream. The automatic resizing tends to overallocate
to avoid the overhead of constant reallocation when writing. In general,
you can determine how many bytes you’ve actually written to by looking
at the Position
in the stream at the
beginning and end of your write operations, or the Length
property of the MemoryStream
.
Having used the CryptoStream
to
write the cipher text into the stream, we need to turn that into a
string we can show on the console.
Unfortunately, the cipher text is not actually text at
all—it is just a stream of bytes. We can’t use the UTF8Encoding.UTF8.GetString
technique we saw
in Chapter 10 to turn the bytes into text, because these
bytes don’t represent UTF-8 encoded characters.
Instead, we need some other sort of text-friendly representation if we’re going to be able to print the encrypted text to the console. We could write each byte out as hex digits. That would be a perfectly reasonable string representation.
However, that’s not very compact (each byte is taking five characters in the string!):
0x01 0x0F 0x03 0xFA 0xB3
A much more compact textual representation is Base64 encoding. This is a very popular textual encoding of arbitrary data. It’s often used to embed binary in XML, which is a fundamentally text-oriented format.
And even better, the framework provides us with a convenient
static helper method to convert from a byte[]
to a Base64 encoded string: Convert.ToBase64String
.
If you’re wondering why there’s no Encoding
class for Base64 to correspond to
the Unicode, ASCII, and UTF-8 encodings we saw in Chapter 10, it’s because Base64 is a completely different
kind of thing. Those other encodings are mechanisms that define binary
representations of textual information. Base64 does the opposite—it
defines a textual representation for binary information.
Example 11-54 shows how we make use of
that in our GetCipherText
method.
Example 11-54. Converting to Base64
private static string GetCipherText(MemoryStream memoryStream) { byte[] buffer =memoryStream.ToArray();
returnSystem.Convert.ToBase64String(buffer, 0, buffer.Length);
}
We use a method on MemoryStream
called ToArray
to get a
byte[]
array containing all the data
written to the stream.
Don’t be caught out by the ToBuffer
method,
which also returns a byte[]
array.
ToBuffer
returns the whole buffer
including any “extra” bytes that have been allocated but not yet
used.
Finally, we call Convert.ToBase64String
to get a string
representation of the underlying data, passing it the byte[]
, along with a start offset into that
buffer of zero (so that we start with the first byte), and the
length.
That takes care of encryption. How about decryption? That’s actually a little bit easier. Example 11-55 shows how.
Example 11-55. Decryption
private static string DecryptString(string cipherText, byte[] key, byte[] iv) { // Create a crypto service provider for the TripleDES algorithm var serviceProvider = new TripleDESCryptoServiceProvider(); // Decode the cipher-text bytes back from the base-64 encoded string byte[] cipherTextBytes = Convert.FromBase64String(cipherText); // Create a memory stream over those bytes using (MemoryStream memoryStream = new MemoryStream(cipherTextBytes)) // And create a cryptographic stream over the memory stream, // using the specified algorithm // (with the provided key and initialization vector) using (var cryptoStream = new CryptoStream( memoryStream, serviceProvider.CreateDecryptor(key, iv), CryptoStreamMode.Read)) // Finally, create a stream reader over the stream, and recover the // original text using (StreamReader reader = new StreamReader(cryptoStream)) { return reader.ReadToEnd(); } }
First, we use Convert.FromBase64String
to convert our Base64
encoded string back to an array of bytes. We then construct a MemoryStream
over that byte[]
by passing it to the appropriate
constructor.
As before, we wrap the MemoryStream
with a CryptoStream
, this time passing it the
ICryptoTransform
created by a call to
CreateDecryptor
on our TripleDESCryptoServiceProvider
, and putting it
into CryptoStreamMode.Read
.
Finally, we construct our old friend the StreamReader
over the CryptoStream
, and read the content back as a
string.
So, what’s actually happening here?
CryptoStream
uses the ICryptoTransform
to take care of turning the
cipher text in the MemoryStream
back
into plain text. If you remember, that plain text is actually the set of
UTF-8 encoded bytes we originally wrote to the stream with the StreamWriter
back in the encryption phase. So,
the StreamReader
takes those and
converts them back into a string for us. You can see that illustrated in
Figure 11-14.
This is a very powerful example of how we can plug together
various components in a kind of pipeline to achieve quite complex
processing, from simple, easily understood building blocks that conform
to a common pattern, but which have no dependencies on each other’s
implementation details. The Stream
abstraction is the key to this flexibility.
In this chapter we looked at the classes in the System.IO
namespace that relate to files and
streams. We saw how we can use static methods on the File
, Directory
, and Path
classes to manage and manipulate files and
folders in the filesystem, including creating, deleting, appending, and
truncating data, as well as managing their access permissions.
We saw how to use StreamReader
and StreamWriter
to deal with reading
and writing text from files, and how we can also read and write binary
data using the underlying Stream
objects themselves, including the ability to Seek
backward and forward in the file.
We then looked at a special type of file stream called isolated storage. This gives us the ability to manage the scope of file access to particular users, machines, applications, or even assemblies. We gain control over quotas (the maximum amount of space any particular store is allowed to use), and get to use local file storage in normally restricted security contexts like that of a Silverlight application, for example.
Finally, we looked at some streams that aren’t files, including
MemoryStream
, which uses memory as its
underlying storage mechanism, and CryptoStream
, which has no storage of its own,
delegating that responsibility to another stream. We showed how these
patterns can be used to plug streams together into a processing
pipeline.
[24] In fact, it is slightly more constrained than that. The .NET Framework limits arrays to 2 GB, and will throw an exception if you try to load a larger file into memory all at once.