Chapter 11. Simple File Handling

Files are the most ubiquitous resource abstraction used in the Unix world. Resources such as memory, disk space, devices, and interprocess communication (IPC) channels are represented as files. By providing a uniform abstraction of these resources, Unix reduces the number of software interfaces programmers must master. The resources accessed through file operations are as follows:

regular files

The kind of files most computer users think of. They serve as data repositories that can grow arbitrarily large and allow random access. Unix files are byte-oriented—any other logical boundaries are purely application conventions; the kernel knows nothing about them.

pipes

Unix’s simplest IPC mechanism. Usually, one process writes information into the pipe while another reads from it. Pipes are what shells use to provide I/O redirection (for example, ls -lR | grep notes or ls | more), and many programs use pipes to feed input to programs that run as their subprocesses. There are two types of pipes: unnamed and named. Unnamed pipes are created as they are needed and disappear once both the read and the write ends of the pipe are closed. Unnamed pipes are so called because they do not exist in the file system and, therefore, have no file name.[1] Named pipes do have file names, and the file name is used to allow two independent processes to communicate through the pipe (similar to the way Unix domain sockets work[2]). Pipes are also known as FIFOs because the data is ordered in a first-in/first-out manner.

directories

Special files that consist of a list of files they contain. Old Unix implementations allowed programs to read and write them in exactly the same manner as regular files. To allow better abstraction, a special set of system calls was added to provide directory manipulation, although the directories are still opened and closed like regular files. Those functions are presented in Chapter 14.

device files

Most physical devices are represented as files. There are two types of device files: block devices and character devices. Block device files represent hardware devices[3] that cannot be read from a byte at a time; they must be read from in multiples of some block size. Under Linux, block devices receive special handling from the kernel[4] and can contain file systems.[5] Disk drives, including CDROM drives and RAM disks, are the most common block devices. Character devices can be read from a single character at a time, and the kernel provides no caching or ordering facilities for them. Modems, terminals, printers, sound cards, and mice are all character devices. Traditionally, special directory entries kept in the /dev directory allow user-space processes to access device resources as files.

symbolic links

A special kind of file that contains the path to another file. When a symbolic link (symlink) is opened, the system recognizes it as a symlink, reads its value, and opens the file it references instead of the symlink itself. When the value stored in the symbolic link is used, the system is said to be following the symlink. Unless otherwise noted, system calls are assumed to follow symlinks that are passed to them.

sockets

Like pipes, sockets provide an IPC channel. They are more flexible than pipes, and can create IPC channels between processes running on different machines. Sockets are discussed in Chapter 17.

[1] Under Linux, the /proc file system includes information on every file currently open on the system. Although this means that unnamed pipes can be found in a file system, they still do not have permanent file names, because they disappear when the processes using the pipe end.

[2] See Chapter 17 for more information on Unix domain sockets.

[3] Not all block devices represent actual hardware. A better description of a block device is an entity on which a file system can reside; Linux’s loopback block device maps a regular file to a logical block device that allows that file to contain a complete file system.

[4] Most notably, they are cached and access to them is ordered.

[5] This is different from some systems that are capable of mounting file systems on character devices, as well as block devices.

In many operating systems, there is a one-to-one correspondence between files and file names. Every file has a file name in a file system, and every file name maps to a single file. Unix divorces the two concepts, allowing for more flexibility.

The only unique identity a file has is its inode (an abbreviation of information node). A file’s inode contains all the information about a file, including the access permissions associated with it, its current size, and how many file names it has (which could be zero, one, twenty, or more). There are two types of inodes. The in-core inode is the only type we normally care about; every open file on the system has one. The kernel keeps in-core inodes in memory, and they are the same for all file-system types. The other type of inodes are on-disk inodes. Every file on a file system has an on-disk inode, and their exact structure depends on the type of file system the file is stored on. When a process opens a file on a file system, the on-disk inode is loaded into memory and converted into an in-core inode. When the in-core inode has been modified, it is transformed back into an on-disk inode and stored in the file system.[6]

On-disk and in-core inodes do not contain exactly the same information. Only the in-core inode, for example, keeps track of how many processes on the system are currently using the file associated with the inode.

As on-disk and in-core inodes are synchronized by the kernel, most system calls end up updating both inodes. When this is the case, we just refer to updating the inode; it is implied that both the in-core and on-disk are affected. Some files (such as unnamed pipes) do not have any on-disk inode. In these cases, only the in-core inode is updated.

A file name exists only in a directory that relates that file name to the on-disk inode. You can think of the file name as a pointer to the on-disk inode for the file associated with it. The on-disk inode contains the number of file names that refer to that inode, called the link count. When a file is removed, the link count is decremented and if the link count is 0 and no processes already have the file open, the space is freed. If other processes have the file open, the disk space is freed when the final process has closed the file.

All of this means that it is possible to

  • Have multiple processes access a file that has never existed in a file system (such as a pipe)

  • Create a file on the disk, remove its directory entry, and continue to read and write from the file

  • Change /tmp/foo and see the changes immediately in /tmp/bar, if both file names refer to the same inode

Unix has always worked this way, but these operations can be disconcerting to new users and programmers. As long as you keep in mind that a file name is merely a pointer to a file’s on-disk inode, and that the inode is the real resource, you should be fine.

The File Mode

Every file on a system has a file type (such as unnamed pipe or character device), as well as a set of access permissions that define what processes may access that file. A file’s type and access permissions are combined in a 16-bit value (a C short) called the file mode.

The bottom 12 bits of a file’s mode represent the access permissions that govern access to the file, as well as file permission modifiers. The file permission modifiers serve a variety of functions. The most important functions are allowing the effective user and group IDs to change when the file is executed.

The file mode is usually written in up to six octal (base 8) digits. When represented in octal, the low-order three digits are the access bits, the next digit contains the file permission modifiers, and the high-order two digits indicate the file type. For example, a file whose mode is 0041777 has a file type of 04, a file permission modifier of 1, and access bits 0777.[7] Similarly, a file of mode 0100755 is of type 010, has no file permission modifiers set, and has access permissions of 0755.

File Access Permissions

Each of the three access digits represents the permissions for a different class of users. The first digit represents the permissions for the file’s owner, the second digit represents permissions for users in the file’s group, and the final digit represents the permissions for all other users. Each octal digit is made up of three bits, which represent read permission, write permission, and execute permission, from most significant to least significant bit. The term world permissions is commonly used to refer to the permission given to all three classes of users.

Let’s try to make the previous paragraph a little more concrete through some examples. Linux’s chmod command allows the user to specify an access mode in octal and then applies that mode to one or more files. If we have a file, somefile, which we would like to allow only the owner to write to but any user (including the owner) to read from, we would use mode 0644 (remember, this is in octal). The leading 6 is 110 in binary, which indicates that the type of user to which it refers (in this case, the owner) is given both read and write permission; the 4s are 010 binary, giving the other types of users (group and other) only read permissions.

$ chmod 0644 somefile
$ ls -l somefile
-rw-r--r--   1 ewt       devel       31 Feb 15 15:12 somefile

If we wanted to allow any member of group devel to write to the file, we would use mode 0664 instead.

$ chmod 0664 somefile
$ ls -l somefile
-rw-rw-r--   1 ewt      devel       31 Feb 15 15:12 somefile

If somefile is a shell script (programs that use #! at their beginning to specify a command interpreter) we want to execute, we must tell the system that the file is executable by turning on the execute bit—in this case, we are allowing the owner to read, write, and execute the file and members of group devel to read and execute the file. Other users may not manipulate the file in any way.

$ chmod 0750 somefile
$ ls -l somefile
-rwxr-x---   1 ewt      devel       31 Feb 15 15:12 somefile

Directories use the same set of access bits as normal files, but with slightly different semantics. Read permissions allow a process to access the directory itself, which lets a user list the contents of a directory. Write permissions allow a process to create new files in the directory and delete existing ones. The execute bit does not translate as well, however (what does it mean to execute a directory?). It allows a process to search a directory, which means it can access a file in that directory, as long as it knows the name of that file.

Most system directories on Linux machines have 0755 permissions and are owned by the root user. This lets all users on the system list the files in a directory and access those files by name, but restricts writing in that directory to the root user. Anonymous ftp sites that allow any person to submit files but do not want to let people download them until an administrator has looked at the contents of the files, normally set their incoming file directories to 0772. That allows all users to create new files in the directory without being allowed either to see the contents of the directory or access files in it.

More information on file access permissions may be found in any introductory Linux or Unix book [Sobell, 2002] [Welsh, 1996].

File Permission Modifiers

The file permission modifier digit is also a bitmask, whose values represent setuid, setgid, and the sticky bit. If the setuid bit is set for an executable file, the process’s effective user ID is set to the owner of the file when the program is executed (see page 111 for information on why this is useful). The setgid bit behaves the same way, but sets the effective group ID to the file’s group. The setuid bit has no meaning for files that are not executable, but if the setgid bit is set for a nonexecutable file, any locking done on the file is mandatory rather than advisory.[8] Under Linux, the setuid and setgid bits are ignored for shell scripts because setuid scripts tend to be insecure.

Neither setuid nor setgid bits have any obvious meaning for directories. The setuid bit actually has no semantics when it is set on a directory. If a directory’s setgid bit is set, all new files created in that directory are owned by the same group that owns the directory itself. This makes it easier to use directories for collaborative work among users.

The sticky bit, the least significant bit in the file permission modifier digit, has an interesting history behind its name. Older Unix implementations had to load an entire program into memory before they could begin executing it. This meant big programs had long startup times, which could get quite annoying. If a program had the sticky bit set, the operating system would attempt to leave the program “stuck” in memory for as long as possible, even when the program was not running, reducing the startup time for those programs. Although this was a bit of a kludge, it worked reasonably well for commonly used programs, such as the C compiler. Modern Unix implementations, including Linux, use demand loading to run programs that load the program piece by piece, making the sticky bit unnecessary, and so Linux ignores the sticky bit for regular files.

The sticky bit is still used for directories. Usually, any user with write permissions to a directory can erase any file in that directory. If a directory’s sticky bit is set, however, files may be removed only by the user who owns the file and the root user. This behavior is handy for directories that are repositories for files created by a wide variety of users, such as /tmp.

The final section of a file’s mode specifies the file’s type. It is contained in the high-order octal digits of the mode and is not a bitmask. Instead, the value of those digits equates to a specific file type (04 indicates a directory; 06 indicates a block device). A file’s type is set when the file is created. It can never be changed except by removing the file.

The include file <sys/stat.h> provides symbolic constants for all of the access bits, which can make code more readable. Both Linux and Unix users usually become comfortable with the octal representation of file modes, however, so it is common for programs to use the octal values directly. Table 11.1 lists the symbolic names used for both file access permissions and file permission modifiers.

Table 11.1. File Permission Constants

Name

Value

Description

S_ISUID

0004000

The program is setuid.

S_ISGID

0002000

The program is setgid.

S_ISVTX

0001000

The sticky bit.

S_IRWXU

00700

The file’s owner has read, write, and execute permissions.

S_IRUSR

00400

The file’s owner has read permission.

S_IWUSR

00200

The file’s owner has write permission.

S_IXUSR

00100

The file’s owner has execute permission.

S_IRWXG

00070

The file’s group has read, write, and execute permissions.

S_IRGRP

00040

The file’s group has read permission.

S_IWGRP

00020

The file’s group has write permission.

S_IXGRP

00010

The file’s group has execute permission.

S_IRWXO

00007

Other users have read, write, and execute permissions.

S_IROTH

00004

Other users have read permission.

S_IWOTH

00002

Other users have write permission.

S_IXOTH

00001

Other users have execute permission.

File Types

The upper four bits of a file mode specify the file’s type. Table 11.2 lists the constants that relate to the file’s type. Bitwise AND’ing any of these constants with a file’s mode yields non-0 if the bit is set.

Table 11.2. File Type Constants

Name

Value (Octal)

Description

S_IFMT

00170000

This value, bitwise ANDed with the mode, gives the file type (which equals one of the other S_IF values).

S_IFSOCK

0140000

The file is a socket.

S_IFLNK

0120000

The file is a symbolic link.

S_IFREG

0100000

The file is a regular file.

S_IFBLK

0060000

The file represents a block device.

S_IFDIR

0040000

The file is a directory.

S_IFCHR

0020000

The file represents a character device.

S_IFIFO

0010000

The file represents a first-in/first-out communications pipe.

The following macros take a file mode as an argument and return true or false:

S_ISLNK(m)

True if the file is a symbolic link

S_ISREG(m)

True if the file is a regular file

S_ISDIR(m)

True if the file is a directory

S_ISCHR(m)

True if the file represents a character device

S_ISBLK(m)

True if the file represents a block device

S_ISFIFO(m)

True if the file is a first-in/first-out pipe

S_ISSOCK(m)

True if the file is a socket

The Process’s umask

The permissions given to newly created files depend on both a system’s setup and an individual user’s preferences. To relieve individual programs of the need to guess the permissions to use for a file, the system allows users to turn off particular permissions for newly created files (and directories, which are just special files). Every process has a umask, which specifies the permission bits to turn off when files are created. This allows a process to specify fairly liberal permissions (usually, world read and write permissions) and end up with the permissions the user would like. If the file is particularly sensitive, the creating process can specify more restrictive permissions than normal, because the umask never results in less restrictive permissions, only in more restrictive permissions.

The process’s current umask is set by the umask() system call.

#include <sys/stat.h>

int umask(int newmask);

The old umask is returned, and the process’s umask is set to the new value. Only read, write, and execute permissions may be specified for the file—you cannot use the umask to prevent the setuid, setgid, or sticky bits from being set. The umask command present in most shells allows the user to set the umask for the shell itself and its subsequent child processes.

As an example, the touch command creates new files with 0666 (world read and write) permissions. Because the user rarely wants this, he could force the touch command to turn off world and group write permissions for a file with a umask of 022, as shown by this example:

$ umask 022
$ touch foo
$ ls -l foo
-rw-r--r--   1 ewt     ewt        0 Feb 24 21:24 foo

If he prefers group write permissions, he can use a umask of 002 instead.

$ umask 002
$ touch foo2
$ ls -l foo2
-rw-rw-r--   1 ewt     ewt        0 Feb 24 21:24 foo2

If he wants all his files to be accessible only by himself, a 077 umask will accomplish the task.

$ umask 077
$ touch foo3
$ ls -l foo3
-rw-------   1 ewt     ewt         0 Feb 24 21:26 foo3

The process’s umask affects the open(), creat(), mknod(), and mkdir() system calls.

Basic File Operations

As a large proportion of Linux’s system calls manipulate files, we begin by showing you the functions that are most widely used. We discuss the more specialized functions later in this chapter. The functions used to read through directories are presented in Chapter 14 to help keep this chapter a bit more concise.

File Descriptors

When a process gains access to a file (usually called opening the file), the kernel returns a file descriptor that the process uses to perform subsequent operations on the file. File descriptors are small, positive integers, which serve as indices into an array of open files the kernel maintains for each process.

The first three file descriptors for a process (0, 1, and 2) have standard usages. The first, 0, is known as standard input (stdin) and is where programs should take their interactive input from. File descriptor 1 is called standard output (stdout), and most output from the program should be directed there. Errors should be sent to standard error (stderr), which is file descriptor 2. The standard C library follows these rules, so gets() and printf() use stdin and stdout, respectively, and these conventions allow shells to properly redirect a process’s input and output.

The <unistd.h> header file provides the STDIN_FILENO, STDOUT_FILENO, and STDERR_FILENO macros, which evaluate to the stdin, stdout, and stderr file descriptors, respectively. Using these symbolic names can make code slightly more readable.

Many of the file operations that manipulate a file’s inode are available in two forms. The first form takes a file name as an argument. The kernel uses that argument to look up the file’s inode and performs the appropriate operation on the inode (this usually includes following symlinks). The second form takes a file descriptor as an argument and performs the operation on the inode it refers to. The two sets of system calls use similar names, with the system calls that expect a file descriptor argument prefixed with the letter f. For example, the chmod() system call changes the access permissions for the file referred to by the passed file name; fchmod() sets the access permissions for the file referred to by the specified file descriptor.

To make the rest of this discussion a bit less verbose, we present both versions of the system calls when they exist but discuss only the first version (which uses a file name).

Closing Files

One of the few operations that is the same for all types of files is closing the file. Here is how to close a file:

#include <unistd.h>

int close(int fd);

This is obviously a pretty basic operation. However, there is one important thing to remember about closing files—it could fail. Some systems (most notably, networked file systems such as NFS) do not try to store the final piece of written data in the file system until the file is closed. If that storage operation fails (the remote host may have crashed), then the close() returns an error. If your application is writing data but does not use synchronous writes (see the discussion of O_SYNC in the next section), you should always check the results of file closures. If close() fails, the updated file is corrupted in some unpredictable fashion! Luckily, this happens extremely rarely.

Opening Files in the File System

Although Linux provides many types of files, regular files are by far the most commonly used. Programs, configuration files, and data files all fall under that heading, and most applications do not (explicitly) use any other file type. There are two ways of opening files that have associated file names:

#include <fcntl.h>
#include <unistd.h>

int open(char * pathname, int flags, mode_t mode);
int creat(char * pathname, mode_t mode);

The open() function returns a file descriptor that references the file pathname. If the return value is less than 0, an error has occurred (as always, errno contains the error code). The flags argument describes the type of access the calling process wants, and also controls various attributes of how the file is opened and manipulated. An access mode must always be provided and is one of O_RDONLY, O_RDWR, and O_WRONLY, which request read-only, read-write, and write-only access, respectively. One or more of the following values may be bitwise OR’ed with the access mode to control other file semantics.

O_CREAT

If the file does not already exist, create it as a regular file.

O_EXCL

This flag should be used only with O_CREAT. When it is specified, open() fails if the file already exists. This flag allows a simple locking implementation, but it is unreliable across networked file systems like NFS.[9]

O_NOCTTY

The file being opened does not become the process’s controlling terminal (see page 136 for more information on controlling terminals). This flag matters only when a process without any controlling terminal is opening a tty device. If it is specified any other time, it is ignored.

O_TRUNC

If the file already exists, the contents are discarded and the file size is set to 0.

O_APPEND

All writes to the file occur at the end of the file, although random access reads are still permitted.

O_NONBLOCK[10]

The file is opened in nonblocking mode. Operations on normal files always block because they are stored on local hard disks with predictable response times, but operations on certain file types have unpredictable completion times. For example, reading from a pipe that does not have any data in it blocks the reading process until data becomes available. If O_NONBLOCK is specified, the read returns zero bytes rather than block. Files that may take an indeterminate amount of time to perform an operation are called slow files.

O_SYNC

Normally, the kernel caches writes and records them to the hardware when it is convenient to do so. Although this implementation greatly increases performance, it is more likely to allow data loss than is immediately writing the data to disk. If O_SYNC is specified when a file is opened, all changes to the file are stored on the disk before the kernel returns control to the writing process. This is very important for some applications, such as database systems, in which write ordering is used to prevent data corruption in case of a system failure.

[9] For more information on file locking, see Chapter 13.

[10] O_NDELAY is the original name for O_NONBLOCK, but it is now obsolete.

The mode parameter specifies the access permissions for the file if it is being created, and it is modified by the process’s current umask. If O_CREAT is not specified, the mode is ignored.

The creat() function is exactly equivalent to

open(pathname, O_CREAT | O_WRONLY | O_TRUNC, mode)

We do not use creat() in this book because we find open() easier to read and to understand.[11]

Reading, Writing, and Moving Around

Although there are a few ways to read from and write to files, only the simplest is discussed here.[12] Reading and writing are nearly identical, so we discuss them simultaneously.

#include <unistd.h>

size_t read(int fd, void * buf, size_t length);
size_t write(int fd, const void * buf, size_t length);

Both functions take a file descriptor fd, a pointer to a buffer buf, and the length of that buffer. read() reads from the file descriptor and places the data read into the passed buffer; write() writes length bytes from the buffer to the file. Both functions return the number of bytes transferred, or -1 on an error (which implies no bytes were read or stored).

Now that we have covered these system calls, here is a simple example that creates the file hw in the current directory and writes Hello World! into it:

 1: /* hwwrite.c */
 2:
 3: #include <errno.h>
 4: #include <fcntl.h>
 5: #include <stdio.h>
 6: #include <stdlib.h>
 7: #include <unistd.h>
 8:
 9: int main(void) {
10:     int fd;
11:
12:     /* open the file, creating it if it's not there, and removing
13:        its contents if it is there */
14:     if ((fd = open("hw", O_TRUNC | O_CREAT | O_WRONLY, 0644)) < 0) {
15:         perror("open"),
16:         exit(1);
17:     }
18:
19:     /* the magic number of 13 is the number of characters which will
20:        be written */
21:     if (write(fd, "Hello World!
", 13) != 13) {
22:         perror("write");
23:         exit(1);
24:     }
25:
26:     close(fd);
27:
28:     return 0;
29: }

Here is what happens when we run hwwrite:

$ cat hw
cat: hw: No such file or directory
$ ./hwwrite
$ cat hw
Hello World!
$

Changing this function to read from a file is a simple matter of changing the open() to

open("hw", O_RDONLY);

and changing the write() of a static string to a read() into a buffer.

Unix files can be divided into two catgories: seekable and nonseekable.[13] Nonseekable files are first-in/first-out channels that do not support random reads or writes, and data cannot be reread or overwritten. Seekable files allow the reads and the writes to occur anywhere in the file. Pipes and sockets are nonseekable files; block devices and regular files are seekable.

As FIFOs are nonseekable files, it is obvious where read() reads from (the beginning of the file) and write() writes to (the end of the file). Seekable files, on the other hand, have no obvious place for the operations to occur. Instead, both happen at the “current” location in the file and advance the current location after the operation. When a seekable file is initially opened, the current location is at the beginning of the file, or offset 0. If 10 bytes are read, the current position is then at offset 10, and a write of 5 more bytes overwrites the data, starting with the eleventh byte in the file (which is at offset 10, where the current position was). After such a write, the current position becomes offset 15, immediately after the overwritten data.

If the current position is the end of the file and the process tries to read from the file, read() returns 0 rather than an error. If more data is written at the end of the file, the file grows just enough to accommodate the extra data and the current position becomes the new end of the file. Each file descriptor keeps track of an independent current position[14] (it is not kept in the file’s inode), so if a file is opened multiple times by multiple processes (or by the same process, for that matter), reads and writes through one of the file descriptors do not affect the location of reads and writes made through the other file descriptor. Of course, the multiple writes could corrupt the file in other ways, so some sort of locking may be needed in these situations.

Files opened with O_APPEND have a slightly different behavior. For such files, the current position is moved to the end of the file before the kernel writes any data. After the write, the current position is moved to the end of the newly written data, as normal. For append-only files, this guarantees that the file’s current position is always at the end of the file immediately following a write().

Applications that want to read and write data from random locations in the file need to set the current position before reading and writing data, using lseek():

#include <unistd.h>

int lseek(int fd, off_t offset, int whence);

The current position for file fd is moved to offset bytes relative to whence, where whence is one of the following:

SEEK_SET[15]

The beginning of the file

SEEK_CUR

The current position in the file

SEEK_END

The end of the file

[15] As most systems define SEEK_SET as 0, it is common to see lseek(fd, offset, 0) used instead of lseek(fd, offset, SEEK_SET). This is not as portable (or readable) as SEEK_SET,but it is fairly common in old code.

For both SEEK_CUR and SEEK_END the offset may be negative. In this case, the current position is moved toward the beginning of the file (from whence) rather than toward the end of the file. For example, the following code moves the current position to five bytes from the end of the file:

lseek(fd, -5, SEEK_END);

The lseek() system call returns the new current position in the file relative to the beginning of the file, or -1 if an error occurred. Thus, lseek(fd, 0, SEEK_END) is a simple way of finding out how large a file is, but make sure you reset the current position before trying to read from fd.

Although the current position is not disturbed by other processes that access the file at the same time,[16] that does not mean multiple processes can safely write to a file simultaneously. Imagine the following sequence:

Process A

Process B

lseek(fd, 0, SEEK_END);

 
 

lseek(fd, 0, SEEK_END);

 

write(fd, buf, 10);

write(fd, buf, 5);

 

In this case process A would have overwritten the first five bytes of process B’s data, which is probably not what was intended. If multiple processes need to write to append to a file simultaneously, the O_APPEND flag should be used, which makes the operation atomic.

Under most POSIX systems, processes are allowed to move the current position past the end of the file. The file is grown to the appropriate size, and the current position becomes the new end of the file. The only catch is that most systems do not actually allocate any disk space for the portion of the file that was never written to; they change only the logical size of the file.

Portions of files that are “created” in this manner are known as holes. Reading from a hole in a file returns a buffer full of zeros, and writing to them could fail with an out-of-disk-space error. All of this means that lseek() should not be used to reserve disk space for later use because that space may not be allocated. If your application needs to allocate some disk space for later use, you must use write(). Files with holes in them are often used for files that have data sparsely spaced throughout them, such as files that represent hash tables.

For a simple, shell-based demonstration of file holes, look at the following example (note that /dev/zero is a character device that returns as many zeros as a process tries to read from it).

$ dd if=/dev/zero of=foo bs=1k count=10
10+0 records in
10+0 records out
$ ls -l foo
-rw-rw-r--   1 ewt      ewt       10240 Feb  6 21:50 foo
$ du foo
10 foo
$ dd if=/dev/zero of=bar bs=1k count=1 seek=9
1+0 records in
1+0 records out
$ ls -l bar
-rw-rw-r--   1 ewt      ewt       10240 Feb  6 21:50 bar
$ du bar
1       bar
$

Although both foo and bar are 10K in size, bar uses only 1K of disk space because the other 9K were seek() ed over when the file was created instead of written.

Partial Reads and Writes

Although both read() and write() take a parameter that specifies how many bytes to read or write, neither one is guaranteed to process the requested number of bytes, even if no error has occurred. The simplest example of this is trying to read from a regular file that is already positioned at the end of the file. The system cannot actually read any bytes, but it is not exactly an error condition either. Instead, the read() call returns 0 bytes. In the same vein, if the current position was 10 bytes from the end of the file and an attempt was made to read more than 10 bytes from the file, 10 bytes would be read and the read() call would return the value 10. Again, this is not considered an error condition.

The behavior of read() also depends on whether the file was opened with O_NONBLOCK. On many file types, O_NONBLOCK does not make any difference at all. Files for which the system can guarantee an operation’s completion in a reasonable amount of time always block on reads and writes; they are sometimes referred to as fast files. This set of files includes local block devices and regular files. For other file types, such as pipes and such character devices as terminals, the process could be waiting for another process (or a human being) to either provide something for the process to read or free resources for the system to use when processing the write() request. In either case, the system has no way of knowing whether it will ever be able to complete the system call. When these files are opened with O_NONBLOCK, for each operation on the file, the system simply does as much as it is able to do immediately, and then returns to the calling process.

Nonblocking I/O is an important topic, and more examples of it are presented in Chapter 13. With the standardization of the poll() system call, however, the need for it (especially for reading) has diminished. If you find yourself using nonblocking I/O extensively, try to rethink your program in terms of poll() to see if you can make it more efficient.

To show a concrete example of reading and writing files, here is a simple reimplementation of cat. It copies stdin to stdout until there is no more input to copy.

 1: /* cat.c */
 2:
 3: #include <stdio.h>
 4: #include <unistd.h>
 5:
 6: /* While there is data on standard in (fd 0), copy it to standard
 7:    out (fd 1). Exit once no more data is available. */
 8:
 9: int main(void) {
10:     char buf[1024];
11:     int len;
12:
13:     /* len will be >= 0 while data is available, and read() is
14:        successful */
15:     while ((len = read(STDIN_FILENO, buf, sizeof(buf))) > 0) {
16:         if (write(1, buf, len) != len) {
17:             perror("write");
18:             return 1;
19:         }
20:     }
21:
22:     /* len was <= 0; If len = 0, no more data is available.
23:        Otherwise, an error occurred. */
24:     if (len < 0) {
25:         perror("read");
26:         return 1;
27:     }
28:
29:     return 0;
30: }

Shortening Files

Although regular files automatically grow when data is written to the end of them, there is no way for the system to automatically shrink files when the data at their end is no longer needed. After all, how would the system know when data becomes extraneous? It is a process’s responsibility to notify the system when a file may be truncated at a certain point.

#include <unistd.h>

int truncate(const char * pathname, size_t length);
int ftruncate(int fd, size_t length);

The file’s size is set to length, and any data in the file past the new end of the file is lost. If length is larger than the current size of the file, the file is actually grown to the indicated length (using holes if possible), although this behavior is not guaranteed by POSIX and should not be relied on in portable programs.

Synchronizing Files

When a program writes data to the file, the data is normally stored in a kernel cache until it gets written to the physical medium (such as a hard drive), but the kernel returns control to that program as soon as the data is copied into the cache. This provides major performance improvements as it allows the kernel to order writes on the disk and to group multiple writes into a single block operation. In the event of a system failure, however, it has a few drawbacks that could be important. For example, an application that assumes data is stored in a database before the index entry for that data is stored might not handle a failure that results in just the index’s getting updated.

There are a few mechanisms applications can use to wait for data to get written to the physical medium. The O_SYNC flag, discussed on page 168, causes all writes to the file to block the calling process until the medium has been updated. While this certainly works, it is not a very neat approach. Normally, applications do not need to have every operation synchronized, more often they need to make sure a set of operations has completed before beginning another set. The fsync() and fdatasync() system calls provide this semantic:

#include <unistd.h>

int fsync(int fd);
int fdatasync(int fd);

Both system calls suspend the application until all of the data for the file fd has been written. The fsync() also waits for the file’s inode information, such as the access time, to get updated.[17] Neither of these system calls can ensure that the data gets written to nonvolatile storage, however. Modern disk drives have large caches, and a power failure could cause some data stored in those caches to get lost.

Other Operations

Linux’s file model does a good job of standardizing most file operations through generic functions such as read() and write() (for example, writing to a pipe is the same as writing to a file on disk). However, some devices have operations that are poorly modeled by this abstraction. For example, terminal devices, represented as character devices, need to provide a method to change the speed of the terminal, and a CD-ROM drive, represented by a block device, needs to know when it should play an audio track to help increase a programmer’s productivity.

All of these miscellaneous operations are accessed through a single system call, ioctl() (short for I/O control), which is prototyped like this:

#include <sys/ioctl.h>

int ioctl(int fd, int request, ...);

although it is almost always used like this:

int ioctl(int fd, int request, void * arg);

Whenever ioctl() is used, its first argument is the file being manipulated and the second argument specifies what operation is being requested. The final argument is usually a pointer to something, but what that something is, as well as the exact semantics of the return code, depends on what type of file fd refers to and what type of operation was requested. For some operations, arg is a long value instead of a pointer; in these instances, a typecast is normally used. There are many examples of ioctl() in this book, and you do not need to worry about using ioctl() until you come across them.

Querying and Changing Inode Information

Finding Inode Information

The beginning of this chapter introduced an inode as the data structure that tracks information about a file rather than just a single process’s view of it. For example, a file’s size is a constant at any given time—it does not change for different processes that have access to the file (compare this with the current position in the file, which is unique for the result of each open() rather than a property of the file itself). Linux provides three ways of reading a file’s inode information:

#include <sys/stat.h>

int stat(const char * pathname, struct stat * statbuf);
int lstat(const char *pathname, struct stat * statbuf);
int fstat(int fd, struct stat * statbuf);

The first version, stat(), returns the inode information for the file referenced by pathname, following any symlinks that are present. If you do not want to follow symlinks (to check if a file name is a symlink, for example), use lstat() instead, which does not follow them. The final version, fstat(), returns the inode referred to by an open file descriptor. All three system calls fill in the struct stat referenced by statbuf with information from the file’s inode. Table 11.3 describes the information available from struct stat.

Table 11.3. Members of struct stat

Type

Field

Description

dev_t

st_dev

The device number the file resides on.

ino_t

st_ino

The file’s on-disk inode number. Each file has an on-disk inode number unique for the device it is on. Thus the (st_dev, st_ino) pair provides a unique identification of the file.

mode_t

st_mode

The mode of the file. This includes information on both the file permissions and the type of file.

nlink_t

st_nlink

The number of pathnames that reference this inode. This does not include symlinks, because symlinks reference other file names, not inodes.

uid_t

st_uid

The user ID that owns the file.

gid_t

st_gid

The group ID that owns the file.

dev_t

st_rdev

If the file is a character or block device, this gives the major and minor numbers of the file. See the discussion on mknod() on page 189 for more information on this member and the macros that manipulate its value.

off_t

st_size

The size of the file in bytes. This is defined only for regular files.

unsigned long

st_blksize

The block size for the file system storing the file.

unsigned long

st_blocks

The number of blocks allocated to the file. Normally, st_blksize * st_blocks is a little more than the st_size because some of the space in the final block is unused. However, for files with holes, st_blksize * st_blocks can be substantially smaller than st_size.

time_t

st_atime

The most recent access time of the file. This is updated whenever the file is opened or its inode is modified.

time_t

st_mtime

The most recent modification time of the file. It is updated whenever the file’s data has changed

time_t

st_ctime

The most recent change time of the file or the inode, including owner, group, link count, and so on.

A Simple Example of stat()

Here is a simple program that displays information from lstat() for each file name passed as an argument. It illustrates how to use the values returned by the stat() family of functions.

 1: /* statsamp.c */
 2:
 3: /* For each file name passed on the command line, we display all of
 4:    the information lstat() returns on the file. */
 5:
 6: #include <errno.h>
 7: #include <stdio.h>
 8: #include <string.h>
 9: #include <sys/stat.h>
10: #include <sys/sysmacros.h>
11: #include <sys/types.h>
12: #include <time.h>
13: #include <unistd.h>
14:
15: #define TIME_STRING_BUF 50
16:
17: /* Make the user pass buf (of minimum length TIME_STRING_BUF) rather
18:    than using a static buf local to the function to avoid the use of
19:    static local variables and dynamic memory. No error should ever
20:    occur so we don't do any error checking. */
21: char * timeString(time_t t, char * buf) {
22:     struct tm * local;
23:
24:     local = localtime(&t);
25:     strftime(buf, TIME_STRING_BUF, "%c", local);
26:
27:     return buf;
28: }
29:
30: /* Display all of the information we get from lstat() on the file
31:     named as our sole parameter. */
32: int statFile(const char * file) {
33:     struct stat statbuf;
34:     char timeBuf[TIME_STRING_BUF];
35:
36:     if (lstat(file, &statbuf)) {
37:         fprintf(stderr, "could not lstat %s: %s
", file,
38:                 strerror(errno));
39:         return 1;
40:     }
41:
42:     printf("Filename : %s
", file);
43:     printf("On device: major %d/minor %d    Inode number: %ld
",
44:            major(statbuf.st_dev), minor(statbuf.st_dev),
45:            statbuf.st_ino);
46:     printf("Size     : %-10ld         Type: %07o       "
47:            "Permissions: %05o
", statbuf.st_size,
48:            statbuf.st_mode & S_IFMT, statbuf.st_mode & ~(S_IFMT));
49:     printf("Owner    : %d                Group: %d"
50:            "          Number of links: %d
",
51:            statbuf.st_uid, statbuf.st_gid, statbuf.st_nlink);
52:     printf("Creation time: %s
",
53:            timeString(statbuf.st_ctime, timeBuf));
54:     printf("Modified time: %s
",
55:            timeString(statbuf.st_mtime, timeBuf));
56:     printf("Access time  : %s
",
57:            timeString(statbuf.st_atime, timeBuf));
58:
59:     return 0;
60: }
61:
62: int main(int argc, const char ** argv) {
63:     int i;
64:     int rc = 0;
65:
66:     /* Call statFile() for each file name passed on the
67:        command line. */
68:     for (i = 1; i < argc; i++) {
69:         /* If statFile() ever fails, rc will end up non-zero. */
70:         rc |= statFile(argv[i]);
71:
72:         /* this prints a blank line between entries, but not after
73:           the last entry */
74:         if ((argc - i) > 1) printf("
");
75:     }
76:
77:     return rc;
78: }

Easily Determining Access Rights

Although the mode of a file provides all the information a program needs to determine whether it is allowed to access a file, testing the set of permissions is tricky and error prone. As the kernel already includes the code to validate access permissions, a simple system call is provided that lets programs determine whether they may access a file in a certain way:

#include <unistd.h>

int access(const char * pathname, int mode);

The mode is a mask that contains one or more of the following values:

F_OK

The file exists. This requires execute permissions on all the directories in the path being checked, so it could fail for a file that does exist.

R_OK

The process may read from the file.

W_OK

The process may write to the file.

X_OK

The process may execute the file (or search the directory).

access() returns 0 if the specified access modes are allowed, and an EACCES error otherwise.

Changing a File’s Access Permissions

A file’s access permissions and access permission modifiers are changed by the chmod() system call.

#include <sys/stat.h>

int chmod(const char * pathname, mode_t mode);
int fchmod(int fd, mode_t mode);

Although chmod() allows you to specify a path, remember that file permissions are based on the inode, not on the file name. If a file has multiple hard links to it, changing the permissions of one of the file’s names changes the file’s permissions everywhere it appears in the file system. The mode parameter may be any combination of the access and access modifier bits just discussed, bitwise OR’ed together. As there are normally quite a few of these values specified at a time, it is common for programs to specify the new permission value directly in octal. Only the root user and the owner of a file are allowed to change the access permissions for a file—all others who attempt to do so get EPERM.

Changing a File’s Owner and Group

Just like a file’s permissions, a file’s owner and group are stored in the file’s inode, so all hard links to a file have the same owner and group. The same system call is used to change both the owner and the group of a file.

#include <unistd.h>

int chown(const char * pathname, uid_t owner, gid_t group);
int fchown(int fd, uid_t owner, gid_t group);

The owner and group parameters specify the new owner and group for the file. If either is -1, that value is not changed. Only the root user may change the owner of a file. When a file’s owner is changed or the file is written to, the setuid bit for that file is always cleared for security reasons. Both the owner of a file and the root user may change the group that owns a file, but the owner must be a member of the group to which he is changing the file. If the file has its group execute bit set, the setgid bit is cleared for security reasons. If the setgid bit is set but the group execute bit is not, the file has mandatory locking enabled and the mode is preserved.

Changing a File’s Timestamps

A file’s owner may change the mtime and atime of a file to any arbitrary value. This makes these timestamps useless as audit trails, but it allows archiving tools like tar and cpio to reset a file’s timestamps to the same values they had when the file was archived. The ctime is changed when the mtime and atime are updated, so tar and cpio are unable to restore it.

There are two ways of changing these stamps: utime() and utimes(). utime() originated in System V and was adopted by POSIX, whereas utimes() is from BSD. Both functions are equivalent; they differ only in the way the new timestamps are specified.

#include <utime.h>
int utime(const char * pathname, struct utimbuf * buf);

#include <sys/time.h>
int utimes(const char * pathname, struct timeval * tvp);

The POSIX version, utime(), takes a struct utimbuf, which is defined in <utime.h> as

struct utimbuf {
    time_t actime;
    time_t modtime;
}

BSD’s utimes() instead passes the new atime and mtime through a struct timeval, which is defined in <sys/time.h>.

struct timeval {
    long tv_sec;
    long tv_usec;
}

The tv_sec element holds the new atime; tv_usec contains the new mtime for utimes().

If NULL is passed as the second argument to either function, both timestamps are set to the current time. The new atime and mtime are specified in elapsed seconds since the epoch (just like the value time() returns), as defined in Chapter 18.

Ext3 Extended Attributes

The primary file system used on Linux systems is the Third Extended File System,[18] commonly abbreviated as ext3. Although the ext3 file system supports all the traditional features of Unix file systems, such as the meanings of the various bits in each file’s mode, it allows for other attributes for each file. Table 11.4 describes the extra attributes currently supported, along with the symbolic name for each attribute. These flags may be set and inspected through the chattr and lsattr programs.

Table 11.4. Extended File Attributes

Attribute

Definition

EXT3_APPEND_FL

If the file is opened for writing, O_APPEND must be specified.

EXT3_IMMUTABLE_FL

The file may not be modified or removed by any user, including root.

EXT3_NODUMP

The file should be ignored by the dump command.

EXT3_SYNC_FL

The file must be updated synchronously, as if O_SYNC had been specified on opening.

As the ext3 extended attributes are outside the standard file system interface, they cannot be modified through chmod() like the file attributes. Instead, ioctl() is used. Recall that ioctl() is defined as

#include <sys/ioctl.h>
#include <linux/ext3_fs.h>

int ioctl(int fd, int request, void * arg);

The file whose attributes are being changed must be open, just as for fchmod(). The request is EXT3_IOC_GETFLAGS to check the current flags for the file and EXT3_IOC_SETFLAGS to set them. In either case, arg must point to an int. If EXT3_IOC_GETFLAGS is used, the long is set to the current value of the program’s flags. If EXT3_IOC_SETFLAGS is used, the new value of the file’s flags is taken from the int pointed to by arg.

The append and immutable flags may be changed only by the root user, as they restrict the operations the root user is able to perform. The other flags may be modified by either the root user or the owner of the file whose flags are being modified.

Here is a small program that displays the flags for any files specified on the command line. It works only for files on an ext3 file system,[19] however. The ioctl() fails for files on any other type of file system.

 1: /* checkflags.c */
 2:
 3: /* For each file name passed on the command line, display
 4:    information on that file's ext3 attributes. */
 5:
 6: #include <errno.h>
 7: #include <fcntl.h>
 8: #include <linux/ext3_fs.h>
 9: #include <stdio.h>
10: #include <string.h>
11: #include <sys/ioctl.h>
12: #include <unistd.h>
13:
14: int main(int argc, const char ** argv) {
15:     const char ** filename = argv + 1;
16:     int fd;
17:     int flags;
18:
19:     /* Iterate over each file name on the command line. The last
20:        pointer in argv[] is NULL, so this while() loop is legal. */
21:     while (*filename) {
22:         /* Unlike normal attributes, ext3 attributes can only
23:            be queried if we have a file descriptor (a file name
24:            isn't sufficient). We don't need write access to query
25:            the ext3 attributes, so O_RDONLY is fine. */
26:         fd = open(*filename, O_RDONLY);
27:         if (fd < 0) {
28:             fprintf(stderr, "cannot open %s: %s
", *filename,
29:                     strerror(errno));
30:             return 1;
31:         }
32:
33:         /* This gets the attributes, and puts them into flags */
34:         if (ioctl(fd, EXT3_IOC_GETFLAGS, &flags)) {
35:             fprintf(stderr, "ioctl failed on %s: %s
", *filename,
36:                     strerror(errno));
37:             return 1;
38:         }
39:
40:         printf("%s:", *filename++);
41:
42:         /* Check for each attribute, and display a message for each
43:            one which is turned on. */
44:         if (flags & EXT3_APPEND_FL) printf(" Append");
45:         if (flags & EXT3_IMMUTABLE_FL) printf(" Immutable");
46:         if (flags & EXT3_SYNC_FL) printf(" Sync");
47:         if (flags & EXT3_NODUMP_FL) printf(" Nodump");
48:
49:         printf("
");
50:         close(fd);
51:     }
52:
53:     return 0;
54: };

The following is a similar program that sets the ext3 extended attributes for a given list of files. The first parameter must be a list of which flags should be set. Each flag is represented in the list by a single letter: A for append-only, I for immutable, S for sync, and N for the nodump flag. This program does not modify the current flags for the file; only the flags specified on the command line are set.

 1: /* setflags.c */
 2:
 3: /* The first parameter to this program is a string consisting of
 4:    0 (an empty string is okay) or more of the letters I, A, S, and
 5:    N. This string specifies which ext3 attributes should be turned
 6:    on for the files which are specified on the rest of the command
 7:    line -- the rest of the attributes are turned off. The letters
 8:    stand for immutable, append-only, sync, and nodump, respectively.
 9:
10:    For example, the command "setflags IN file1 file2" turns on the
11:    immutable and nodump flags for files file1 and file2, but turns
12:    off the sync and append-only flags for those files. */
13:
14: #include <errno.h>
15: #include <fcntl.h>
16: #include <linux/ext3_fs.h>
17: #include <stdio.h>
18: #include <string.h>
19: #include <sys/ioctl.h>
20: #include <unistd.h>
21:
22: int main(int argc, const char ** argv) {
23:     const char ** filename = argv + 1;
24:     int fd;
25:     int flags = 0;
26:
27:     /* Make sure the flags to set were specified, along with
28:        some file names. Allow a "0" to be set to indicate that all
29:        of the flags should be reset. */
30:     if (argc < 3) {
31:         fprintf(stderr, "setflags usage: [0][I][A][S][N] "
32:                         "<filenames>
");
33:         return 1;
34:     }
35:
36:     /* each letter represents a flag; set the flags which are
37:        specified */
38:     if (strchr(argv[1], 'I')) flags |= EXT3_IMMUTABLE_FL;
39:     if (strchr(argv[1], 'A')) flags |= EXT3_APPEND_FL;
40:     if (strchr(argv[1], 'S')) flags |= EXT3_SYNC_FL;
41:     if (strchr(argv[1], 'N')) flags |= EXT3_NODUMP_FL;
42:
43:     /* iterate over all of the file names in argv[] */
44:     while (*(++filename)) {
45:         /* Unlike normal attributes, ext3 attributes can only
46:            be set if we have a file descriptor (a file name
47:            isn't sufficient). We don't need write access to set
48:            the ext3 attributes, so O_RDONLY is fine. */
49:         fd = open(*filename, O_RDONLY);
50:         if (fd < 0) {
51:             fprintf(stderr, "cannot open %s: %s
", *filename,
52:                     strerror(errno));
53:             return 1;
54:         }
55:
56:         /* Sets the attributes as specified by the contents of
57:            flags. */
58:         if (ioctl(fd, EXT3_IOC_SETFLAGS, &flags)) {
59:             fprintf(stderr, "ioctl failed on %s: %s
", *filename,
60:                     strerror(errno));
61:             return 1;
62:         }
63:         close(fd);
64:     }
65:
66:     return 0;
67: };

Manipulating Directory Entries

Remember that directory entries (file names) are nothing more than pointers to on-disk inodes; almost all the important information concerning a file is stored in the inode. open() lets a process create directory entries that are regular files, but other functions are needed to create other types of files and to manipulate the entries themselves. The functions that allow you to create, remove, and search directories are covered in Chapter 14; socket files are introduced in Chapter 17. This section covers symbolic links, device files, and FIFOs.

Creating Device and Named Pipe Entries

Processes create device file entries and named pipes in the file system through mknod().

#include <fcntl.h>
#include <unistd.h>

int mknod(const char * pathname, mode_t mode, dev_t dev);

The pathname is the file name to create, mode is both the access mode of the new file (which gets modified by the current umask) and the new file type (S_IFIFO, S_IFBLK, or S_IFCHR). The final parameter, dev, contains the major and minor numbers of the device to create. The type of device (character or block) and the major number of the device tell the kernel which device driver is responsible for operations on that device file. The minor number is used internally by the device driver to differentiate among multiple devices it provides. Only the root user is allowed to create device files; all users may create named pipes.

The <sys/sysmacros.h> header file provides three macros for manipulating dev_t values. The makedev() macro takes a major number as its first argument and a minor number as its second, and it returns a dev_t suitable for mknod(). The major() and minor() macros take a dev_t value as their sole argument and return the device’s major and minor numbers, respectively.

The mknod program available under Linux provides a user-level interface to the mknod() system call (see man 1 mknod for details). Here is a simple reimplementation of mknod to illustrate the mknod() system call. Notice that the program creates the file with mode 0666 (giving read and write access to all users), and it depends on the process’s umask setting to get the permissions right.

 1: /* mknod.c */
 2:
 3: /* Create the device or named pipe specified on the command line.
 4:    See the mknod(1) man page for details on the command line
 5:    parameters. */
 6:
 7: #include <errno.h>
 8: #include <stdio.h>
 9: #include <stdlib.h>
10: #include <string.h>
11: #include <sys/stat.h>
12: #include <sys/sysmacros.h>
13: #include <unistd.h>
14:
15: void usage(void) {
16:      fprintf(stderr, "usage: mknod <path> [b|c|u|p] "
17:                      "<major> <minor>
");
18:      exit(1);
19: }
20:
21: int main(int argc, const char ** argv) {
22:     int major = 0, minor = 0;
23:     const char * path;
24:     int mode = 0666;
25:     char *end;
26:     int args;
27:
28:     /* We always need at least the type of inode to create, and
29:        the path for it. */
30:     if (argc < 3) usage();
31:
32:     path = argv[1];
33:
34:     /* the second argument tells us the type of node to create */
35:     if (!strcmp(argv[2], "b")) {
36:         mode |= S_IFBLK;
37:         args = 5;
38:     } else if (!strcmp(argv[2], "c") || !strcmp(argv[2], "u")) {
39:         mode |= S_IFCHR;
40:         args = 5;
41:     } else if (!strcmp(argv[2], "p")) {
42:         mode |= S_IFIFO;
43:         args = 3;
44:     } else {
45:         fprintf(stderr, "unknown node type %s
", argv[2]);
46:         return 1;
47:     }
48:
49:     /* args tells us how many parameters to expect, as we need more
50:        information to create device files than named pipes */
51:     if (argc != args) usage();
52:
53:     if (args == 5) {
54:         /* get the major and minor numbers for the device file to
55:            create */
56:         major = strtol(argv[3], &end, 0);
57:         if (*end) {
58:             fprintf(stderr, "bad major number %s
", argv[3]);
59:             return 1;
60:         }
61:
62:         minor = strtol(argv[4], &end, 0);
63:         if (*end) {
64:             fprintf(stderr, "bad minor number %s
", argv[4]);
65:             return 1;
66:         }
67:     }
68:
69:     /* if we're creating a named pipe, the final parameter is
70:        ignored */
71:     if (mknod(path, mode, makedev(major, minor))) {
72:         fprintf(stderr, "mknod failed: %s
", strerror(errno));
73:         return 1;
74:     }
75:
76:     return 0;
77: }

Creating Hard Links

When multiple file names in the file system refer to a single inode, the files are called hard links to one other. All the file names must reside in the same physical file system (normally, this means they must all be on the same device). When a file has multiple hard links, each of those file names is an exact peer—there is no way to tell which file name was originally used. One benefit of this model is that removing one hard link does not remove the file from the device—it stays until all the links to it are removed. The link() system call links a new file name to an existing inode.

#include <unistd.h>

int link(const char * origpath, const char * newpath);

The origpath refers to a pathname that already exists; newpath is the path for the new hard link. Any user may create a link to a file he has read access to, as long as he has write access for the directory in which he is creating the link and execute permissions on the directory origpath is in. Only the root user has ever been allowed to create hard links to directories, but doing so was generally a bad idea, because most file systems and several utilities do not handle it very well; it is now entirely disallowed.

Using Symbolic Links

Symbolic links are a more flexible type of link than hard links, but they do not share the peer relationship that hard links enjoy. Whereas hard links share an inode, symbolic links point to other file names. If the destination file name is removed, the symbolic link then points to a file that does not exist, resulting in a dangling link. Using symbolic links between subdirectories is common, and symbolic links may also cross physical file system boundaries, which hard links cannot.

Almost all system calls that access files by their pathname automatically follow symbolic links to find the proper inode. The following calls do not follow symbolic links under Linux:

  • chown()

  • lstat()

  • readlink()

  • rename()

  • unlink()

Creating symbolic links is just like creating hard links, but the symlink() system call is used instead.

#include <unistd.h>

int symlink(const char * origpath, const char * newpath);

If the call is successful, the file newpath gets created as a symbolic link pointing to origpath (often newpath is said to contain oldpath as its value).

Finding the value of the symbolic link is a bit more complicated.

#include <unistd.h>

int readlink(const char * pathname, char * buf,
             size_t bufsiz);

The buffer that buf points to is filled in by the contents of the pathname symlink as long as buf is long enough to hold the contents. bufsize should contain the length of buf in bytes. Usually, the PATH_MAX constant is used for the size of the buffer, as that should be big enough to hold the contents of any symlink.[20] One oddity of readlink() is that it does not end the string it writes to buf with a '' character, so buf is not a valid C string even if readlink() succeeds. It instead returns the number of bytes written to buf on success, and -1 on error. Thanks to this quirk, code that uses readlink() often looks like this:

char buf[PATH_MAX + 1];
int bytes;

if ((bytes = readlink(pathname, buf, sizeof(buf) - 1)) < 0) {
    perror("error in readlink");
} else {
    buf[bytes] = '';
}

Removing Files

Removing a file removes the pointer to its inode and removes the file’s data if there are no other hard links to the file. If any processes have the file open, the file’s inode is preserved until the final process closes the file, and then the inode and the file’s data are both discarded. As there is no way of forcing a file to be removed immediately, this operation is called unlinking the file, since it removes a file name/inode link.

#include <unistd.h>

int unlink(char * pathname);

Renaming Files

A file name may be changed to any other file name as long as both names are on the same physical partition (this is the same limit that applies to creating hard links). If the new file name already references a file, that name is unlinked before the move takes place. The rename() system call is guaranteed to be atomic. Other processes on the system always see the file in existence under one name or the other; there is no point at which the file does not exist under either name, nor under both names. As open files are not related to file names (only to inodes), renaming a file that other processes have open does not affect those other processes in any way. Here is what the system call looks like:

#include <unistd.h>

int rename(const char * oldpath, const char * newpath);

After the call, the file referenced by oldpath may be referenced by newpath, but not by oldpath.

Manipulating File Descriptors

Nearly all the file-related system calls we have talked about, with the exception of lseek(), manipulate a file’s inode, which causes their results to be shared among processes that have the file open. There are a few system calls that instead act on the file descriptor itself. The fcntl() system call can be used for numerous file descriptor manipulations. fcntl() looks like this:

#include <fcntl.h>

int fcntl(int fd, int command, long arg);

For many commands, arg is not used. We discuss most of fcntl()’s uses here. It is also used for file locking, file leases, and nonblocking I/O, which are discussed in Chapter 13, as well as directory change notification, which is presented in Chapter 14.

Changing the Access Mode for an Open File

The append mode (as indicated by the O_APPEND flag when the file is opened) and nonblocking mode (the O_NONBLOCK flag), can be turned on and off after a file has already been opened through the fcntl()’s F_SETFL command. The arg parameter should be the flags that should be set—if one of the flags is not specified, it is turned off for fd.

F_GETFL may be used to query the current flags for the file. It returns all the flags, including the read/write mode the file was opened in. F_SETFL allows only setting the flags mentioned above; any other flags that are present in the arg parameter are ignored.

fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) | O_RDONLY);

is perfectly legal, but it does not accomplish anything. Turning on append mode for a file descriptor looks like this:

fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) | O_APPEND);

Note that care was taken to preserve the O_NONBLOCK setting. Turning append mode off looks similar:

fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) & ~O_APPEND);

Modifiying the close-on-exec Flag

During an exec() system call, file descriptors are normally left open for the new program to use. In certain cases, you may wish to have files closed when you call exec(). Rather than closing them by hand, you can ask the system to close a certain file descriptor when exec() is called through fcntl()’s F_GETFD and F_SETFD commands. If the close-on-exec flag is set when F_GETFD is used, fcntl() returns non-0; otherwise, it returns 0. The close-on-exec flag is set through F_SETFD; it is disabled if arg is 0 and enabled otherwise.

Here is how you would force fd to be closed when the process exec() s:

fcntl(fd, F_SETFD, 1);

Duplicating File Descriptors

Occasionally, a process needs to create a new file descriptor that references a file that is already open. Shells use this functionality to redirect standard input, output, and error at the user’s request. If the process does not care what file descriptor is used for the new reference, it should use dup().

#include <unistd.h>

int dup(int oldfd);

dup() returns a file descriptor that references the same inode as oldfd, or-1 on an error. The oldfd is still a valid file descriptor and still references the original file. The new file descriptor is always the smallest file descriptor currently available. If the process needs the new file descriptor to have a particular value (such as 0 to change its standard input), it should use dup2() instead.

#include <unistd.h>

int dup2(int oldfd, int newfd);

If newfd references an already open file descriptor, that file descriptor is closed. If the call succeeds, it returns the new file descriptor and newfd references the same file as oldfd. The fcntl() system call provides almost the same functionality through the F_DUPFD command. The first argument, fd, is the already open file descriptor. The new file descriptor is the first available file descriptor that is the same or larger than the last argument to fcntl(). (This is different than how dup2() works.) You could implement dup2() through fcntl() like this:

int dup2(int oldfd, int newfd) {
    close(newfd);                   /* ensure newfd is available */
    return fcntl(oldfd, F_DUPFD, newfd);
}

Creating two duped file descriptors that reference the same file is not the same as opening a file twice. Nearly all duped file descriptors’ attributes are shared; they share a common current position, access mode, and locks. (These items are stored in a file structure,[21] one of which is created each time the file is opened. A file descriptor refers to a file structure, and dup() ed file descriptors refer to a single file structure.) The only attribute that can be independently controlled for the two file descriptors is their close-on-exec status. After the process has fork() ed, the parent’s open files are inherited by the child, and those pairs of file descriptors (one in the parent and one in the new child) behave exactly like a file descriptor that has been duped with the current position and most other attributes being shared.[22]

Creating Unnamed Pipes

Unnamed pipes are similar to named pipes, but they do not exist in the file system. They have no pathnames associated with them, and they and all their remnants disappear after the final file descriptor that references them is closed. They are almost exclusively used for interprocess communication between a child and parent processes or between sibling processes.

Shells use unnamed pipes to execute commands such as ls | head. The ls process writes to the same pipe that head reads its input from, yielding the results the user intended.

Creating an unnamed pipe results in two file descriptors, one of which is read-only and the other one of which is write-only.

#include <unistd.h>

int pipe(int fds[2]);

The sole parameter is filled in with the two returned file descriptors, fds[0] for reading and fds[1] for writing.

Adding Redirection to ladsh

Now that we have covered the basics of file manipulation, we can teach ladsh to redirect input and output through files and pipes. ladsh2.c, which we present here, handles pipes (denoted by a | in ladsh commands, just as in most shells) and input and output redirection for arbitrary file descriptors. We show only the modified pieces of code here—full source to ladsh2.c is available from http://ladweb.net/lad/src/. The changes to parseCommand() are a simple exercise in string parsing, so we do not bother discussing them here.

The Data Structures

Although ladsh1.c included the concept of a job as multiple processes (presumably tied together by pipes), it did not provide a way of specifying which files to use for a child’s input and output. To allow for this, new data structures are introduced and existing ones modified.

24:                        REDIRECT_APPEND };
25:
26: struct redirectionSpecifier {
27:     enum redirectionType type;  /* type of redirection */
28:     int fd;                 /* file descriptor being redirected */
29:     char * filename;        /* file to redirect fd to */
30: };
31:
32: struct childProgram {
33:     pid_t pid;              /* 0 if exited */
34:     char ** argv;           /* program name and arguments */
35:     int numRedirections;    /* elements in redirection array */
36:     struct redirectionSpecifier * redirections; /* I/O redirs */
37: };

struct redirectionSpecifier tells ladsh2.c how to set up a single file descriptor. It contains an enum redirectionType that tells us whether this redirection is an input redirection, an output redirection that should be appended to an already existing file, or an output redirection that replaces any existing file. It also includes the file descriptor that is being redirected, as well as the name of the file involved. Each child program (struct childProgram) now specifies an arbitrary number of redirections for itself.

These new data structures are not involved in setting up pipes between processes. As a job is defined as multiple child processes with pipes tying them together, there is no need for more explicit information describing the pipes. Figure 11.1 shows how these new data structures would look for the command tail < input-file | sort > output-file.

Job Data Structures for ladsh2.c

Figure 11.1. Job Data Structures for ladsh2.c

Changing the Code

Once parseCommand() has set up the data structures properly, running the commands in the proper sequence is easy enough, as long as you watch the details. First of all, we added a loop to runCommand() to start the child processes, because there could now be multiple children. Before entering the loop, we set up nextin and nextout, which are the file descriptors to use for the standard input and the standard output of the next process we start. To begin with, we use the same stdin and stdout as the shell.

Now we take a look at what happens inside the loop. The basic idea is as follows:

  1. If this is the final process in the job, make sure nextout points at stdout. Otherwise, we need to connect the output of this job to the input side of an unnamed pipe.

  2. Fork the new process. Inside the child, redirect stdin and stdout as specified by nextout, nextin, and any file redirections that were specified.

  3. Back in the parent, we close the nextin and nextout used by the just-started child (unless they are the shell’s own stdin or stdout).

  4. Now set up the next process in the job to receive its input from output of the process we just created (through nextin).

Here is how these ideas translate into C:

365:     nextin = 0, nextout = 1;
366:     for (i = 0; i < newJob.numProgs; i++) {
367:         if ((i + 1) < newJob.numProgs) {
368:             pipe(pipefds);
369:             nextout = pipefds[1];
370:         } else {
371:             nextout = 1;
372:         }
373:
374:         if (!(newJob.progs[i].pid = fork())) {
375:             if (nextin != 0) {
376:                 dup2(nextin, 0);
377:                 close(nextin);
378:             }
379:
380:             if (nextout != 1) {
381:                 dup2(nextout, 1);
382:                 close(nextout);
383:             }
384:
385:             /* explicit redirections override pipes */
386:             setupRedirections(newJob.progs + i);
387:
388:             execvp(newJob.progs[i].argv[0], newJob.progs[i].argv);
389:             fprintf(stderr, "exec() of %s failed: %s
",
390:                     newJob.progs[i].argv[0],
391:                     strerror(errno));
392:             exit(1);
393:         }
394:
395:         /* put our child in the process group whose leader is the
396:            first process in this pipe */
397:         setpgid(newJob.progs[i].pid, newJob.progs[0].pid);
398:
399:         if (nextin != 0) close(nextin);
400:         if (nextout != 1) close(nextout);
401:
402:         /* If there isn't another process, nextin is garbage
403:            but it doesn't matter */
404:         nextin = pipefds[0];

The only other code added to ladsh2.c to allow redirection was setupRedirections(), the source of which appears unchanged in all subsequent versions of ladsh. Its job is to process the struct redirectionSpecifier specifiers for a child job and modify the child’s file descriptors as appropriate. We recommend reading over the function as it appears in Appendix B to ensure you understand its implementation.



[6] Linux has always used the term inode for both types, while other Unix variants have reserved inode for on-disk inodes and call in-core inodes vnodes. While using the vnode terminology is less confusing, we choose to use inode for both types to keep consistent with Linux standards.

[7] This is the mode usually used for the /tmp directory.

[8] See Chapter 13 for more information on file locking.

[11] creat() is misspelled, anyway.

[12] readv(), writev(), and mmap() are discussed in Chapter 13; sendmsg() and recvmsg() are mentioned in Chapter 17.

[13] Although this division is almost clean, TCP sockets support out-of-band data, which makes it a bit dirtier. Out-of-band data is outside the scope of this book; [Stevens, 2004] provides a complete description.

[14] Almost independent; see the discussion of dup() on page 196 for the exceptions to this.

[16] Well, not usually, anyway. If processes share file descriptors (meaning file descriptors that arose from a single open() call), those processes share the same file structure and the same current position. The most common way for this to happen is for files after a fork(), as discussed on page 197. The other way this can happen is if a file descriptor is passed to another process through a Unix domain socket, which is described on pages 424-425.

[17] The inode information for files is listed in Table 11.3.

[18] So named because it is a journaling version of the Second Extended File System, the successor to the Linux Extended File System, which was designed as a more complete file system than the Minix file system, the only one Linux originally supported.

[19] It actually works just fine on an ext2 file system as well. The two file systems are very similar (an ext3 file system can even be mounted as an ext2 one), and the programs presented here work on both. In fact, if all of the places 3 appears in the source are changed to 2 the programs still compile and function identically.

[20] Although PATH_MAX is not guaranteed to be large enough, it is, for all practical purposes. If you are dealing with pathological cases, you should call readlink() iteratively, making the buffer bigger until readlink() returns a value smaller than bufsiz.

[21] File structures are also known as file table entries or open file objects, depending on the operating system.

[22] The file descriptor in each process refers to the same file structure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset