Each VFS object is stored in a suitable data structure, which includes both the object attributes and a pointer to a table of object methods. The kernel may dynamically modify the methods of the object and, hence, it may install specialized behavior for the object. The following sections explain the VFS objects and their interrelationships in detail.
A superblock object consists of a
super_block
structure whose fields are described
in Table 12-2.
Table 12-2. The fields of the superblock object
Type |
Field |
Description |
---|---|---|
|
|
Pointers for superblock list |
|
|
Device identifier |
|
|
Block size in bytes |
|
|
Block size in number of bits |
|
|
Modified (dirty) flag |
|
|
Maximum size of the files |
|
|
Filesystem type |
|
|
Superblock methods |
|
|
Disk quota methods |
|
|
Mount flags |
|
|
Filesystem magic number |
|
|
Dentry object of mount directory |
|
|
Semaphore used for unmounting |
|
|
Superblock semaphore |
|
|
Reference counter |
|
|
Secondary reference counter |
|
|
List of modified inodes |
|
|
List of inodes involved in I/O |
|
|
List of file objects assigned to the superblock |
|
|
Pointer to the block device driver descriptor |
|
|
Pointers for a list of superblock objects of a given filesystem type (see Section 12.3.2) |
|
|
Options for disk quota |
|
|
Specific filesystem information |
All superblock objects are linked in a circular doubly linked list.
The first element of this list is represented by the
super_blocks
variable, while the
s_list
field of the superblock object stores the
pointers to the adjacent elements in the list. The
sb_lock
spin lock protects the list against
concurrent accesses in multiprocessor systems.
The last u
union field includes superblock
information that belongs to a specific filesystem; for instance, as
we shall see later in Chapter 17, if the superblock
object refers to an Ext2 filesystem, the field stores an
ext2_sb_info
structure, which includes the disk
allocation bit masks and other data of no concern to the VFS common
file model.
In general, data in the u
field is duplicated in
memory for reasons of efficiency. Any disk-based filesystem needs to
access and update its allocation bitmaps in order to allocate or
release disk blocks. The VFS allows these filesystems to act directly
on the u
union field of the superblock in memory
without accessing the disk.
This approach leads to a new problem, however: the VFS superblock
might end up no longer synchronized with the corresponding superblock
on disk. It is thus necessary to introduce an
s_dirt
flag, which specifies whether the
superblock is dirty—that is, whether the data on the disk must
be updated. The lack of synchronization leads to the familiar problem
of a corrupted filesystem when a site’s power goes
down without giving the user the chance to shut down a system
cleanly. As we shall see in Section 14.2.4,
Linux minimizes this problem by periodically copying all dirty
superblocks to disk.
The methods associated with a superblock are called
superblock operations
. They are described by the
super_operations
structure whose address is
included in the s_op
field.
Each specific filesystem can define its own superblock operations.
When the VFS needs to invoke one of them, say read_inode( )
, it executes the following:
sb->s_op->read_inode(inode);
where sb
stores the address of the superblock
object involved. The read_inode
field of the
super_operations
table contains the address of the
suitable function, which is therefore directly invoked.
Let’s briefly describe the superblock operations,
which implement higher-level operations like deleting files or
mounting disks. They are listed in the order they appear in the
super_operations
table:
read_inode(inode)
Fills the fields of the inode object whose address is passed as the
parameter from the data on disk; the i_ino
field
of the inode object identifies the specific filesystem inode on the
disk to be read.
read_inode2(inode, p)
Similar to the previous one, but the inode is identified by a 64-bit
number pointed by p
. This method should disappear
as soon as the whole VFS architecture moves to 64-bit quantities; for
now, it is used by the ReiserFS filesystem only.
dirty_inode(inode)
Invoked when the inode is marked as modified (dirty). Used by filesystems like ReiserFS and Ext3 to update the filesystem journal on disk.
write_inode(inode, flag)
Updates a filesystem inode with the contents of the inode object
passed as the parameter; the i_ino
field of the
inode object identifies the filesystem inode on disk that is
concerned. The flag
parameter indicates whether
the I/O operation should be synchronous.
put_inode(inode)
Releases the inode object whose address is passed as the parameter. As usual, releasing an object does not necessarily mean freeing memory, since other processes may still use that object.
delete_inode(inode)
Deletes the data blocks containing the file, the disk inode, and the VFS inode.
put_super(super)
Releases the superblock object whose address is passed as the parameter (because the corresponding filesystem is unmounted).
write_super(super)
Updates a filesystem superblock with the contents of the object indicated.
write_super_lockfs(super)
Blocks changes to the filesystem and updates the superblock with the contents of the object indicated. The method should be implemented by journaling filesystems, and should be invoked by the Logical Volume Manager (LVM) driver. It is currently not in use.
unlockfs(super)
Undoes the block of filesystem updates achieved by the
write_super_lockfs( )
superblock method.
statfs(super, buf)
Returns statistics on a filesystem by filling the
buf
buffer.
remount_fs(super, flags, data)
Remounts the filesystem with new options (invoked when a mount option must be changed).
clear_inode(inode)
Like put_inode
, but also releases all pages that
contain data concerning the file that corresponds to the indicated
inode.
umount_begin(super)
Interrupts a mount operation because the corresponding unmount operation has been started (used only by network filesystems).
fh_to_dentry(super, filehandle, len, filehandletype. parent)
Used by the Network File System (NFS) kernel thread knfsd to return the dentry object corresponding to a given file handle. (A file handle is an identifier of a NFS file.)
dentry_to_fh(dentry, filehandle, lenp, need_parent)
Used by the NFS kernel thread knfsd to derive the file handle corresponding to a given dentry object.
show_options(seq_file, vfsmount)
Used to display the filesystem-specific options
The preceding methods are available to all possible filesystem types.
However, only a subset of them applies to each specific filesystem;
the fields corresponding to unimplemented methods are set to
NULL
. Notice that no read_super
method to read a superblock is defined—how could the kernel
invoke a method of an object yet to be read from disk?
We’ll find the read_super
method
in another object describing the filesystem type (see the later
section Section 12.4).
All information needed by the filesystem
to handle a file is included in a data structure called an inode. A
filename is a casually assigned label that can be changed, but the
inode is unique to the file and remains the same as long as the file
exists. An inode object in memory consists of an
inode
structure whose fields are described in
Table 12-3.
Table 12-3. The fields of the inode object
Type |
Field |
Description |
---|---|---|
|
|
Pointers for the hash list |
|
|
Pointers for the inode list |
|
|
Pointers for the dentry list |
|
|
Pointers for the modified buffers list |
|
|
Pointers for the modified data buffers list |
|
|
inode number |
|
|
Usage counter |
|
|
Device identifier |
|
|
File type and access rights |
|
|
Number of hard links |
|
|
Owner identifier |
|
|
Group identifier |
|
|
Real device identifier |
|
|
File length in bytes |
|
|
Time of last file access |
|
|
Time of last file write |
|
|
Time of last inode change |
|
|
Block size in number of bits |
|
|
Block size in bytes |
|
|
Number of blocks of the file |
|
|
Version number, automatically incremented after each use |
|
|
inode semaphore |
|
|
Secondary inode semaphore used when removing or renaming the inode |
|
|
inode operations |
|
|
Default file operations |
|
|
Pointer to superblock object |
|
|
inode wait queue |
|
|
Pointer to file lock list |
|
|
Pointer to an |
|
|
|
|
|
inode disk quotas |
|
|
Pointers of a list of block device file inodes (see Chapter 13) |
|
|
Used if the file is a pipe (see Chapter 19) |
|
|
Pointer to the block device driver |
|
|
Pointer to the character device driver |
|
|
Bit mask of directory notify events |
|
|
Used for directory notifications |
|
|
inode state flags |
|
|
Filesystem mount flags |
|
|
Nonzero if file is a socket |
|
|
Usage counter for writing processes |
|
|
File creation flags |
|
|
inode version number (used by some filesystems) |
|
|
Specific filesystem information |
The final u
union field is used to include inode
information that belongs to a specific filesystem. For instance, as
we shall see in Chapter 17, if the inode object
refers to an Ext2 file, the field stores an
ext2_inode_info
structure.
Each inode object duplicates some of the data included in the disk
inode—for instance, the number of blocks allocated to the file.
When the value of the i_state
field is equal to
I_DIRTY_SYNC
, I_DIRTY_DATASYNC
,
or I_DIRTY_PAGES
, the inode is dirty—that
is, the corresponding disk inode must be updated; the
I_DIRTY
macro can be used to check the value of
these three flags at once (see later for details). Other values of
the i_state
field are I_LOCK
(the inode object is involved in a I/O transfer),
I_FREEING
(the inode object is being freed), and
I_CLEAR
(the inode object contents are no longer
meaningful).
Each inode object always appears in one of the following circular doubly linked lists:
The list of valid unused inodes, typically those mirroring valid disk
inodes and not currently used by any process. These inodes are not
dirty and their i_count
field is set to 0. The
first and last elements of this list are referenced by the
next
and prev
fields,
respectively, of the inode_unused
variable. This
list acts as a disk cache.
The list of in-use inodes, typically those mirroring valid disk
inodes and used by some process. These inodes are not dirty and their
i_count
field is positive. The first and last
elements are referenced by the inode_in_use
variable.
The list of dirty inodes. The first and last elements are referenced
by the s_dirty
field of the corresponding
superblock object.
Each of the lists just mentioned links the i_list
fields of the proper inode objects.
inode objects are also included in a hash table named
inode_hashtable
. The hash table speeds up the
search of the inode object when the kernel knows both the inode
number and the address of the superblock object corresponding to the
filesystem that includes the file.[84] Since hashing may induce
collisions, the inode object includes an i_hash
field that contains a backward and a forward pointer to other inodes
that hash to the same position; this field creates a doubly linked
list of those inodes. The hash table also includes a special chain
list for the inodes not assigned to a superblock (such as the inodes
used by sockets; see Chapter 18); its first and
last elements are referenced by the
anon_hash_chain
variable.
The methods associated with an inode object are also called
inode operations
. They are described by an
inode_operations
structure, whose address is
included in the i_op
field. Here are the inode
operations in the order they appear in the
inode_operations
table:
create(dir,
dentry,
mode)
Creates a new disk inode for a regular file associated with a dentry object in some directory.
lookup(dir,
dentry)
Searches a directory for an inode corresponding to the filename included in a dentry object.
link(old_dentry,
dir, new_dentry)
Creates a new hard link that refers to the file specified by
old_dentry
in the directory
dir
; the new hard link has the name specified by
new_dentry
.
unlink(dir,
dentry)
Removes the hard link of the file specified by a dentry object from a directory.
symlink(dir,
dentry,
symname)
Creates a new inode for a symbolic link associated with a dentry object in some directory.
mkdir(dir,
dentry,
mode)
Creates a new inode for a directory associated with a dentry object in some directory.
rmdir(dir,
dentry)
Removes from a directory the subdirectory whose name is included in a dentry object.
mknod(dir,
dentry,
mode,
rdev)
Creates a new disk inode for a special file associated with a dentry
object in some directory. The mode
and
rdev
parameters specify, respectively, the file
type and the device’s major number.
rename(old_dir,
old_dentry,
new_dir,
new_dentry)
Moves the file identified by old_entry
from the
old_dir
directory to the
new_dir
one. The new filename is included in the
dentry object that new_dentry
points to.
readlink(dentry,
buffer,
buflen)
Copies into a memory area specified by buffer
the
file pathname corresponding to the symbolic link specified by the
dentry.
follow_link(inode,
dir)
Translates a symbolic link specified by an inode object; if the symbolic link is a relative pathname, the lookup operation starts from the specified directory.
truncate(inode)
Modifies the size of the file associated with an inode. Before
invoking this method, it is necessary to set the
i_size
field of the inode object to the required
new size.
permission(inode,
mask)
Checks whether the specified access mode is allowed for the file
associated with inode
.
revalidate(dentry)
Updates the cached attributes of a file specified by a dentry object (usually invoked by the network filesystem).
setattr(dentry, iattr)
Notifies a “change event” after touching the inode attributes.
getattr(dentry, iattr)
Used by networking filesystems when noticing that some cached inode attributes must be refreshed.
The methods just listed are available to all possible inodes and
filesystem types. However, only a subset of them applies to a
specific inode and filesystem; the fields corresponding to
unimplemented methods are set to NULL
.
A file object describes how a process
interacts with a file it has opened. The object is created when the
file is opened and consists of a file
structure,
whose fields are described in Table 12-4. Notice
that file objects have no corresponding image on disk, and hence no
“dirty” field is included in the
file
structure to specify that the file object has
been modified.
Table 12-4. The fields of the file object
Type |
Field |
Description |
---|---|---|
|
|
Pointers for generic file object list |
|
|
dentry object associated with the file |
|
|
Mounted filesystem containing the file |
|
|
Pointer to file operation table |
|
|
File object’s usage counter |
|
|
Flags specified when opening the file |
|
|
Process access mode |
|
|
Current file offset (file pointer) |
|
|
Read-ahead flag |
|
|
Maximum number of pages to be read-ahead |
|
|
File pointer after last read-ahead |
|
|
Number of read-ahead bytes |
|
|
Number of read-ahead pages |
|
|
Data for asynchronous I/O via signals |
|
|
User’s UID |
|
|
User’s GID |
|
|
Error code for network write operation |
|
|
Version number, automatically incremented after each use |
|
|
Needed for tty driver |
|
|
Descriptor for direct access buffer (see Section 15.2) |
|
|
Lock for direct I/O transfer |
The main information stored in a file object is the file pointer —the current position in the file from which the next operation will take place. Since several processes may access the same file concurrently, the file pointer cannot be kept in the inode object. Each file object is always included in one of the following circular doubly linked lists:
The list of “unused” file objects.
This list acts both as a memory cache for the file objects and as a
reserve for the superuser; it allows the superuser to open a file
even if the dynamic memory in the system is exhausted. Since the
objects are unused, their f_count
fields are 0.
The first element of the list is a dummy and it is stored in the
free_list
variable. The kernel makes sure that the
list always contains at least NR_RESERVED_FILES
objects, usually 10.
The list of “in use” file objects
not yet assigned to a superblock. The f_count
field of each element in this list is set to 1. The first element of
the list is a dummy and it is stored in the
anon_list
variable.
Several lists of “in use” file
objects already assigned to superblocks. Each superblock object
stores in the s_files
field the dummy first
element of a list of file objects; thus, file objects of files
belonging to different filesystems are included in different lists.
The f_count
field of each element in such a list
is set to 1 plus the number of processes that are using the file
object.
Regardless of which list a file object is in at the moment, the
pointers of the next and previous elements in the list are stored in
the f_list
field of the file object. The
files_lock
semaphore protects the lists against
concurrent accesses in multiprocessor systems.
The size of the list of “unused”
file objects is stored in the nr_free_files
field
of the files_stat
variable. The
get_empty_filp( )
function is invoked when the VFS
must allocate a new file object. The function checks whether the
“unused” list has more than
NR_RESERVED_FILES
items, in which case one can be
used for the newly opened file. Otherwise, it falls back to normal
memory allocation.
The files_stat
variable also includes the
nr_files
field (which stores the number of file
objects included in all lists) and the max_files
field (which is the maximum number of allocatable file
objects—i.e., the maximum number of files that can be accessed
at the same time in the system).[85]
As we explained earlier in Section 12.1.1, each filesystem includes
its own set of file operations
that perform such activities as
reading and writing a file. When the kernel loads an inode into
memory from disk, it stores a pointer to these file operations in a
file_operations
structure whose address is
contained in the i_fop
field of the inode object.
When a process opens the file, the VFS initializes the
f_op
field of the new file object with the address
stored in the inode so that further calls to file operations can use
these functions. If necessary, the VFS may later modify the set of
file operations by storing a new value in f_op
.
The following list describes the file operations in the order in which
they appear in the file_operations
table:
llseek(file,
offset,
origin)
Updates the file pointer.
read(file,
buf,
count,
offset)
Reads count
bytes from a file starting at position
*offset
; the value *offset
(which usually corresponds to the file pointer) is then incremented.
write(file,
buf,
count,
offset)
Writes count
bytes into a file starting at
position *offset
; the value
*offset
(which usually corresponds to the file
pointer) is then incremented.
readdir(dir,
dirent,
filldir)
Returns the next directory entry of a directory in
dirent
; the filldir
parameter
contains the address of an auxiliary function that extracts the
fields in a directory entry.
poll(file,
poll_table)
Checks whether there is activity on a file and goes to sleep until something happens on it.
ioctl(inode,
file,
cmd,
arg)
Sends a command to an underlying hardware device. This method applies only to device files.
mmap(file,
vma)
Performs a memory mapping of the file into a process address space (see Chapter 15).
open(inode,
file)
Opens a file by creating a new file object and linking it to the corresponding inode object (see Section 12.6.1 later in this chapter).
flush(file)
Called when a reference to an open file is closed—that is, when
the f_count
field of the file object is
decremented. The actual purpose of this method is
filesystem-dependent.
release(inode,
file)
Releases the file object. Called when the last reference to an open
file is closed—that is, when the f_count
field of the file object becomes 0.
fsync(file,
dentry)
Writes all cached data of the file to disk.
fasync(fd, file,
on)
Enables or disables asynchronous I/O notification by means of signals.
lock(file,
cmd,
file_lock)
Applies a lock to the file (see Section 12.7 later in this chapter).
readv(file, vector, count, offset)
Reads bytes from a file and puts the results in the buffers described
by vector
; the number of buffers is specified by
count
.
writev(file, vector, count, offset)
Writes bytes into a file from the buffers described by
vector
; the number of buffers is specified by
count
.
sendpage(file, page, offset, size, pointer, fill)
Transfers data from this file to another file; this method is used by sockets (see Chapter 18).
get_unmapped_area(file, addr, len, offset, flags)
Gets an unused address range to map the file (used for frame buffer memory mappings).
The methods just described are available to all possible file types.
However, only a subset of them apply to a specific file type; the
fields corresponding to unimplemented methods are set to
NULL
.
We
mentioned in Section 12.1.1 that the VFS considers each
directory a file that contains a list of files and other directories.
We shall discuss in Chapter 17 how directories are
implemented on a specific filesystem. Once a directory entry is read
into memory, however, it is transformed by the VFS into a dentry
object based on the dentry
structure, whose fields
are described in Table 12-5. The kernel creates a
dentry object for every component of a pathname that a process looks
up; the dentry object associates the component to its corresponding
inode. For example, when looking up the
/tmp/test
pathname, the kernel creates a dentry
object for the /
root directory, a second dentry
object for the tmp
entry of the root directory,
and a third dentry object for the test
entry of
the /tmp
directory.
Notice that dentry objects have no corresponding image on disk, and
hence no field is included in the dentry
structure
to specify that the object has been modified. Dentry objects are
stored in a slab allocator cache called
dentry_cache
; dentry objects are thus created and
destroyed by invoking kmem_cache_alloc( )
and
kmem_cache_free( )
.
Table 12-5. The fields of the dentry object
Type |
Field |
Description |
---|---|---|
|
|
Dentry object usage counter |
|
|
Dentry flags |
|
|
Inode associated with filename |
|
|
Dentry object of parent directory |
|
|
Pointers for list in hash table entry |
|
|
Pointers for unused list |
|
|
Pointers for the list of dentry objects included in parent directory |
|
|
For directories, list of dentry objects of subdirectories |
|
|
List of associated inodes (alias) |
|
|
Flag set to 1 if and only if the dentry is the mount point for a filesystem |
|
|
Filename |
|
|
Used by |
|
|
Dentry methods |
|
|
Superblock object of the file |
|
|
Dentry cache flags |
|
|
Filesystem-dependent data |
|
|
Space for short filename |
Each dentry object may be in one of four states:
The dentry object contains no valid information and is not used by the VFS. The corresponding memory area is handled by the slab allocator.
The dentry object is not currently used by the kernel. The
d_count
usage counter of the object is 0, but the
d_inode
field still points to the associated
inode. The dentry object contains valid information, but its contents
may be discarded if necessary in order to reclaim memory.
The dentry object is currently used by the kernel. The
d_count
usage counter is positive and the
d_inode
field points to the associated inode
object. The dentry object contains valid information and cannot be
discarded.
The inode associated with the dentry does not exist, either because
the corresponding disk inode has been deleted or because the dentry
object was created by resolving a pathname of a nonexisting file. The
d_inode
field of the dentry object is set to
NULL
, but the object still remains in the dentry
cache so that further lookup operations to the same file pathname can
be quickly resolved. The term
“negative” is misleading since no
negative value is involved.
Since reading a directory entry from disk and constructing the corresponding dentry object requires considerable time, it makes sense to keep in memory dentry objects that you’ve finished with but might need later. For instance, people often edit a file and then compile it, or edit and print it, or copy it and then edit the copy. In such cases, the same file needs to be repeatedly accessed.
To maximize efficiency in handling dentries, Linux uses a dentry cache, which consists of two kinds of data structures:
A set of dentry objects in the in-use, unused, or negative state.
A hash table to derive the dentry object associated with a given filename and a given directory quickly. As usual, if the required object is not included in the dentry cache, the hashing function returns a null value.
The dentry cache also acts as a controller for an inode cache . The inodes in kernel memory that are associated with unused dentries are not discarded, since the dentry cache is still using them. Thus, the inode objects are kept in RAM and can be quickly referenced by means of the corresponding dentries.
All the “unused” dentries are
included in a doubly linked “Least Recently
Used” list sorted by time of insertion. In other
words, the dentry object that was last released is put in front of
the list, so the least recently used dentry objects are always near
the end of the list. When the dentry cache has to shrink, the kernel
removes elements from the tail of this list so that the most recently
used objects are preserved. The addresses of the first and last
elements of the LRU list are stored in the next
and prev
fields of the
dentry_unused
variable. The
d_lru
field of the dentry object contains pointers
to the adjacent dentries in the list.
Each “in use” dentry object is
inserted into a doubly linked list specified by the
i_dentry
field of the corresponding inode object
(since each inode could be associated with several hard links, a list
is required). The d_alias
field of the dentry
object stores the addresses of the adjacent elements in the list.
Both fields are of type struct
list_head
.
An “in use” dentry object may become “negative” when the last hard link to the corresponding file is deleted. In this case, the dentry object is moved into the LRU list of unused dentries. Each time the kernel shrinks the dentry cache, negative dentries move toward the tail of the LRU list so that they are gradually freed (see Section 16.7.6).
The hash table is implemented by means of a
dentry_hashtable
array. Each element is a pointer
to a list of dentries that hash to the same hash table value. The
array’s size depends on the amount of RAM installed
in the system. The d_hash
field of the dentry
object contains pointers to the adjacent elements in the list
associated with a single hash value. The hash function produces its
value from both the address of the dentry object of the directory and
the filename.
The dcache_lock
spin lock protects the dentry
cache data structures against concurrent accesses in multiprocessor
systems. The d_lookup( )
function looks in the
hash table for a given parent dentry object and filename.
The methods associated with a dentry object are called
dentry operations
;
they are described by the dentry_operations
structure, whose address is stored in the d_op
field. Although some filesystems define their own dentry methods, the
fields are usually NULL
and the VFS replaces them
with default functions. Here are the methods, in the order they
appear in the dentry_operations
table:
d_revalidate(dentry, flag)
Determines whether the dentry object is still valid before using it for translating a file pathname. The default VFS function does nothing, although network filesystems may specify their own functions.
d_hash(dentry, name)
Creates a hash value; this function is a filesystem-specific hash
function for the dentry hash table. The dentry
parameter identifies the directory containing the component. The
name
parameter points to a structure containing
both the pathname component to be looked up and the value produced by
the hash function.
d_compare(dir, name1, name2)
Compares two filenames; name1
should belong to the
directory referenced by dir
. The default VFS
function is a normal string match. However, each filesystem can
implement this method in its own way. For instance, MS-DOS does not
distinguish capital from lowercase letters.
d_delete(dentry)
Called when the last reference to a dentry object is deleted
(d_count
becomes 0). The default VFS function does
nothing.
d_release(dentry)
Called when a dentry object is going to be freed (released to the slab allocator). The default VFS function does nothing.
d_iput(dentry, ino)
Called when a dentry object becomes
“negative”—that is, it loses
its inode. The default VFS function invokes iput( )
to release the inode object.
We mentioned in
Section 1.5 that each process has its own current
working directory and its own root directory. These are just two
examples of data that must be maintained by the kernel to represent
the interactions between a process and a filesystem. A whole data
structure of type fs_struct
is used for that
purpose (see Table 12-6) and each process
descriptor has an fs
field that points to the
process fs_struct
structure.
Table 12-6. The fields of the fs_struct structure
Type |
Field |
Description |
---|---|---|
|
|
Number of processes sharing this table |
|
|
Read/write spin lock for the table fields |
|
|
Bit mask used when opening the file to set the file permissions |
|
|
Dentry of the root directory |
|
|
Dentry of the current working directory |
|
|
Dentry of the emulated root directory (always |
|
|
Mounted filesystem object of the root directory |
|
|
Mounted filesystem object of the current working directory |
|
|
Mounted filesystem object of the emulated root directory (always
|
A second table, whose address is contained in the
files
field of the process descriptor, specifies
which files are currently opened by the process. It is a
files_struct
structure whose fields are
illustrated in Table 12-7.
Table 12-7. The fields of the files_struct structure
Type |
Field |
Description |
---|---|---|
|
|
Number of processes sharing this table |
|
|
Read/write spin lock for the table fields |
|
|
Current maximum number of file objects |
|
|
Current maximum number of file descriptors |
|
|
Maximum file descriptors ever allocated plus 1 |
|
|
Pointer to array of file object pointers |
|
|
Pointer to file descriptors to be closed on |
|
|
Pointer to open file descriptors |
|
|
Initial set of file descriptors to be closed on |
|
|
Initial set of file descriptors |
|
|
Initial array of file object pointers |
The fd
field points to an array of pointers to
file objects. The size of the array is stored in the
max_fds
field. Usually, fd
points to the fd_array
field of the
files_struct
structure, which includes 32 file
object pointers. If the process opens more than 32 files, the kernel
allocates a new, larger array of file pointers and stores its address
in the fd
fields; it also updates the
max_fds
field.
For every file with an entry in the fd
array, the
array index is the file descriptor
. Usually, the first element (index
0) of the array is associated with the standard input of the process,
the second with the standard output, and the third with the standard
error (see Figure 12-3). Unix processes use the file
descriptor as the main file identifier. Notice that, thanks to the
dup( )
, dup2( )
, and
fcntl( )
system calls, two file descriptors may
refer to the same opened file—that is, two elements of the
array could point to the same file object. Users see this all the
time when they use shell constructs like
2>&1
to redirect the standard error to the
standard output.
A process cannot use more than NR_OPEN
(usually,
1, 048 ,576) file descriptors. The kernel also enforces a dynamic
bound on the maximum number of file descriptors in the
rlim[RLIMIT_NOFILE]
structure of the process
descriptor; this value is usually 1,024, but it can be raised if the
process has root privileges.
The open_fds
field initially contains the address
of the open_fds_init
field, which is a bitmap that
identifies the file descriptors of currently opened files. The
max_fdset
field stores the number of bits in the
bitmap. Since the fd_set
data structure includes
1,024 bits, there is usually no need to expand the size of the
bitmap. However, the kernel may dynamically expand the size of the
bitmap if this turns out to be necessary, much as in the case of the
array of file objects.
The kernel provides an fget( )
function to be
invoked when the kernel starts using a file object. This function
receives as its parameter a file descriptor fd
.
It returns the address in
current->files->fd[fd]
(that is, the address
of the corresponding file object), or NULL
if no
file corresponds to fd
. In the first case,
fget( )
increments the file object usage counter
f_count
by 1.
The kernel also provides an fput( )
function to be
invoked when a kernel control path finishes using a file object. This
function receives as its parameter the address of a file object and
decrements its usage counter, f_count
. Moreover,
if this field becomes 0, the function invokes the
release
method of the file operations (if
defined), releases the associated dentry object and filesystem
descriptor, decrements the i_writecount
field in
the inode object (if the file was opened for writing), and finally
moves the file object from the “in
use” list to the
“unused” one.
[84] Actually, a Unix
process may open a file and then unlink it. The
i_nlink
field of the inode could become 0, yet the
process is still able to act on the file. In this particular case,
the inode is removed from the hash table, even if it still belongs to
the in-use or dirty list.
[85] By default,
max_files
stores the value 8,192, but the system
administrator can tune this parameter by writing into the
/proc/sys/fs/file-max
file.