Each
filesystem has its own root directory
. The filesystem whose root
directory is the root of the system’s directory tree
is called root filesystem
. Other filesystems can be mounted
on the system’s directory tree; the directories on
which they are inserted are called mount points
. A mounted filesystem is the
child
of the mounted filesystem to which
the mount point directory belongs. For instance, the
/proc
virtual filesystem is a child of the root
filesystem (and the root filesystem is the
parent
of /proc
).
In most traditional Unix-like kernels, each filesystem can be mounted
only once. Suppose that an Ext2 filesystem stored in the
/dev/fd0
floppy disk is mounted on
/flp
by issuing the command:
mount -t ext2 /dev/fd0 /flp
Until the filesystem is unmounted by issuing a
umount
command, any other mount command acting on
/dev/fd0
fails.
However, Linux 2.4 is different: it is possible to mount the same filesystem several times. For instance, issuing the following command right after the previous one will likely succeed in Linux:
mount -t ext2 -o ro /dev/fd0 /flp-ro
As a result, the Ext2 filesystem stored in the floppy disk is mounted
both on /flp
and on
/flp-ro
; therefore, its files can be accessed
through both /flp
and
/flp-ro
(in this example, accesses through
/flp-ro
are read-only).
Of course, if a filesystem is mounted n times, its root directory can be accessed through n mount points, one per mount operation. Although the same filesystem can be accessed by several paths, it is really unique. Thus, there is just one superblock object for all of them, no matter of how many times it has been mounted.
Mounted filesystems form a hierarchy: the mount point of a filesystem might be a directory of a second filesystem, which in turn is already mounted over a third filesystem, and so on.[86]
It is also possible to stack multiple mounts on a single mount point. Each new mount on the same mount point hides the previously mounted filesystem, although processes already using the files and directories under the old mount can continue to do so. When the topmost mounting is removed, then the next lower mount is once more made visible.
As you can imagine, keeping track of mounted filesystems can quickly
become a nightmare. For each mount operation, the kernel must save in
memory the mount point and the mount flags, as well as the
relationships between the filesystem to be mounted and the other
mounted filesystems. Such information is stored in data structures
named mounted filesystem descriptors
; each descriptor is a data
structure that has type vfsmount
, whose fields are
shown in Table 12-11.
Table 12-11. The fields of the vfsmount data structure
Type |
Field |
Description |
---|---|---|
|
|
Pointers for the hash table list |
|
|
Points to the parent filesystem on which this filesystem is mounted on |
|
|
Points to the |
|
|
Points to the |
|
|
Points to the superblock object of this filesystem |
|
|
Head of the parent list of descriptors (relative to this filesystem) |
|
|
Pointers for the parent list of descriptors (relative to the parent filesystem) |
|
|
Usage counter |
|
|
Flags |
|
|
Device file name |
|
|
Pointers for global list of descriptors |
The vfsmount
data structures are kept in several
doubly linked circular lists:
A circular doubly linked “global”
list including the descriptors of all mounted filesystems. The head
of the list is a first dummy element, which is represented by the
vfsmntlist
variable. The
mnt_list
field of the descriptor contains the
pointers to adjacent elements in the list.
An hash table indexed by the address of the
vfsmount
descriptor of the parent filesystem and
the address of the dentry object of the mount point directory. The
hash table is stored in the mount_hashtable
array,
whose size depends on the amount of RAM in the system. Each item of
the table is the head of a circular doubly linked list storing all
descriptors that have the same hash value. The
mnt_hash
field of the descriptor contains the
pointers to adjacent elements in this list.
For each mounted filesystem, a circular doubly linked list including
all child mounted filesystems. The head of each list is stored in the
mnt_mounts
field of the mounted filesystem
descriptor; moreover, the mnt_child
field of the
descriptor stores the pointers to the adjacent elements in the list.
The mount_sem
semaphore protects the lists of
mounted filesystem objects from concurrent accesses.
The mnt_flags
field of the descriptor stores the
value of several flags that specify how some kinds of files in the
mounted filesystem are handled. The flags are listed in Table 12-12.
Table 12-12. Mounted filesystem flags
Name |
Description |
---|---|
|
Forbid |
|
Forbid access to device files in the mounted filesystem |
|
Disallow program execution in the mounted filesystem |
The following functions handle the mounted filesystem descriptors:
alloc_vfsmnt( )
Allocates and initializes a mounted filesystem descriptor
free_vfsmnt(mnt)
Frees a mounted filesystem descriptor pointed by
mnt
lookup_mnt( parent,mountpoint)
Looks up a descriptor in the hash table and returns its address
Mounting the root filesystem is a crucial part of system initialization. It is a fairly complex procedure because the Linux kernel allows the root filesystem to be stored in many different places, such as a hard disk partition, a floppy disk, a remote filesystem shared via NFS, or even a fictitious block device kept in RAM.
To keep the description simple, let’s assume that
the root filesystem is stored in a partition of a hard disk (the most
common case, after all). While the system boots, the kernel finds the
major number of the disk that contains the root filesystem in the
ROOT_DEV
variable. The root filesystem can be
specified as a device file in the /dev
directory
either when compiling the kernel or by passing a suitable
“root” option
to the initial bootstrap loader. Similarly, the mount flags of the
root filesystem are stored in the root_mountflags
variable. The user specifies these flags either by using the
rdev
external program on a compiled kernel image
or by passing a suitable rootflags option to the
initial bootstrap loader (see Appendix A).
Mounting the root filesystem is a two-stage procedure, shown in the following list.
The kernel mounts the special rootfs filesystem, which just provides an empty directory that serves as initial mount point.
The kernel mounts the real root filesystem over the empty directory.
Why does the kernel bother to mount the rootfs filesystem before the real one? Well, the rootfs filesystem allows the kernel to easily change the real root filesystem. In fact, in some cases, the kernel mounts and unmounts several root filesystems, one after the other. For instance, the initial bootstrap floppy disk of a distribution might load in RAM a kernel with a minimal set of drivers, which mounts as root a minimal filesystem stored in a RAM disk. Next, the programs in this initial root filesystem probe the hardware of the system (for instance, they determine whether the hard disk is EIDE, SCSI, or whatever), load all needed kernel modules, and remount the root filesystem from a physical block device.
The first stage is performed by the init_mount_tree( )
function, which is executed during system initialization:
struct file_system_type root_fs_type; root_fs_type.name = "rootfs"; root_fs_type.read_super = rootfs_read_super; root_fs_type.fs_flags = FS_NOMOUNT; register_filesystem(&root_fs_type); root_vfsmnt = do_kern_mount("rootfs", 0, "rootfs", NULL);
The root_fs_type
variable stores the descriptor
object of the rootfs special filesystem; its
fields are initialized, and then it is passed to the
register_filesystem( )
function (see the earlier
section Section 12.3.2). The
do_kern_mount( )
function mounts the special
filesystem and returns the address of a new mounted filesystem
object; this address is saved by init_mount_tree( )
in the root_vfsmnt
variable. From now
on, root_vfsmnt
represents the root of the tree of
the mounted filesystems.
The do_kern_mount( )
function receives the
following parameters:
type
The type of filesystem to be mounted
flags
The mount flags (see Table 12-13 in the later section Section 12.4.2)
name
The device file name of the block device storing the filesystem (or the filesystem type name for special filesystems)
data
Pointers to additional data to be passed to the
read_super
method of the filesystem
The function takes care of the actual mount operation by performing the following operations:
Checks whether the current process has the privileges for the mount
operation (the check always succeeds when the function is invoked by
init_mount_tree( )
because the system
initialization is carried on by a process owned by root).
Invokes get_fs_type( )
to search in the list of
filesystem types and locate the name stored in the
type
parameter; get_fs_type( )
returns the address of the corresponding
file_system_type
descriptor.
Invokes alloc_vfsmnt( )
to allocate a new mounted
filesystem descriptor and stores its address in the
mnt
local variable.
Initializes the mnt->mnt_devname
field with the
content of the name
parameter.
Allocates a new superblock and initializes it.
do_kern_mount( )
checks the flags in the
file_system_type
descriptor to determine how to do
this:
If FS_REQUIRES_DEV
is on, invokes
get_sb_bdev( )
(see the later section Section 12.4.2)
If FS_SINGLE
is on, invokes
get_sb_single( )
(see the later section Section 12.4.2)
Otherwise, invokes get_sb_nodev( )
If the FS_NOMOUNT
flag in the
file_system_type
descriptor is on, sets the
MS_NOUSER
flag in the superblock object.
Initializes the mnt->mnt_sb
field with the
address of the new superblock object.
Initializes the mnt->mnt_root
and
mnt->mnt_mountpoint
fields with the address of
the dentry object corresponding to the root directory of the
filesystem.
Initializes the mnt->mnt_parent
field with the
value in mnt
(the newly mounted filesystem has no
parent).
Releases the s_umount
semaphore of the superblock
object (it was acquired when the object was allocated in Step 5).
Returns the address mnt
of the mounted filesystem
object.
When the do_kern_mount( )
function is invoked by
init_mount_tree( )
to mount the
rootfs special filesystem, neither the
FS_REQUIRES_DEV
flag nor the
FS_SINGLE
flag are set, so the function uses
get_sb_nodev( )
to allocate the superblock object.
This function executes the following steps:
Invokes get_unnamed_dev( )
to allocate a new
fictitious block device identifier (see the earlier section Section 12.3.1).
Invokes the read_super( )
function, passing to it
the filesystem type object, the mount flags, and the fictitious block
device identifier. In turn, this function performs the following
actions:
Allocates a new superblock object and puts its address in the local
variable s
.
Initializes the s->s_dev
field with the block
device identifier.
Initializes the s->s_flags
field with the mount
flags (see Table 12-13).
Acquires the sb_lock
spin lock.
Initializes the s->s_type
field with the
filesystem type descriptor of the filesystem.
Inserts the superblock in the global circular list whose head is
super_blocks
.
Inserts the superblock in the filesystem type list whose head is
s->s_type->fs_supers
.
Releases the sb_lock
spin lock.
Acquires for writing the s->s_umount
read/write
semaphore.
Acquires the s->s_lock
semaphore.
Invokes the read_super
method of the filesystem
type.
Sets the MS_ACTIVE
flag in
s->s_flags
.
Releases the s->s_lock
semaphore.
Returns the address s
of the superblock.
If the filesystem type is implemented by a kernel module, increments its usage counter.
Returns the address of the new superblock.
The second stage of the mount operation for the root filesystem is
performed by the mount_root( )
function near the
end of the system initialization. For the sake of brevity, we
consider the case of a disk-based filesystem whose device files are
handled in the traditional way (we briefly discuss in Chapter 13 how the devfs virtual
filesystem offers an alternative way to handle device files). In this
case, the function performs the following operations:
Allocates a buffer and fills it with a list of filesystem type names. This list is either passed to the kernel in the rootfstype boot parameter or is built by scanning the elements in the simply linked list of filesystem types.
Invokes the bdget( )
and blkdev_get( )
functions to check whether the
ROOT_DEV
root device exists and is properly
working.
Invokes get_super( )
to search for a superblock
object associated with the ROOT_DEV
device in the
super_blocks
list. Usually none is found because
the root filesystem is still to be mounted. The check is made,
however, because it is possible to remount a previously mounted
filesystem. Usually the root filesystem is mounted twice during the
system boot: the first time as a read-only filesystem so that its
integrity can be safely checked; the second time for reading and
writing so that normal operations can start. We’ll
suppose that no superblock object associated with the
ROOT_DEV
device is found in the
super_blocks
list.
Scans the list of filesystem type names built in Step 1. For each
name, invokes get_fs_type( )
to get the
corresponding file_system_type
object, and invokes
read_super( )
to attempt to read the corresponding
superblock from disk. As described earlier, this function allocates a
new superblock object and attempts to fill it by using the method to
which the read_super
field of the
file_system_type
object points. Since each
filesystem-specific method uses unique magic numbers, all
read_super( )
invocations will fail except the one
that attempts to fill the superblock by using the method of the
filesystem really used on the root device. The read_super( )
method also creates an inode object and a dentry object
for the root directory; the dentry object maps to the inode object.
Allocates a new mounted filesystem object and initializes its fields
with the ROOT_DEV
block device name, the address
of the superblock object, and the address of the dentry object of the
root directory.
Invokes the graft_tree( )
function, which inserts
the new mounted filesystem object in the children list of
root_vfsmnt
, in the global list of mounted
filesystem objects, and in the mount_hashtable
hash table.
Sets the root
and pwd
fields of
the fs_struct
table of current
(the init process) to the dentry object of the
root directory.
Once the root filesystem is initialized, additional filesystems may be mounted. Each must have its own mount point, which is just an already existing directory in the system’s directory tree.
The mount( )
system call is used to mount a
filesystem; its sys_mount( )
service routine acts
on the following parameters:
The pathname of a device file containing the filesystem, or
NULL
if it is not required (for instance, when the
filesystem to be mounted is network-based)
The pathname of the directory on which the filesystem will be mounted (the mount point)
The filesystem type, which must be the name of a registered filesystem
The mount flags (permitted values are listed in Table 12-13)
A pointer to a filesystem-dependent data structure (which may be
NULL
)
Table 12-13. Mount flags
Macro |
Description |
---|---|
|
Files can only be read |
|
Forbid |
|
Forbid access to device files |
|
Disallow program execution |
|
Write operations are immediate |
|
Remount the filesystem changing the mount flags |
|
Mandatory locking allowed |
|
Do not update file access time |
|
Do not update directory access time |
|
Create a “bind mount,” which allows making a file or directory visible at another point of the system directory tree |
|
Atomically move a mounted filesystem on another mount point |
|
Should recursively create “bind mounts” for a directory subtree (still unfinished in 2.4.18) |
|
Generate kernel messages on mount errors |
The sys_mount( )
function copies the value of the
parameters into temporary kernel buffers, acquires the big kernel
lock, and invokes the do_mount( )
function. Once
do_mount( )
returns, the service routine releases
the big kernel lock and frees the temporary kernel buffers.
The do_mount( )
function takes care of the actual
mount operation by performing the following operations:
Checks whether the sixteen highest-order bits of the mount flags are
set to the “magic” value
0xce0d
; in this case, they are cleared. This is a
legacy hack that allows the sys_mount( )
service
routine to be used with old C libraries that do not handle the
highest-order flags.
If any of the MS_NOSUID
,
MS_NODEV
, or MS_NOEXEC
flags
passed as a parameter are set, clears them and sets the corresponding
flag (MNT_NOSUID
, MNT_NODEV
,
MNT_NOEXEC
) in the mounted filesystem object.
Looks up the pathname of the mount point by invoking
path_init( )
and path_walk( )
(see the later section Section 12.5).
Examines the mount flags to determine what has to be done. In particular:
If the MS_REMOUNT
flag is specified, the purpose
is usually to change the mount flags in the
s_flags
field of the superblock object and the
mounted filesystem flags in the mnt_flags
field of
the mounted filesystem object. The do_remount( )
function performs these changes.
Otherwise, checks the MS_BIND
flag. If it is
specified, the user is asking to make visible a file or directory on
another point of the system directory tree. Usually, this is done
when mounting a filesystem stored in a regular file instead of a
physical disk partition
(loopback
).
The do_loopback( )
function accomplishes this
task.
Otherwise, checks the MS_MOVE
flag. If it is
specified, the user is asking to change the mount point of an already
mounted filesystem. The do_move_mount( )
function
does this atomically.
Otherwise, invokes do_add_mount( )
. This is the
most common case. It is triggered when the user asks to mount either
a special filesystem or a regular filesystem stored in a disk
partition. do_add_mount( )
performs the following
actions:
Invokes do_kern_mount( )
passing, to it the
filesystem type, the mount flags, and the block device name. As
already described in Section 12.4.1, do_kern_mount( )
takes care of the actual mount operation.
Acquires the mount_sem
semaphore.
Initializes the flags in the mnt_flags
field of
the new mounted filesystem object allocated by
do_kern_mount( )
.
Invokes graft_tree( )
to insert the new mounted
filesystem object in the global list, in the hash table, and in the
children list of the parent-mounted filesystem.
Releases the mount_sem
semaphore.
Invokes path_release( )
to terminate the pathname
lookup of the mount point (see the later section Section 12.5).
The core of the mount operation is the do_kern_mount( )
function, which we already described in the earlier
section Section 12.4.1. Recall that
this function checks the filesystem type flags to determine how the
mount operation is to be done. For a regular disk-based filesystem,
the FS_REQUIRES_DEV
flag is set, so
do_kern_mount( )
invokes the get_sb_bdev( )
function, which performs the following actions:
Invokes path_init( )
and path_walk( )
to look up the pathname of the block device (see
Section 12.5).
Invokes blkdev_get( )
to open the block device
storing the regular filesystem.
Searches the list of superblock objects; if a superblock relative to the block device is already present, returns its address. This means that the filesystem is already mounted and will be mounted again.
Otherwise, allocates a new superblock object, initializes its
s_dev
, s_bdev
,
s_flags
, and s_type
fields, and
inserts it into the global lists of superblocks and the superblock
list of the filesystem type descriptor.
Acquires the s_lock
spin lock of the superblock.
Invokes the read_super
method of the filesystem
type to access the superblock information on disk and fill the other
fields of the new superblock object.
Sets the MS_ACTIVE
flag of the superblock.
Releases the s_lock
spin lock of the superblock.
If the filesystem type is implemented by a kernel module, increments its usage counter.
Invokes path_release( )
to terminate the mount
point lookup operation.
The
umount( )
system call is used to unmount a
filesystem. The corresponding sys_umount( )
service routine acts on two parameters: a filename (either a mount
point directory or a block device filename) and a set of flags. It
performs the following actions:
Invokes path_init( )
and path_walk( )
to look up the mount point pathname (see the next
section). Once finished, the functions return the address
d
of the dentry object corresponding to the
pathname.
If the resulting directory is not the mount point of a filesystem,
returns the -EINVAL
error code. This check is done
by verifying that d->mnt->mnt_root
contains
the address of the dentry object d
.
If the filesystem to be unmounted has not been mounted on the system
directory tree, returns the -EINVAL
error code.
(Recall that some special filesystems have no mount point.) This
check is done by invoking the check_mnt( )
function on d->mnt
.
If the user does not have the privileges required to unmount the
filesystem, returns the -EPERM
error code.
Invokes do_umount( )
, which performs the following
operations:
Retrieves the address of the superblock object from the
mnt_sb
field of the mounted filesystem object.
If the user asked to force the unmount operation, interrupts any
ongoing mount operation by invoking the umount_begin
superblock operation.
If the filesystem to be unmounted is the root filesystem and the user
didn’t ask to actually detach it, invokes
do_remount_sb( )
to remount the root filesystem
read-only and terminates.
Acquires the mount_sem
semaphore for writing and
the dcache_lock
dentry spin lock.
If the mounted filesystem does not include mount points for any child
mounted filesystem, or if the user asked to forcibly detach the
filesystem, invokes umount_tree( )
to unmount the
filesystem (together with all children).
Releases mount_sem
and
dcache_lock
.
[86] Quite surprisingly, the mount point of a filesystem might be a directory of the same filesystem, provided that it was already mounted before. For instance:
mount -t ext2 /dev/fd0 /flp; touch /flp/foo mkdir /flp/mnt; mount -t ext2 /dev/fd0 /flp/mnt
Now, the empty foo file on the floppy filesystem can be accessed both as flp.foo and flp/mnt/foo.