In
this section, we illustrate how the VFS derives an inode from the
corresponding file pathname. When a process must identify a file, it
passes its file pathname to some VFS system call, such as
open( )
, mkdir( )
,
rename( )
, or stat( )
.
The standard procedure for performing this task consists of analyzing the pathname and breaking it into a sequence of filenames. All filenames except the last must identify directories.
If the first character of the pathname is /
,
the pathname is absolute, and the search starts from the directory
identified by current->fs->root
(the process
root directory). Otherwise, the pathname is relative and the search
starts from the directory identified by
current->fs->pwd
(the process-current
directory).
Having in hand the inode of the initial directory, the code examines the entry matching the first name to derive the corresponding inode. Then the directory file that has that inode is read from disk and the entry matching the second name is examined to derive the corresponding inode. This procedure is repeated for each name included in the path.
The dentry cache considerably speeds up the procedure, since it keeps the most recently used dentry objects in memory. As we saw before, each such object associates a filename in a specific directory to its corresponding inode. In many cases, therefore, the analysis of the pathname can avoid reading the intermediate directories from the disk.
However, things are not as simple as they look, since the following Unix and VFS filesystem features must be taken into consideration:
The access rights of each directory must be checked to verify whether the process is allowed to read the directory’s content.
A filename can be a symbolic link that corresponds to an arbitrary pathname; in this case, the analysis must be extended to all components of that pathname.
Symbolic links may induce circular references; the kernel must take this possibility into account and break endless loops when they occur.
A filename can be the mount point of a mounted filesystem. This situation must be detected, and the lookup operation must continue into the new filesystem.
Pathname lookup is performed by three functions: path_init( )
, path_walk( )
, and
path_release( )
. They are always invoked in this
exact order.
The path_init( )
function receives three
parameters:
name
A pointer to the file pathname to be resolved.
flags
The value of flags that represent how the looked-up file is going to be accessed. The flags are listed in Table 12-17 in the later section Section 12.6.1.[87]
nd
The address of a struct nameidata
data structure.
The struct nameidata
data structure is filled by
path_walk( )
with data pertaining to the pathname
lookup operation. The fields of this structure are shown in Table 12-14.
Table 12-14. The fields of the nameidata data structure
Type |
Field |
Description |
---|---|---|
|
|
Address of the dentry object |
|
|
Address of the mounted filesystem object |
|
|
Last component of the pathname (used when the
|
|
|
Lookup flags |
|
|
Type of last component of the pathname (used when the
|
The dentry
and mnt
fields point
respectively to the dentry object and the mounted filesystem object
of the last resolved component in the pathname. Once
path_walk( )
successfully returns, these two
fields “describe” the file that is
identified by the given pathname.
The flags
field stores the value of some flags
used in the lookup operation; they are listed in Table 12-15.
Table 12-15. The flags of the lookup operation
Macro |
Description |
---|---|
|
If the last component is a symbolic link, interpret (follow) it. |
|
The last component must be a directory. |
|
There are still filenames to be examined in the pathname (used only by NFS). |
|
The pathname must identify an existing file. |
|
Look up the directory including the last component of the pathname. |
|
Do not consider the emulated root directory (always set for the 80 × 86 architecture). |
The goal of the path_init( )
function consists of
initializing the nameidata
structure, which it
does in the following manner:
Sets the dentry
field with the address of the
dentry object of the directory where the pathname lookup operation
starts. If the pathname is relative (it doesn’t
start with a slash), the field points to the dentry of the working
directory (current->fs->pwd
); otherwise it
points to the dentry of the root directory of the process
(current->fs->root
).
Sets the mnt
field with the address of the mounted
filesystem object relative to the directory where the pathname lookup
operation starts: either current->fs->pwdmnt
or current->fs->rootmnt
, according to
whether the pathname is relative or absolute.
Initializes the flags
field with the value of the
flags
parameter.
Initializes the last_type
field to
LAST_ROOT
.
Once path_init( )
initializes the
nameidata
data structure, the path_walk( )
function takes care of the lookup operation, and stores
in the nameidata
structure the pointers to the
dentry object and mounted filesystem object relative to the last
component of the pathname. The function also increments the usage
counters of the objects referenced by
nd->dentry
and nd->mnt
so
that the caller function may safely access them once
path_walk( )
returns. When the caller finishes
accessing them, it invokes the third function of the set,
path_release( )
, which receives as a parameter the
address of the nameidata
data structure and
decrements the two usage counters of nd->dentry
and nd->mnt
.
We are now ready to describe the core of the pathname lookup
operation, namely the path_walk( )
function. It
receives as parameters a pointer name
to the
pathname to be resolved and the address nd
of the
nameidata
data structure. The function initializes
to zero the total_link_count
of the current
process (see the later section Section 12.5.3), and then invokes
link_path_walk( )
. This latter function acts on
the same two parameters of path_walk( )
.
To make things a bit easier, we first describe what
link_path_walk( )
does when
LOOKUP_PARENT
is not set and the pathname does not
contain symbolic links (standard pathname lookup). Next, we discuss
the case in which LOOKUP_PARENT
is set: this type
of lookup is required when creating, deleting, or renaming a
directory entry, that is, during a parent pathname lookup. Finally,
we explain how the function resolves the symbolic links.
When the LOOKUP_PARENT
flag is cleared,
link_path_walk( )
performs the following steps.
Initializes the lookup_flags
local variable with
nd->flags
.
Skips any leading slash ( / ) before the first component of the pathname.
If the remaining pathname is empty, returns the value 0. In the
nameidata
data structure, the
dentry
and mnt
fields point to
the object relative to the last resolved component of the original
pathname.
If the link_count
field in the descriptor of the
current process is positive, sets the
LOOKUP_FOLLOW
flag in the
lookup_flags
local variable (see Section 12.5.3).
Executes a cycle that breaks name
into components
(the intermediate slashes are treated as filename separators); for
each component found, the function:
Retrieves the address of the inode object of the last resolved
component from nd->dentry->d_inode
.
Checks that the permissions of the last resolved component stored
into the inode allow execution (in Unix, a directory can be traversed
only if it is executable). If the inode has a custom
permission
method, the function executes it;
otherwise, it executes the vfs_permission( )
function, which examines the access mode stored in the
i_mode
inode field and the privileges of the
running process.
Considers the next component to be resolved. From its name, it computes a hash value for the dentry cache hash table.
Skips any trailing slash ( / ) after the slash that terminates the name of the component to be resolved.
If the component to be resolved is the last one in the original pathname, jump to Step 6.
If the name of the component is “.” (a single dot), continues with the next component (“.” refers to the current directory, so it has no effect inside a pathname).
If the name of the component is “. .” (two dots), tries to climb to the parent directory:
If the last resolved directory is the process’s root
directory (nd->dentry
is equal to
current->fs->root
and
nd->mnt
is equal to
current->fs->rootmnt
), continues with the
next component.
If the last resolved directory is the root directory of a mounted
filesystem (nd->dentry
is equal to
nd->mnt->mnt_root
), sets
nd->mnt
to
nd->mnt->mnt_parent
and
nd->dentry
to
nd->mnt->mnt_mountpoint
, and then restarts
Step 5.g. (Recall that several filesystems can be mounted on the same
mount point).
If the last resolved directory is not the root directory of a mounted
filesystem, sets nd->dentry
to
nd->dentry->d_parent
and continues with the
next component.
The component name is neither “.”
nor “. .”, so the function must
look it up in the dentry cache. If the low-level filesystem has a
custom d_hash
dentry method, the function invokes
it to modify the hash value already computed in Step 5.c.
Invokes cached_lookup( )
, passing as parameters
nd->dentry
, the name of the component to be
resolved, the hash value, and the LOOKUP_CONTINUE
flag, which specifies that this is not the last component of the
pathname. The function invokes d_lookup( )
to
search the dentry object of the component in the dentry cache. If
cached_lookup( )
fails in finding the dentry in
the cache, link_walk_path( )
invokes
real_lookup( )
to read the directory from disk and
create a new dentry object. In either case, we can assume at the end
of this step that the dentry
local variable points
to the dentry object of the component name to be resolved in this
cycle.
Checks whether the component just resolved (dentry
local variable) refers to a directory that is a mount point for some
filesystem (dentry->d_mounted
is set to 1). In
this case, invokes lookup_mnt( )
, passing to it
dentry
and nd->mnt
, in order
to get the address mounted
of the child mounted
filesystem object. Next, it sets dentry
to
mounted->mnt_root
and
nd->mnt
to mounted
. Then it
repeats the whole step (several filesystems can be mounted on the
same mount point).
Checks whether the inode object dentry->d_inode
has a custom follow_link
method. If this is the
case, the component is a symbolic link, which is described in the
later section Section 12.5.3.
Checks that dentry
points to the dentry object of
a directory
(dentry->d_inode->i_op->lookup
method is
defined). If not, returns the error -ENOTDIR
,
because the component is in the middle of the original pathname.
Sets nd->dentry
to dentry
and continues with the next component of the pathname.
Now all components of the original pathname are resolved except the
last one. If the pathname has a trailing slash, it sets the
LOOKUP_FOLLOW
and
LOOKUP_DIRECTORY
in the
lookup_flags
local variable to force
interpretation of the last component as a directory name.
Checks the value of the LOOKUP_PARENT
flag in the
lookup_flags
variable. In the following, we assume
that the flag is set to 0, and we postpone the opposite case to the
next section.
If the name of the last component is
“.” (a single dot), terminates the
execution returning the value 0 (no error). In the
nameidata
structure that nd
points to, the dentry
and mnt
fields refer to the objects relative to the next-to-last component of
the pathname (any component “.” has
no effect inside a pathname).
If the name of the last component is “. .” (two dots), tries to climb to the parent directory:
If the last resolved directory is the process’s root
directory (nd->dentry
is equal to
current->fs->root
and
nd->mnt
is equal to
current->fs->rootmnt
), terminates the
execution returning the value 0 (no error).
nd->dentry
and nd->mnt
refer to the objects relative to the next to the last component of
the pathname—that is, to the root directory of the process.
If the last resolved directory is the root directory of a mounted
filesystem (nd->dentry
is equal to
nd->mnt->mnt_root
), sets
nd->mnt
to
nd->mnt->mnt_parent
and
nd->dentry
to
nd->mnt->mnt_mountpoint
, and then restarts
Step 5.j.
If the last resolved directory is not the root directory of a mounted
filesystem, sets nd->dentry
to
nd->dentry->d_parent
, and terminates the
execution returning the value 0 (no error).
nd->dentry
and nd->mnt
refer to the objects relative to the next-to-last component of the
pathname.
The name of the last component is neither
“.” nor “.
.”, so the function must look it up in the dentry
cache. If the low-level filesystem has a custom
d_hash
dentry method, the function invokes it to
modify the hash value already computed in Step 5.c.
Invokes cached_lookup( )
, passing as parameters
nd->dentry
, the name of the component to be
resolved, the hash value, and no flag
(LOOKUP_CONTINUE
is not set because this is the
last component of the pathname). If cached_lookup( )
fails in finding the dentry in the cache, it also invokes
real_lookup( )
to read the directory from disk and
create a new dentry object. In either case, we can assume at the end
of this step that the dentry
local variable points
to the dentry object of the component name to be resolved in this
cycle.
Checks whether the component just resolved (dentry
local variable) refers to a directory that is a mount point for some
filesystem (dentry->d_mounted
is set to 1). In
this case, invokes lookup_mnt( )
, passing to it
dentry
and nd->mnt
, in order
to get the address mounted
of the child mounted
filesystem object. Next, it sets dentry
to
mounted->mnt_root
and
nd->mnt
to mounted
. Then it
repeats the whole step (because several filesystems can be mounted on
the same mount point).
Checks whether LOOKUP_FOLLOW
flag is set in
lookup_flags
and the inode object
dentry->d_inode
has a custom
follow_link
method. If this is the case, the
component is a symbolic link that must be interpreted, as described
in the later section Section 12.5.3.
Sets nd->dentry
with the value stored in the
dentry
local variable. This dentry object is
“the result” of the lookup
operation.
Checks whether nd->dentry->d_inode
is
NULL
. This happens when there is no inode
associated with the dentry object, usually because the pathname
refers to a nonexisting file. In this case:
If either LOOKUP_POSITIVE
or
LOOKUP_DIRECTORY
is set in
lookup_flags
, it terminates, returning the error
code -ENOENT
.
Otherwise, it terminates returning the value 0 (no error).
nd->dentry
points to the negative dentry object
created by the lookup operation.
There is an inode associated with the last component of the pathname.
If the LOOKUP_DIRECTORY
flag is set in
lookup_flags
, checks that the inode has a custom
lookup
method—that is, it is a directory. If
not, terminates returning the error code -ENOTDIR
.
Terminates returning the value 0 (no error).
nd->dentry
and nd->mnt
refer to the last component of the pathname.
In many cases, the real target of a lookup operation is not the last
component of the pathname, but the next-to-last one. For example,
when a file is created, the last component denotes the filename of
the not yet existing file, and the rest of the pathname specifies the
directory in which the new link must be inserted. Therefore, the
lookup operation should fetch the dentry object of the next-to-last
component. For another example, unlinking a file identified by the
pathname /foo/bar
consists of removing
bar
from the directory foo
.
Thus, the kernel is really interested in accessing the file directory
foo
rather than bar
.
The LOOKUP_PARENT
flag is used whenever the lookup
operation must resolve the directory containing the last component of
the pathname, rather than the last component itself.
When the LOOKUP_PARENT
flag is set, the
path_walk( )
function also sets up the
last
and last_type
fields of
the nameidata
data structure. The
last
field stores the name of the last component
in the pathname. The last_type
field identifies
the type of the last component; it may be set to one of the values
shown in Table 12-16.
Table 12-16. The values of the last_type field in the nameidata data structure
Value |
Description |
---|---|
|
Last component is a regular filename |
|
Last component is " |
|
Last component is “.” |
|
Last component is “. .” |
|
Last component is a symbolic link into a special filesystem |
The LAST_ROOT
flag is the default value set by
path_init( )
when the whole pathname lookup
operation starts (see the description at the beginning of
Section 12.5). If the pathname
turns out to be just "/
“, the kernel does not change the initial
value of the last_type
field. The
LAST_BIND
flag is set by the
follow_link
inode object’s method
of symbolic links in special filesystems (see the next section).
The remaining values of the last_type
field are
set by link_path_walk( )
when the
LOOKUP_PARENT
flag is on; in this case, the
function performs the same steps described in the previous section up
to Step 7. From Step 7 onward, however, the lookup operation for the
last component of the pathname is different:
Sets nd->last
to the name of the last component
Initializes nd->last_type
to
LAST_NORM
If the name of the last component is
“.” (a single dot), sets
nd->last_type
to LAST_DOT
If the name of the last component is “.
.” (two dots), sets
nd->last_type
to LAST_DOTDOT
Terminates by returning the value 0 (no error)
As you can see, the last component is not interpreted at all. Thus,
when the function terminates, the dentry
and
mnt
fields of the nameidata
data structure point to the objects relative to the directory that
includes the last component.
Recall that a symbolic link is a regular file that stores a pathname of another file. A pathname may include symbolic links, and they must be resolved by the kernel.
For example, if /foo/bar
is a symbolic link
pointing to (containing the pathname) ../dir
,
the pathname /foo/bar/file
must be resolved by
the kernel as a reference to the file /dir/file
.
In this example, the kernel must perform two different lookup
operations. The first one resolves /foo/bar
;
when the kernel discovers that bar is the name
of a symbolic link, it must retrieve its content and interpret it as
another pathname. The second pathname operation starts from the
directory reached by the first operation and continues until the last
component of the symbolic link pathname has been resolved. Next, the
original lookup operation resumes from the dentry reached in the
second one and with the component following the symbolic link in the
original pathname.
To further complicate the scenario, the pathname included in a symbolic link may include other symbolic links. You might think that the kernel code that resolves the symbolic links is hard to understand, but this is not true; the code is actually quite simple because it is recursive.
However, untamed recursion is intrinsically dangerous. For instance,
suppose that
a symbolic link points to itself.
Of course, resolving a pathname including such a symbolic link may
induce an endless stream of recursive invocations, which in turn
quickly leads to a kernel stack overflow. The
link_count
field in the descriptor of the current
process is used to avoid the problem: the field is incremented before
each recursive execution and decremented right after. If the field
reaches the value 5, the whole lookup operation terminates with an
error code. Therefore, the level of nesting of symbolic links can be
at most 5.
Furthermore, the total_link_count
field in the
descriptor of the current process keeps track of how many symbolic
links (even nonnested) were followed in the original lookup
operation. If this counter reaches the value 40, the lookup operation
aborts. Without this counter, a malicious user could create a
pathological pathname including many consecutive symbolic links that
freezes the kernel in a very long lookup operation.
This is how the code basically works: once the
link_path_walk( )
function retrieves the dentry
object associated with a component of the pathname, it checks whether
the corresponding inode object has a custom
follow_link
method (see Step 5.k and Step 13 in
Section 12.5.1). If so,
the inode is a symbolic link that must be interpreted before
proceeding with the lookup operation of the original pathname.
In this case, the link_path_walk( )
function
invokes do_follow_link( )
, passing to it the
address of the dentry object of the symbolic link and the address of
the nameidata
data structure. In turn,
do_follow_link( )
performs the following steps:
Checks that current->link_count
is less than 5;
otherwise, returns the error code -ELOOP
Checks that current->total_link_count
is less
than 40; otherwise, returns the error code -ELOOP
If the current->need_resched
flag is set,
invokes schedule( )
to give a chance to preempt
the running process
Increments current->link_count
and
current->total_link_count
Updates the access time of the inode object associated with the symbolic link to be resolved
Invokes the follow_link
method of the inode,
passing to it the addresses of the dentry object and of the
nameidata
data structure
Decrements the current->link_count
field
Returns the error code returned by the follow_link
method (0 for no error)
The follow_link
method is a filesystem-dependent
function that reads the pathname stored in the symbolic link from the
disk. Having filled a buffer with the symbolic
link’s pathname, most follow_link
methods end up invoking the vfs_follow_link( )
function and returning the value taken from it. In turn, the
vfs_follow_link( )
does the following:
Checks whether the first character of the symbolic link pathname is a
slash; if so, the dentry
and
mnt
fields of the nameidata
data structure are set so they refer to the current process root
directory.
Invokes link_path_walk( )
to resolve the symbolic
link pathname, passing to it the nameidata
data
structure.
Returns the value taken from link_path_walk( )
.
When do_follow_link( )
finally terminates, it
returns the address of the dentry object referred to by the symbolic
link to the original execution of link_path_walk( )
. The link_path_walk( )
assigns this
address to the dentry
local variable, and then
proceeds with the next step.
[87] There is,
however, a small difference in how the O_RDONLY
,
O_WRONLY
, and O_RDWR
flags are
encoded. The bit at index 0 (lowest-order) of the
flags
parameter is set only if the file access
requires read privileges; similarly, the bit at index 1 is set only
if the file access requires write privileges. Conversely, for the
open( )
system call, the value of the
O_WRONLY
flag is stored in the bit at index 0,
while the O_RDWR
flag is stored in the bit at
index 1; thus, the O_RDONLY
flag is true when both
bits are cleared. Notice that it is not possible to specify in the
open( )
system call that a file access does not
require either read or write privileges; this makes sense, however,
in a pathname lookup operation involving symbolic links.