For the sake of brevity, we cannot discuss the implementation of all the VFS system calls listed in Table 12-1. However, it could be useful to sketch out the implementation of a few system calls, just to show how VFS’s data structures interact.
Let’s reconsider the example proposed at the
beginning of this chapter: a user issues a shell command that copies
the MS-DOS file /floppy/TEST
to the Ext2 file
/tmp/test
. The command shell invokes an external
program like cp
, which we assume executes the
following code fragment:
inf = open("/floppy/TEST", O_RDONLY, 0); outf = open("/tmp/test", O_WRONLY | O_CREAT | O_TRUNC, 0600); do { len = read(inf, buf, 4096); write(outf, buf, len); } while (len); close(outf); close(inf);
Actually, the code of the real cp program is more complicated, since it must also check for possible error codes returned by each system call. In our example, we just focus our attention on the “normal” behavior of a copy operation.
The open( )
system call is serviced by the
sys_open( )
function, which receives as parameters
the pathname filename
of the file to be opened,
some access mode flags flags
, and a permission bit
mask mode
if the file must be created. If the
system call succeeds, it returns a file descriptor—that is, the
index assigned to the new file in the
current->files->fd
array of pointers to file
objects; otherwise, it returns -1.
In our example, open( )
is invoked twice; the
first time to open /floppy/TEST
for reading
(O_RDONLY
flag) and the second time to open
/tmp/test
for writing
(O_WRONLY
flag). If /tmp/test
does not already exist, it is created (O_CREAT
flag) with exclusive read and write access for the owner (octal
0600
number in the third parameter).
Conversely, if the file already exists, it is rewritten from scratch
(O_TRUNC
flag). Table 12-17 lists
all flags of the open( )
system call.
Table 12-17. The flags of the open( ) system call
Flag name |
Description |
---|---|
|
Open for reading |
|
Open for writing |
|
Open for both reading and writing |
|
Create the file if it does not exist |
|
With |
|
Never consider the file as a controlling terminal |
|
Truncate the file (remove all existing contents) |
|
Always write at end of the file |
|
No system calls will block on the file |
|
Same as |
|
Synchronous write (block until physical write terminates) |
|
Asynchronous I/O notification via signals |
|
Direct I/O transfer (no kernel buffering) |
|
Large file (size greater than 2 GB) |
|
Fail if file is not a directory |
|
Do not follow a trailing symbolic link in pathname |
Let’s describe the operation of the
sys_open( )
function. It performs the following
steps:
Invokes getname( )
to read the file pathname from
the process address space.
Invokes get_unused_fd( )
to find an empty slot in
current->files->fd
. The corresponding index
(the new file descriptor) is stored in the fd
local variable.
Invokes the filp_open( )
function, passing as
parameters the pathname, the access mode flags, and the permission
bit mask. This function, in turn, executes the following steps:
Copies the access mode flags into namei_flags
, but
encodes the access mode flags O_RDONLY
,
O_WRONLY
, and O_RDWR
with the
format expected by the pathname lookup functions (see the earlier
section Section 12.5).
Invokes open_namei( )
, passing to it the pathname,
the modified access mode flags, and the address of a local
nameidata
data structure. The function performs
the lookup operation in the following manner:
If O_CREAT
is not set in the access mode flags,
starts the lookup operation with the LOOKUP_PARENT
flag not set. Moreover, the LOOKUP_FOLLOW
flag is
set only if O_NOFOLLOW
is cleared, while the
LOOKUP_DIRECTORY
flag is set only if the
O_DIRECTORY
flag is set.
If O_CREAT
is set in the access mode flags, starts
the lookup operation with the LOOKUP_PARENT
flag
set. Once the path_walk( )
function successfully
returns, checks whether the requested file already exists. If not,
allocates a new disk inode by invoking the create
method of the parent inode.
The open_namei( )
function also executes several
security checks on the file located by the lookup operation. For
instance, the function checks whether the inode associated with the
dentry object found really exists, whether it is a regular file, and
whether the current process is allowed to access it according to the
access mode flags. Also, if the file is opened for writing, the
function checks that the file is not locked by other processes.
Invokes the dentry_open( )
function, passing to it
the access mode flags and the addresses of the dentry object and the
mounted filesystem object located by the lookup operation. In turn,
this function:
Allocates a new file object.
Initializes the f_flags
and
f_mode
fields of the file object according to the
access mode flags passed to the open( )
system
call.
Initializes the f_dentry
and
f_vfsmnt
fields of the file object according to
the addresses of the dentry object and the mounted filesystem object
passed as parameters.
Sets the f_op
field to the contents of the
i_fop
field of the corresponding inode object.
This sets up all the methods for future file operations.
Inserts the file object into the list of opened files pointed to by
the s_files
field of the
filesystem’s superblock.
If the O_DIRECT
flag is set, preallocates a direct
access buffer (see Section 15.3).
If the open
method of the file operations is
defined, invokes it.
Returns the address of the file object.
Sets current->files->fd[fd]
to the address
of the file object returned by dentry_open( )
.
Returns fd
.
Let’s return to the code in our
cp example. The open( )
system calls return two file descriptors, which are stored in the
inf
and outf
variables. Then
the program starts a loop: at each iteration, a portion of the
/floppy/TEST
file is copied into a local buffer
(read( )
system call), and then the data in the
local buffer is written into the /tmp/test
file
(write( )
system call).
The read( )
and write( )
system
calls are quite similar. Both require three parameters: a file
descriptor fd
, the address buf
of a memory area (the buffer containing the data to be transferred),
and a number count
that specifies how many bytes
should be transferred. Of course, read( )
transfers the data from the file into the buffer, while
write( )
does the opposite. Both system calls
return either the number of bytes that were successfully transferred
or -1 to signal an error condition.
A return value less than count
does not mean that
an error occurred. The kernel is always allowed to terminate the
system call even if not all requested bytes were transferred, and the
user application must accordingly check the return value and reissue,
if necessary, the system call. Typically, a small value is returned
when reading from a pipe or a terminal device, when reading past the
end of the file, or when the system call is interrupted by a signal.
The End-Of-File condition (EOF) can easily be recognized by a 0
return value from read( )
. This condition will not
be confused with an abnormal termination due to a signal, because if
read( )
is interrupted by a signal before any data
is read, an error occurs.
The read or write operation always takes place at the file offset
specified by the current file pointer (field f_pos
of the file object). Both system calls update the file pointer by
adding the number of transferred bytes to it.
In short, both sys_read( )
(the read( )
’s service routine) and
sys_write( )
(the write( )
’s service routine) perform almost the
same steps:
Invoke fget( )
to derive from
fd
the address file
of the
corresponding file object and increment the usage counter
file->f_count
.
Check whether the flags in file->f_mode
allow
the requested access (read or write operation).
Invoke locks_verify_area( )
to check whether there
are mandatory locks for the file portion to be accessed (see
Section 12.7 later in this
chapter).
Invoke either file->f_op->read
or
file->f_op->write
to transfer the data. Both
functions return the number of bytes that were actually transferred.
As a side effect, the file pointer is properly updated.
Invoke fput( )
to decrement the usage counter
file->f_count
.
Return the number of bytes actually transferred.
The loop in our example code terminates when the read( )
system call returns the value 0—that is, when all
bytes of /floppy/TEST
have been copied into
/tmp/test
. The program can then close the open
files, since the copy operation has completed.
The close( )
system call receives as its parameter
fd
, which is the file descriptor of the file to be
closed. The sys_close( )
service routine performs
the following operations:
Gets the file object address stored in
current->files->fd[fd]
; if it is
NULL
, returns an error code.
Sets current->files->fd[fd]
to
NULL
. Releases the file descriptor
fd
by clearing the corresponding bits in the
open_fds
and close_on_exec
fields of current->files
(see Chapter 20 for the Close on Execution flag).
Invokes filp_close( )
, which performs the
following operations:
Invokes the flush
method of the file operations,
if defined
Releases any mandatory lock on the file
Invokes fput( )
to release the file object