We won’t be able to discuss all system calls related to networking. However, we shall examine the basic ones, namely those needed to send a UDP datagram.
In most Unix-like systems, the User Mode code fragment that sends a datagram looks like the following:
int sockfd; /* socket descriptor */ struct sockaddr_in addr_local, addr_remote; /* IPv4 address descriptors */ const char *mesg[] = "Hello, how are you?"; sockfd = socket(PF_INET, SOCK_DGRAM, 0); addr_local.sin_family = AF_INET; addr.sin_port = htons(50000); addr.sin_addr.s_addr = htonl(0xc0a050f0); /* 192.160.80.240 */ bind(sockfd, (struct sockaddr *) & addr_local, sizeof(struct sockaddr_in)); addr_remote.sin_family = AF_INET; addr_remote.sin_port = htons(49152); inet_pton(AF_INET, "192.160.80.110", &addr_remote.sin_addr); connect(sockfd, (struct sockaddr *) &addr_remote, sizeof(struct sockaddr_in)); write(sockfd, mesg, strlen(mesg)+1);
Obviously, this listing does not represent the complete source code
of the program. For instance, we have not defined a main( )
function, we have omitted the proper
#include
directives for loading the header files,
and we have not checked the return values of the system calls.
However, the listing includes all network-related system calls issued
by the program to send a UDP datagram.
Let’s describe the system calls in the order the program uses them.
The socket( )
system call creates a new endpoint
for a communication between two or more processes. In our example
program, it is invoked in this way:
sockfd = socket(PF_INET, SOCK_DGRAM, 0);
The socket( )
system call returns a file
descriptor. In fact, a socket is similar to an opened file because it
is possible to read and write data on it by means of the usual
read( )
and write( )
system
calls.
The first parameter of the socket( )
system call
represents the network architecture that will be used for the
communication, as well as a particular network layer protocol adopted
by the network architecture. The PF_INET
macro
denotes both the IPS architecture and Version 4 of the IP protocol
(IPv4). Linux supports several different network architectures; a few
of them are shown in Table 18-1 earlier in this
chapter.
The second parameter of the socket( )
system call
specifies the basic model of communication inside the network
architecture. As we already know, the IPS architecture offers
essentially two alternative models of communication:
SOCK_STREAM
Reliable, connection-oriented, stream-based communication implemented by the TCP transport protocol
SOCK_DGRAM
Unreliable, connection-less, datagram-based communication implemented by the UDP transport protocol
Moreover, the special SOCK_RAW
value creates a
socket that can be used to directly access the network layer protocol
(in our case, the IPv4 protocol).
In general, a network architecture might offer other models of
communication. For instance, SOCK_SEQPACKET
specifies a reliable, connection-oriented, datagram-based
communication, while SOCK_RDM
specifies a
reliable, connection-less, datagram-based communication; however,
neither of them is available in the IPS.
The third parameter of the socket( )
system call
specifies the transport protocol to be used in the communication; in
general, for any model of communication, the network architecture
might offer several different protocols. Passing the value 0 selects
the default protocol for the specified communication model. Of
course, when using the IPS, the value 0 selects the TCP transport
protocol (IPPROTO_TCP
) for the
SOCK_STREAM
model and the UDP protocol
(IPPROTO_IP
) for the SOCK_DGRAM
model. On the other hand, the SOCK_RAW
model
allows the programmer to specify any one of the network-layer service
protocols of the IPS — for instance, the Internet Control
Message Protocol (IPPROTO_ICMP
), the Exterior
Gateway Protocol (IPPROTO_EGP
), or the Internet
Group Management Protocol (IPPROTO_IGMP
).
The socket( )
system call is implemented by means
of the sys_socket( )
service routine, which
essentially performs three actions:
Allocates a descriptor for the new BSD socket (see the later section Section 18.1.3).
Initializes the new descriptor according to the specified network architecture, communication model, and protocol.
Allocates the first available file descriptor of the process and associates a new file object with that file descriptor and with the socket object.
Let’s
return to the service routine of the socket( )
system call. After having allocated a new BSD socket, the function
must initialize it according to the given network architecture,
communication model, and protocol.
For every known network architecture, the kernel stores a pointer to
an object of type net_proto_family
in the
net_families
array. Essentially, the object just
defines the create
method, which is invoked
whenever the kernel initializes a new socket of that network
architecture.
The create
method corresponding to the
PF_INET
architecture is implemented by
inet_create( )
. This function checks whether the
communication model and the protocol specified as parameters of the
socket( )
system call are compatible with the IPS
network architecture; then it allocates and initializes a new INET
socket and links it to the parent BSD socket.
Before terminating, the socket( )
’s service routine allocates a new file
object and a new dentry object for the
sockfs’s file of the socket;
then it associates these objects with the process that raised the
system call through a new file descriptor (see Section 12.2.6).
As far as the VFS is concerned, any file associated with a socket is in no way special. The corresponding dentry object and inode object are included in the dentry cache and in the inode cache, respectively. The process that created the socket can access the file by means of the system calls that act on already opened files — that is, the system calls that receive a file descriptor as a parameter. Of course, the file object methods are implemented by functions that operate on the socket rather than on the file.
As far as the User Mode process is concerned, however, the
socket’s file is somewhat peculiar. In fact, a
process can never issue an open( )
system call on
such a file because it never appears on the system directory tree
(remember that the sockfs special filesystem has
no visible mount point). For the same reason, it is not possible to
remove a socket file through the unlink( )
system
call: the inodes belonging to the sockfs
filesystem are automatically destroyed by the kernel whenever the
socket is closed (released).
Once the socket( )
system call completes, a new
socket is created and initialized. It represents a new communication
channel that can be identified by the following five elements:
protocol, local IP address, local port number, remote IP address, and
remote port number.
Only the “protocol” element has been set so far. Hence, the next action of the User Mode process consists of setting the “local IP address” and the “local port number.” These two elements identify the process that is sending packets onto the socket so the receiving process on the remote machine can determine who is talking and where the answers should be sent.[119]
The corresponding instructions in our simple program are the following:
struct sockaddr_in addr_local; addr_local.sin_family = AF_INET; addr.sin_port = htons(50000); addr.sin_addr.s_addr = htonl(0xc0a050f0); /* 192.160.80.240 */ bind(sockfd, (struct sockaddr *) & addr_local, sizeof(struct sockaddr_in));
The addr_local
local variable is of type
struct
sockaddr_in
and
represents an IPS identifier for a socket. It includes three
significant fields:
sin_family
The protocol family (AF_INET
,
AF_INET6
, or AF_PACKET
; this is
the same as the macros in Table 18-1).
sin_port
The port number.
sin_addr
The network address. In the IPS architecture, it is composed of a
single 32-bit field s_addr
storing the IP address.
Therefore, our program sets the fields of the
addr_local
variable to the protocol family
AF_INET
, the port number 50,000, and the IP
address 192.160.80.240. Notice how the dotted notation of the IP
address is translated into a hexadecimal number.
In the 80 × 86 architecture, the numbers are represented
in the “little endian” format (the
byte at lower address is the less significant one) while the IPS
architecture requires that the numbers be represented in the
“big endian” format (the byte at
lower address is the most significant one). Several functions, such
as htons( )
and htonl( )
, are
used to ensure that data is sent in the network byte order; other
functions, such as ntohs( )
and ntohl( )
, ensure that received data is converted from the network
to the host byte order.
The bind( )
system call receives as parameters the
socket file descriptor and the address of
addr_local
. It also receives the length of the
struct
sockaddr_in
data
structure; in fact, bind( )
can be used for
sockets of any network architecture, as well as for Unix sockets and
any different type of socket that has addresses of different length.
The sys_bind( )
service routine copies the data of
the sock_addr
variable into the kernel address
space, retrieves the address of the BSD socket object
(struct
socket
) that
corresponds to the file descriptor, and invokes its
bind
method. In the IPS architecture, this method
is implemented by the inet_bind( )
function.
The inet_bind( )
function performs essentially the
following operations:
Invokes the inet_addr_type( )
function to check
whether the IP address passed to the bind( )
system call corresponds to the address of some network card device of
the host; if not, it returns an error code. However, the User Mode
program may pass the special IP address INADDR_ANY
(0.0.0.0), which essentially delegates to the kernel the task of
assigning the IP sender address.
If the port number passed to the bind( )
system
call is smaller than 1,024, checks whether the User Mode process has
superuser privileges (this is the
CAP_NET_BIND_SERVICE
capability; see
Section 20.1.1). However, the User Mode process may
pass the value 0 as the port number; the kernel assigns a random,
unused port number (see below).
Sets the rcv_saddr
and saddr
fields of the INET socket object with the IP address passed to the
system call (the former field is used when looking in the routing
table, while the latter is included in the header of outgoing
packets). Usually, the fields hold the same value, except for special
transmission modes like broadcast and multicasting.
Invokes the get_port
protocol method of the INET
socket object to check whether there already exists an INET socket
for the transport protocol using the same local port number and IP
address as the one being initialized. For IPv4 sockets using the UDP
transport protocol, the method is implemented by the
udp_v4_get_port( )
function. To speed up the
lookup operation, the function uses a per-protocol hash table.
Moreover, if the User Mode program specified a value of 0 for the
port, the function assigns an unused number to the socket.
Stores the local port number in the sport
field of
the INET socket object.
The next operation of the User Mode process consists of setting the
“remote IP address” and the
“remote port number,” so the kernel
knows where datagrams written to the socket have to be sent. This is
achieved by invoking the connect( )
system call.
It is important to observe that a User Mode program is in no way
obliged to connect a UDP socket to a destination host. In fact, the
program may use the sendto( )
and
sendmsg( )
system calls to transmit datagrams over
the socket, each time specifying the destination
host’s IP address and port number. Similarly, the
program may receive datagrams from a UDP socket by invoking the
recvfrom( )
and recvmsg( )
system calls. However, the connect( )
system call
is required if the User Mode program transfers data on the socket by
means of the read( )
and write( )
system call.
Since our program is going to use the write( )
system call to send its datagram, it invokes connect( )
to set up the destination of the message. The relevant
instructions are:
struct sockaddr_in addr_remote; addr_remote.sin_family = AF_INET; addr_remote.sin_port = htons(49152); inet_pton(AF_INET, "192.160.80.110", &addr_remote.sin_addr); connect(sockfd, (struct sockaddr *) &addr_remote, sizeof(struct sockaddr_in));
The program initializes the addr_remote
local
variable by writing into it the IP address 192.160.80.110 and the
port number 49,152. This is very similar to the initialization of the
addr_local
variable discussed in the previous
section; however, this time the program invoked the
inet_pton( )
library helper function to convert a
string representing the IP address in dotted notation into a number
in the network order format.
The connect( )
system call receives the same
parameters as the bind( )
system call. It copies
the data of the addr_remote
variable into the
kernel address space, retrieves the address of the BSD socket object
(struct
socket
) corresponding
to the file descriptor, and invokes its connect
method. In IPS architecture, this method is implemented by either the
inet_dgram_connect( )
function for UDP or the
inet_stream_connect( )
function for TCP.
Our simple program uses a UDP socket, so let’s
describe what the inet_dgram_connection( )
function does:
If the socket does not have a local port number, invokes
inet_autobind( )
to automatically assign a unused
value. In our case, the program issued a bind( )
system call before invoking collect( )
, but an
application using UDP is not really obliged to do so.
Invokes the connect
method of the INET socket
object.
The UDP protocol implements the INET socket’s
connect
method by means of the
udp_connect( )
function, which executes the
following actions:
If the INET socket already has a destination host, removes it from
the destination cache (which is the dst_cache
field of the sock
object; see the earlier section
Section 18.1.5).
Invokes the ip_route_connect( )
function to
establish a route to the host identified by the IP address passed as
a parameter of connect( )
. In turn, this function
invokes ip_route_output_key( )
to search an entry
corresponding to the route in the route cache (see the earlier
section Section 18.1.6.2). If the route
cache does not include the desired entry,
ip_route_output_key( )
invokes
ip_route_output_slow( )
to look up a suitable
entry in the FIB (see the earlier section Section 18.1.6.1). Let’s
assume that, once this step terminates, a route is found, so the
address of a suitable rtable
object is determined.
Initializes the daddr
field of the INET socket
object with the remote IP address found in the
rtable
object. Usually, it coincides with the IP
address specified by the user as a parameter of the connect( )
system call.
Initializes the dport
field of the INET socket
object with the remote port number specified as a parameter of the
connect( )
system call.
Puts the value TCP_ESTABLISHED
in the
state
field of the INET socket object (when used
by UDP, the flag indicates that the INET socket is
“connected” to a destination host).
Sets the dst_cache
entry of the
sock
object to the address of the
dst_entry
object embedded in the
rtable
object (see the earlier section Section 18.1.5).
Finally, our example program is ready to send messages to the remote host; it simply writes the data onto the socket:
write(sockfd, mesg, strlen(mesg)+1);
The write( )
system call triggers the
write
method of the file object associated with
the sockfd
file descriptor. For socket files, this
method is implemented by the sock_write( )
function, which performs the following actions:
Determines the address of the socket
object
embedded in the file’s inode.
Allocates and initializes a “message
header”; namely, a msghdr
data
structure, which stores various control information.
Invokes the sock_sendmsg( )
function, passing to
it the addresses of the socket
object and the
msghdr
data structure. In turn, this function
performs the following actions:
Invokes scm_send( )
to check the contents of the
message header and allocate a scm_cookie
(socket control message
) data structure, storing into it a few
fields of control information distilled from the message header.
Invokes the sendmsg
method of the
socket
object, passing to it the addresses of the
socket object, message header, and scm_cookie
data
structure.
Invokes scm_destroy( )
to release the
scm_cookie
data structure.
Since the BSD socket has been set up specifying the UDP protocol, the
addresses of the socket
object’s
methods are stored in the inet_dgram_ops
table. In
particular, the sendmsg
method is implemented by
the inet_sendmsg( )
function, which extracts the
address of the INET socket stored in the BSD socket and invokes the
sendmsg
method of the INET socket.
Again, since the INET socket has been set up specifying the UDP
protocol, the addresses of the sock
object’s methods are stored in the
udp_prot
table. In particular, the
sendmsg
method is implemented by the
udp_sendmsg( )
function.
The udp_sendmsg( )
function receives as parameters the addresses of the
sock
object and the message header
(msghdr
data structure), and performs the
following actions:
Allocates a udpfakehdr
data structure, which
contains the UDP header of the packet to be sent.
Determines the address of the rtable
describing
the route to the destination host from the
dst_cache
field of the sock
object.
Invokes ip_build_xmit( )
, passing to it the
addresses of all relevant data structures, like the
sock
object, the UDP header, the
rtable
object, and the address of a UDP-specific
function that constructs the packet to be transmitted.
The ip_build_xmit( )
function is used to transmit an IP datagram. It performs the
following actions:
Invokes sock_alloc_send_skb( )
to allocate a new
socket buffer together with the corresponding socket buffer
descriptor (see the earlier section Section 18.1.7).
Determines the position inside the socket buffer where the payload shall go (the payload is placed near the end of the socket buffer, so its position depends on the payload size).
Writes the IP header on the socket buffer, leaving space for the UDP header.
Invokes either udp_getfrag_nosum( )
or
udp_getfrag( )
to copy the data of the UDP
datagram from the User Mode buffer; the latter function also
computes, if required, the checksum of the data and of the UDP header
(the UDP standard specifies that this checksum computation be
optional).[120]
Invokes the output
method of the
dst_entry
object, passing to it the address of the
socket buffer descriptor.
The output
method of
the dst_entry
object invokes the function of the
data link layer that writes the hardware header (and trailer, if
required) of the packet in the buffer.
The output
method of the IP’s
dst_entry
object is usually implemented by the
ip_output( )
function, which receives as a
parameter the address skb
of the socket buffer
descriptor. In turn, this function essentially performs the following
actions:
Checks whether there is already a suitable hardware header descriptor
in the cache by looking at the hh
field of the
skb->dst
destination cache object (see the
earlier section Section 18.1.5). If
the field is not NULL
, the cache includes the
header, so it copies the hardware header into the socket buffer, and
then invokes the hh_output
method of the
hh_cache
object.
Otherwise, if the skb->dst->hh
field is
NULL
, the header must be prepared from scratch.
Thus, the function invokes the output
method of
the neighbour
object pointed to by the
neighbour
field of skb->dst
,
which is implemented by the neigh_resolve_output( )
function. To compose the header, the latter function
invokes a suitable method of the net_device
object
relative to the network card device that shall transmit the packet,
and then inserts the new hardware header in the cache.
Both the hh_output
method of the
hh_cache
object and the output
method of the neighbour
object end up invoking the
dev_queue_xmit( )
function.
The dev_queue_xmit( )
function takes care of
queueing the socket buffer for later transmission. In general,
network cards are slow devices, and at any given instant there can be
many packets waiting to be transmitted. They are usually processed
with a First-In, First-Out policy (hence the queue of packets), even
if the Linux kernel offers several sophisticated packet scheduling
algorithms to be used in high-performance routers. As a general rule,
all network card devices define their own queue of packets waiting to
be transmitted. Exceptions are virtual devices like the loopback
device (lo) and the devices offered by various
tunneling protocols, but we don’t discuss these
further.
A queue of socket buffers is implemented through a complex
Qdisc
object. Thanks to this data structure, the
packet scheduling functions can efficiently manipulate the queue and
quickly select the “best” packet to
be sent. However, for the purpose of our simple description, the
queue is just a list of socket buffer descriptors.
Essentially, dev_queue_xmit( )
performs the
following actions:
Checks whether the driver of the network device (whose descriptor is
stored in the dev
field of the socket buffer
descriptor) defines its own queue of packets waiting to be
transmitted (the address of the Qdisc
object is
stored in the qdisc
field of the
net_device
object).
Invokes the enqueue
method of the corresponding
Qdisc
object to append the socket buffer to the
queue.
Invokes the qdisc_run( )
function to ensure that
the network device is actively sending the packets in the queue.
The chain of functions executed by the sys_write( )
system call service routine ends here. As you see, the
final result consists of a new packet that is appended to the
transmit queue of a network card device.
In the next section, we look at how our packet is processed by the network card.
[119] Actually, when a process uses the UDP
protocol, it can omit the invocation of the bind( )
system call. In this case, the kernel automatically
assigns a local address and a local port number to the socket as soon
as the program issues a connect( )
or
listen( )
system call.
[120] You might wonder why the IP header is written in the socket buffer before the UDP header. Well, the UDP standard dictates that the checksum, if used, has to be computed on the payload, the UDP header, and the last 12 bytes of the IP header (including the source and destination IP addresses). The simplest way to compute the UDP checksum is thus to write the IP header before the UDP header.