Advanced – the VM split

What we have seen so far is actually not the complete picture; in reality, this address space needs to be shared between user and kernel space.

This section is considered advanced. We leave it to the reader to decide whether to dive into the details that follow. While they're very useful, especially from a debug viewpoint, it's not strictly required for following the rest of this book.

Recall what we mentioned in the Library segments section: if a Hello, world application is to work, it needs to have a mapping to the printf(3) glibc routine. This is achieved by having the dynamic or shared libraries memory-mapped into the process VAS at runtime (by the loader program).

A similar argument could be made for any and every system call issued by the process: we understood from Chapter 1Linux System Architecture, that the system call code is actually within the kernel address space. Thus, if issuing a system call were to succeed, we would need to re-vector the CPU's Instruction Pointer (IP or PC register) to the address of the system call code, which, of course, is within kernel address space. Now, if the process VAS consists of just text, data, library, and stack segments, as we have been so far suggesting, how would it work? Recall the fundamental rule of virtual memory: you cannot look outside the box (available address space).

In order for this whole scheme to succeed, therefore, even kernel virtual address space—yes, please note, even the kernel address space is considered virtual  must somehow be mapped into the process VAS.

As we saw earlier, on a 32-bit system, the total VAS available to a process is 4 GB. So far, the implicit assumption is that the top of the process VAS on 32-bit is therefore 4 GB. That's right. As well, again, the implicit assumption is that the stack segment (consisting of stack frames) lies here—at the 4 GB point at the top. Well, that's incorrect (please refer to Fig 11).

The reality is this: the OS creates the process VAS, and arranges for the segments within it; however, it reserves some amount of virtual memory at the top end for the kernel or OS-mapping (meaning, the kernel code, data structures, stacks, and drivers). By the way, this segment, which contains kernel code and data, is usually referred to as the kernel segment.

How much VM is kept for the kernel segment? Ah, that's a tunable or a configurable that is set by kernel developers (or the system administrator) at kernel-configuration time; it's called VMSPLIT. This is the point in the VAS where we split the address space between the OS kernel and user mode memory – the text, data, library, and stack segments!

In fact, for clarity, let's reproduce Fig 11 (as Fig 14), but this time, explicitly reveal the VM Split:

Figure 14: The process VM Split

Let's not get into the gory details here: suffice it to say that on an IA-32 (Intel x86 32-bit), the splitting point is typically the 3 GB point. So, we have a ratio: userspace VAS : kernel VAS  ::  3 GB : 1 GB    ; on the IA-32.

Remember, this is tunable. On other systems, such as a typical ARM-32 platform, the split might be like this instead: userspace VAS : kernel VAS  ::  2 GB : 2 GB   ; on the ARM-32.

On an x86_64 with a gargantuan 2^64 VAS (that's a mind-boggling 16 Exabytes!), it would be: userspace VAS : kernel VAS  ::  128 TB : 128 TB   ; on the x86_64.

Now one can clearly see why we use the term monolithic to describe the Linux OS architecture – each process is indeed like a single, large piece of stone!

Each process contains both of the following:

  • Userspace mappings

    • Text (code)
    • Data
      • Initialized data
      • Uninitialized data (BSS)
      • Heap
    • Library mappings
    • Other mappings
    • Stack
  • Kernel segments

Every process alive maps into the kernel VAS (or kernel segment, as it's usually called), in its top end.

This is a crucial point. Let's look at a real-world case: on the Intel IA-32 running the Linux OS, the default value of VMSPLIT is 3 GB (which is 0xc0000000). Thus, on this processor, the VM layout for each process is as follows:

  • 0x0 to 0xbfffffff : userspace mappings, that is, text, data, library and stack.
  • 0xc0000000 to 0xffffffff : kernel space or the kernel segment.

This is made clear in the following diagram:

Fig 15: Full process VAS on the IA-32

Notice how the top gigabyte of VAS for every process is the same – the kernel segment. Also keep in mind that this layout is not the same on all systems  the VMSPLIT and the size of user and kernel segments varies with the CPU architecture.

Since Linux 3.3 and especially 3.10 (kernel versions, of course), Linux supports the prctl(2) system call. Looking up its man page reveals all kinds of interesting, though non-portable (Linux-only), things one could do. For example, prctl(2), used with the PR_SET_MM parameter, lets a process (with root privileges) essentially specify its VAS layout, its segments, in terms of start and end virtual addresses for text, data, heap, and stack. This is certainly not required for normal applications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset