|
|
|
|
Some details about Linux memory management
In this web page, I'll try
to give you enough of the truth about Linux memory
management to get your projects done,
without diving into every single detail (which would take
several chapters of a book, such as is done in the
book Understanding
the Linux Kernel).
The x86 memory management architecture uses both segmentation and
paging; we will cover both of these concepts in class in several weeks.
Very roughly speaking, a segment is a partition of a process's
address space that has its own protection policy. So, in the x86
architecture, it is possible to split the range of memory addresses
that a process sees into multiple contiguous segments, and assign
different protection modes to each. Paging is a technique
for mapping small (usually 4KB) regions of a process's address space
to chunks of real, physical memory. Paging thus controls how regions
inside a segment are mapped onto physical RAM.
In Linux, the OS designers decided to carve up the 32 bit address
space of each process in the following way:
All processes therefore have two segments (there are some details
about additional segments that I'm hiding, but that's ok for now):
- one segment (addresses 0x00000000 through
0xBFFFFFFF) for user-level, process-specific data such
as the program's code, static data, heap, and stack. Every
process has its own, independent user segment.
- one segment (addresses 0xC0000000 through
0xFFFFFFFF), which contains kernel-specific data such as the
kernel instructions, data, some stacks on which kernel code can
execute, and more interestingly, a region in this segment is
directly mapped to physical memory, so that the kernel can
directly access physical memory locations without having to
worry about address translation. (We'll talk about address
translation later on in class.) The same kernel segment is
mapped into every process, but processes can access it only
when executing in protected kernel mode.
So, in user-mode (i.e., when a process is executing its program's
instructions in user-mode), the process may only access addresses less
than 0xC0000000; any access to an address higher than this
results in a fault. However, when a user-mode process begins
executing in the kernel (for instance, after having made a system
call), the protection bit in the CPU is changed to supervisor mode
(and some segmentation registers are changed), meaning that the
process is thereby able to access addresses above 0xC0000000.
Because kernel-critical data structures and the mapping to
physical memory are contained in the kernel segment, it is imperative
that the user-level process can't cause the kernel to unwittingly
read or write memory locations in this segment. Therefore, whenever
the user-level process passes an address (e.g., a C reference) into
the kernel through a system call, that address needs to be carefully
checked. In particular, the kernel needs to make sure that all
such addresses are below 0xC0000000. (There is one more
check that must be made, namely that the addresses references
a piece of the process's address space that has actually been
mapped or allocated, but ignore this for now, since the kernel
uses some clever tricks to do this check transparently to your
system call implementation.)
How do you make sure that the address is legal? You could write
the code to do the check for the boundary 0xC0000000
yourself, but then your system call implementation will contain
assumptions that might change across versions of the linux kernel;
if later versions of the linux kernel decide to add a third
segment, or simply change the boundary between user and kernel
segments by a few bytes, your code will suddenly become buggy.
(In fact, this web page may already be out of date with respect
to the current kernel, but the essence of the model is still
correct.)
Instead, take a look at what other system call implementations (such
as sys_gettimeofday()) do: they make use of various
convenience routines to copy bytes to/from the user-level.
These convenience routines do all of the checking on your behalf,
using another convenience routine called access_ok().
Thus, if the kernel implementation changes, the kernel designers
will modify these convenience routines to be correct, and your
code will continue to work.
|