Project 2 - Fair-Share Scheduling

Out: Thursday April 12th, 2001, during sections
Due: Thursday April 26th, 2001, during sections

Assignment Goals

Very useful hint

You might want to take a look at this link before you dive too deeply into this assignment. Note that this link describes the Linux 2.2 kernel, and we're working with the 2.4 kernel in this course. So, some things have changed (in particular, multiprocessor scheduling has been simplified).

Background

Linux actually contains three scheduling policies, one for "general use" and two for a kind of real-time processing. We'll be concerned with the general use policy, which is a descendant of the multi-level feedback approach discussed in the book. However, the other two have some penetration into the code you'll end up modifying, so you need to understand at least a tiny bit about them.

Quoting from "man sched_setscheduler":

SCHED_OTHER is the default universal time-sharing scheduler policy used by most processes, SCHED_FIFO and SCHED_RR are intended for special time-critical applications that need precise control over the way in which runnable processes are selected for execution. Processes scheduled with SCHED_OTHER must be assigned the static priority 0, processes scheduled under SCHED_FIFO or SCHED_RR can have a static priority in the range 1 to 99. Only processes with superuser privileges can get a static priority higher than 0 and can therefore be scheduled under SCHED_FIFO or SCHED_RR. The system calls sched_get_priority_min and sched_get_priority_max can be used to find out the valid priority range for a scheduling policy in a portable way on all POSIX.1b conforming systems.
And further: SCHED_OTHER can only be used at static priority 0. SCHED_OTHER is the standard Linux time-sharing scheduler that is intended for all processes that do not require special static priority real-time mechanisms. The process to run is chosen from the static priority 0 list based on a dynamic priority that is determined only inside this list. The dynamic priority is based on the nice level (set by the nice or setpriority system call) and increased for each time quantum the process is ready to run, but denied to run by the scheduler. This ensures fair progress among all SCHED_OTHER processes.
Note that these describe the logical operation of the scheduler;  the actual implementation might (and probably will) not be quite what you'd expect from reading these descriptions.

As with many schedulers, one of the key concepts in the Linux scheduler is round-robin (RR) scheduling.  Part of the point of RR scheduling is fairness.  In particular, there is fairness among processes, in the sense that each process (of a given priority) receives an equal allocation of CPU time over an appropriately measured interval. One problem with RR scheduling is that it is per process fairness, not per user fairness.  What this means is that a user who starts many processes will receive, in total, a much larger fraction of system resources than will a user who has only a few processes running.

Fair-share schedulers address this problem, using the RR mechanism, but with the goal of providing equal allocation per user (that is, summing over all processes being run for that user), not per process.  As an example, if user A has four times as many processes running as user B, each of user A's process would receive only one fourth of the allocation per second, say, given to processes for user B, so that users A and B split the CPU resource equally.  Under the default scheduler, user A would receive 80% of the CPU.

Complications:

The examples above assume that all processes are in tight CPU loops.  That isn't the typical case.  Instead, processes perform IO and other blocking operations.  This complicates the definition of "fair share."  (A more detailed discussion of this than you need for this assignment is given in A Coarse Grained Fair Share Scheduler, as well as in other references available on the web.)

Part of this assignment (to be completed in Part 2) is for you to decide just what the objectives of your scheduler are.  The objectives are distinct from the mechanisms used to actually perform the scheduling.  Typically, the mechanisms do not achieve the objectives, except in extreme cases (e.g., all running processes are in tight CPU loops).  In fact, typically it's difficult or impossible to give a convincing argument about just what "should happen" for a given set of real processes.

If it were me doing this assignment, I'd try to adopt the mechanisms and objectives of the current Linux scheduler, subject to the change to fair-share as the overall flavor of the scheduler, on the theory (which is true) that the current scheduler represents a survival of the fittest evolution of the algorithm based on observations of its performance in actual use over a number of years.

In any case, though, you're free to make whatever decisions you think are appropriate, but you should try to be complete in your Part 2 write-up about what those decisions are.

The Assignment

Before you begin the assignment, log into spinlock or coredump, and install some tools in your directory:
We have created a tar archive which contains the source code and binaries of a number of utilities that you will find useful during the course of this project. We'll describe them below. To extract the files from the archive, go to your working directory and use the following command:
tar -xzf /cse451kernels/jj/fairshare.tar.gz
A directory called fairshare/ will be created, and the utilities will be extracted into it.
 

Project Utilities Overview

You'll be working on the scheduler.  If the systems hangs or crashes, you'll know your changes haven't worked.  But, how will you know that they are working?

We've written some utility programs to help with this.  These might be of some use in looking at how the existing scheduler behaves, but are intended primarily to help evaluate your changes.  The source to the utilities is available, but they were written with the intention that you'd never have to look at it.  The source can be used to install the utilities on Linux machines outside of the lab (they're already installed in the lab), or to customize the utilities if you want them to do things they don't already do.

There are two major utilities, one that shows a full-screen display of the fraction of CPU allocated to each user and each user process, and one to help launch a set of applications as if they were launched by a set of distinct users.

Monitor Utility

Command:  monitor [refresh interval in seconds]

The monitor prints a full screen display showing CPU usage by all user-level (not root, and not a few daemon) processes.  The display is updated every "refresh interval" seconds, defaulting to 2 seconds if no parameter is given.  (2 seconds is also the minimum interval allowed.)

Here is some sample output:

Elapsed time: 2.13 seconds
 
testuser: 100% ( 99%, 0%)
( tcsh) 18122 0% ( 0%, 0%)
( piglet) 18168 99% ( 99%, 0%)
( monitor) 18170 0% ( 0%, 0%)
The first line shows the amount of real time that passed between screen updates.  The next line shows that user testuser consumed 100% of the CPU during those 2.13 seconds.  testuser was running three processes, a shell (with PID 18122), a process called piglet, and the monitor itself.  The piglet process used the CPU for 99% of the 2.13 seconds, of which 99% of the 2.13 seconds was in user mode and 0% was system mode.  (Conversion to integers in the monitor program, as well as the Heisenberg principle, cause some loss of precision.)

The monitor program accepts two keyboard inputs.  The inputs are acted upon only the next time the screen is updated (for reasons that you can guess from the original client code in project assignment two).  The inputs are:

 
'q' Quit the program
<space>  Toggle between real-time and CPU time modes
The sample output explained above is real-time mode: fractions are given relative to the actual elapsed time between refreshes of the screen.  In CPU time mode fractions are relative to the total amount of CPU used during that interval.  So, if the refresh interval were exactly 2 seconds and some one process were the only one that ran during that time, and it used 1 second of CPU, it would show 50% consumption in real-time mode and 100% consumption in CPU mode.

Spawn Utility and Its Cronies

Command: spawn [configuration file]

The spawn command reads a configuration file, each line of which specifies a user ID (UID) and the name of an executable.  It forks a new process for each line and runs the named executable in it.  It also sets the UID of the process to the value given in the line of the configuration file.  This allows you to launch a set of processes as if they had been launched by a set of distinct users.  Because spawn sets the UID of processes, you must be logged in as root to use it.

Two sample executables are also provided, piglet and io.  Piglet is just a tight CPU loop.  Io is a program that reads every file on the system (that is accessible to the UID of the process).  The io program is thus very IO bound, and not at all CPU bound.

A sample configuration file, spawn.conf, has also been provided.  It's contents are:

9000 piglet
9000 piglet
9000 piglet
9001 piglet
9001 piglet
When spawn reads this file it will launch five copies of the piglet program, three as user 9000 and two as user 9001.  (These UIDs were chosen because they do not conflict with any real users on the SPL machines.)  Here is output from the monitor program after launching spawn with this configuration file:
Elapsed time: 2.12 seconds
 
testuser:  0% ( 0%, 0%)
( csh) 7760 0% ( 0%, 0%)
( csh) 7847 0% ( 0%, 0%)
( monitor) 7868 0% ( 0%, 0%)
User 9000: 60% ( 60%, 0%)
( piglet) 7840 19% ( 19%, 0%)
( piglet) 7841 20% ( 20%, 0%)
( piglet) 7842 20% ( 20%, 0%)
User 9001: 39% ( 39%, 0%)
( piglet) 7843 19% ( 19%, 0%)
( piglet) 7844 19% ( 19%, 0%)
(The two new user names show up as "User UID" because there is no actual user name for those UIDs on that system.)

Exactly what spawn does is a tiny bit complicated, but it's designed to work almost whatever it is you're trying to do.  So, you don't have to understand the next paragraph - it should just work.

The complication has to do with where (a) the configuration file is, and (b) where the executables named in the configuration file are.  The rules are this:

    If no configuration file is given on the command line, the file is assumed to exist in the same directory as the spawn executable, and to be named spawn.conf.

    If a configuration file name is given, if the path is absolute (e.g., ~/FairShareProject/myspawn.conf), that's used.  If the path is not absolute (e.g., myspawn.conf), it's relative to the current working directory.

    Executable file names in the configuration file can be either absolute path names or relative names.  If relative, they are relative to the directory in which the configuration file was found.

One last thing.  The spawn program includes one additional function:  it kills everything it has spawned after 2 minutes.  This is intended to be a convenience, in case you forget to kill something that is either in a tight CPU loop or reading the entire file system.  It is easy to change the 2 minute interval to something else, if you want, or to defeat this entirely.

One other last thing.  There is absolutely no reason to run piglet or io on any machines besides VMware virtual machines (or your own, if you're working at home), and you should never run them on any shared machine.  This includes all four general purpose instructional machines (ceylon, fiji, sumatra, and tahiti), as well as spinlock and coredump.

The really last thing: If you have troubles with spawn you can also create users by your own and let them run test programs:

Therefore, log in as root on VMWare and create users with the newusers command. (See man newusers for exact syntax)
Newusers expects the following information for a user:
account:password:UID:GID:GECOS:directory:shell

To create a testuser type the following:
echo testuser:none:101:200::/:/bin/tcsh | newusers

You can log in several users by switching to different consoles with ALT+F# (e.g. ALT+F1 is the first console, ALT+F2 the second, ...)
 

Assigment Overview

There are 3 parts to this assignment. In Part 1, you will read the source code to the existing scheduler, and design (on paper) the modifications necessary to support fair-share scheduling, as defined above. In Part 2, you will implement, debug, and test your modifications. Part 3 is an extension to your fair-share scheduling and involves creating a new system call which can set user nice levels.

Warning: it should be obvious that if you can find an implementation of fair-share scheduling for Linux on the web, we can find that same implementation. You should do this project without referring to an existing implementation of fair-share scheduling.

Linux Source and Configuration

The most directly relevant code is in .../kernel/sched.c.  It is certain that you will need to look at other source files as well.
You should make decisions appropriate to execution on multi-processors when writing your code.  Of course, our (virtual) machines are not multi-processors, so this is a bit of pretending. Nevertheless you should write code that would be sensible for execution on multi-processors, but your code has to be tested only on uni-processors.  This means that you must worry about race conditions.  However, I don't believe that you need to create any new policy or mechanism specific to multiprocessor scheduling.

Recommended Procedure

It is very inconvenient to have a hard bug in the scheduler - the machine simply won't run your copy of the kernel, and you'll spend days rebooting VMware.

For that reason, I recommend that you (a) think hard about your code before you recompile and install, and (b) make modifications incrementally, not all at once.  If the machine fails to boot, or to run, it's a good guess that the last set of changes you made are the problem.  (It's not certain, of course.)  Making sure that set of changes is small can help find the problem. Additionally, you might find that instrumenting the code (printk's, or similar sorts of things) helps you, and even that building some additional tools (e.g., a new system call, and a user-level application that invokes it) are worth the effort.

One more hint:

Try to modify the scheduler only for users with UID >= 100. Values between 0 and 99 are typically reserved for system accounts and the experience showed, that you will have much less system failures if you let root and daemons do their job as they always did it ;)

Part 1: Design your Fair-Share Scheduler
For Part 1 read the source code to the existing Linux scheduler, and design (on paper) how you want your fair-share scheduler to work.

WHAT TO TURN IN FOR PART 1

As part of your writeup for this assignment, you need to answer the following questions:

  1. What does the current Linux scheduler do? Make sure you address all of the following questions:
  2. What are the objectives of your fair-share scheduler? Make sure you tell us how you want to allocate CPU among processes (a) that all belong to a single user, and (separately) (b) that are not in tight CPU loops. That is, tell us what you want "fair-share" to mean.  It is okay, expected even, that the answers to some of these questions can be answered best by referring to the implementation (just as exactly what the current scheduler does can be defined fully only by its implementation).
Part 2: Implement your Fair-Share Scheduler
Implement your fair-share scheduler, as you designed it in Part 1.

WHAT TO TURN IN FOR PART 2:

Part 3: Write a usernice program
Nice is a tool which lets you run a program with modified scheduling priority. Nice takes two parameters. The first is the nice level [in the range from -20 (highest) to 19 (lowest)], the second the program to execute. The program is then executed with an adjusted scheduling priority (nice 19 /bin/bash would for example run a shell with much less priority or attention as usual). Have a look at schedule.c for the details.

The task for part 3 is it, to create a program called usernice which has two parameters. The first one is the nice level [between -20 and +19] and the second one the UID (user ID). The idea is to give users different priorities, so that not every user gets the same fair share, but that you can give some users more or less attention.
In order to achieve this goal you will have to modify schedule.c again and implement user nice levels. Then you will have to add a new system call to the kernel which can set the user nice levels. The last step is creating a program which uses this system call in order to alter user scheduling priorities.

WHAT TO TURN IN FOR PART 3:

What to Turn In