Computer Architecture, the Boot Process, the Kernel and systemd

Von-Neumann Architecture –

A brief but interesting history of computing machines can be found at Wikipedia.

Modern day computers are “stored-program” computers that include an instruction set and memory that can hold sets of instructions (i.e. a program). A notable scientists that helped develop the theoretical basis for the stored-program computer is Alan Turing who in 1936 wrote a paper that described a universal computing machine; a machine that can theoretically execute any algorithm. He called it the Universal computing machine. Today we call it the Turing machine.

Another pioneer was Von-Neumann who, while working on the Manhattan Project in 1944, wrote the first logical design of a stored-program computer, for the EDVAC machine. The design was conceptually equivalent to Eckert and Mauchly’s ENIAC machine, developed at the University of Pennsylvania.

The logical description produced by Von-Neumann gave him a great deal of notoriety, and as a result, the architecture of a stored-program computer is dubbed Von-Neumann Architecture.

A high-level diagram of the Von-Neumann architecture for computers is shown below.

As you can see, the computer contains a central processing unit (CPU) as well as random access memory (RAM) and devices for input and output. The CPU contains a control unit and arithmetic/logic unit (ALU) which handle the execution of instructions that are stored in RAM.

Although we’ll delve deeper into computer architecture in CSCI-340, for now, understand the that today’s computers are still based on the Von-Neumann architecture.

Boot Process

In a modern-day computer, the basic input/output system (BIOS) is a program that initializes the computer when you turn it on.

A typical boot sequence includes the following steps.

Power is given to the machine. The chipset sends a reset signal to the CPU until a clean power signal is received.
The CPU is preprogramed to look at a particular memory address in ROM for a jump instruction to the start of the BIOS boot program.
The BIOS boot program starts with a power-on self test (POST). The POST includes the following checks.
1. Verify CPU registers
2. Verify BIOS code
3. Verify DMA (direct memory access) which allows access to memory without CPU
4. Verify timer, interrupt controller
5. Locate, determine size and set up access to main memory
6. Identify and set up access to devices available for booting
7. Identify and set up access to system busses
8. Setup a user interface for BIOS configuration
The BIOS runs the video BIOS that stored on a chip on the video card (dedicated) or included in the BIOS (integrated).
The BIOS looks for other device BIOSs and runs them (eg. IDE/ATA hard disk BIOS).
The BIOS displays a start-up screen allowing access to BIOS configuration.
The BIOS performs other tests like a memory check and displays the results to the screen.
The BIOS tests and configures hardware like hard drives, COM and LPT ports.
The BIOS detects and configures Plug and Play devices.
The BIOS displays a summary screen.
The BIOS determines the boot drive.
The BIOS locates the boot sector on the drive.
The BIOS loads the boot loader code in the boot sector into memory and starts running the boot loader.
The boot loader loads and starts running a second-stage boot loader like GNU GRUB, located in the master boot record in the first sector of the bootable drive.
The second-stage boot loader loads the file system drivers, displays a menu to choose a kernel and loads the kernel and passes control to the kernel.

The Kernel

An operating systems contains numerous programs that together manage the resources of the computer. The primary program in an operating system is called the kernel. The kernel handles the following operating system functions (among others):

Process creation and CPU scheduling
Memory management
Interprocess communication
Device I/O (including networks)
Managing the file systems

Throughout the semester, we’ll be discussing aspects common to all operating systems. But to see how we, as programmers, can utilize the services an operating system provides, we’ll be using the Linux operating system. As such, it may be useful to refer to the various articles found here which which discuss many aspects of the Linux operating system and these which focus on the Linux kernel.

Loading the Linux Kernel

The Linux kernel is loaded in two stages:

The kernel is loaded into memory as a compressed image file (bzImage or zImage). It is decompressed, a few functions such as basic memory control are set up and control is switched to the kernel by calling startup().
startup() (also called process 0) established memory management, detects and configures hardware capabilities then calls start_kernel().
start_kernel() performs a wide range of initializations, then find, loads, decompresses and starts Initrd.
Initrd loads all necessary drivers. Next, it initializes virtual devices related to the file system. Then it creates a root device and mounts the root partition. At this point kernel is loaded into memory and operational.

The kernel, as well as the various subsystems started by the kernel, stay loaded in RAM while the computer is running.

The First Process, Systemd

While the kernel performs the core functions of the operating system, other programs are needed to make the system usable.

On most Linux systems today, the kernel starts systemd, a program that:

Manages on-demand starting of daemons like windowing system
Manages network configurations
Manages user sessions
Manages the console
Manages virtual devices (/dev)
Contains a cron-style task scheduler
Starts an event logging facility
Provides system state snapshot and restore
Allows power management
Becomes parent of orphaned children.

These system processes are usually performed by daemon processes. A daemon is a process that runs in the background without user interaction.

Systemd, initially released in 2010, replaced the inherited initd process. Most major Linux distributions have adopted systemd.

OSX has a similar daemon called launchd – but is not so ambitious.

Systemd has PID 1. If a parent of a child process dies, systemd becomes the child’s parent. When a process dies, systemd reclaims its resources.

Kernel Mode, User Mode, and System Calls

Kernel Mode vs. User Mode –

The CPU executes instructions on behalf of the kernel, other system programs and user programs.

When executing instructions on behalf of the kernel, the system is in kernel mode (kernel space)
When executing instructions on behalf of other system programs and user program, it is in user mode (user space)

There is a mode bit in the hardware to distinguish which mode the computer is running in. When the system is in kernel mode the system is executing trusted code. This trusted code has full access to all of the resources on the computer. When in user mode, processes must make system calls to request access to system resources.

System Calls

When a user space program needs access to system resources it may request access by calling system functions or by calling library functions which in turn make system calls.

System calls are calls to trusted kernel code that provide access to system resources. Many UNIX based systems (e.g. MacOS, Linux) provide a set of system calls that are POSIX (Portable Operating System Interface) compliant.

When a system call is made, the following process occurs:

The call is made through a wrapper macro that was compiled into the program binary
The wrapper assembly code pushes the parameters onto a parameter stack and requests the CPU switch from user mode (if in user mode) to kernel mode and start executing a specific routine using an interrupt instruction.
The kernel then takes over and identifies which system call was called and executes the correct system-call handler using the data on the stack.
After the system call is executed, a value is usually returned (and sets errno if the return code is negative) and the CPU may switch to user mode.

Resource: http://www.tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html

There are dozens of system calls available for application programmers. A quick reference can be found here. POSIX compliant operating systems also include manual pages that provide information on system functions and C standard library functions. The manual pages can be accessed via the web or in a terminal. To view a manual page in a terminal, simply type man followed by the name of the function on the command line.

Making System Calls in C

Every system call has its declaration defined in a header file. To determine the name of the header file, refer to the system call’s man page. When you’ve identified the name of the header file, reference the header file in your source code with an include statement as shown below.

#include <header_file_name.h>

As mentioned above, most system calls return a negative number upon error and set a global variable named errno to an integral value which can be used to identify the error that occurred when the system call was being executed. You can determine what values are returned by a system call by reading the system call’s man page. When calling a system function, it is important to check the return value of the system call. Often, an application program can not proceed if a system call does not complete properly. The C standard library function strerror() can be used to print a human readable error message based on the value in the global variable errno. The header file for strerror is <string.h>.

For example, kill() is a system call that sends a signal to a running process based on the process’ process identifier (pid). Upon error, kill() returns -1. To properly call kill we should check the value returned by kill() and if it is -1, we should log the error and then act accordingly.

pid_t pid;
int signo;
...

if (kill(pid, signo) == -1) {
    printf("Error calling kill: %s\n", strerror(errno));
    ...
}

Processes

At any one moment in time only a single program is being executed on a CPU. In a second of time, multiple programs may have run, each for perhaps 100 milliseconds, which makes it appear as if the processor is running programs in parallel.

Since the processor is switching back and forth between programs, the system must keep track of data for each program. The stored data for a running program is called a process. The act of swapping one processes off the CPU for another is called a context switch.

A process has memory (not necessarily contiguous) allocated for it to store

the executable code that is running
a stack
heap
environmental variables
initialized global variables
BSS (block started by symbol) that contains uninitialized variables

Each process has a unique process identifier (pid). You can view the pids (and other information) of the process running on a Linux machine by issuing the following command on the command line.

ps -ax

Process Creation

The fork() system call is the only way to create a new process in a POSIX compliant operating system.

#include <unistd.h>
pid_t fork(void);

When fork() is executed it creates a new child process. The child process is a near identical copy of the parent process that called fork(). The two processes (parent and child) share the same address space where their executable code is stored, but they have different stacks and context switch state information so that when they execute they can diverge in their execution. Immediately, following the call to fork(), the child’s stack contains the same information as its parent’s stack and both processes continue executing at the point of execution following the call to fork() that created the child. The fork() method however returns a value of 0 to the child process and the pid of the child process to the parent process. This allows the programmer to inspect the return value for fork() and have the child process do one thing and the parent process do another.

Changing the Executable Code of a Process

Fork() is often used to create a new process that runs a different executable file than that of the parent. This is accomplished by using one of the C standard library exec functions defined in <unistd.h>. These functions do not return on success. Instead the text, data, and stack of the calling process are replaced by that of the program loaded and execution begins at the entry point of the new executable.

    pid_t pid;
    int status;

    if ((pid = fork()) == 0) {                    // child process
        char *path = "./print_time";
        char *cmd = "print_time";
        char *arg = "Child 1";

        if(execl(path, cmd, arg, NULL) == -1) {
            printf("%s\n", strerror(errno));
            exit(1);
        }
    } else {                                      // parent process
        if (wait(&status) == -1) {
            printf("%s\n", strerror(errno));
            exit(1);
        }
    }

The execl() function takes a c-string specifying the path to an executable file followed by a NULL terminated list of arguments that are to be passed to new executable file. The first argument in the list is traditionally the name of the executable. In the example above, we assume there is a program named print_time that resides in the current working directory, hence the use of “./” in the path variable. We also assume that print_time takes a string (“Child 1”) as an argument. Other examples can be found here.

The wait() function, defined in <sys/wait.h>, halts execution of the calling process until a child process’ state changes (e.g. it terminates) or a signal handler interrupts the call.

/proc

All information about a process’ state is stored by the kernel in memory. On Linux, the kernel maintains what is known as the /proc file system for this purpose. The state information stored includes:

process state
program counter (next instruction)
data in CPU registers
scheduling information
memory management information
accounting information
I/O status information

The data in /proc is mounted to the root file system and can be view by a root user.

Process States

In between creation and termination, a process can be in one of three states:

Running (actually using the CPU at that instant)
Ready (runnable; waiting to run on the CPU)
Blocked (unable to run until some external event happens)

The scheduler is the process that decides which process will run and for how long.

4 transitions are possible:

Ready -> Running (scheduler decides to run the process)
Running -> Ready (context switch)
Running ->Blocked (I/O or event wait)
Blocked -> Ready (I/O or event completion)

The states listed above are generalized and should not be confused with the actual states used by a specific kernel. For example, in the Linux kernel, there are 5 states. A process that is in the TASK_RUNNING state is either running on the CPU or is in the runnable queue.

Process Termination

Processes can terminate due to the following

normal end of code
gracefully termination by programmer after error (using exit())
terminated by OS due to fatal error
killed by another process (using kill())

The kill() system call can be used to send a signal to the OS requesting a process be terminated.

int kill(pid_t pid, int sig);

The parameters to kill include an integer indicating the pid of the process to which the signal is to be sent and an integer indicating the type of signal to be sent.

Two common signals are SIGTERM (15) and SIGKILL (9). If SIGTERM is received, the OS may forward the signal to the process. If the process has registered a signal handler it can catch the signal and execute the code in the signal handler rather than automatically terminating. The SIGKILL signal requests immediate termination and if the process which issues the kill() system call has authority over the killee, the process is terminated.

A list of signals for a Linux system can be found at /usr/include/bits/signum.h. An example showing the GNU C signum.h file can be seen here.

Example

#include <stdio.h>

void main() {
    while(1) {
        printf(“X ”);
    }
}

If you compile and run the example above you will see “X” being repeatedly printed to stdout. By pressing CTRL+C, the shell will send the SIGINT signal to the process causing it to terminate.

You can also send a signal to a process using the kill command on the command line as shown below. Here the -9 flag indicates that a SIGKILL signal should be sent to the process having the pid 5678.

kill -9 5678

Orphan and Zombie Processes

If a parent process terminates before a child process, the child process is called an orphan process and is inherited by systemd.

If the child process terminates while the parent is still running, the child’s process information stays in memory until the parent calls wait() which returns the child’s return status. While the child’s process information is in memory, the child’s process (though terminated) is called a zombie process.

System Limits, Process Limits, & Process Usage

System Limits

The amount of RAM, the CPU speed, the number of currently running processes, and other factors limit the resources available to a process. The operating system is responsible for establishing and enforcing the resource limits.

Recall that the kernel maintains a virtual file system (/proc) to maintain system information. For example, when on Linux, we can display the total number of files that can be open at a time using the following command:

cat /proc/sys/fs/file-max

Programmatically, we can call sysconf() which returns some of the values of the system limits.

#include <unistd.h>
long sysconf(int resource)

Sysconf() takes as a resource identifier and returns a long value. Some of the resource identifiers are:

_SC_CHILD_MAX: The max number of simultaneous processes per user ID.
_SC_CLK_TCK: The number of clock ticks per second.
_SC_OPEN_MAX: The maximum number of files a process can have open at any time.
_SC_STREAM_MAX: The maximum number of streams that a process can have open at any time.

Getting Process Limits

“The resource limits for a process are normally established by (the kernel) when the system is initialized and then inherited by each successive process. Each implementation has its own way of tuning the various limits” (Advanced Programming in UNIX, section 7.11). On Linux, the kernel creates directories named /proc/#, where # is a process’ pid, for each user process. The kernel uses this space to store process information.

We can programmatically determine some of the limits on a process’ resources using the getrlimit() function.

#include <sys/time.h>
#include <sys/resource.h>
int getrlimit(int resource, struct rlimit *rlptr)

Getrlimit() takes a resource identifier and a pointer to a struct rlimit object (shown below) as parameters. The function determines a resource limit (current and max) for the current process and sets the values in the struct rlimit. We can then retrieve the limits by accessing the struct fields.

struct rlimit {
    rlim_t rlim_cur;          /* soft limit - current limit */
    rlim_t rlim_max;          /* hard limit - maximum value for rlim_cur */
};

Some notes:

The rlim_t type is defined as an unsigned long long. The C standard says an unsigned long long must be at least 64 bits (8 bytes). We can print the value of sizeof(unsigned long long) if we need that information.
An infinite limit is specified by the constant RLIM_INFINITY.
<sys/resource.h> specifies the types of resources and structures for resource limits.

Below is a list of resources that have limits we can query. All of the resources available can be found on the getrlimit() man page.

RLIMIT_CPU: The maximum amount of CPU time in seconds. When the soft limit is exceeded, the SIGXXCPU signal is sent to the process.
RLIMIT_DATA: The maximum size in bytes of the data segment (initialized data, uninitialized data and heap – does not include the stack).
RLIMIT_STACK: The maximum size in bytes of the stack.
RLIMIT_NOFILE: The maximum number of open files per process.
RLIMIT_AS: The maximum size in bytes of a process’ total available memory.
RLIMIT_NPROC: The maximum number of child processes per real user ID.

Setting Process Limits

A process can change its own limits and can change other process’ limits, if it is privileged. To do so, it uses the setrlimit() function.

#include <sys/time.h>
#include <sys/resource.h>
int setrlimit(int resource, const struct rlimit *rlptr)

Setrlimit() attempts to change the current and maximum limit for a process resource. When using this method, values for both limits must be set. It is sometimes helpful to call getrlimit() to get the current limits. The following rules govern the changing of resource limits for processes:

A process can change its soft limit up to its hard limit.
A process can lower its hard limit down to its soft limit. This process is irreversible for normal users.
Super users processes can raise a hard limit.

Exceeding System Limits

Question: Can a superuser increase the hard limit above the system limit?

Answer: Yes. According to the setrlimit() man page, “A privileged process (superuser) may make arbitrary changes to either limit value.”

Question: What happens when a system limit has been reached and a system call tries to exceed the system limit?

Answer: The system call will usually return -1 and set errno to an appropriate value. For example, if the user has a soft limit of unlimited for the number of open files and the system has reached its max, according to the man page for open(), the system call will return -1 and set errno to ENFILE: “The system limit on the total number of open files has been reached”.

Resource Usage

We can get information about the amount of resources currently used by a process using getrusage().

#include <sys/time.h>
#include <sys/resource.h>
int getrusage(int who, struct rusage *usage)

The first parameter of the function identifies whose statistics should collected. The valid values for who are:

RUSAGE_SELF (the calling process)
RUSAGE_CHILDREN (all children that have terminated and have been waited for)
RUSAGE_THREAD (the calling thread)

The second parameter is a pointer to a struct (shown below) that will hold the statistics.

struct rusage {
    struct timeval ru_utime; /* user CPU time used */
    struct timeval ru_stime; /* system CPU time used */
    long   ru_maxrss;        /* maximum resident set size */
    long   ru_ixrss;         /* integral shared memory size */
    long   ru_idrss;         /* integral unshared data size */
    long   ru_isrss;         /* integral unshared stack size */
    long   ru_minflt;        /* page reclaims (soft page faults) */
    long   ru_majflt;        /* page faults (hard page faults) */
    long   ru_nswap;         /* swaps */
    long   ru_inblock;       /* block input operations */
    long   ru_oublock;       /* block output operations */
    long   ru_msgsnd;        /* IPC messages sent */
    long   ru_msgrcv;        /* IPC messages received */
    long   ru_nsignals;      /* signals received */
    long   ru_nvcsw;         /* voluntary context switches */
    long   ru_nivcsw;        /* involuntary context switches */
};

Process CPU Time

We can see how much time a process has executed on the CPU using the times() function. Some of these usage statistics are also supplied by getrusage().

<sys/times.h>
clock_t times (struct tms *buffer)

Times() returns the number of clock ticks from some arbitrary but particular point in time (like system start-up). The tms struct has the following fields:

tms_utime: the CPU time used during the execution of user instructions of the calling process
tms_stime: the CPU time used by the system on behalf of the calling process
tms_cutime: the recursive sum of tms_utime and tms_cutime of all child processes /* only for children that the process waited for using a wait() function */
tms_cstime: the recursive sum of tms_stime and tms_cstime of all child processes /* only for children that the process waited for using a wait() function */

We can get the number of ticks per second by passing _SC_CLK_TCK to sysconf(). The function will return the number of clock ticks per second.

Host Information

We can get information about the host machine using uname().

#include <sys/utsname.h>
int uname(struct utsname *name)

uname() takes a pointer to a struct utsname object and sets the fields of that struct. These fields include:

sysname: Name of the operating system implementation
nodename: Network name of this machine
release: Release level of the operating system
version: Version level of the operating system
machine: Machine hardware platform

User IDs

Every process has 6 or more user ids associated with it. The real user ID and real group ID specify who is running the program. They are taken from the password file when the user logs in. These can be changed by super user.

real user ID
real group ID

We can set a special bit in the file’s mode word (suid bit) that says, “when this file is executed, set the effective user ID of the process to be the owner of the file”.

effective user ID
effective group ID

We can set the suid bit of an executable file using chmod. We preface the usual 3-digit argument to chmod with a 4^th digit with value 4 to turn the suid bit on. For example:

$ chmod 4755 executable_file

If the file is owned by root, when the file as executed it would run as root. This is what happens, for example, when you run passwd.

The original effective user and group ids that are set by the exec() function are saved.

saved set-user-ID
saved set-group-ID

We can get the current ids using the following system functions.

#include <unistd.h>
uid_t getuid(void)                /* get the user id of the calling process */
uid_t geteuid(void)               /* get the effective user id */
gid_t getegid(void)               /* get the effective group id */
get_t getgid(void)                /* get the gid of the calling process */

We can change the current ids using the following system functions.

int setuid(uid_t uid)
int setgid(gid_t gid);
int seteuid(uid_t uid);
int setegid(gid_t gid);

Process Ids

We can get the pid and parent pid for the calling process using the following functions:

#include <unistd.h>
pid_t getpid(void)               /* get the pid of the calling process */
pid_t getppid(void)              /* get the parent pid of the calling process */

Login Name

A program, when run by a super user, can get the username of the person running the process by calling getlogin().

#include <unistd.h>
char* getlogin(void)