Skip to content

0x321 Concurrency

This page describes the implementation and design of the kernel (mostly Linux).

1. CPU Virtualization

Concept (Limited Direct Execution) To have a better performance, it is desirable that each process should run on CPU directly, which raises the question of how to enable OS to regain control of CPU when the process is running on CPU. It turns out there are two ways to solve this question.

cooperative approach This assumes that each process behaves reasonably and would give control back to CPU frequently. Two approach allows the transfer of control

  • normal system call: syscall of course would transfer control to kernel
  • yield system call: there is a yield system call which do nothing but to transfer control. Some early version of OS (e.g: Mac OS 9) applies the approach, but is not a good idea in general, for example, the case of inf loop

non-cooperative approach Another way is to use timer's interrupt. The timer would raise interrupt at the scale of milliseconds. Kernel can implement the interrupt handler to manage processes.

When the interrupt get triggered, the process is similar to the system call. The hardware has some responsibilities (e.g: save the running state of the program to the kernel stack), so that return-from-trap would work.

During the interrupt, the OS can decide to switch to another process (context switch). it will save a few register values and restore a few from the next process, then the return-from-trap execution would wake the new process instead of the old one.

1.1. Scheduling

The most basic scheduling approaches assumes that run-time of each job is known. Then there are two groups of approaches.

The first group minimizes the turnaround time, which is the average time of completion time minus arrival time

  • Shortest Job First (SJF): literally run the shortest job first
  • Shortest Time-To-Completion First (STCF): SJF might run a heavy job when it is the only job. In a preemptive OS, it should reschedule jobs based on their remaining workload.

The second group targets the response time, which is defined as the time taken to response after the arrival

  • round robin (RR): schedule jobs in a for loop for a time slice. The time slice should not be too short (because context switch has non-trivial cost) There is a trade-off between these two groups. Additionally, the assumption of run-time of each job is unrealistic. In the real word, following schedulers are commonly used.

1.1.1. Multi-level Feedback Queue

There are multiple queues corresponding to different priorities of jobs. It runs with the following rules

  • The queue with higher priority run its internal jobs with round-robin
  • When a new job arrives, it is put into the highest priority queue
  • When a job consumes its time allotment, its priority is reduced
  • After some period, all jobs will move up to the top priority (priority boost)

This scheduler can both reduce response time (by putting interactive jobs in the higher priority queue) and alleviate the starvation issue (by the priority boost)

Schedulers can be given hints about priority (e.g: nice command)

Many real worlds OS use this family of scheduler (e.g: Solaris, BSD family, Windows NT family...).

2. Process

2.1. Process Scheduler

2.1.1. O(n) Scheduler

2.1.2. O(1) Scheduler

2.1.3. Completely Fair Scheduler

2.2. POSIX

API (fork (2), vfork (BSD)) - return val is -1 for error (e.g: reach RLIMIT_NPROC), 0 for child process, other for parent process. The parent process usually gets CPU first after Linux 2.6.32. - file descriptor are shared for child and parent after fork. (e.g: lseek in children can be visible from parent) - child and parent are using different virtual page tables, but both points to the same physical pages. The text segment is read-only, so no need to copy, but other pages of data, heap, stack are copy-on-write, which means when either child or parent try to modify those pages, then a new page is assigned. - The parent process usually runs first by default after Linux 2.6.32, but can be controlled by proc/sys/kernel/sched_child_runs_first. The benefit of running child first is to reduce copy pages when immediately syscall exec. The benefit of running parent first is to reduce TLB and CPU swapping time.

API (_exit (2), exit(3)) - _exit is the system call, exit is its wrapper in libc - exit performs several actions before _exit: exit handler, stdio stream buffer flush

API (clone (2)) - finer control over process creation

2.3. Windows API

Windows do not have fork, use spawn (fork + exec) or CreateProcess instead

API (CreateProcess)

3. Thread

NPTLL Design

There are two main linux threading implementation for posix threads

  • LinuxThreads: original implementation. Lots of derivations from SUSv3. Not longer provided after glibc 2.4
  • Native POSIX Threads Library (NPTL). Modern implementation. Better performance and more compliance to SUSv3. Here is a good tutorial for pthread.

3.1. Linux

3.1.1. Thread Creation

API (pthread_create (3))

  • specify function and argument to start a new thread

API (pthread_exit, pthread_join, pthread_detach (3))

  • pthread_detach: cannot be joined after detach

3.1.2. Thread Synchronizations

3.1.2.1. Mutex

Currently, pthread mutex on Linux is mainly implemented with futex (fast userspace mutex).

API (pthread_mutex_lock, pthread_mutex_unlock (3))

  • mutex_lock will be blocking if mutex has already been acquired by others
  • latency of lock/unlock costs 25 ns respectively
  • static allocation: PTHREAD_MUTEX_INITIALIZER
  • dynamic allocation: pthread_mutex_init
3.1.2.2. Condition Variables

3.1.3. Green Thread

User-space level thread, supported by VM not kernel

  • nonpreemptive thread (each thread return resource to OS by itself)
  • less expensive than normal thread

3.2. Windows

Unlike Linux, Thread is implemented at OS level in Window and SunOS

4. Event-Driven

The threads implemented by users, OS does not recognize user-level threads. - Advantages are less context switch time. - Disadvantages are the blocking operations (e.g: IO), 1 user-thread IO will block the entire process, therefore blocking all users-threads

In general, user-level threads can be implemented using one of four models.

  • Many-to-one
  • One-to-one
  • Many-to-many
  • Two-level

5. Synchronization

Synchronization Implementation is usually hardware dependent, it can also implemented with software taking advantage of sharing memory (e.g: Peterson's algorithm)

6. Synchronization Primitives

Race condition happens when multiple threads enter the critical section at roughly the same time and attempt to modify the shared data. This happens because the update method are usually executed atomically. For instance, the increment operation in x86 usually take three instruction: move memory to register, update register, move back to memory. To prevent those issues, we can build a set of synchronization primitives.

6.1. Busy Waiting

Busy waiting is usually not considered a good strategy because it is wasting CPU time. Another problem with it is priority inversion problem, which high priority process depends on resource acquired by low priority process.

  • spinlock: hardware implementation. It can be implemented with test-and-set instruction (e.g.: xchg)
  • disable interrupts: crazy...
  • Peterson: software implementation. use shared memory to implement busy waiting

6.2. Hardware Primitives

There are a couple of hardware primitives to support concurrency feature

Test-and-Set (atomic exchange): swap a *ptr and immediate value atomically (x86: xchg)

Compare-and-Set: compare *ptr with a value, if equal, then set an immediate value. This is the most popular synchronization primitive. (x86: CMPXCHG)

Load-Link/Store-Conditional (LL/SC): load *ptr and store an immediate value to it if no other threads load-link it before. (e.g: RISC-V: lr/sc, ARM: ldxr/stxr, MIPS: ll/sc...)

Fetch-and-Add (atomic adder): increment a *ptr atomically

6.2.1. Atomic

It has interfaces such as read, increment, decrement to modify an int. Those are implemented by LOCK prefix in x86 and strex/ldrex monitor in arm.

6.3. Software Primitives

6.3.1. Lock

Lock is a mutual exclusion primitive. It can be implemented with semaphore with \(k=1\)

Previous primitives can be used to implement the simple spinlocks, which typically suffer from the performance issue as waiting threads waste the timeslice assigned to them. There are a couple of solutions to this issue:

  • yield to deschedule itself instead of spin during waiting
  • put itself into a queue and sleep (park), when the resource is unlocked, this thread is waked up (unpark)

6.3.2. Condition Variable

Condition Variable is an ordering primitive.

Condition Variable is an efficient way to send notification among threads. The notification feature itself can be easily implemented with spin, which is not efficient of course.

Condition variable has two interfaces

  • wait: put the thread into sleep until the condition changed. need to grab a lock when calling wait. It will wait until condition AND mutex become available. lock is to protect the predicate state.
  • signal: wake a sleeping thread waiting this condition. do nothing if no one is waiting. If more than one thread is blocked on a condition variable, the scheduling policy shall determine the order in which threads are unblocked A simple rule with condition variables is to always use while to check condition instead of if, because condition might change when waiting thread start running. (Mesa semantics)

Producer/Consumer Problem This is a synchronization problem in which producer generate items into buffers and consumers take out items. It is seen in many places (multithread server, shell pipe...)

6.3.3. Semaphore

Semaphore was originally introduced by Dijkstra as a single primitive for all things related to synchronization. Actually, semaphore can be used to implement both locks and condition variables.

It is an object with an integer value that we can modify. In the POSIX standard, it has following interfaces:

  • sem_wait (P): decrement the value, go to sleep if it is negative
  • sem_post (V): increment the value, wake one if there is a sleeping thread

The value of the semaphore, when negative, is equal to the number of waiting threads

7. Reference

[1] Understanding the Linux Kernel, Third Edition, which covers Linux kernel 2.6.

[2] the dinosaur book

[3] the MINIX book

[4] Windows Internals

[5] Lions' Commentary on UNIX.

[6] Kerrisk, Michael. The Linux programming interface: a Linux and UNIX system programming handbook. No Starch Press, 2010.