0x321 Concurrency
This page describes the interface/implementation of process/thread/synchronization.
1. Process
A process is a program in execution
1.1. Kernel
Kernel represents each process using a struct called Process Control Block (or known as process descriptor or task structure). It stores all info about each process.
In linux kernel, it is called task_struct
stored in a doubly-linked list (known as the process table).
struct task_struct {
volatile long state; // Process state (e.g., TASK_RUNNING, TASK_STOPPED)
struct thread_info *thread_info;
struct exec_domain *exec_domain; // Execution domain information (deprecated)
struct mm_struct *mm; // Memory management information (address space)
struct fs_struct *fs; // Filesystem information
struct files_struct *files; // File descriptor table
struct signal_struct *signal; // Signal handlers and signals pending
struct sighand_struct *sighand; // Signal handling information
...
};
This process table is visible under user mode through /proc
1.2. Process API
1.2.1. POSIX
API (fork (2), vfork (BSD))
- return val is -1 for error (e.g: reach RLIMIT_NPROC), 0 for child process, other for parent process. The parent process usually gets CPU first after Linux 2.6.32.
- file descriptor are shared for child and parent after fork. (e.g: lseek in children can be visible from parent)
- child and parent are using different virtual page tables, but both points to the same physical pages. The text segment is read-only, so no need to copy, but other pages of data, heap, stack are copy-on-write, which means when either child or parent try to modify those pages, then a new page is assigned.
- The parent process usually runs first by default after Linux 2.6.32, but can be controlled by proc/sys/kernel/sched_child_runs_first. The benefit of running child first is to reduce copy pages when immediately syscall exec. The benefit of running parent first is to reduce TLB and CPU swapping time.
API (_exit (2), exit(3))
- _exit is the system call, exit is its wrapper in libc
- exit performs several actions before _exit: exit handler, stdio stream buffer flush
API (clone (2))
- finer control over process creation
1.2.2. Windows API
Windows do not have fork, use spawn (fork + exec) or CreateProcess instead
API (CreateProcess)
1.3. IPC API
The lowest level IPC on Windows is done by COM (component object model). On Linux, there are two families of IPC: System V IPC and POSIX IPC. POSIX IPC is a newer one and thread safe, but sometimes not supported in some OS.
1.3.1. System V IPC
1.3.2. POSIX IPC
1.3.2.1. Pipe
Pipes are the oldest method of IPC (from Version 3 Unix). A pipe is an undirected byte stream (random access (e.g.: lseek) not allowed. However, pipe looks not supported in recent Zircon microkernel
Writes of up to PIPE_BUF (4096 in Linux )bytes are guaranteed to be atomic. buf size can be modified with fcntl (up to about 1M in Linux). this would help reduce context switch
API (int pipe(int filedes[2]) (2))
filedes[0]
is the read end, filedes[1]
is the write end. If all write fds are closed, then all read fds receives EOF
Normally pipe is followed with fork and the child process imherits copies of parent's fds. Usually one closes the read end, and the other closes the write end.
Bidirectional IPC can be implemented with two pipes
close unused pipep fd is important.
If redundant write fds are not closed, then read fd cannot receive EOF correctly.
If redundant read fds are not closed, then write fd cannot receive SIGPIPE signal
1.3.3. Windows IPC
Windows has following IPC mechanism
- Clipboard,
- COM
- Data Copy
- DDE
- File Mapping
- MailSlot
- Pipes
- Windows Socket
- RPC
1.3.3.1. COM IPC
In my personal understanding, COM is doing following things equivalent in linux
- fork a process
- open pipe to share message binary memory
- construct object over binary (COM interface defines how to construct obj) and manage its life
- close the process
1.3.3.2. WinRT IPC
wrapper of COM
2. Thread
2.1. Kernel
2.1.1. Linux
Linux kernel treats process and thread as essentially the same thing (i.e. task_struct)), with the only distrinction being the level of resoure sharing between them.
Both uses clone
to create a task with a configurable level of sharing:
CLONE_FILES
: share the same file descriptable tableCLONE_PARENT
: whether to setup parent-child relationshipCLONE_VM
: share the same memory space (or copy-on-write copy)
process creation (i.e. fork
) uses clone
with least sharing, while thread creation pthread_create
calls clone
with most sharing.
2.1.2. Windows
Unlike Linux, Thread is implemented at OS level in Windows.
Process is stored in EPROCESS
(or executive process block), which contains list of threads KTHREAD
(or thread control block)
2.2. Thread API
There are two main linux threading implementation for posix threads
- LinuxThreads: original implementation. Lots of derivations from SUSv3. Not longer provided after glibc 2.4
- Native POSIX Threads Library (NPTL). Modern implementation. Better performance and more compliance to SUSv3.
2.2.1. POSIX (pthread)
API (pthread_create (3))
- specify function and argument to start a new thread
API (pthread_exit, pthread_join, pthread_detach (3))
- pthread_detach: cannot be joined after detach
Mutex: Currently, pthread mutex on Linux is mainly implemented with futex (fast userspace mutex).
API (pthread_mutex_lock, pthread_mutex_unlock (3))
- mutex_lock will be blocking if mutex has already been acquired by others
- latency of lock/unlock costs 25 ns respectively
- static allocation: PTHREAD_MUTEX_INITIALIZER
- dynamic allocation: pthread_mutex_init
3. CPU Virtualization
To have a better performance, it is desirable that each process should run on CPU directly, which raises the question of how to enable OS to regain control of CPU when the process is running on CPU. It turns out there are two ways to solve this question.
cooperative approach: this assumes that each process behaves reasonably and would give control back to CPU frequently. Two approach allows the transfer of control
- normal system call: syscall of course would transfer control to kernel
- yield system call: there is a yield system call which do nothing but to transfer control. Some early version of OS (e.g: Mac OS 9) applies the approach, but is not a good idea in general, for example, the case of inf loop
non-cooperative approach: Another way is to use timer's interrupt. The timer would raise interrupt at the scale of milliseconds. Kernel can implement the interrupt handler to manage processes.
When the interrupt get triggered, the process is similar to the system call. The hardware has some responsibilities (e.g: save the running state of the program to the kernel stack), so that return-from-trap would work.
During the interrupt, the OS can decide to switch to another process (context switch). it will save a few register values and restore a few from the next process, then the return-from-trap execution would wake the new process instead of the old one.
3.1. Scheduling
The most basic scheduling approaches assumes that run-time of each job is known. Then there are two groups of approaches.
The first group minimizes the turnaround time, which is the average time of completion time minus arrival time
- Shortest Job First (SJF): literally run the shortest job first
- Shortest Time-To-Completion First (STCF): SJF might run a heavy job when it is the only job. In a preemptive OS, it should reschedule jobs based on their remaining workload.
The second group targets the response time, which is defined as the time taken to response after the arrival
- round robin (RR): schedule jobs in a for loop for a time slice. The time slice should not be too short (because context switch has non-trivial cost) There is a trade-off between these two groups. Additionally, the assumption of run-time of each job is unrealistic. In the real word, following schedulers are commonly used.
3.1.1. Multi-level Feedback Queue
There are multiple queues corresponding to different priorities of jobs. It runs with the following rules
- The queue with higher priority run its internal jobs with round-robin
- When a new job arrives, it is put into the highest priority queue
- When a job consumes its time allotment, its priority is reduced
- After some period, all jobs will move up to the top priority (priority boost)
This scheduler can both reduce response time (by putting interactive jobs in the higher priority queue) and alleviate the starvation issue (by the priority boost)
Schedulers can be given hints about priority (e.g: nice command)
Many real worlds OS use this family of scheduler (e.g: Solaris, BSD family, Windows NT family...).
3.1.2. O(n) Scheduler
3.1.3. O(1) Scheduler
3.1.4. Completely Fair Scheduler
3.2. Event-Driven
The threads implemented by users, OS does not recognize user-level threads.
- Advantages are less context switch time.
- Disadvantages are the blocking operations (e.g: IO), 1 user-thread IO will block the entire process, therefore blocking all users-threads
In general, user-level threads can be implemented using one of four models.
- Many-to-one
- One-to-one
- Many-to-many
- Two-level
4. Synchronization
Synchronization Implementation is usually hardware dependent, it can also implemented with software taking advantage of sharing memory (e.g: Peterson's algorithm)
4.1. Synchronization Primitives
Race condition happens when multiple threads enter the critical section at roughly the same time and attempt to modify the shared data. This happens because the update method are usually executed atomically.
For instance, the increment operation in x86 usually take three instruction:
- move memory to register
- update register
- move back to memory.
To prevent those issues, we can build a set of synchronization primitives.
4.1.1. Busy Waiting
Busy waiting is usually not considered a good strategy because it is wasting CPU time. Another problem with it is priority inversion problem, which high priority process depends on resource acquired by low priority process.
- spinlock: hardware implementation. It can be implemented with test-and-set instruction (e.g.: xchg)
- disable interrupts: crazy...
- Peterson: software implementation. use shared memory to implement busy waiting
4.1.2. Hardware Primitives
There are a couple of hardware primitives to support concurrency feature
Test-and-Set (atomic exchange): swap a *ptr and immediate value atomically (x86: xchg)
Compare-and-Set: compare *ptr with a value, if equal, then set an immediate value. This is the most popular synchronization primitive. (x86: CMPXCHG)
Load-Link/Store-Conditional (LL/SC): load *ptr and store an immediate value to it if no other threads load-link it before. (e.g: RISC-V: lr/sc, ARM: ldxr/stxr, MIPS: ll/sc...)
Fetch-and-Add (atomic adder): increment a *ptr atomically
4.1.2.1. Atomic
It has interfaces such as read, increment, decrement to modify an int. Those are implemented by LOCK prefix in x86 and strex/ldrex monitor in arm.
4.1.3. Software Primitives
4.1.3.1. Lock
Lock is a mutual exclusion primitive. It can be implemented with semaphore with \(k=1\)
Previous primitives can be used to implement the simple spinlocks, which typically suffer from the performance issue as waiting threads waste the timeslice assigned to them. There are a couple of solutions to this issue:
- yield to deschedule itself instead of spin during waiting
- put itself into a queue and sleep (park), when the resource is unlocked, this thread is waked up (unpark)
4.1.3.2. Condition Variable
Condition Variable is an ordering primitive.
Condition Variable is an efficient way to send notification among threads. The notification feature itself can be easily implemented with spin, which is not efficient of course.
Condition variable has two interfaces
- wait: put the thread into sleep until the condition changed. need to grab a lock when calling wait. It will wait until condition AND mutex become available. lock is to protect the predicate state.
- signal: wake a sleeping thread waiting this condition. do nothing if no one is waiting. If more than one thread is blocked on a condition variable, the scheduling policy shall determine the order in which threads are unblocked A simple rule with condition variables is to always use while to check condition instead of if, because condition might change when waiting thread start running. (Mesa semantics)
Producer/Consumer Problem This is a synchronization problem in which producer generate items into buffers and consumers take out items. It is seen in many places (multithread server, shell pipe...)
4.1.3.3. Semaphore
Semaphore was originally introduced by Dijkstra as a single primitive for all things related to synchronization. Actually, semaphore can be used to implement both locks and condition variables.
It is an object with an integer value that we can modify. In the POSIX standard, it has following interfaces:
- sem_wait (P): decrement the value, go to sleep if it is negative
- sem_post (V): increment the value, wake one if there is a sleeping thread
The value of the semaphore, when negative, is equal to the number of waiting threads
5. Reference
[1] Understanding the Linux Kernel, Third Edition, which covers Linux kernel 2.6.
[2] the dinosaur book
[3] the MINIX book
[4] Windows Internals
[5] Lions' Commentary on UNIX.
[6] Kerrisk, Michael. The Linux programming interface: a Linux and UNIX system programming handbook. No Starch Press, 2010.