This page is my summary of the user space API of POSIX and Linux. Main reference is The Linux Programming Interface 
Similar APIs are grouped together, section
fork (2), vfork (BSD)
valis -1 for error (e.g: reach RLIMIT_NPROC), 0 for child process, other for parent process. The parent process usually gets CPU first after Linux 2.6.32.
- file descriptor are shared for child and parent after
fork. (e.g: lseekin children can be visible from parent)
- child and parent are using different virtual page tables, but both points to the same physical pages. The text segment is read-only, so no need to copy, but other pages of data, heap, stack are copy-on-write, which means when either child or parent try to modify those pages, then a new page is assigned.
- The parent process usually runs first by default after Linux
2.6.32,but can be controlled by proc/sys/kernel/sched_child_runs_first. The benefit of running child first is to reduce copy pages when immediately syscall exec. The benefit of running parent first is to reduce TLB and CPU swapping time.
_exit (2), exit(3)
- _exit is the system call,
exitis its wrapper in libc
- exit performs several actions before _exit: exit handler,
stdiostream buffer flush
- finer control over process creation
There are two main
- LinuxThreads: original implementation. Lots of derivations from SUSv3. Not longer provided after glibc 2.4
- Native POSIX Threads Library (NPTL). Modern implementation. Better performance and more compliance to SUSv3. Here is a good tutorial for pthread.
- specify function and argument to start a new thread
pthread_exit, pthread_join, pthread_detach (3)
- pthread_detach: cannot be joined after detach
Currently, pthread mutex on Linux is mainly implemented with futex (fast userspace mutex).
pthread_mutex_lock, pthread_mutex_unlock (3)
- mutex_lock will be blocking if mutex has already been acquired by others
- latency of lock/unlock costs 25 ns respectively
- static allocation: PTHREAD_MUTEX_INITIALIZER
- dynamic allocation: pthread_mutex_init
User-space level thread, supported by VM not kernel
nonpreemptivethread (each thread return resource to OS by itself)
- less expensive than normal thread
stat, lstat, fstat (2)
- stat: return info about a file. require exec permission for all parent paths
lstat: similar to stat, return info of a link itself if link specified fstat: info about a file descriptor
According to POSIX, I/O is intended to be atomic to ordinary files and pipes and FIFOs
- O_RDONLY, O_WRONLY, O_RDWR: file access mode
- O_CREAT: create the file if not
- O_DIRECT: bypass OS page cache
- O_EXCL: use together with O_CREAT to prevent race condition of creation. Only one process will succeed when creating, others will fail.
- O_APPEND: append write is atomic for most file systems (e.g: NFS, HDFS are not).
- read and write at a specific offset without modifying current offset in
- It is equivalent to atomically perform:
lseekto the new offset, io, lseekto the original offset.
- useful for multithread applications.
- scatter/gather IO
- readv read from fd into multiple buffers atomically
- writev write from multiple buffers into fd atomically
- perform control operations on an open file descriptor
inotify API is added to replace dnotify from kernel 2.6.
- inotify_init: create an inotify instance
- inotify_add_watch: add items to the watch lists
- read from the inotify file descriptor to retrieve inotify_event structs
- lsof: list open files
This section describes memory related API in user space
mmap, munmap (2)
can select whether mmap memoryis private (MAP_PRIVATE) or shared (MAP_SHARED) can map file or map anonymous memory (an option for memory allocation). mmap
- offset and
should be page aligned in addr , length will be rounded up to a multiple of page size (BTW, page size can be retrieved by linux getpagesize(2))
mprotect, madvise, mlock, msync (2)
mprotect: change protection (PROT_READ, PROT_WRITE, PROT_EXEC) of a region madvise: tell OS the expecteddread pattern to make goodguess. (e.g: random or sequence) mlock: lock the region to prevent from being swapped out. msync: force memory to be written into file(sync or async)
brk, sbrk (2)
- heap allocation system call
changecurrent program break. specify the new program break address. brk brk(0) returns current program break sbrkspecify increments
malloc, free (3)
- memory allocation can be implemented with
brk, sbrkor mmap.
- free memory blocks are managed (as a linked list) in user space to reduce syscalls
- malloc first search empty blocks in current memory lists. If found, return the block and mark as used. If not found, call
brkor sbrkto allocate new memory. This is to prevent issuing system calls. free will return the block to managed memory lists (in user space) without calling brkor sbrk. firstfit strategy: implementation used K&R amdmalloc in embedded systems. find the first block whose size is enough. The problem is memory fragmentation.
- best fit strategy:
calloc, realloc (3)
calloc: malloc with initialization realloc: can be used in vector, map implementations
memalign, posix_memalign (3)
- allocate memory with a specific alignment.
- useful for SSE, AVX…
- allocate memory on stack
Users / Groups
nonsensitive system password file
- login name: unique user name
- encrypted password: DES hash of password, x if shadow password enabled
- User ID (UID): superuser (root) if value 0
- Group ID (GID): group id of the first group
- Comment: text about the user
- home directory: HOME variable
- login shell: shell
sensitive password file. password hash is saved here
group info (note that part of the info is saved in /etc/passwd) . current login user’s group can be checked with group (1)
- group name: unique group name
- encrypted password: group password, x if shadow password enabled
- group ID (GID): group id
- user list: users
The lowest level IPC on Windows is done by COM (component object model). On Linux, there are two families of IPC: System V IPC and POSIX IPC. POSIX IPC is a newer one and thread safe, but sometimes not supported in some OS.
System V IPC
Pipes are the oldest method of IPC (from Version 3 Unix). A pipe is an undirected byte stream (random access (e.g.: lseek) not allowed. However, pipe looks not supported in recent Zircon microkernel
Writes of up to PIPE_BUF (4096 in Linux )bytes are guaranteed to be atomic. buf size can be modified with fcntl (up to about 1M in Linux). this would help reduce context switch
int pipe(int filedes) (2)
- filedes is the read end, filedes is the write end. If all write fds are closed, then all read fds receives EOF
- Normally pipe is followed with fork and the child process imherits copies of parent’s fds. Usually one closes the read end, and the other closes the write end.
- Bidirectional IPC can be implemented with two pipes
- close unused pipep fd is important.
- If redundant write fds are not closed, then read fd cannot receive EOF correctly.
- If redundant read fds are not closed, then write fd cannot receive SIGPIPE signal
systemcall to create a socket and return its file descriptor.
- There are three socket domains: AF_UNIX for
socketon the same host, AF_INET for IPv4, AF_INET6 for IPv6.
- There are two types: SOCK_STREAM for connection-oriented communication (e.g: TCP), SOCK_DGRAM for connectionless communication (e.g.: UDP)
- bind socket to an address. addr is a generic structure to handle both pathname (for unix socket) and IP (for inet socket)
- Listening for incoming connections, backlog is the limit of pending connections.
server sideinterface to accept a connection
- can be configured as either blocking or nonblocking
- client side interface
- connecting to a peer socket
- getaddrinfo (3): domain -> address
- getnameinfo (3): address -> domain
 Kerrisk, Michael. The Linux programming interface: a Linux and UNIX system programming handbook. No Starch Press, 2010.