0x322 Linux Interface

This page is my summary of the user space API of POSIX and Linux. Main reference is The Linux Programming Interface [1]

Similar APIs are grouped together, section id is provided on the right side for each API.

Process

fork (2), vfork (BSD)

  • return val is -1 for error (e.g: reach RLIMIT_NPROC), 0 for child process, other for parent process. The parent process usually gets CPU first after Linux 2.6.32.
  • file descriptor are shared for child and parent after fork. (e.g: lseek in children can be visible from parent)
  • child and parent are using different virtual page tables, but both points to the same physical pages. The text segment is read-only, so no need to copy, but other pages of data, heap, stack are copy-on-write, which means when either child or parent try to modify those pages, then a new page is assigned.
  • The parent process usually runs first by default after Linux 2.6.32, but can be controlled by proc/sys/kernel/sched_child_runs_first. The benefit of running child first is to reduce copy pages when immediately syscall exec. The benefit of running parent first is to reduce TLB and CPU swapping time.

_exit (2), exit(3)

  • _exit is the system call, exit is its wrapper in libc
  • exit performs several actions before _exit: exit handler, stdio stream buffer flush

clone (2)

  • finer control over process creation

Threads

There are two main linux threading implementation for posix threads

  • LinuxThreads: original implementation. Lots of derivations from SUSv3. Not longer provided after glibc 2.4
  • Native POSIX Threads Library (NPTL). Modern implementation. Better performance and more compliance to SUSv3. Here is a good tutorial for pthread.

Thread Creation

pthread_create (3)

  • specify function and argument to start a new thread

pthread_exit, pthread_join, pthread_detach (3)

  • pthread_detach: cannot be joined after detach

Thread Synchronizations

Mutex

Currently, pthread mutex on Linux is mainly implemented with futex (fast userspace mutex).

pthread_mutex_lock, pthread_mutex_unlock (3)

  • mutex_lock will be blocking if mutex has already been acquired by others
  • latency of lock/unlock costs 25 ns respectively
  • static allocation: PTHREAD_MUTEX_INITIALIZER
  • dynamic allocation: pthread_mutex_init

Condition Variables

Green Thread

User-space level thread, supported by VM not kernel

  • nonpreemptive thread (each thread return resource to OS by itself)
  • less expensive than normal thread

File System

File Attributes

stat, lstat, fstat (2)

  • stat: return info about a file. require exec permission for all parent paths
  • lstat: similar to stat, return info of a link itself if link specified
  • fstat: info about a file descriptor

File IO

According to POSIX, I/O is intended to be atomic to ordinary files and pipes and FIFOs

open (2)

  • O_RDONLY, O_WRONLY, O_RDWR: file access mode
  • O_CREAT: create the file if not existed
  • O_DIRECT: bypass OS page cache
  • O_EXCL: use together with O_CREAT to prevent race condition of creation. Only one process will succeed when creating, others will fail.
  • O_APPEND: append write is atomic for most file systems (e.g: NFS, HDFS are not).

pread/pwrite (2)

  • read and write at a specific offset without modifying current offset in fd.
  • It is equivalent to atomically perform: lseek to the new offset, io, lseek to the original offset.
  • useful for multithread applications.

readv/writev (2)

  • scatter/gather IO
  • readv read from fd into multiple buffers atomically
  • writev write from multiple buffers into fd atomically

fcntl (2)

  • perform control operations on an open file descriptor

IO Multiplexing

select

poll

epoll

File Monitoring

inotify API is added to replace dnotify from kernel 2.6.

  • inotify_init: create an inotify instance
  • inotify_add_watch: add items to the watch lists
  • read from the inotify file descriptor to retrieve inotify_event structs

Other IO

io_uring

Command

  • lsof: list open files

Signals

Memory

This section describes memory related API in user space

Shared Memory

mmap, munmap (2)

  • mmap can select whether memory is private (MAP_PRIVATE) or shared (MAP_SHARED)
  • mmap can map file or map anonymous memory (an option for memory allocation).
  • offset and addr should be page aligned in linux, length will be rounded up to a multiple of page size (BTW, page size can be retrieved by getpagesize(2))

mprotect, madvise, mlock, msync (2)

  • mprotect: change protection (PROT_READ, PROT_WRITE, PROT_EXEC) of a region
  • madvise: tell OS the expectedd read pattern to make good guess. (e.g: random or sequence)
  • mlock: lock the region to prevent from being swapped out.
  • msync: force memory to be written into file (sync or async)

Memory Allocation

brk, sbrk (2)

  • heap allocation system call
  • change current program break.
  • brk specify the new program break address. brk(0) returns current program break
  • sbrk specify increments

malloc, free (3)

  • memory allocation can be implemented with brk, sbrk or mmap.
  • free memory blocks are managed (as a linked list) in user space to reduce syscalls
  • malloc first search empty blocks in current memory lists. If found, return the block and mark as used. If not found, call brk or sbrk to allocate new memory. This is to prevent issuing system calls. free will return the block to managed memory lists (in user space) without calling brk or sbrk.
  • first fit strategy: implementation used K&R amd malloc in embedded systems. find the first block whose size is enough. The problem is memory fragmentation.
  • best fit strategy: glibc malloc implementation.

calloc, realloc (3)

  • calloc: malloc with initialization
  • realloc: can be used in vector, map implementations

memalign, posix_memalign (3)

  • allocate memory with a specific alignment.
  • useful for SSE, AVX…

alloca (3)

  • allocate memory on stack

Credentials

Users / Groups

/etc/passwd

nonsensitive system password file

  • login name: unique user name
  • encrypted password: DES hash of password, x if shadow password enabled
  • User ID (UID): superuser (root) if value 0
  • Group ID (GID): group id of the first group
  • Comment: text about the user
  • home directory: HOME variable
  • login shell: shell

/etc/shadow

sensitive password file. password hash is saved here

/etc/group

group info (note that part of the info is saved in /etc/passwd) . current login user’s group can be checked with group (1)

  • group name: unique group name
  • encrypted password: group password, x if shadow password enabled
  • group ID (GID): group id
  • user list: users

IO

IPC

The lowest level IPC on Windows is done by COM (component object model). On Linux, there are two families of IPC: System V IPC and POSIX IPC. POSIX IPC is a newer one and thread safe, but sometimes not supported in some OS.

System V IPC

POSIX IPC

Pipe

Pipes are the oldest method of IPC (from Version 3 Unix). A pipe is an undirected byte stream (random access (e.g.: lseek) not allowed. However, pipe looks not supported in recent Zircon microkernel

Writes of up to PIPE_BUF (4096 in Linux )bytes are guaranteed to be atomic. buf size can be modified with fcntl (up to about 1M in Linux). this would help reduce context switch

int pipe(int filedes[2]) (2)

  • filedes[0] is the read end, filedes[1] is the write end. If all write fds are closed, then all read fds receives EOF
  • Normally pipe is followed with fork and the child process imherits copies of parent’s fds. Usually one closes the read end, and the other closes the write end.
  • Bidirectional IPC can be implemented with two pipes
  • close unused pipep fd is important.
    • If redundant write fds are not closed, then read fd cannot receive EOF correctly.
    • If redundant read fds are not closed, then write fd cannot receive SIGPIPE signal

Socket

socket (2)

  • system call to create a socket and return its file descriptor.
  • There are three socket domains: AF_UNIX for socket on the same host, AF_INET for IPv4, AF_INET6 for IPv6.
  • There are two types: SOCK_STREAM for connection-oriented communication (e.g: TCP), SOCK_DGRAM for connectionless communication (e.g.: UDP)

Stream Socket

reference: The linux programming interface [1]

bind (2)

  • bind socket to an address. addr is a generic structure to handle both pathname (for unix socket) and IP (for inet socket)

listen (2)

  • Listening for incoming connections, backlog is the limit of pending connections.

accept (2)

  • server side interface to accept a connection
  • can be configured as either blocking or nonblocking

connect (2)

  • client side interface
  • connecting to a peer socket

Datagram Socket

Reference: The Linx Programming Interface [1]
  • recvfrom
  • sendto

DNS

  • getaddrinfo (3): domain -> address
  • getnameinfo (3): address -> domain

Reference

[1] Kerrisk, Michael. The Linux programming interface: a Linux and UNIX system programming handbook. No Starch Press, 2010.