Skip to content

0x363 Database

1. History

  • 1960: Charles Bachman designed the Integrated Database System, the first DBMS. (Turing 1973)
  • 1964: IBM designed IMS (hierarchical model) for Apollo program (BOM for Saturn V)
  • 1969: CODASYL (network model) published
  • 1970: Codd from IBM research lab published the paper of the relational model. (Turing 1981)
  • 1973 Ingres project (led by Stonebraker) demostrate that it was possible to build practical and efficient implementation relation model. (Turing 2014)
  • 1974: SQL designed by Donald Chamberlin and Raymond Boyce from IBM (SQL become ANSI standard in 1986)
  • 1979: Oracle database
  • 1995: MySQL (MySQL AB -> Sun Microsystems -> 2010 Oracle), it was forked into MariaDB by its cofounder.
  • 1996: PostgreSQL (originally called POSTGRES from the Ingres database)

2. Data Model

A data model (or datamodel) is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities

2.1. Relational Model

Relational Model is a data model representing data with unordered set of tuples.

2.1.1. Key Concepts

A relational database contains a set of named relations (or tables) Each relation has a set of named attributes (or columns) Each tuple (or row) has a value for each attribute, tuples in the relation is not ordered. Each attribute has a type (or domain)

The schema is the formal description of relations in database, it is something like typing in programming languages.

Instance is the actual contents at given point in time There is a special value NULL for unknown and undefined, which is important in relational database.

Key is an attribute whose value is unique in each tuple or set of attributes whose combined values ar unique. Keys are often denoted by underlining the set of key attributes. Relational database tends to build special index for keys to achieve efficiency.

There are two types of languages in relational model

  • DDL (data definition language): create, drop
  • DML (data manipulation language): select, insert

2.1.2. Querying Relational Database

The steps in creating and using a relational database

  • Design schema; create using DDL
  • Bulk load initial data
  • Repeat; execute queries and modificiations

Queries return relations (or table)

2.1.3. Star schema and Snowflakes

Definition (star schema) star schema adopted by relational data warehouse, it classifies table as either dimension or fact

Definition (fact table) store observations or events. can be very large. A fact table contains dimension key in the dimension table.

Definition (dimension table) describes the business entities

Definition (snowflake schema) Star schema stores redundant data in dimension tables, while snowflake schema fully normalizes dimension tables and avoids data redundancy

2.2. Hierarchical Model

Hierarchical Model or Document Model represent all data as a tree structure. It worked well for 1-to-Many relationships, but it made many-to-many relationships difficult.

To handle n-n relationships, we can

  • denormalize the data (duplicating)
  • resolve the join inside the application.

2.2.1. XML

XML (Extensible Markup Language) is a standard for data representation and exchange. XML has more flexibility when compared with the relational model.

The basic constructs of XML

  • tagged elements, which can be nested
  • attributes
  • text (can be thought as the content for the leaf node) DTD/XSD

A well-formed XML adheres to basic structural requirements

  • single root
  • matched tags, no open tags
  • unique attributes within elements

A valid XML adheres to the basic structral requirements and also content-specific specification Examples to describe the specification are

  • Document Type Descriptor (DTD)
  • XML Schema (XSD)

The specification can be provided to the XML parser to decide whether the XML is valid or not. For example, xmllint can be used to verify.

Document Type Descriptor (DTD) DTD is a grammar-like language for specifying elements, attributes, nesting, ordering, occurrences It also has special attribute types such as ID and IDREF(S), which works like the pointers

XML Schema (XSD) In addition to the DTD feature, XSD also sepcifies data types, keys, pointers. It is also written in XML.

2.2.2. JSON

JSON is a stanrdard to serialize data objects into human readable format, it is useful for data interchange, and representing & storing semistructured data.

2.3. Network Model

The network model represents data using a graph structure. Link between records in the network model are like the pointers in the programming languages. It was standarized by CODASYL.

Querying network need to follow a path from the root designed manually.

3. Query Language

3.1. Relational Algebra

Relational algebra defines the interface how to retrieve and manipulate tuples in a relation. Input is relation, output is relation, it is proposed by Codds when working in IBM.

Query (or expression) on set of relations produce relation as a result.

Note that semantics of relational algebra removes duplicates from relation when querying (because rows are set).

operator (selector) selector is to pick certain rows based on condition.


operator (project) project is to pick certain columns.


operator (cross product) combine two relations.

\[A \times B\]

Suppose \(A\) has \(s\) tuples and \(B\) has \(t\) tuples, then cross product has \(st\) tuples. relation name is attached to each attribute to disambiguate.

\[|A \times B| = |A||B|\]

operator (natural join) Enforce equality on all attributes with same name, and eliminate one copy of duplicate attributes.

\[A \bowtie B\]

Natural join does not add any expression power to algebra, it can be expressed using the join

\[ A \bowtie B = \Pi_{schema(E_1) \cup schema(E_2)} (\sigma_{E_1.A_1 = E_2.A_1, E_1.A_2 = E_2.A_2} (A \times B))\]

operator (theta join) Apply the condition to join, this is the basic operation used in DBMS

\[A \bowtie_{\theta} B = \sigma_{\theta} (A \times B)\]

The following are set operators operator (union) combines rows vertically, the schema should match

\[A \cup B\]

operator (difference)

\[ A - B\]

operator (intersection) intersection can be expressed with union and difference

\[ A \cap B\]

operator (rename) rename the schema for an relation

\[\rho_{R(A_1, A_2, ..., A_n)(E)}\]

Rename operator is to unify schemas for set operators because schema must match.

3.2. SQL

SQL is the declarative language partially implementing the relational algebra, derived from SEQUAL from IBM. Row in SQL is an unordered multiset.

statement (select)

\[\text{Select }A_1, A_2, ..., A_n\text{ From }R_1, R_2, ..., R_m\text{ where condition}\]

This is equivalent to the following sigma algebra

\[\Pi_{A_1, A_2, ..., A_n}(\sigma_{condition}(R_1 \times R_2 \times ... \times R_m))\]

duplicates: select will return duplicate because SQL is a multiset model, to reduce the duplicates, add distinct after select

order: select will return results without any order, to force order, add order by attr at the end.

4. Index Management

An index is an additional structure that is derived from the primary data. Maintaining index incurs overhead, especially on writes.: well-chosen indexes speed up read queries, but every index slows down writes.

When the entire db cannot fit into the memory, we store them in the disk and maintain a index to map key into the disk offset. The following three are most common indexes, they can be kept on memory or disk.

4.1. Hash Index

Hash hash can be used as an in-memory index, one typical index is the following.

4.1.1. Bitcask

Bitcask is the default storage engine for Riak

This introduction paper is quite easy to understand.

we use hash index to map the key to the offset of log-structured data file.

  • Whenever we append a new key-value pair to the file, we update the hash point the key to the new offset.
  • The logs are splitted into segments of a certain size, which would be compacted later.

The hash map points to the file id (segment id), and its internal offset

key field

The value field:

value store


  • high speed read/write (the paper says it can do 5~6k writes per sec)
  • good for case where the value for each key is updated frequently


  • entire hash map should be kept in memory
  • range query are not efficient (might be achievable thorough secondary index though)

4.2. SSTable (memtable)

When we cannot fit entire hashmap into the memory, we can use SSTable or memtable

SSTable (e.g: red-black/AVL trees) is an in-memory index to store new key-value pairs

4.2.1. LSM-Tree and memtable

For even larger SSTable, we can write out some SSTable to storage and use memtable to only record the latest key-value pairs.

memtable (e.g: red-black/AVL trees) it stores new key-value pairs temporarily. When they gets bigger than some threshold (e.g: a few Mb), we write them sorted pairs into new SSTable.

Pros (over hash index):

  • merging segments is simple and efficient using priority queue. (see this related Leetcode problem)
  • in-memory index can be sparse: 1 key for every few kilo bytes. (because key-value are sorted in the segment)


4.2.2. LevelDB

4.2.3. Cassandra

4.3. B-Tree

5. Storage Management

One solution to manage storage is using file mapping (mmap), it works well for reading, but not good when there are multiple writers. Instead, database tends to manage storage by themselves using database pages (different page size for different db)

Heap File A heap file is an unordered collection of pages. There are two styles of heap file.

  • Linked List: there is a header managing pointers to data pages and free pages
  • Page Directory: track the location and free slots for each page

5.1. Layout Model

Each page contains a header (page size, checksum, version...). There are two types of layouts:

  • page oriented layout
  • log-structured model

5.1.1. Page-Oriented Model (Tuple-Oriented Model)

Tuple-Oriented model, usually associated with B-Tree


  • fast read

Slotted Tuples

Reference: CMU 15445 Lecture 3

5.1.2. Log-Structured-based Model

Log oriented model appends the new log to pages (hbase, cassandra, leveldb ...)


  • fast write

The most popular modell is the LSM-Tree + SSTable

SSTable (Sorted String Table): requires the key-value pair be sorted by key in each segment.

LSM-Tree LSM-Tree's storage stores multiple SSTables

  • When the memtable gets bigger than some threshold, we write it out to a disk as an SSTable
  • from time to time, we merge and compact segments to discard overwritten values

5.2. Storage Model

5.2.1. Row Model (N-ary Storage Model)

Reference: CMU 15445 Lecture 4

5.2.2. Column Model (Decomposition Storage Model)

The DBMS stores the values of a single attribute for all tuples contiguously in a page

Reference: CMU 15445 Lecture 4

Ideal for OLAP workloads where read-only queries perform large scans over a subset of the table’s attributes

Large value:

  • make them into overflow page and store pointer
  • make them into external file and store path

Column Compression

Sequences of values for each column is usually repetitive, so we can compress them using several compression techniques, compression can reduce memory bandwidth and help vectorize the processing

Compression techniques are for example:

  • bitmap encoding: for each unique label, we store a bitmap
  • run-length encoding: if the number of distinct label is very large, then the bitmap is sparse, which we can further use run-length encoding to compress

Compression Order

We can impose order to all rows using a column to improve processing efficiency and compression. Different column sorting can be done in different replications

5.2.3. Wide-Column Model

It is a mix of row-oriented model and column-oriented model:

  • column-family are stored separately (as the col model does)
  • but within column family, multiple columns are stored together (as the row model does)

6. Memory Management

Reference: CMU 15445 Lecture 5

Buffer Pool is a large memory range as an array of fixed size pages. An array entry (slot) is called a frame. DBMS may have multiple buffer pools.

Page table keep tracks of pages that are currently in buffer pool, also maintains meta-data per page such as dirty flag, pin, reference counter for concurrency issue.

6.1. Optimization

  • scan sharing: decide which page to evict based on access pattern of multiple queries
  • pre-fetching: prefetch page based on query
  • buffer pool bypass: bypass memory pool to reduce overhead
  • OS page cache: O_DIRECT to bypass OS cache (one exception is postgre)

6.1.1. Buffer Replacement Policies

The followings are the most common policies. Enterprise db usually have much better policies than open source db

  • LRU (Least Recently Used): maintain a timestamp, evict pages by timestamp order. The disadvantage of LRU is the sequential flooding for sequential scan, the most recent page is the most useless. clock: approximation of LRU, set to 1 when a page accessed, evict page whose when it is 0. Same problem as LRU
  • LRU-K: take into account history of the last K references as timestamps and compute the interval between subsequent accesses, the interval helps to predict when the page will get accessed next time. Typical K is 1.
  • Localization: restrict memory to its private buffer rather than polluting all memory.
  • Priority Hints: provide hints whether a page is important or not. Dirty pages need to be taken account (e.g: background writing)

7. Query Optimization

7.1. Join Order and Algorithm

The performance of a query plan is largely determined by the order of join. Most query optimizers determine join order via a dynamic programming

To actually do the join, there are three join algorithms

Algorithm (nested loop join) nested two for loop for \(R\) and \(S\)

algorithm nested_loop_join is
    for each tuple r in R do
        for each tuple s in S do
            if r and s satisfy the join condition then
                yield tuple <r,s>

Algorithm (blocked nested loop) generalization of nested loop join. suppose \(|R| < |S|\). cache blocks of \(R\) in memory, only scan \(S\) 1 time for each block of \(R\)

Algorithm (sort-merge join) Sort \(R\) and \(S\) by the join attribute, then process them with a linear scan

Algorithm (hash join) building hash table (for the smaller relation), then probing the other table

8. Transaction

property (ACID)

  • Atomicity: commits are either fully successful or roll back to the prior states completely
  • Consistency: maintaining database invariants: transaction wil bring from one valid state to another
  • Isolation: any transaction will not be affected by other transaction
  • Durability: successful commit will survive permanently (usually meaning recorded in non-volatile memory)

8.1. Concurrency Control

We have 3 categories

Optimistic concurrency control (OCC)

Allows transactions to execute concurrent read/write operations, and determines whether or not the result of the combined execution is serializable. If there is a conflict, one of the conflicting transactions is aborted.

Multiversion concurrency control (NVCC)

Pessimistic concurrency control (PCC)

9. Reference