0x363 Database
- 1. History
- 2. Data Model
- 3. Query Language
- 4. Index Management
- 5. Storage Management
- 6. Memory Management
- 7. Query Optimization
- 8. Transaction
- 9. Reference
1. History
- 1960: Charles Bachman designed the Integrated Database System, the first DBMS. (Turing 1973)
- 1964: IBM designed IMS (hierarchical model) for Apollo program (BOM for Saturn V)
- 1969: CODASYL (network model) published
- 1970: Codd from IBM research lab published the paper of the relational model. (Turing 1981)
- 1973 Ingres project (led by Stonebraker) demostrate that it was possible to build practical and efficient implementation relation model. (Turing 2014)
- 1974: SQL designed by Donald Chamberlin and Raymond Boyce from IBM (SQL become ANSI standard in 1986)
- 1979: Oracle database
- 1995: MySQL (MySQL AB -> Sun Microsystems -> 2010 Oracle), it was forked into MariaDB by its cofounder.
- 1996: PostgreSQL (originally called POSTGRES from the Ingres database)
2. Data Model
A data model (or datamodel) is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities
2.1. Relational Model
Relational Model is a data model representing data with unordered set of tuples.
2.1.1. Key Concepts
A relational database contains a set of named relations (or tables) Each relation has a set of named attributes (or columns) Each tuple (or row) has a value for each attribute, tuples in the relation is not ordered. Each attribute has a type (or domain)
The schema is the formal description of relations in database, it is something like typing in programming languages.
Instance is the actual contents at given point in time There is a special value NULL for unknown and undefined, which is important in relational database.
Key is an attribute whose value is unique in each tuple or set of attributes whose combined values ar unique. Keys are often denoted by underlining the set of key attributes. Relational database tends to build special index for keys to achieve efficiency.
There are two types of languages in relational model
- DDL (data definition language): create, drop
- DML (data manipulation language): select, insert
2.1.2. Querying Relational Database
The steps in creating and using a relational database
- Design schema; create using DDL
- Bulk load initial data
- Repeat; execute queries and modificiations
Queries return relations (or table)
2.1.3. Star schema and Snowflakes
Definition (star schema) star schema adopted by relational data warehouse, it classifies table as either dimension or fact
Definition (fact table) store observations or events. can be very large. A fact table contains dimension key in the dimension table.
Definition (dimension table) describes the business entities
Definition (snowflake schema) Star schema stores redundant data in dimension tables, while snowflake schema fully normalizes dimension tables and avoids data redundancy
2.2. Hierarchical Model
Hierarchical Model or Document Model represent all data as a tree structure. It worked well for 1-to-Many relationships, but it made many-to-many relationships difficult.
To handle n-n relationships, we can
- denormalize the data (duplicating)
- resolve the join inside the application.
2.2.1. XML
XML (Extensible Markup Language) is a standard for data representation and exchange. XML has more flexibility when compared with the relational model.
The basic constructs of XML
- tagged elements, which can be nested
- attributes
- text (can be thought as the content for the leaf node)
2.2.1.1. DTD/XSD
A well-formed XML adheres to basic structural requirements
- single root
- matched tags, no open tags
- unique attributes within elements
A valid XML adheres to the basic structral requirements and also content-specific specification Examples to describe the specification are
- Document Type Descriptor (DTD)
- XML Schema (XSD)
The specification can be provided to the XML parser to decide whether the XML is valid or not. For example, xmllint can be used to verify.
Document Type Descriptor (DTD) DTD is a grammar-like language for specifying elements, attributes, nesting, ordering, occurrences It also has special attribute types such as ID and IDREF(S), which works like the pointers
XML Schema (XSD) In addition to the DTD feature, XSD also sepcifies data types, keys, pointers. It is also written in XML.
2.2.2. JSON
JSON is a stanrdard to serialize data objects into human readable format, it is useful for data interchange, and representing & storing semistructured data.
2.3. Network Model
The network model represents data using a graph structure. Link between records in the network model are like the pointers in the programming languages. It was standarized by CODASYL.
Querying network need to follow a path from the root designed manually.
3. Query Language
3.1. Relational Algebra
Relational algebra defines the interface how to retrieve and manipulate tuples in a relation. Input is relation, output is relation, it is proposed by Codds when working in IBM.
Query (or expression) on set of relations produce relation as a result.
Note that semantics of relational algebra removes duplicates from relation when querying (because rows are set).
operator (selector) selector is to pick certain rows based on condition.
operator (project) project is to pick certain columns.
operator (cross product) combine two relations.
Suppose \(A\) has \(s\) tuples and \(B\) has \(t\) tuples, then cross product has \(st\) tuples. relation name is attached to each attribute to disambiguate.
operator (natural join) Enforce equality on all attributes with same name, and eliminate one copy of duplicate attributes.
Natural join does not add any expression power to algebra, it can be expressed using the join
operator (theta join) Apply the condition to join, this is the basic operation used in DBMS
The following are set operators operator (union) combines rows vertically, the schema should match
operator (difference)
operator (intersection) intersection can be expressed with union and difference
operator (rename) rename the schema for an relation
Rename operator is to unify schemas for set operators because schema must match.
3.2. SQL
SQL is the declarative language partially implementing the relational algebra, derived from SEQUAL from IBM. Row in SQL is an unordered multiset.
statement (select)
This is equivalent to the following sigma algebra
duplicates: select will return duplicate because SQL is a multiset model, to reduce the duplicates, add distinct after select
order: select will return results without any order, to force order, add order by attr at the end.
4. Index Management
An index is an additional structure that is derived from the primary data. Maintaining index incurs overhead, especially on writes.: well-chosen indexes speed up read queries, but every index slows down writes.
When the entire db cannot fit into the memory, we store them in the disk and maintain a index to map key into the disk offset. The following three are most common indexes, they can be kept on memory or disk.
4.1. Hash Index
Hash hash can be used as an in-memory index, one typical index is the following.
4.1.1. Bitcask
Bitcask is the default storage engine for Riak
This introduction paper is quite easy to understand.
we use hash index to map the key to the offset of log-structured data file.
- Whenever we append a new key-value pair to the file, we update the hash point the key to the new offset.
- The logs are splitted into segments of a certain size, which would be compacted later.
The hash map points to the file id (segment id), and its internal offset
The value field:
Pros:
- high speed read/write (the paper says it can do 5~6k writes per sec)
- good for case where the value for each key is updated frequently
Cons:
- entire hash map should be kept in memory
- range query are not efficient (might be achievable thorough secondary index though)
4.2. SSTable (memtable)
When we cannot fit entire hashmap into the memory, we can use SSTable or memtable
SSTable (e.g: red-black/AVL trees) is an in-memory index to store new key-value pairs
4.2.1. LSM-Tree and memtable
For even larger SSTable, we can write out some SSTable to storage and use memtable to only record the latest key-value pairs.
memtable (e.g: red-black/AVL trees) it stores new key-value pairs temporarily. When they gets bigger than some threshold (e.g: a few Mb), we write them sorted pairs into new SSTable.
Pros (over hash index):
- merging segments is simple and efficient using priority queue. (see this related Leetcode problem)
- in-memory index can be sparse: 1 key for every few kilo bytes. (because key-value are sorted in the segment)
Cons
4.2.2. LevelDB
4.2.3. Cassandra
4.3. B-Tree
5. Storage Management
One solution to manage storage is using file mapping (mmap), it works well for reading, but not good when there are multiple writers. Instead, database tends to manage storage by themselves using database pages (different page size for different db)
Heap File A heap file is an unordered collection of pages. There are two styles of heap file.
- Linked List: there is a header managing pointers to data pages and free pages
- Page Directory: track the location and free slots for each page
5.1. Layout Model
Each page contains a header (page size, checksum, version...). There are two types of layouts:
- page oriented layout
- log-structured model
5.1.1. Page-Oriented Model (Tuple-Oriented Model)
Tuple-Oriented model, usually associated with B-Tree
Pros:
- fast read
Slotted Tuples
5.1.2. Log-Structured-based Model
Log oriented model appends the new log to pages (hbase, cassandra, leveldb ...)
Pros:
- fast write
The most popular modell is the LSM-Tree + SSTable
SSTable (Sorted String Table): requires the key-value pair be sorted by key in each segment.
LSM-Tree LSM-Tree's storage stores multiple SSTables
- When the memtable gets bigger than some threshold, we write it out to a disk as an SSTable
- from time to time, we merge and compact segments to discard overwritten values
5.2. Storage Model
5.2.1. Row Model (N-ary Storage Model)
5.2.2. Column Model (Decomposition Storage Model)
The DBMS stores the values of a single attribute for all tuples contiguously in a page
Ideal for OLAP workloads where read-only queries perform large scans over a subset of the table’s attributes
Large value:
- make them into overflow page and store pointer
- make them into external file and store path
Column Compression
Sequences of values for each column is usually repetitive, so we can compress them using several compression techniques, compression can reduce memory bandwidth and help vectorize the processing
Compression techniques are for example:
- bitmap encoding: for each unique label, we store a bitmap
- run-length encoding: if the number of distinct label is very large, then the bitmap is sparse, which we can further use run-length encoding to compress
Compression Order
We can impose order to all rows using a column to improve processing efficiency and compression. Different column sorting can be done in different replications
5.2.3. Wide-Column Model
It is a mix of row-oriented model and column-oriented model:
- column-family are stored separately (as the col model does)
- but within column family, multiple columns are stored together (as the row model does)
6. Memory Management
Buffer Pool is a large memory range as an array of fixed size pages. An array entry (slot) is called a frame. DBMS may have multiple buffer pools.
Page table keep tracks of pages that are currently in buffer pool, also maintains meta-data per page such as dirty flag, pin, reference counter for concurrency issue.
6.1. Optimization
- scan sharing: decide which page to evict based on access pattern of multiple queries
- pre-fetching: prefetch page based on query
- buffer pool bypass: bypass memory pool to reduce overhead
- OS page cache: O_DIRECT to bypass OS cache (one exception is postgre)
6.1.1. Buffer Replacement Policies
The followings are the most common policies. Enterprise db usually have much better policies than open source db
- LRU (Least Recently Used): maintain a timestamp, evict pages by timestamp order. The disadvantage of LRU is the sequential flooding for sequential scan, the most recent page is the most useless. clock: approximation of LRU, set to 1 when a page accessed, evict page whose when it is 0. Same problem as LRU
- LRU-K: take into account history of the last K references as timestamps and compute the interval between subsequent accesses, the interval helps to predict when the page will get accessed next time. Typical K is 1.
- Localization: restrict memory to its private buffer rather than polluting all memory.
- Priority Hints: provide hints whether a page is important or not. Dirty pages need to be taken account (e.g: background writing)
7. Query Optimization
7.1. Join Order and Algorithm
The performance of a query plan is largely determined by the order of join. Most query optimizers determine join order via a dynamic programming
To actually do the join, there are three join algorithms
Algorithm (nested loop join) nested two for loop for \(R\) and \(S\)
algorithm nested_loop_join is
for each tuple r in R do
for each tuple s in S do
if r and s satisfy the join condition then
yield tuple <r,s>
Algorithm (blocked nested loop) generalization of nested loop join. suppose \(|R| < |S|\). cache blocks of \(R\) in memory, only scan \(S\) 1 time for each block of \(R\)
Algorithm (sort-merge join) Sort \(R\) and \(S\) by the join attribute, then process them with a linear scan
Algorithm (hash join) building hash table (for the smaller relation), then probing the other table
8. Transaction
property (ACID)
- Atomicity: commits are either fully successful or roll back to the prior states completely
- Consistency: maintaining database invariants: transaction wil bring from one valid state to another
- Isolation: any transaction will not be affected by other transaction
- Durability: successful commit will survive permanently (usually meaning recorded in non-volatile memory)
8.1. Concurrency Control
We have 3 categories
Optimistic concurrency control (OCC)
Allows transactions to execute concurrent read/write operations, and determines whether or not the result of the combined execution is serializable. If there is a conflict, one of the conflicting transactions is aborted.
Multiversion concurrency control (NVCC)
Pessimistic concurrency control (PCC)
9. Reference
- [1] Stanford Database Youtube Lectures
- [2] CMU Database Lectures
- [3] Kleppmann, Martin. Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. " O'Reilly Media, Inc.", 2017.
- [4] Database Internals O'Reilly