Skip to content

0x363 Storage

1. Serialization

Programs usually work with data in two formats:

  • in memory, data is kept in objects, which are optimized for efficient access of GPU by using pointers
  • in file disk/network, data has to be encoded into a sequence of bytes

The translation from object to byte sequence is called encoding, or marshalling or serialization.

Language-Specific Format

Many languages support its own encoding features

  • Java: java.io.Serializable
  • Ruby: Marshal
  • Python: pickle

They are convenient but has the following problems

  • not portable across languages
  • security issue to instantiate any class
  • data-versioning issue (i.e backward/forward compatibility)

1.1. Row-Oriented Format

There are binary encoding for JSON such as MessagePack, but the encoding is still not very efficient. Compact binary encoding such as Protocol Buffer are better in the inter-organization communication

Protocol Buffer is a language-neutral, platform-neutral extensible mechanism for serializing structured data.

message Person {
    required string user_name       = 1;
    optional int64  favorite_number = 2;
    repeated string interests       = 3;
}

It has the following benefits:

  • compact: field name string is omitted by replacing it with a field tag number
  • compatibility: backward/forward compatibility can be achieved by paying attention to the tag number

protobuf

See the following python snippet for converting between object, binary and string.

address_book = addressbook_pb2.AddressBook()

# encode object into binary 
with open("address_book.pb", "wb") as f:
  f.write(address_book.SerializeToString())

# decode from binary into object
with open("address_book.pb", "rb") as f:
  # either of the following
  address_book.ParseFromString(f.read())
  address_book = addressbook_pb2.AddressBook.FromString(f.read())


from google.protobuf import text_format

# encode object into string
with open("address_book.pbtxt", "w") as f:
    f.write(text_format.MessageToString(address_book))

# decode string into object
with open("address_book.pbtxt", "r") as f:
    text_format.Parse(f.read(), address_book)

There are a few other similar formats such as Avro and Thrift

1.2. Column-Oritend Format

Apache Parquet (Capacitor)

2. Storage Model

A DBMS's storage model specifies how it phyiscally organizes tuples on disk and in memory.

It has different performance characteristics based on the target workload (OLTP vs OLAP)

2.1. Row Model (N-ary Storage Model)

Row-oriented model:

  • stores (almost) all attributes for a single tuple contiguously in a single page (e.g. 4KB page size)
  • ideal for OLTP worklads
  • fast read

Reference: CMU 15445 Lecture 4

2.1.1. Layout

Each page contains a header (page size, checksum, version...).

There are two types of layouts:

  • page oriented layout
  • log-structured model

Tuple-Oriented model, usually associated with B-Tree

Pros:

  • fast read

Slotted Tuples

Reference: CMU 15445 Lecture 3

Log oriented layout appends the new log to pages (hbase, cassandra, leveldb ...)

Pros:

  • fast write

The most popular modell is the LSM-Tree + SSTable

SSTable (Sorted String Table): requires the key-value pair be sorted by key in each segment.

LSM-Tree LSM-Tree's storage stores multiple SSTables

  • When the memtable gets bigger than some threshold, we write it out to a disk as an SSTable
  • from time to time, we merge and compact segments to discard overwritten values

2.2. Column Model (Decomposition Storage Model)

Column Model:

  • stores the values of a single attribute for all tuples contiguously in a page
  • Ideal for OLAP workloads where read-only queries perform large scans over a subset of the table’s attributes

Reference: CMU 15445 Lecture 4

Large value:

  • make them into overflow page and store pointer
  • make them into external file and store path

Capacitor in Dremel

Capacitor is the column-based storage format used in Dremel

capacitor

2.2.1. Compression

Column Compression

Sequences of values for each column is usually repetitive, so we can compress them using several compression techniques, compression can reduce memory bandwidth and help vectorize the processing

Compression techniques are for example:

  • bitmap encoding: for each unique label, we store a bitmap
  • run-length encoding: if the number of distinct label is very large, then the bitmap is sparse, which we can further use run-length encoding to compress

Compression Order

We can impose order to all rows using a column to improve processing efficiency and compression. Different column sorting can be done in different replications

2.3. Mixed Model

a mix of row-oriented model and column-oriented model:

Wide-column model:

  • column-family are stored separately (as the col model does)
  • but within column family, multiple columns are stored together (as the row model does)

BigTable

BigTable is a sparse multi-dimension map

\[key(row, column, timestamp) \rightarrow value \]

Apache Parquet

See this twitter blog

good intro video

3. Storage Management

One solution to manage storage is using file mapping (mmap), it works well for reading, but not good when there are multiple writers. Instead, database tends to manage storage by themselves using database pages (different page size for different db)

Heap File A heap file is an unordered collection of pages. There are two styles of heap file.

  • Linked List: there is a header managing pointers to data pages and free pages
  • Page Directory: track the location and free slots for each page