Skip to content

0x335 Distribution

1. Scheduler

1.1. Kubernetes

1.2. slurm

how to use slurm

scontrol

scontrol update nodename=<nodename> state=resume

1.3. YARN

ResourceManager

  • keeps the metadata of jobs
  • hosts on a different host from HDFS NameNode

NodeManager

  • run on each node, co-located with HDFS DataNode
  • manage YARN container (resource allocation done by resourcemanager)

2. MapReduce

2.1. Implementations

Model (FlumeJava) Google's pipeline

3. Spark

3.1. RDD

Documentation

resilient distributed dataset (RDD) is a fault-tolerant collection of elements that can be operated on in parallel.

3.1.1. Operations

There are two types of operations in Spark

  • transformations: return a new RDD from existing ones. Lazy-execution. (e.g: map, reduceByKey, flatMap)
  • actions: returns a value to the driver program after computation. (e.g: reduce)
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)

3.2. Structured API

3.2.1. SQL

3.2.2. DataFrame

3.2.3. Dataset

4. MPI

5. Batch Processing

6. Stream Processing

6.1. Kafka (Streaming)

Kafka is a general publish/subscribe messaging system, it can publish message from different sources into multiple topics, and let subscriber receive those topics

7. Reference

  • [1] Docker Deep Dive: Zero to Docker in a single book