0x335 Distribution
1. Scheduler
1.1. Kubernetes
1.2. slurm
how to use slurm
scontrol
scontrol update nodename=<nodename> state=resume
1.3. YARN
ResourceManager
- keeps the metadata of jobs
- hosts on a different host from HDFS NameNode
NodeManager
- run on each node, co-located with HDFS DataNode
- manage YARN container (resource allocation done by resourcemanager)
2. MapReduce
2.1. Implementations
Model (FlumeJava) Google's pipeline
3. Spark
3.1. RDD
resilient distributed dataset (RDD) is a fault-tolerant collection of elements that can be operated on in parallel.
3.1.1. Operations
There are two types of operations in Spark
- transformations: return a new RDD from existing ones. Lazy-execution. (e.g: map, reduceByKey, flatMap)
- actions: returns a value to the driver program after computation. (e.g: reduce)
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
3.2. Structured API
3.2.1. SQL
3.2.2. DataFrame
3.2.3. Dataset
4. MPI
5. Batch Processing
6. Stream Processing
6.1. Kafka (Streaming)
Kafka is a general publish/subscribe messaging system, it can publish message from different sources into multiple topics, and let subscriber receive those topics
7. Reference
- [1] Docker Deep Dive: Zero to Docker in a single book