0x371 Computing
This note is about topics of shared-nothing parallel computing
2. MPI
MPI defines the syntax and semantics of portable message-passing libraries routines.
Check this tutorial
2.1. Communicator
Communicator
objects connect groups of processes in the MPI session. the default one is MPI_COMM_WORLD
2.2. Point-to-Point Communication
2.2.1. Send and Recv
MPI_send
, MPI_Recv
are both blocking API sending number of count
of type datatype
in data
stream
with following syntax
MPI_Send(
void* data,
int count,
MPI_Datatype datatype,
int destination,
int tag,
MPI_Comm communicator)
MPI_Recv(
void* data,
int count,
MPI_Datatype datatype,
int source,
int tag,
MPI_Comm communicator,
MPI_Status* status)
2.2.2. All-to-All
2.3. Collectives Operations
2.4. Python API
mpi4py
provides a python API, see the following example:
It can be launched with something like mpirun -n 8 /python hello_mpi.py
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
world = comm.Get_size()
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)
print("rank ", rank, " world ", world)
3. Batch Processing
3.1. MapReduce
3.1.1. Abstraction
Mapreduce has three phases:
Map phase
- read a collection of values or KV pairs from input source
- Mapper is applied to each input element and emits zero or more KV pairs.
Shuffle phase:
- group KV pairs with the same key from Mapper and outputs each key and a stream of values from that key
Reduce phase:
- take the key-grouped data and reuces vlaues in parallel
3.2. Spark
3.2.1. Lower Abstraction (RDD)
RDD (Resilient Distributed Dataset)
- low-level API about how to do something
- compile-time type safety (e.g. Integer, String, Binary)
There are two manipulations:
- transformations: lazy operations. e.g. map, filter, flatMap, reduceKey
- actions: compute a result (e.g. count, collect, reduce, take) they are eager and trigger execution
Higher Abstraction
high level abstractions are automatically converted into RDD via plan optimization
Spark SQL
DataFrame API
Dataset API
3.3. FlumeJava
3.3.1. Abstraction
4. Stream Processing
4.1. MillWheel
https://research.google/pubs/millwheel-fault-tolerant-stream-processing-at-internet-scale/
5. Reference
- [1] Docker Deep Dive: Zero to Docker in a single book