0x371 Computing
This note is about topics of shared-nothing parallel computing
1. MPI
MPI defines the syntax and semantics of portable message-passing libraries routines.
Check this tutorial
1.1. Communicator
Communicator
objects connect groups of processes in the MPI session. the default one is MPI_COMM_WORLD
1.2. Point-to-Point Communication
1.2.1. Send and Recv
MPI_send
, MPI_Recv
are both blocking API sending number of count
of type datatype
in data
stream
with following syntax
MPI_Send(
void* data,
int count,
MPI_Datatype datatype,
int destination,
int tag,
MPI_Comm communicator)
MPI_Recv(
void* data,
int count,
MPI_Datatype datatype,
int source,
int tag,
MPI_Comm communicator,
MPI_Status* status)
1.2.2. All-to-All
1.3. Collectives Operations
1.4. Python API
mpi4py
provides a python API, see the following example:
It can be launched with something like mpirun -n 8 /python hello_mpi.py
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
world = comm.Get_size()
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)
print("rank ", rank, " world ", world)
2. Batch Processing
2.1. MapReduce
MapReduce (Dean and Ghemawat, 2008)1 is a programmming model for batch processing
2.1.1. Abstraction
Mapreduce has three phases:
Map phase
- read a collection of values or KV pairs from input source
- Mapper is applied to each input element and emits zero or more KV pairs.
Shuffle phase:
- group KV pairs with the same key from Mapper and outputs each key and a stream of values from that key
Reduce phase:
- take the key-grouped data and reuces vlaues in parallel
2.2. Spark
2.2.1. Lower Abstraction (RDD)
RDD (Resilient Distributed Dataset)
- low-level API about how to do something
- compile-time type safety (e.g. Integer, String, Binary)
There are two manipulations:
- transformations: lazy operations. e.g. map, filter, flatMap, reduceKey
- actions: compute a result (e.g. count, collect, reduce, take) they are eager and trigger execution
2.2.2. Higher Abstraction
high level abstractions are automatically converted into RDD via plan optimization
Spark SQL
DataFrame API
Dataset API
2.3. FlumeJava
FlumeJava (Chambers et al., 2010)2 is a library representing immutable parallel collections and its parallel processing.
Execution is based on dataflow graph contructed internally, which get optimized on appropriate primitives (e.g. MapReduce)
2.3.1. PCollection
PCollection<T>
is an unordered immutable bag of elements derived from Java Collection<T>
.
2.3.2. PTransform
PTransform
is the name used in Apache Beam API. It is to transform PCollection
. The general-purpose predefined PTransform are the followings:
2.3.2.1. ParDo
parallelDo
(ParDo
) is one of he main way to manipulate PCollection is to invoke data-parallel operation. It takes DoFn
object
# The input PCollection of Strings.
words = ...
# The DoFn to perform on each element in the input PCollection.
class ComputeWordLengthFn(beam.DoFn):
def process(self, element):
return [len(element)]
# Apply a ParDo to the PCollection "words" to compute lengths for each word.
word_lengths = words | beam.ParDo(ComputeWordLengthFn())
There are a few lightweight DoFn
provided
beam.Map
: emits one output for each input element (e.g.words | beam.Map(len)
)beam.FlatMap
: emits zero, one or more output for each input elementbeam.Filter
: copies input to outputs conditionally
2.3.2.2. GroupByKey
GroupByKey
is a reduction operation, analogous to the Shuffle phrase in mapreduce. It has the signature: PTable<K,V> -> PTable<K, Collection<V>>
2.3.2.3. Combine
Combine
is a transform to reduce a collections of elements. The combining function has to be commutative and associative to enable optimizaiton.
2.3.3. Optimization
FlumeJava optimizer transforms an execution plan into an optimized through a series of independent graph transformation.
ParallelDo Fusion:: separate parallelDo f and g are replaced by a single \(g \circ f\)
Combiner Lift: Input elements are combined per key and window before they are shuffled
3. Stream Processing
3.1. MillWheel
MillWheel (Akidau et al., 2013)3
- latency reduction
- unbounded data
- correctedness
- out-of-order processing
4. Reference
- [1] Docker Deep Dive: Zero to Docker in a single book
-
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107–113. ↩
-
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R Henry, Robert Bradshaw, and Nathan Weizenbaum. 2010. FlumeJava: Easy, efficient data-parallel pipelines. ACM Sigplan Notices, 45(6):363–375. ↩
-
Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. Millwheel: Fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 6(11):1033–1044. ↩