Skip to content

0x371 Computing

This note is about topics of shared-nothing parallel computing

1. MPI

MPI defines the syntax and semantics of portable message-passing libraries routines.

Check this tutorial

1.1. Communicator

Communicator objects connect groups of processes in the MPI session. the default one is MPI_COMM_WORLD

1.2. Point-to-Point Communication

1.2.1. Send and Recv

MPI_send, MPI_Recv are both blocking API sending number of count of type datatype in data stream with following syntax

MPI_Send(
    void* data,
    int count,
    MPI_Datatype datatype,
    int destination,
    int tag,
    MPI_Comm communicator)

MPI_Recv(
    void* data,
    int count,
    MPI_Datatype datatype,
    int source,
    int tag,
    MPI_Comm communicator,
    MPI_Status* status)

1.2.2. All-to-All

1.3. Collectives Operations

1.4. Python API

mpi4py provides a python API, see the following example:

It can be launched with something like mpirun -n 8 /python hello_mpi.py

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
world = comm.Get_size()

import os

os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)
print("rank ", rank, " world ", world)

2. Batch Processing

2.1. MapReduce

MapReduce (Dean and Ghemawat, 2008)1 is a programmming model for batch processing

2.1.1. Abstraction

Mapreduce has three phases:

Map phase

  • read a collection of values or KV pairs from input source
  • Mapper is applied to each input element and emits zero or more KV pairs.

Shuffle phase:

  • group KV pairs with the same key from Mapper and outputs each key and a stream of values from that key

Reduce phase:

  • take the key-grouped data and reuces vlaues in parallel

2.2. Spark

2.2.1. Lower Abstraction (RDD)

RDD (Resilient Distributed Dataset)

  • low-level API about how to do something
  • compile-time type safety (e.g. Integer, String, Binary)

There are two manipulations:

  • transformations: lazy operations. e.g. map, filter, flatMap, reduceKey
  • actions: compute a result (e.g. count, collect, reduce, take) they are eager and trigger execution

2.2.2. Higher Abstraction

high level abstractions are automatically converted into RDD via plan optimization

Spark SQL

DataFrame API

Dataset API

2.3. FlumeJava

FlumeJava (Chambers et al., 2010)2 is a library representing immutable parallel collections and its parallel processing.

Execution is based on dataflow graph contructed internally, which get optimized on appropriate primitives (e.g. MapReduce)

2.3.1. PCollection

PCollection<T> is an unordered immutable bag of elements derived from Java Collection<T>.

2.3.2. PTransform

PTransform is the name used in Apache Beam API. It is to transform PCollection. The general-purpose predefined PTransform are the followings:

2.3.2.1. ParDo

parallelDo (ParDo) is one of he main way to manipulate PCollection is to invoke data-parallel operation. It takes DoFn object

# The input PCollection of Strings.
words = ...

# The DoFn to perform on each element in the input PCollection.
class ComputeWordLengthFn(beam.DoFn):
  def process(self, element):
    return [len(element)]

# Apply a ParDo to the PCollection "words" to compute lengths for each word.
word_lengths = words | beam.ParDo(ComputeWordLengthFn())

There are a few lightweight DoFn provided

  • beam.Map: emits one output for each input element (e.g. words | beam.Map(len))
  • beam.FlatMap: emits zero, one or more output for each input element
  • beam.Filter: copies input to outputs conditionally
2.3.2.2. GroupByKey

GroupByKey is a reduction operation, analogous to the Shuffle phrase in mapreduce. It has the signature: PTable<K,V> -> PTable<K, Collection<V>>

2.3.2.3. Combine

Combine is a transform to reduce a collections of elements. The combining function has to be commutative and associative to enable optimizaiton.

2.3.3. Optimization

FlumeJava optimizer transforms an execution plan into an optimized through a series of independent graph transformation.

ParallelDo Fusion:: separate parallelDo f and g are replaced by a single \(g \circ f\)

Combiner Lift: Input elements are combined per key and window before they are shuffled

3. Stream Processing

3.1. MillWheel

MillWheel (Akidau et al., 2013)3

  • latency reduction
  • unbounded data
  • correctedness
  • out-of-order processing

4. Reference

  • [1] Docker Deep Dive: Zero to Docker in a single book

  1. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107–113. 

  2. Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R Henry, Robert Bradshaw, and Nathan Weizenbaum. 2010. FlumeJava: Easy, efficient data-parallel pipelines. ACM Sigplan Notices, 45(6):363–375. 

  3. Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. Millwheel: Fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 6(11):1033–1044.