6 Distributed Computing models

The World of big clusters and complex message passing

Authors

Affiliations

François-David Collin

CNRS

IMAG

Paul-Valéry Montpellier 3 University

Ghislain Durif

CNRS

LBMC

7 Map-Reduce

7.1 The (real) beating Heart of Big Data

Map\rightarrow{}Reduce patern is the most common pattern to process data in (real) Big Data.

. . .

It is heavily used by Google, Facebook, and IBM.

. . .

Hadoop from Apache is a popular Map-Reduce framework (also called MapReduce in the Hadoop framework, not to be confused with the more general Map\rightarrow{}Reduce Pattern).

. . .

Hadoop is backed by a HDFS (Hadoop Distributed File System) and a YARN (Yet Another Resource Manager)

HDFS is a distributed file system (a file system that is distributed across a cluster of computers)
YARN is a resource manager (a program that manages the resources of a cluster)

7.2 Split-Apply-Combine pattern

Split:
- Split the data into smaller pieces
Apply:
- Process the data in the pieces
Combine:
- Merge the results

7.3 Map

Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain:

Map(k1,v1) → list(k2,v2)

\Longrightarrow heavily parallelized

7.4 Reduce

The values associated from the same key are combined.

The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain:

Reduce(k2, list (v2)) → list((k3, v3))

7.5 Schema

7.6 Canonical example : Word Count, I

The canonical MapReduce example counts the appearance of each word in a set of documents

def map(name, document):
  // name: document name
  // document: document contents (list of words)
  for word in document:
    emit (word, 1)

def reduce(word, partialCounts):
  // word: a word
  // partialCounts: a list of aggregated partial counts
  sum = 0
  for pc in partialCounts:
    sum += pc
  emit (word, sum)

7.7 Canonical example : Word Count, II

7.8 Spark, spiritual son of MapReduce

Spark is widely used for machine learning on scalable data sets (faster than MapReduce by an order of magnitude).

. . .

Spark is largely inspired by the MapReduce pattern but extends it by using a distributed graph rather than a “linear” data flow like Map\rightarrow{}Reduce.

\Longrightarrow Complex disbributed computing.

. . .

Spark emphasizes ease of use of the cluster ressources in a simple and functional way

7.9 Spark, code example : Word Count

text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

7.10 Spark, another example : machine learning

# Every record of this DataFrame contains the label and
# features represented by a vector.
df = sqlContext.createDataFrame(data, ["label", "features"])

# Set parameters for the algorithm.
# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)

# Fit the model to the data.
model = lr.fit(df)

# Given a dataset, predict each point's label, 
# and show the results.
model.transform(df).show()

8 Message-Passing Patterns

At the heart of any distributed computing system is the message-passing pattern.

Processes are message-passing each other over a network (or a shared memory).

8.1 Schema

8.2 Main message-passing functions

Scatter: partition the data into smaller pieces and send them to the different processes
Gather: collect the data from the different processes and merge them.
Broadcast: Send the same data to all the processes.
Reduce: Merge the data from all the processes and produce a single result.

7 Map-Reduce

7.1 The (real) beating Heart of Big Data

7.2 Split-Apply-Combine pattern

7.3 Map

7.4 Reduce

7.5 Schema

7.6 Canonical example : Word Count, I

7.7 Canonical example : Word Count, II

7.8 Spark, spiritual son of MapReduce

7.9 Spark, code example : Word Count

7.10 Spark, another example : machine learning

8 Message-Passing Patterns

8.1 Schema

8.2 Main message-passing functions

9 References