My notes of paper " MapReduce Simplified Data Processing on Large Clusters "

What’s MapReduce?

A programming model which users can specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The run-time system takes care of the detail of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication.

Why MapReduce?

Allow us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.

Programming Model

Expressing the computation as 2 functions: Map and Reduce.

Map: takes an input pair and produces a set of intermediate key/value pairs, and then groups together all intermediate values associated with the same intermediate key “I” and passes them to the Reduce function.

Reuce: accepts an intermediate key “I” and a set of values for that keys. It merges together these values to form a possibly smaller set of values.

types:

1
2
map (k1, v1) -> list(k2, v2)
reduce (k2, list(v2)) -> list(v2)