What’s MapReduce?
A programming model which users can specify a map
function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce
function that merges all intermediate values associated with the same intermediate key.
The run-time system takes care of the detail of partitioning
the input data, scheduling the program’s execution across a set of machines
, handling machine failures, and managing the required inter-machine communication.
Why MapReduce?
Allow us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library
.
Programming Model
Expressing the computation as 2 functions: Map
and Reduce
.
Map: takes an input pair and produces a set of intermediate key/value pairs, and then groups together all intermediate values associated with the same intermediate key “I” and passes them to the Reduce
function.
Reuce: accepts an intermediate key “I” and a set of values for that keys. It merges together these values to form a possibly smaller set of values.
types:
1 | map (k1, v1) -> list(k2, v2) |