Friday, March 1, 2013

How MapReduce works

Inspired by the map and reduce operations in functional languages, Google came up with a programming model and implementation to handle operations on huge data in a time-efficient manner.

In order to use MapReduce, the user needs to define two functions:
the Map function, which generates an intermediate map 
and the Reduce function, which operates on the intermediate map resulting in desired output.

When the user program calls the MapReduce function,

  1. The MapReduce library in the user program splits the input files into M pieces and copies of the program are started on a cluster of machines. One of these copies is master and is responsible for assigning the M map tasks and the R reduce tasks to the workers(others).
  2. When a map task is completed, the intermediate key/value pairs are buffered in memory. These buffered pairs are periodically written to local disk which is partitioned into R regions by the partitioning function. The master is responsible for forwarding these locations to the reducer workers. 
  3. When the master notifies a reduce worker about the locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. 
  4. After reading all the intermediate data, the reduce worker sorts the data by the intermediate keys. This ensures that all occurrences of a key are grouped together.
  5. Then, the reduce worker iterates over the sorted data and for each unique key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. 
  6. The output from the Reduce function is appended to a final output file for that reduce partition.

Once all the map and reduce tasks are completed, the MapReduce call in the user program returns back to the user code and the output of the MapReduce execution is available in the R output files.