Simple and minimal implementation of the MapReduce distributed computation framework introduced by Jeffrey Dean and Sanjay Ghemawat @ Google Inc. in 2004.
MapReduce is a programming model and processing technique designed for distributed computing on large datasets. It consists of two main phases: the Map phase, which processes input files/data in parallel, and the Reduce phase, which performs a summary operation on the mapped data by key.
- Minimum viable implementation that produces the same output as a sequential MapReduce application.
- Graceful exit of all forked threads / goroutines.
To run distributed MapReduce (default):
make
To run sequential MapReduce (for testing/benchmarking):
make run_seq
To clean up the created directories (run as part of previous targets):
make clean
- ./plugins: Directory where the compiled plugin (wordcounter) will be stored.
- ./intermediates: Directory for storing intermediate files generated during the Map phase.
- ./outputs: Directory for storing the final output files after the Reduce phase.