A. On a high level MapReduce involves 2 steps
- Step 1 is mapping: Take one input, process into many outputs. For example, take some wooden blocks and split them into smaller blocks. Process them to have key/value pairs.
- Step 2 is reducing: Process those many outputs, for example process each block by coloring them in different colors and then generate one output by combining the split blocks with similar color into one block by gluing them together.
Why do we need to map it and then reduce it? Because if the data sets are so huge, you can process them in parallel by different machines by splitting up the tasks to make it more efficient. The data is split across server nodes. This is the MapReduce way of processing large data. Not every piece of data can be easily split and then combined.
Example: Input = "big brown fox jumped over a brown fence". The requirement is to count the number of occurrences of the word "count".
1. Break the input "big brown fox jumped over a brown fence" into individual word and map them into key value pairs [key, value]. For example [big, 1] [brown, 1], [fox, 1] , [jumped, 1], [over, 1], [a,1], [brown, 1], [fence, 1].
2. Sort them by the key.
3. Send all the mapped ‘brown’ key/value pairs to the same reducer node, joining on the values e.g. ‘brown’ => [1,1] where for each occurrence of ‘brown’, we are aggregating it’s count of 1.
Count up all the values for the each word, so that e.g. ‘brown’ appears 2 times
"Facebook created Cassandra, Google created BigTable and MapReduce, Amazon created SimpleTable and LinkedIn created Project Voldemort", etc. We know there are quite a few solutions out there.