MapReduce in the mongo shell

One of the most interesting features that has been underappreciated and not widely supported throughout MongoDB history, is the ability to write MapReduce natively using the shell.

MapReduce is a data processing method for getting aggregation results from large sets of data. The main advantage of this is that it is inherently parallelizable as evidence by frameworks such as Hadoop.

MapReduce is really useful when used to implement a data pipeline. Multiple MapReduce commands can be chained to produce different results. An example of this is aggregating data by using different reporting periods (such as hour, day, week, month, and year) where we use the output of each more granular reporting period to produce a less granular report.

A simple example of MapReduce in our examples given that our input books collection is as follows:

> db.books.find()
{ "_id" : ObjectId("592149c4aabac953a3a1e31e"), "isbn" : "101", "name" : "Mastering MongoDB", "price" : 30 }
{ "_id" : ObjectId("59214bc1aabac954263b24e0"), "isbn" : "102", "name" : "MongoDB in 7 years", "price" : 50 }
{ "_id" : ObjectId("59214bc1aabac954263b24e1"), "isbn" : "103", "name" : "MongoDB for experts", "price" : 40 }

Our map and reduce functions are defined as follows:

> var mapper = function() {
emit(this.id, 1);
};

In this mapper, we simply output a key of id of each document with a value of 1:

> var reducer = function(id, count) {
return Array.sum(count);
};

In reducer, we sum across all values (where each one has a value of 1):

> db.books.mapReduce(mapper, reducer, { out:"books_count" });
{
"result" : "books_count",
"timeMillis" : 16613,
"counts" : {
"input" : 3,
"emit" : 3,
"reduce" : 1,
"output" : 1
},
"ok" : 1
}
> db.books_count.find()
{ "_id" : null, "value" : 3 }
>

Our final output will be a document with no ID, since we didn't output any value for ID, and a value of six, since there are six documents in the input dataset.

Using MapReduce, MongoDB will apply a map to each input document, emitting key-value pairs at the end of the map phase. Then each reducer will get key-value pairs with the same key as input, processing all multiple values. The reducer's output will be a single key-value pair for each key.

Optionally, we can use a finalize function to further process the results of mapper and reducer. MapReduce functions use JavaScript and run within the mongod process. MapReduce can output inline as a single document, subject to the 16 MB document size limit, or as multiple documents in an output collection. Input and output collections can be sharded.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset