Consolidating read querying

We should aim to have as few queries as possible. This can be achieved by embedding information into sub-documents instead of having separate entities. This can lead to an increased write load, as we have to keep the same data points in multiple documents and maintain their values everywhere when they change in one place.

The design considerations here are as follows:

The read performance benefits from data duplication/denormalization.
The data integrity benefits from data references (DBRef or in-application code, using an attribute as a foreign key).

We should denormalize, especially if our read/write ratio is too high (our data rarely changes values, but it gets accessed several times in between) if our data can afford to be inconsistent for brief periods of time, and, most importantly, if we absolutely need our reads to be as fast as possible and are willing to pay the price in consistency/write performance.

The most obvious candidates for fields that we should denormalize (embed) are dependent fields. If we have an attribute or a document structure that we don't plan to query on its own, but only as part of a contained attribute/document, then it makes sense to embed it, rather than have it in a separate document/collection.

Using our MongoDB books example, a book can have a related data structure that refers to a review from a reader of the book. If our most common use case is showing a book along with its associated reviews, then we can embed reviews into the book document.

The downside to this design is that when we want to find all of the book reviews by a user, this will be costly, as we will have to iterate all of the books for the associated reviews. Denormalizing users and embedding their reviews can be a solution to this problem.

A counterexample is data that can grow unbounded. In our example, embedding reviews along with heavy metadata can lead to an issue if we hit the 16 MB document size limit. A solution is to distinguish between data structures that we expect to grow rapidly and those that we don't, and to keep an eye on their sizes through monitoring processes that query our live dataset at off-peak times and reporting on attributes that may pose a risk down the line.

Don't embed data that can grow unbounded.

When we embed attributes, we have to decide whether we will use a sub-document or an enclosing array.

When we have a unique identifier to access the sub-document, we should embed it as a sub-document. If we don't know exactly how to access it or we need the flexibility to be able to query for an attribute's values, then we should embed it in an array.

For example, with our books collection, if we decide to embed reviews into each book document, we have the following two design options:

A book document with an array:

{
Isbn: '1001',
Title: 'Mastering MongoDB',
Reviews: [
{ 'user_id': 1, text: 'great book', rating: 5 },
{ 'user_id': 2, text: 'not so bad book', rating: 3 },
]
}

A book with an embedded document:

{
Isbn: '1001',
Title: 'Mastering MongoDB',
Reviews:
{ 'user_id': 1, text: 'great book', rating: 5 },
{ 'user_id': 2, text: 'not so bad book', rating: 3 },
}

The array structure has the advantage that we can directly query MongoDB for all of the reviews with a rating greater than 4 through the embedded array reviews.

Using the embedded document structure, on the other hand, we can retrieve all of the reviews the same way that we would using the array, but if we want to filter them, it has to be done on the application side, rather than on the database side.

Table of Contents for Consolidating read querying

Create new playlist

Sign In

Sign Up

Table of Contents for
Consolidating read querying