Data normalization is the process of organizing documents and collections to minimize redundancy and dependency. You normalize data by identifying object properties that are subobjects and that should be stored as a separate document in another collection from the object’s document. Typically you do this for objects that have a one-to-many or many-to-many relationship with subobjects.
The advantage of normalizing data is that the database size will be smaller because only a single copy of each object will exist in its own collection instead of duplicated on multiple objects in a single collection. Also, if you modify the information in the subobject frequently, you only need to modify a single instance rather than every record in the object’s collection that has that subobject.
A major disadvantage of normalizing data is that when you look up user objects that require the normalized subobject, a separate lookup must occur to link the subobject. This can result in a significant performance hit if you are accessing the user data frequently.
An example of when it makes sense to normalize data is a system that contains users that have a favorite store. Each User
is an object with name
, phone
, and favoriteStore
properties. The favoriteStore
property is also a subobject that contains name
, street
, city
, and zip
properties.
However, thousands of users may have the same favorite store, so there is a high one-to-many relationship there. Therefore, it doesn’t make sense to store the FavoriteStore
object data in each User
object because that would result in thousands of duplications. Instead, the FavoriteStore
object should include an _id
object property that can be referenced from documents in the user’s FavoriteStores
collection. The application can then use the reference ID favoriteStore
to link data from the Users
collection to FavoriteStore
documents in the FavoriteStores
collection.
Figure 11.1 illustrates the structure of the Users
and FavoriteStores
collections described above.