Balancing data – how to track and keep our data balanced

One of the advantages of sharding in MongoDB is that it is mostly transparent to the application and requires minimal administration and operational effort.

One of the core tasks that MongoDB needs to perform continuously is balancing data between shards. No matter whether we implement range-based or hash-based sharding, MongoDB will need to calculate bounds for the hashed field to be able to figure out which shard to direct every new document insert or update toward. As our data grows, these bounds may need to get readjusted to avoid having a hot shard that ends up with the majority of our data.

For the sake of this example, let's assume that there is a data type named extra_tiny_int with integer values from [-12, 12). If we enable sharding on this extra_tiny_int field, then the initial bounds of our data will be the whole range of values denoted by $minKey: -12 and $maxKey: 11.

After we insert some initial data, MongoDB will generate chunks and recalculate the bounds of each chunk to try and balance our data.

By default, the initial number of chunks created by MongoDB is 2 × number of shards.

In our case of two shards and four initial chunks, the initial bounds will be calculated as follows:

Chunk1: [-12..-6)

Chunk2:  [-6..0)

Chunk3:  [0..6)

Chunk4:  [6,12) where '[' is inclusive and ')' is not inclusive

The following diagram illustrates the preceding explanation:

After we insert some data, our chunks will look as follows:

  • ShardA:
    • Chunk1: -12,-8,-7
    • Chunk2:  -6
  •   ShardB:
    • Chunk3: 0, 2      
    • Chunk4: 7,8,9,10,11,11,11,11

The following diagram illustrates the preceding explanation:

In this case, we observe that chunk4 has more items than any other chunk. MongoDB will first split chunk4 into two new chunks, attempting to keep the size of each chunk under a certain threshold (64 MB, by default).

Now, instead of chunk4, we have chunk4A: 7,8,9,10 and chunk4B11,11,11,11.

The following diagram illustrates the preceding explanation:

The new bounds of this are as follows:

  • chunk4A: [6,11)
  • chunk4B: [11,12)

Note that chunk4B can only hold one value. This is now an indivisible chunk—a chunk that cannot be broken down into smaller ones anymore—and will grow in size unbounded, causing potential performance issues down the line.

This clarifies why we need to use a high-cardinality field as our shard key and why something like a Boolean, which only has true/false values, is a bad choice for a shard key.

In our case, we now have two chunks in ShardA and three chunks in ShardB. Let's look at the following table:

Number of chunks

Migration threshold

≤19

2

20-79

4

≥80

8

 

We have not reached our migration threshold yet, since 3-2 = 1.

The migration threshold is calculated as the number of chunks in the shard with the highest count of chunks and the number of chunks in the shard with the lowest count of chunks, as follows:

  • Shard1 -> 85 chunks
  • Shard2 -> 86 chunks
  • Shard3 -> 92 chunks

In the preceding example, balancing will not occur until Shard3 (or Shard2) reaches 93 chunks because the migration threshold is 8 for ≥80 chunks and the difference between Shard1 and Shard3 is still 7 chunks (92-85).

If we continue adding data in chunk4A, it will eventually be split into chunk4A1 and chunk4A2.

Now we have four chunks in ShardB (chunk3, chunk4A1, chunk4A2, and chunk4B) and two chunks in ShardA (chunk1 and chunk2).

The following diagram illustrates the relationships of the chunks to the shards:

The MongoDB balancer will now migrate one chunk from ShardB to ShardA as 4-2 = 2, reaching the migration threshold for fewer than 20 chunks. The balancer will adjust the boundaries between the two shards in order to be able to query more effectively (targeted queries). The following diagram illustrates the preceding explanation:

As you can see from the preceding diagram, MongoDB will try to split >64 MB chunks in half in terms of size. The bounds between the two resulting chunks may be completely uneven if our data distribution is uneven to begin with. MongoDB can split chunks into smaller ones, but cannot merge them automatically. We need to manually merge chunks, a delicate and operationally expensive procedure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset