OSD variance and fillage

The variance in OSD utilization is crucial in determining whether one or more OSDs need to be reweighted to even out the workload. A reasonably uniform distribution of PGs, and thus of data, among OSDs is also important so that outliers allocated significantly more than their share do not become full. In this section, we'll explore this phenomenon and how to watch for it. In Chapter 19, Operations and Maintenance, we described a strategy for mitigating uneven distribution; here we explore why it happens and how to adjust thresholds to meet your local needs.

The variance for some OSDs can increase when objects are not allocated in a balanced manner across all PGs. The CRUSH algorithm is designed to treat every PG as equal and thus distributes based on PG count and OSD weight, but usually, this does not match real-world demands. Client-side workloads are not aware of PGs or their distribution in the cluster; they only know about S3 objects or RBD image blocks. Thus, depending on the name of the underlying RADOS objects that comprise the higher level S3 objects or RBD image blocks, these objects may get mapped to a small set of PGs that are colocated a small set of OSDs. If this occurs, the OSDs in which those PGs are housed will show a higher variance and hence are likely to reach capacity sooner than others. Thus we need to watch out for the OSDs that show higher utilization variance. Similarly, we also need to watch out for OSDs that show significantly lower variance for a long period of time. This means that these OSDs are not pulling their weight and should be examined to determine why. OSDs that have variance equal or close to 1.0 can be considered to be in their optimal state.

CRUSH weights reflect disk capacity, while the OSD reweight mentioned in the previous chapter acts as an override (or fudge factor) to ensure balanced PG distribution. Ceph has configurable limits that allow us to control the amount of space being utilized within an OSD and take actions when those limits are reached. The two main configuration options are mon_osd_nearfull_ratio and mon_osd_full_ratio. The near-full ratio acts as a warning; OSDs that exceed this threshold results in the overall Ceph cluster status changing to HEALTH_WARN from HEALTH_OK. The full ratio provides a hard threshold above which client I/O cannot complete on the disk until the usage drops below the threshold. These values default to 85% and 95% respectively, of the total disk capacity.

It is vital to note that unlike most Ceph settings, the compiled-in defaults and any changes in ceph.conf provide initial values. Once a cluster goes through multiple OSD map updates, however, the near-full and full ratio from the PG map override the initial values. In order to ensure that the thresholds we set are always respected for all drives within the cluster, we should ensure that the values that we configure for the near-full and full ratio match those in the PG map. We can set the PG map values as below:

root@ceph-client0:~# ceph pg set_nearfull_ratio 0.85
root@ceph-client0:~# ceph pg set_full_ratio 0.95

These commands may be used to raise the thresholds in an emergency, in order to restore cluster operation until the situation can be addressed properly, but be warned that these thresholds are there for a reason and it is vital for the continued well-being of your cluster to promptly address very full OSDs.

Table of Contents for OSD variance and fillage

Create new playlist

Sign In

Sign Up

Table of Contents for
OSD variance and fillage