Flags

Ceph has a number of flags that are usually applied across the cluster as a whole. These flags direct Ceph's behavior in a number of ways and when set are reported by ceph status.

The most commonly utilized flag is noout, which directs Ceph to not automatically mark out any OSDs that enter the down state. Since OSDs will not be marked out while the flag is set, the cluster will not start the backfill/recovery process to ensure optimal replication. This is most useful when performing maintenance, including simple reboots. By telling Ceph Hold on, we'll be right back we save the overhead and churn of automatically-triggered data movement.

Here's an example of rebooting an OSD node within a Jewel cluster. First, we run ceph -s to check the overall status of the cluster. This is an excellent habit to get into before, during, and after even the simplest of maintenance; if something is amiss it is almost always best to restore full health before stirring the pot.

# ceph -s
   cluster 3369c9c6-bfaf-4114-9c31-576afa64d0fe
     health HEALTH_OK
     monmap e2: 5 mons at {mon001=10.8.45.10:6789/0,mon002=10.8.45.11:6789/0,mon003=10.8.45.143:6789/0,mon004=10.80.46.10:6789/0,mon005=10.80.46.11:6789/0}
            election epoch 24, quorum 0,1,2,3,4 mon001,mon002,mon003,mon004,mon005
     osdmap e33039: 280 osds: 280 up, 280 in
            flags sortbitwise,require_jewel_osds
      pgmap v725092: 16384 pgs, 1 pools, 0 bytes data, 1 objects
            58364 MB used, 974 TB / 974 TB avail
               16384 active+clean

Here we see that the cluster health is HEALTH_OK, and that all 280 OSDs are both up and in. Squeaky clean healthy cluster. Note that two flags are already set: sortbitwise and require_jewel_osds. The sortbitwise flag denotes an internal sorting change required for certain new features. The require_jewel_osds flag avoids compatibility problems by preventing pre-Jewel OSDs from joining the cluster. Both should always be set on clusters running Jewel or later releases, and you will not need to mess with either directly unless you're upgrading from an earlier release.

# ceph osd set noout
set noout

Now that we've ensured that the cluster is healthy going into our maintenance, we set the noout flag to forestall unwarranted recovery. Note that the cluster health immediately becomes HEALTH_WARN and that a line is added by ceph status showing the reason for the warning state.

# ceph status
   cluster 3369c9c6-bfaf-4114-9c31-576afa64d0fe
     health HEALTH_WARN
            noout flag(s) set
     monmap e2: 5 mons at {mon001=10.8.45.10:6789/0,mon002=10.8.45.11:6789/0,mon003=10.8.45.143:6789/0,mon004=10.80.46.10:6789/0,mon005=10.80.46.11:6789/0}
            election epoch 24, quorum 0,1,2,3,4 mon001,mon002,mon003,mon004,mon005
     osdmap e33050: 280 osds: 280 up, 280 in
            flags noout,sortbitwise,require_jewel_osds
      pgmap v725101: 16384 pgs, 1 pools, 0 bytes data, 1 objects
            58364 MB used, 974 TB / 974 TB avail
               16384 active+clean

At this point we still have all OSDs up and in; we're good to proceed.

# ssh osd013 shutdown -r now

We've rebooted a specific OSD node, say to effect a new kernel or to straighten out a confused HBA. Ceph quickly marks the OSDs on this system down; 20 currently are provisioned on this particular node. Note a new line in the status output below calling out these down OSDs with a corresponding adjustment to the number of up OSDs on the osdmap line. It's easy to gloss over disparities in the osdmap entry, especially when using fonts where values like 260 and 280 are visually similar, so we're pleased that Ceph explicitly alerts us to the situation.

This cluster has CRUSH rules that require copies of data to be in disjoint racks. With the default replicated pool size and min_size settings of three and two respectively all placement groups (PGs) whose acting sets include one of these OSDs are marked undersized and degraded. With only one replica out of service, Ceph continues serving data without missing a beat. This example cluster is idle, but in production when we do this there will be operations that come in while the host is down. These operations will manifest in the output of ceph status as additional lines listing PGs in the state backfill_wait. This indicates that Ceph has data that yearns to be written to the OSDs that are (temporarily) down.

If we had not set the noout flag, after a short grace period Ceph would have proceeded to map that new data (and the existing data allocated to the down OSDs) to other OSDs in the same failure domain. Since a multi-rack replicated pool usually specifies the failure domain as the rack, that would mean that each of the surviving three hosts in the same rack as osd013 would receive a share. Then when the host (and its OSDs) comes back up, Ceph would map that data to its original locations, and lots of recovery would ensue to move it back. This double movement of data is superfluous so long as we get the OSDs back up within a reasonable amount of time.

# ceph status
    cluster 3369c9c6-bfaf-4114-9c31-576afa64d0fe
     health HEALTH_WARN
            3563 pgs degraded
            3261 pgs stuck unclean
            3563 pgs undersized
            20/280 in osds are down
            noout flag(s) set
     monmap e2: 5 mons at {mon001=10.8.45.10:6789/0,mon002=10.8.45.11:6789/0,mon003=10.8.45.143:6789/0,mon004=10.80.46.10:6789/0,mon005=10.80.46.11:6789/0}
            election epoch 24, quorum 0,1,2,3,4 mon001,mon002,mon003,mon004,mon005
     osdmap e33187: 280 osds: 260 up, 280 in; 3563 remapped pgs
            flags noout,sortbitwise,require_jewel_osds
      pgmap v725174: 16384 pgs, 1 pools, 0 bytes data, 1 objects
            70498 MB used, 974 TB / 974 TB avail
               12821 active+clean
                3563 active+undersized+degraded

Note also that the tail end of this output tallies PGs that are in certain combinations of states. Here 3563 PGs are noted again as undersized but also active, since they are still available for client operations. The balance of the cluster's PGs are reported as active+clean. Sum the two numbers and we get 16384, as reported on the pgmap line.

We can exploit these PG state sums as a handy programmatic test for cluster health. During operations like rolling reboots, it is prudent to ensure complete cluster health between iterations. One way to do that is to compare the total PG count with the active number. Since PGs can have other states at the same time as active, such as active+scrubbing and active+scrubbing+deep, we need to sum all such combinations. Here is a simple Ansible play that implements this check.

- name: wait until clean PG == total PG
 shell: "ceph -s | awk '/active+clean/ { total += $1 }; END { print total }'"
 register:    clean_PG
 until:       total_PG.stdout|int == clean_PG.stdout|int
 retries:     20
 delay:       20
 delegate_to: "{{ ceph_primary_mon }}"
 run_once:    true

There is room for improvement: we should use the more surgical ceph pg stat as input, and we should use the safer -f json or -f json-pretty output formats along with the jq utility to guard against inter-release changes. This is, as they say, left as an exercise for the reader.

While we were diverted to the above tip the OSD node we rebooted came back up and the cluster again became healthy. Note that the warning that 20 out of 280 OSDs are down is gone, reflected on the osdmap line as well. The health status, however, remains at HEALTH_WARN so long as we have a flag set that limits cluster behavior. This helps us remember to remove temporary flags when we no longer need their specific behavior modification.

# ceph status
  cluster 3369c9c6-bfaf-4114-9c31-576afa64d0fe
  health HEALTH_WARN
  noout flag(s) set
  monmap e2: 5 mons at {mon001=10.8.45.10:6789/0,mon002=10.8.45.11:6789/0,mon003=10.8.45.143:6789/0,mon004=10.80.46.10:6789/0,mon005=10.80.46.11:6789/0}
         election epoch 24, quorum 0,1,2,3,4 mon001,mon002,mon003,mon004,mon005
   osdmap e33239: 280 osds: 280 up, 280 in
         flags noout,sortbitwise,require_jewel_osds
   pgmap v725292: 16384 pgs, 1 pools, 0 bytes data, 1 objects
          58364 MB used, 974 TB / 974 TB avail
         16384 active+clean

We'll proceed to remove that flag. This double-negative command would irk your high school grammar teacher, but here it makes total sense in the context of a flag setting.

# ceph osd unset noout
unset noout

Now that we have no longer tied Ceph's hands the cluster's health status has returned to HEALTH_OK and our exercise is complete.

# ceph status
   cluster 3369c9c6-bfaf-4114-9c31-576afa64d0fe
   health HEALTH_OK
   monmap e2: 5 mons at {mon001=10.8.45.10:6789/0,mon002=10.8.45.11:6789/0,mon003=10.8.45.143:6789/0,mon004=10.80.46.10:6789/0,mon005=10.80.46.11:6789/0}
          election epoch 24, quorum 0,1,2,3,4 mon001,mon002,mon003,mon004,mon005
   osdmap e33259: 280 osds: 280 up, 280 in
         flags noout,sortbitwise,require_jewel_osds
   pgmap v725392: 16384 pgs, 1 pools, 0 bytes data, 1 objects
          58364 MB used, 974 TB / 974 TB avail
          16384 active+clean

With the Luminous release, the plain format output of ceph status has changed quite a bit, showing the value of using the -f json output format for scripting tasks.

# ceph -s
 cluster:
 id: 2afa26cb-95e0-4830-94q4-5195beakba930c
 health: HEALTH_OK

  services:
    mon: 5 daemons, quorum mon01,mon02,mon03,mon04,mon05
    mgr: mon04(active), standbys: mon05
    osd: 282 osds: 282 up, 282 in

  data:
    pools: 1 pools, 16384 pgs
    objects: 6125k objects, 24502 GB
    usage: 73955 GB used, 912 TB / 985 TB avail
    pgs: 16384 active+clean

# ceph -s
  cluster:
    id: 2af1107b-9950-4800-94a4-51951701a02a930c
    health: HEALTH_WARN
            noout flag(s) set
            48 osds down
            2 hosts (48 osds) down
            Degraded data redundancy: 3130612/18817908 objects degraded (16.636%), 8171 pgs unclean, 8172 pgs degraded, 4071 pgs undersized

  services:
    mon: 5 daemons, quorum mon01,mon02,mon03,mon04,mon05
    mgr: mon04(active), standbys: mon05
    osd: 282 osds: 234 up, 282 in
         flags noout

  data:
    pools: 1 pools, 16384 pgs
    objects: 6125k objects, 24502 GB
    usage: 73941 GB used, 912 TB / 985 TB avail
    pgs: 3130612/18817908 objects degraded (16.636%)
             8212 active+clean
             8172 active+undersized+degraded

  io:
    client: 8010 MB/s wr, 0 op/s rd, 4007 op/s wr

# ceph status
  cluster:
    id: 2afa26cb-95e0-4830-94q4-5195beakba930c
    health: HEALTH_WARN
            Degraded data redundancy: 289727/18817908 objects degraded (1.540%), 414 pgs unclean, 414 pgs degraded

  services:
    mon: 5 daemons, quorum mon01,mon02,mon03,mon04,mon05
    mgr: mon04(active), standbys: mon05
    osd: 283 osds: 283 up, 283 in

  data:
    pools: 1 pools, 16384 pgs
    objects: 6125k objects, 24502 GB
    usage: 73996 GB used, 916 TB / 988 TB avail
    pgs: 289727/18817908 objects degraded (1.540%)
             15970 active+clean
             363 active+recovery_wait+degraded
             51 active+recovering+degraded

  io:
    recovery: 1031 MB/s, 258 objects/s

Other flags include noin, norecover, nobackfill, and norebalance. Their effects are nuanced and use-cases are few. It is possible to shoot oneself in the foot, so it is suggested to research them and experiment in a busy but non-production cluster to gain a full understanding of their dynamics before messing with them.

Later in this chapter, we'll touch on the noscrub and nodeepscrub flags.

Table of Contents for Flags

Create new playlist

Sign In

Sign Up

Table of Contents for
Flags