Scrubs

Data corruption is rare but it does happen, a phenomenon described scientifically as bit-rot. Sometimes we write to a drive, and a surface or cell failure results in reads failing or returning something other than what we wrote. HBA misconfiguration, SAS expander flakes, firmware design flaws, drive electronics errors, and medium failures can also corrupt data. Surface errors affect between 1 in 1016 to as many as 1 in 1014 bits stored on HDDs. Drives can also become unseated due to human error or even a truck rumbling by. This author has also seen literal cosmic rays flip bits.

Ceph lives for strong data integrity and has a mechanism to alert us of this situation: scrubs. Scrubs are somewhat analogous to fsck on a filesystem and the patrol reads or surface scans that many HBAs run. The idea is to check each replicated copy of data to ensure that they're mutually consistent. Since copies of data are distributed throughout the cluster this is done at Placement Group (PG) granularity. Each PG's objects reside together on the same set of OSDs so it is natural and efficient to scrub them together.

When a given PG is scrubbed, the primary OSD for that PG calculates a checksum of data and requests that the other OSDs in the PG's Acting Set do the same. The checksums are compared; if they agree, all is well. If they do not agree, Ceph flags the PG in question with the inconsistent state, and its complement of data becomes inaccessible.

Ceph scrubs, like dark glasses, come in two classes: light and deep.

Light scrubs are also known as shallow scrubs or simply scrubs and are, well, lightweight and by default are performed for every PG every day. They checksum and compare only object metadata (such as size, XATTR, and omap data) and completely quickly and without taking much in the way of resources. Filesystem errors and rare Ceph bugs can be caught by light scrubs.

Deep scrubs read and checksum all of the PG's objects payload data. Since each replica of a PG may hold multiple gigabytes of data, these require substantially more resources than light scrubs. Large reads from media take a much longer time to complete and contend with ongoing client operations, so deep scrubs are spread across a longer period of time, by default, this is one week.

Many Ceph operators find that even weekly runs result in unacceptable impact and use the osd_deep_scrub_interval setting in ceph.conf to spread them out over a longer period. There are also options to align deep scrubs with off-hours or other times of lessened client workload. One may also prevent deep scrubs from slowing recovery (as Ceph scrambles to restore data replication) by configuring osd_scrub_during_recovery with a value of false. This applies at PG granularity, not across the entire cluster, and helps avoid the dreaded blocked requests that can result when scrub and recovery operations align to disrupt pesky user traffic.

Scrub settings can be dynamically adjusted throughout the cluster through the injection mechanism we explored earlier in this chapter, with the requisite caveat regarding permanence. Say as our awesome cluster has become a victim of its own success we find that deep scrubs are becoming too aggressive and we want to space them out over four weeks instead of cramming into just one but don't want to serially restart hundreds of OSD daemons. We can stage the change in ceph.conf and also immediately update the running config (Note: the value is specified in seconds; 4 weeks x 7 days x 24 hours x 60 minutes x 60 seconds = 2419200).

# ceph tell osd.* injectargs '--osd_deep_scrub_interval 2419200' 

We can also exploit flags to turn scrubs off and on completely throughout the cluster if say they are getting in the way of a deep dive into cluster protocol dynamics or if one wishes to test or refute an assertion that they are responsible for poor performance. Leaving deep scrubs disabled for a long period of time can result in a thundering herd phenomenon, especially in Ceph's Hammer and earlier releases, so it is wise to not forget to re-enable them.

A well-engineered cluster does just fine with scrubs running even during rolling reboots, so you are encouraged to resist the urge to disable them for the duration of maintenance. Ceph Jewel and later releases leverage the osd_scrub_interval_randomize_ratio to mitigate this effect by slewing scrub scheduling to distribute them evenly, which additionally helps prevent them from clumping up. Remember that light scrubs consume little in the way of resources, so it even more rarely makes sense to disable them.

# ceph osd set noscrub 
set noscrub 
# ceph osd set nodeep-scrub 
set nodeep-scrub 
... 
# ceph osd unset noscrub 
# ceph osd unset nodeep-scrub 

Bit-flips affecting stored data can remain latent for some time until a user thinks to read the affected object and gets back an error -- or worse, something different from what she wrote. Ceph stores multiple replicas of data to ensure data durability and accessibility, but serves reads only from the lead OSD in each placement group's Acting Set. This means that errors in the other replicas may be unnoticed until overwritten or when peering results in a change in lead OSD designation. This is why we actively seek them out with deep scrubs, to find and address them before they can impair client operations.

Here's an example of HDD errors due to a drive firmware design flaw precipitating an inconsistent PG. The format of such errors varies depending on drive type, Linux distribution, and kernel version, but they're similar with respect to the data presented.

 Dec 15 10:55:44 csx-ceph1-020 kernel: end_request: I/O error, dev      
sdh, sector 1996791104
Dec 15 10:55:44 csx-ceph1-020 kernel: end_request: I/O error, dev
sdh, sector 3936989616
Dec 15 10:55:44 csx-ceph1-020 kernel: end_request: I/O error, dev
sdh, sector 4001236872
Dec 15 13:00:18 csx-ceph1-020 kernel: XFS (sdh1): xfs_log_force:
error 5 returned.
Dec 15 13:00:48 csx-ceph1-020 kernel: XFS (sdh1): xfs_log_force:
error 5 returned.
...

Here we see the common pattern where drive errors result in large numbers of filesystem errors.

The server in the above example uses an HBA that we can manage with LSI's storcli utility, so let's see what we can discover about what happened.

 [root@csx-ceph1-020 ~]# /opt/MegaRAID/storcli/storcli64 /c0 /eall        
/s9 show all | grep Error
Media Error Count = 66 Other Error Count = 1

On these HBAs Media Error Count typically reflects surface (medium) errors and Other Error Count reflects something more systemic like drive electronics failure or accidental unseating.

Here's a log excerpt that shows Ceph's diligent deep scrubbing discovering the affected data replica.

2015-12-19 09:10:30.403351 osd.121 10.203.1.22:6815/3987 10429 :           
[ERR] 20.376b shard 121: soid
a0c2f76b/rbd_data.5134a9222632125.0000000000000001/head//20
candidate had a read error 2015-12-19 09:10:33.224777 osd.121 10.203.1.22:6815/3987 10430 :
[ERR] 20.376b deep-scrub 0 missing, 1 inconsistent objects 2015-12-19 09:10:33.224834 osd.121 10.203.1.22:6815/3987 10431 :
[ERR] 20.376b deep-scrub 1 errors

This is then reflected in the output of ceph health or ceph status.

root@csx-a-ceph1-001:~# ceph status
  cluster ab84e9c8-e141-4f41-aa3f-bfe66707f388
   health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
   osdmap e46754: 416 osds: 416 up, 416 in
   pgmap v7734947: 59416 pgs: 59409 active+clean, 1     
active+clean+inconsistent, 6 active+clean+scrubbing+deep
root@csx-a-ceph1-001:~# ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg 20.376b is active+clean+inconsistent, acting [11,38,121] 1 scrub errors

The affected placement group has been flagged as inconsistent. Since this state impacts the availability of even a fraction of a percent of the cluster's data, the overall cluster health is set to HEALTH_ERR, which alerts us to a situation that must be addressed immediately.

Remediation of an inconsistent PG generally begins with identifying the ailing OSD; the log messages above direct us to osd.121. We use the procedure described later in this chapter to remove the OSD from service. The removal itself may clear the inconsistent state of the PG, or we may need to manually repair it.

root@csx-a-ceph1-001:~# ceph pg repair 20.376b

This directs Ceph to replace the faulty replicas of data by reading clean, authoritative copies from the OSDs in the balance of the Active Set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset