The 40,000 foot view

To visually see the overall topology of a Ceph cluster, run ceph osd tree. This will show us at once the hierarchy of CRUSH buckets, including the name of each bucket, the weight, whether it is marked up or down, a weight adjustment, and an advanced attribute of primary affinity. This cluster was provisioned initially with 3 racks each housing 4 hosts for a total of 12 OSD nodes. Each OSD node (also known as host, also known as server) in turn houses 24 OSD drives.

    # ceph osd tree
    ID  WEIGHT    TYPE NAME           UP/DOWN REWEIGHT PRIMARY-AFFINITY
     -1 974.89661 root default
    -14 330.76886     rack r1
     -2  83.56099         host data001
      0   3.48199             osd.0       up  1.00000          1.00000
    ...
     23   3.48199             osd.23      up  1.00000          1.00000
     -3  80.08588         host data002
     24   3.48199             osd.24      up  1.00000          1.00000
     25   3.48199             osd.25      up  1.00000          1.00000
     26   3.48199             osd.26      up  1.00000          1.00000
     27   3.48199             osd.27      up  1.00000          1.00000
     28   3.48199             osd.28      up  1.00000          1.00000
     29   3.48199             osd.29      up  1.00000          1.00000
     30   3.48199             osd.30      up  1.00000          1.00000
     31   3.48199             osd.31      up  1.00000          1.00000
     32   3.48199             osd.32      up  1.00000          1.00000
     34   3.48199             osd.34      up  1.00000          1.00000
     35   3.48199             osd.35      up  1.00000          1.00000
     36   3.48199             osd.36      up  1.00000          1.00000
     37   3.48199             osd.37      up  1.00000          1.00000
     38   3.48199             osd.38      up  1.00000          1.00000
     39   3.48199             osd.39    down        0          1.00000
     40   3.48199             osd.40      up  1.00000          1.00000
     41   3.48199             osd.41      up  1.00000          1.00000
     42   3.48199             osd.42      up  1.00000          1.00000
     43   3.48199             osd.43      up  1.00000          1.00000
     44   3.48199             osd.44      up  1.00000          1.00000
     45   3.48199             osd.45      up  1.00000          1.00000
     46   3.48199             osd.46      up  1.00000          1.00000
     47   3.48199             osd.47      up  1.00000          1.00000
     -4  83.56099         host data003
     48   3.48199             osd.48      up  1.00000          1.00000
    ...
     -5  83.56099         host data004
     72   3.48199             osd.72      up  1.00000          1.00000
    ...
     95   3.48199             osd.95      up  1.00000          1.00000
    -15 330.76810     rack r2
     -6  83.56099         host data005
     96   3.48199             osd.96      up  1.00000          1.00000
    ...
     -7  80.08557         host data006
    120   3.48199             osd.120     up  1.00000          1.00000
    ...
     -8  83.56055         host data007
     33   3.48169             osd.33      up  1.00000          1.00000
    144   3.48169             osd.144     up  1.00000          1.00000
    ...
    232   3.48169             osd.232     up  1.00000          1.00000
     -9  83.56099         host data008
    168   3.48199             osd.168     up  1.00000          1.00000
    -16 313.35965     rack r3
    -10  83.56099         host data009
    192   3.48199             osd.192     up  1.00000          1.00000
    ...
    -11  69.63379         host data010
    133   3.48169             osd.133     up  1.00000          1.00000
    ...
    -12  83.56099         host data011
    239   3.48199             osd.239     up  1.00000          1.00000
    ...
    -13  76.60388         host data012
    ...
    286   3.48199             osd.286     up  1.00000          1.00000

Let's go over what this tree is telling us. Note that a number of similar lines have been replaced with ellipses for brevity, a practice we will continue throughout this and following chapters.

After the column headers the first data line is:

-1 974.89661 root default

The first column is an ID number that Ceph uses internally, and with which we rarely need to concern ourselves. The second column under the WEIGHT heading is the CRUSH weight. By default, the CRUSH weight of any bucket corresponds to its raw capacity in TB; in this case we have a bit shy of a petabyte (PB) of raw space. We'll see that this weight is the sum of the weights of the buckets under the root in the tree.

Since this cluster utilizes the conventional replication factor of 3, roughly 324 TB of usable space is currently available in this cluster. The balance of the line is root default, which tells us that this CRUSH bucket is of the root type, and that it's name is default. Complex Ceph clusters can contain multiple roots, but most need only one.

The next line is as follows:

-14 330.76886     rack r1

It shows a bucket of type rack, with a weight of roughly 330 TB. Skipping ahead a bit we see two more rack buckets with weights 330 and 313 each. Their sum gets us to the roughly 974 TB capacity (weight) of the root bucket. When the rack weights are not equal, as in our example, usually either they contain different numbers of host buckets (or simply hosts), or more often their underlying hosts have unequal weights.

Next we see the following:

-2  83.56099         host data001

This indicates a bucket of type host, with the name data001. As with root and rack buckets, the weight reflects the raw capacity (before replication) of the underlying buckets in the hierarchy. Below rack1 in our hierarchy we see hosts named data001, data002, data003, and data004. In our example, we see that host data002 presents a somewhat lower weight than the other three racks. This may mean that a mixture of drive sizes has been deployed or that some drives were missed during initial deployment. In our example, though, the host only contains 23 OSD buckets (or simply OSDs) instead of the expected 24. This reflects a drive that has failed and been removed entirely, or one that was not deployed in the first place.

Under each host bucket we see a number of OSD entries.

 24 3.48199 osd.24 up 1.00000 1.00000

In our example, these drives are SAS SSDs each nominally 3840 GB in size, which we describe as the marketing capacity. The discrepancy between that figure and the 3.48199 TB weight presented here is due to multiple factors:

The marketing capacity is expressed in base 10 units; everything else uses base 2 units
Each drive carves out 10 GB for journal use
XFS filesystem overhead

Note also that one OSD under data002 is marked down. This could be a question of the process having been killed or a hardware failure. The CRUSH weight is unchanged, but the weight adjustment is set to 0, which means that data previously allocated to this drive has been directed elsewhere. When we restart the OSD process successfully, the weight adjustment returns to one and data backfills to the drive.

Note also that while many Ceph commands will present OSDs and other items sorted, the IDs (or names) of OSDs on a given host or rack are a function of the cluster's history. When deployed sequentially the numbers will increment neatly, but over time as OSDs and hosts are added and removed discontinuities will accrue. In the above example, note that OSD 33 (also known as osd.33) currently lives on host data007 instead of data002 as one might expect from the present pattern. This reflects the sequence of events:

Drive failed on data002 and was removed
Drive failed on data007 and was removed
The replacement drive on data007 was deployed as a new OSD

When deploying OSDs, Ceph generally picks the lowest unused number; in our case that was 33. It is futile to try to maintain any given OSD number arrangement; it will change over time as drives and host come and go, and the cluster is expanded.

A number of Ceph status commands accept an optional -f json or -fjson -pretty switch, which results in output in a form less readable by humans, but more readily parsed by code. The format of default format commands may change between releases, but the JSON output formats are mostly constant. For this reason management and monitoring scripts are encouraged to use the -f json output format to ensure continued proper operation when Ceph itself is upgraded.

    # ceph osd tree -f json
    
    {"nodes":[{"id":-1,"name":"default","type":"root","type_id":10,"children":[-16,-15,-14]},{"id":-14,"name":"r1","type":"rack","type_id":3,"children":[-5,-4,-3,-2]},{"id":-2,"name":"data001","type":"host","type_id":1,"children":
    [23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0]},{"id":0,"name":"osd.0","type":"osd","type_id":0,"crush_weight":3.481995,"depth":3,"exists":1,"status":"up","reweight":1.000000,"primary_affinity":1.000000},{"id":1,"name":"osd
    .1","type":"osd","type_id":0,"crush_weight":3.481995,"depth":3,"exists":1,"status":"up","reweight":1.000000,"primary_affinity":1.000000},{"id":2,"name":"osd.2","type":"osd","type_id":0,"crush_weight":3.481995,"depth":3,"exists":1,"status":"
    up","reweight":1.000000,"primary_affinity":1.000000},{"id":3,"name":"osd.3","type":"osd","type_id":0,"crush_weight":3.481995,"depth":3,"exists":1,"status":"up","reweight":1.000000,"primary_affinity":1.000000},{"id":4,"name":"osd.4","type":"
    ...

The -f json-pretty output format is something of a compromise: it includes structure to aid programmatic parsing, but also uses whitespace to allow humans to readily inspect visually.

    # ceph osd tree -f json-pretty 
    {
        "nodes": [
            {
                "id": -1,
                "name": "default",
                "type": "root",
                "type_id": 10,
                "children": [
                    -16,
                    -15,
                    -14
                ]
             },
             {
                "id": -14,
                "name": "1",
                "type": "rack",
                "type_id": 3,
                "children": [
                    -5,
                    -4,
                    -3,
                    -2
                   ]
              },
              {
                "id": -2,
                "name": "data001",
                "type": "host",
                "type_id": 1,
                "children": [
                    23,
                    22,
                    21,
                 ...

One may for example extract a list of OSDs that have a non-default reweight adjustment value, using the jq utility. This approach saves a lot of tedious and error-prone coding with awk or perl.

    # ceph osd tree -f json | jq 
      '.nodes[]|select(.type=="osd")|select(.reweight != 1)|.id'
    11
    66

Some commands in Ceph that show a detailed status emit hundreds or thousands of lines of output. It is strongly suggested you to enable unlimited scroll back in your terminal application and to pipe such commands through a pager, for example, less.

iTerm2 is a free package for macOS that offers a wealth of features not found in Apple's bundled Terminal.app. It can be downloaded from https://www.iterm2.com/.

Table of Contents for The 40,000 foot view

Create new playlist

Sign In

Sign Up

Table of Contents for
The 40,000 foot view