To visually see the overall topology of a Ceph cluster, run ceph osd tree. This will show us at once the hierarchy of CRUSH buckets, including the name of each bucket, the weight, whether it is marked up or down, a weight adjustment, and an advanced attribute of primary affinity. This cluster was provisioned initially with 3 racks each housing 4 hosts for a total of 12 OSD nodes. Each OSD node (also known as host, also known as server) in turn houses 24 OSD drives.
# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 974.89661 root default -14 330.76886 rack r1 -2 83.56099 host data001 0 3.48199 osd.0 up 1.00000 1.00000 ... 23 3.48199 osd.23 up 1.00000 1.00000 -3 80.08588 host data002 24 3.48199 osd.24 up 1.00000 1.00000 25 3.48199 osd.25 up 1.00000 1.00000 26 3.48199 osd.26 up 1.00000 1.00000 27 3.48199 osd.27 up 1.00000 1.00000 28 3.48199 osd.28 up 1.00000 1.00000 29 3.48199 osd.29 up 1.00000 1.00000 30 3.48199 osd.30 up 1.00000 1.00000 31 3.48199 osd.31 up 1.00000 1.00000 32 3.48199 osd.32 up 1.00000 1.00000 34 3.48199 osd.34 up 1.00000 1.00000 35 3.48199 osd.35 up 1.00000 1.00000 36 3.48199 osd.36 up 1.00000 1.00000 37 3.48199 osd.37 up 1.00000 1.00000 38 3.48199 osd.38 up 1.00000 1.00000 39 3.48199 osd.39 down 0 1.00000 40 3.48199 osd.40 up 1.00000 1.00000 41 3.48199 osd.41 up 1.00000 1.00000 42 3.48199 osd.42 up 1.00000 1.00000 43 3.48199 osd.43 up 1.00000 1.00000 44 3.48199 osd.44 up 1.00000 1.00000 45 3.48199 osd.45 up 1.00000 1.00000 46 3.48199 osd.46 up 1.00000 1.00000 47 3.48199 osd.47 up 1.00000 1.00000 -4 83.56099 host data003 48 3.48199 osd.48 up 1.00000 1.00000 ... -5 83.56099 host data004 72 3.48199 osd.72 up 1.00000 1.00000 ... 95 3.48199 osd.95 up 1.00000 1.00000 -15 330.76810 rack r2 -6 83.56099 host data005 96 3.48199 osd.96 up 1.00000 1.00000 ... -7 80.08557 host data006 120 3.48199 osd.120 up 1.00000 1.00000 ... -8 83.56055 host data007 33 3.48169 osd.33 up 1.00000 1.00000 144 3.48169 osd.144 up 1.00000 1.00000 ... 232 3.48169 osd.232 up 1.00000 1.00000 -9 83.56099 host data008 168 3.48199 osd.168 up 1.00000 1.00000 -16 313.35965 rack r3 -10 83.56099 host data009 192 3.48199 osd.192 up 1.00000 1.00000 ... -11 69.63379 host data010 133 3.48169 osd.133 up 1.00000 1.00000 ... -12 83.56099 host data011 239 3.48199 osd.239 up 1.00000 1.00000 ... -13 76.60388 host data012 ... 286 3.48199 osd.286 up 1.00000 1.00000
Let's go over what this tree is telling us. Note that a number of similar lines have been replaced with ellipses for brevity, a practice we will continue throughout this and following chapters.
After the column headers the first data line is:
-1 974.89661 root default
The first column is an ID number that Ceph uses internally, and with which we rarely need to concern ourselves. The second column under the WEIGHT heading is the CRUSH weight. By default, the CRUSH weight of any bucket corresponds to its raw capacity in TB; in this case we have a bit shy of a petabyte (PB) of raw space. We'll see that this weight is the sum of the weights of the buckets under the root in the tree.
Since this cluster utilizes the conventional replication factor of 3, roughly 324 TB of usable space is currently available in this cluster. The balance of the line is root default, which tells us that this CRUSH bucket is of the root type, and that it's name is default. Complex Ceph clusters can contain multiple roots, but most need only one.
The next line is as follows:
-14 330.76886 rack r1
It shows a bucket of type rack, with a weight of roughly 330 TB. Skipping ahead a bit we see two more rack buckets with weights 330 and 313 each. Their sum gets us to the roughly 974 TB capacity (weight) of the root bucket. When the rack weights are not equal, as in our example, usually either they contain different numbers of host buckets (or simply hosts), or more often their underlying hosts have unequal weights.
Next we see the following:
-2 83.56099 host data001
This indicates a bucket of type host, with the name data001. As with root and rack buckets, the weight reflects the raw capacity (before replication) of the underlying buckets in the hierarchy. Below rack1 in our hierarchy we see hosts named data001, data002, data003, and data004. In our example, we see that host data002 presents a somewhat lower weight than the other three racks. This may mean that a mixture of drive sizes has been deployed or that some drives were missed during initial deployment. In our example, though, the host only contains 23 OSD buckets (or simply OSDs) instead of the expected 24. This reflects a drive that has failed and been removed entirely, or one that was not deployed in the first place.
Under each host bucket we see a number of OSD entries.
24 3.48199 osd.24 up 1.00000 1.00000
In our example, these drives are SAS SSDs each nominally 3840 GB in size, which we describe as the marketing capacity. The discrepancy between that figure and the 3.48199 TB weight presented here is due to multiple factors:
- The marketing capacity is expressed in base 10 units; everything else uses base 2 units
- Each drive carves out 10 GB for journal use
- XFS filesystem overhead
Note also that one OSD under data002 is marked down. This could be a question of the process having been killed or a hardware failure. The CRUSH weight is unchanged, but the weight adjustment is set to 0, which means that data previously allocated to this drive has been directed elsewhere. When we restart the OSD process successfully, the weight adjustment returns to one and data backfills to the drive.
Note also that while many Ceph commands will present OSDs and other items sorted, the IDs (or names) of OSDs on a given host or rack are a function of the cluster's history. When deployed sequentially the numbers will increment neatly, but over time as OSDs and hosts are added and removed discontinuities will accrue. In the above example, note that OSD 33 (also known as osd.33) currently lives on host data007 instead of data002 as one might expect from the present pattern. This reflects the sequence of events:
- Drive failed on data002 and was removed
- Drive failed on data007 and was removed
- The replacement drive on data007 was deployed as a new OSD
When deploying OSDs, Ceph generally picks the lowest unused number; in our case that was 33. It is futile to try to maintain any given OSD number arrangement; it will change over time as drives and host come and go, and the cluster is expanded.
A number of Ceph status commands accept an optional -f json or -fjson -pretty switch, which results in output in a form less readable by humans, but more readily parsed by code. The format of default format commands may change between releases, but the JSON output formats are mostly constant. For this reason management and monitoring scripts are encouraged to use the -f json output format to ensure continued proper operation when Ceph itself is upgraded.
# ceph osd tree -f json {"nodes":[{"id":-1,"name":"default","type":"root","type_id":10,"children":[-16,-15,-14]},{"id":-14,"name":"r1","type":"rack","type_id":3,"children":[-5,-4,-3,-2]},{"id":-2,"name":"data001","type":"host","type_id":1,"children": [23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0]},{"id":0,"name":"osd.0","type":"osd","type_id":0,"crush_weight":3.481995,"depth":3,"exists":1,"status":"up","reweight":1.000000,"primary_affinity":1.000000},{"id":1,"name":"osd .1","type":"osd","type_id":0,"crush_weight":3.481995,"depth":3,"exists":1,"status":"up","reweight":1.000000,"primary_affinity":1.000000},{"id":2,"name":"osd.2","type":"osd","type_id":0,"crush_weight":3.481995,"depth":3,"exists":1,"status":" up","reweight":1.000000,"primary_affinity":1.000000},{"id":3,"name":"osd.3","type":"osd","type_id":0,"crush_weight":3.481995,"depth":3,"exists":1,"status":"up","reweight":1.000000,"primary_affinity":1.000000},{"id":4,"name":"osd.4","type":" ...
The -f json-pretty output format is something of a compromise: it includes structure to aid programmatic parsing, but also uses whitespace to allow humans to readily inspect visually.
# ceph osd tree -f json-pretty { "nodes": [ { "id": -1, "name": "default", "type": "root", "type_id": 10, "children": [ -16, -15, -14 ] }, { "id": -14, "name": "1", "type": "rack", "type_id": 3, "children": [ -5, -4, -3, -2 ] }, { "id": -2, "name": "data001", "type": "host", "type_id": 1, "children": [ 23, 22, 21, ...
One may for example extract a list of OSDs that have a non-default reweight adjustment value, using the jq utility. This approach saves a lot of tedious and error-prone coding with awk or perl.
# ceph osd tree -f json | jq '.nodes[]|select(.type=="osd")|select(.reweight != 1)|.id' 11 66
Some commands in Ceph that show a detailed status emit hundreds or thousands of lines of output. It is strongly suggested you to enable unlimited scroll back in your terminal application and to pipe such commands through a pager, for example, less.