OSD logs

Next, let's check out entries from osd.8's log. Since Ceph OSDs are more numerous than MONs and process a larger rate of transactions we find that they grow much more quickly. Be sure to allocate ample space for them. As above, this cluster is running Ceph's Jewel release.

2017-09-18 00:19:08.268212 7fe07b990700  1 leveldb: Compacting 4@0 
+ 5@1 files
2017-09-18 00:19:08.327958 7fe07b990700  1 leveldb: Generated table   
#14159: 55666 keys, 2129686 bytes
2017-09-18 00:19:08.382062 7fe07b990700  1 leveldb: Generated table     
#14160: 52973 keys, 2129584 bytes
2017-09-18 00:19:08.434451 7fe07b990700  1 leveldb: Generated table   
#14161: 54404 keys, 2129467 bytes
2017-09-18 00:19:08.488200 7fe07b990700  1 leveldb: Generated table    
#14162: 54929 keys, 2129347 bytes
2017-09-18 00:19:08.553861 7fe07b990700  1 leveldb: Generated table    
#14163: 53000 keys, 2129511 bytes
2017-09-18 00:19:08.570029 7fe07b990700  1 leveldb: Generated table   
#14164: 14956 keys, 600195 bytes
2017-09-18 00:19:08.594003 7fe07b990700  1 leveldb: Generated table  
#14165: 17061 keys, 710889 bytes
2017-09-18 00:19:08.594016 7fe07b990700  1 leveldb: Compacted 4@0 +  
5@1 files => 11958679 bytes
2017-09-18 00:19:08.595153 7fe07b990700  1 leveldb: compacted to:   
files[ 0 7 17 0 0 0 0 ]
2017-09-18 00:19:08.595317 7fe07b990700  1 leveldb: Delete type=2   
#14154

This is normal housekeeping. The OSD maintains an internal metadata database that it periodically redds up.

2017-09-18 00:19:53.953129 7fe0b5c2f700  0 --  
10.24.49.15:6816/123339 >> 10.24.49.147:6834/1012539   
pipe(0x557b7551b400 sd=83 :6816 s=0 pgs=0 cs=0 l=0  
c=0x557b76a0ea80).accept connect_seq 15 vs existing 15 state  
standby
2017-09-18 00:19:53.953290 7fe0b5c2f700  0 --     
10.24.49.15:6816/123339 >> 10.24.49.147:6834/1012539   
pipe(0x557b7551b400 sd=83 :6816 s=0 pgs=0 cs=0 l=0   
c=0x557b76a0ea80).accept connect_seq 16 vs existing 15 state  
standby
2017-09-18 00:19:54.933317 7fe0b07db700  0 -- 
10.24.49.15:6816/123339 >> 10.24.50.15:6844/135559 
pipe(0x557b7f57e000 sd=38 :6816 s=0 pgs=0 cs=0 l=0 
c=0x557b7fb54700).accept connect_seq 57 vs existing 57 state 
standby
2017-09-18 00:19:54.933774 7fe0b07db700  0 --   
10.24.49.15:6816/123339 >> 10.24.50.15:6844/135559 
pipe(0x557b7f57e000 sd=38 :6816 s=0 pgs=0 cs=0 l=0   
c=0x557b7fb54700).accept connect_seq 58 vs existing 57 state 
standby

The OSD has decided to relieve its boredom by accepting some connection requests.

2017-09-18 00:34:54.052375 7fe0b5c2f700  0 -- 
10.240.49.15:6816/123339 >> 10.240.49.147:6834/1012539 
pipe(0x557b7551b400 sd=83 :6816 s=2 pgs=1045 cs=17 l=0 
c=0x557b86d4e600).fault with nothing to send, going to standby
2017-09-18 00:34:55.031169 7fe0b07db700  0 --  
10.240.49.15:6816/123339 >> 10.240.50.15:6844/135559 
pipe(0x557b7f57e000 sd=38 :6816 s=2 pgs=3948 cs=59 l=0 
c=0x557b76c85e00).fault with nothing to send, going to standby

The word fault in these messages is scary, but they're a normal part of the OSD request lifecycle.

2017-09-18 02:08:51.710567 7fe0e466f700  0 log_channel(cluster) log 
[INF] : 1.35da deep-scrub starts
2017-09-18 02:08:52.040149 7fe0e466f700  0 log_channel(cluster) log   
[INF] : 1.35da deep-scrub ok

Here a deep scrub started and completed successfully. We love it when that happens.

2017-09-18 02:27:44.036497 7fe0926fc700  0 -- 
10.240.49.15:6816/123339 >> 10.240.50.17:6834/1124935 
pipe(0x557b95384800 sd=34 :6816 s=0 pgs=0 cs=0 l=0 
c=0x557b80759200).accept connect_seq 4 vs existing 3 state standby
2017-09-18 02:39:53.195393 7fe118064700 -1 osd.8 33380 
heartbeat_check: no reply from 0x557b91ccd390 osd.272 since back 
2017-09-18 02:39:32.671169 front 2017-09-18 02:39:32.671169 (cutoff 
2017-09-18 02:39:33.195374)
2017-09-18 02:39:54.195516 7fe118064700 -1 osd.8 33380 
heartbeat_check: no reply from 0x557b91ccd390 osd.272 since back 
2017-09-18 02:39:32.671169 front 2017-09-18 02:39:32.671169 (cutoff 
2017-09-18 02:39:34.195502)

Ugh!!! The OSD with the ID number 272 has gone incommunicado. These entries may indicate a network problem, exhaustion of the nf_conntrack table, or that osd.272 machine's daemon process has given up the ghost. Go ahead and take a garlic fries break. You don't want to tackle this next block on an empty stomach. Ellipses indicate spots that have been edited for relative brevity.

-4> 2017-09-15 21:58:20.149473 7f158d64f700  0    
    filestore(/var/lib/ceph/osd/ceph-231) write couldn't open    
    meta/#-1:97b6cd72:::osdmap.25489:0#: (117) Structure needs 
    cleaning
    -3> 2017-09-15 21:58:20.149536 7f158d64f700  0   
    filestore(/var/lib/ceph/osd/ceph-231)  error (117) Structure 
    needs cleaning not handled on operation 0x55a18d942b88  
    (17413.0.1, or op 1, counting from 0)
    -2> 2017-09-15 21:58:20.149544 7f158d64f700  0  
    filestore(/var/lib/ceph/osd/ceph-231) unexpected error code
    -1> 2017-09-15 21:58:20.149545 7f158d64f700  0   
    filestore(/var/lib/ceph/osd/ceph-231)  transaction dump:
    {
      "ops": [
         {
             "op_num": 0,
             "op_name": "write",
             "collection": "meta",
             "oid": "#-1:725cc562:::inc_osdmap.25489:0#",
             "length": 203,
             "offset": 0,
             "bufferlist length": 203
         },
         {
             "op_num": 1,
             "op_name": "write",
             "collection": "meta",
             "oid": "#-1:97b6cd72:::osdmap.25489:0#",
             "length": 404169,
             "offset": 0,
             "bufferlist length": 404169
         },
         {
             "op_num": 2,
             "op_name": "write",
             "collection": "meta",
             "oid": "#-1:7b3f43c4:::osd_superblock:0#",
             "length": 417,
             "offset": 0,
             "bufferlist length": 417
         }
      ]
   }
       0> 2017-09-15 21:58:20.152050 7f158d64f700 -1    
       os/filestore/FileStore.cc: In function 'void    
       FileStore::_do_transaction(ObjectStore::Transaction&,   
       uint64_t, int, ThreadPool::TPHandle*)' thread 7f158d64f700   
       time 2017-09-15 21:58:20.149608
    
       os/filestore/FileStore.cc: 2920: FAILED assert(0 ==    
       "unexpected error")
       
       ceph version 10.2.6   
       (656b5b63ed7c43bd014bcafd81b001959d5f089f)
    1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
       char const*)+0x8b) [0x55a1826b007b]
    2: (FileStore::_do_transaction(ObjectStore::Transaction&,   
       unsigned long, int, ThreadPool::TPHandle*)+0xefd)     
       [0x55a1823a26ed]
    3: (FileStore::_do_transactions  
                   (std::vector<ObjectStore::Transaction,              
        std::allocator<ObjectStore::Transaction> >&, unsigned long,  
        ThreadPool::TPHandle*)+0x3b) [0x55a1823a842b]  
    4:  (FileStore::_do_op(FileStore::OpSequencer*,  
        ThreadPool::TPHandle&)+0x2b5) [0x55a1823a8715]  
    5:  (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e)  
       [0x55a1826a145e]  
    6: (ThreadPool::WorkThread::entry()+0x10) [0x55a1826a2340]  
    7: (()+0x8184) [0x7f159bbff184]  
    8: (clone()+0x6d)  [0x7f1599d28ffd]  NOTE: a copy of the           
       executable, or `objdump -rdS <executable>` is needed to                
       interpret this. --- logging levels ---  0/ 5 none  0/ 0  
       lockdep  0/ 0 context  0/ 0 crush  0/ 0 mds  0/ 1 optracker   
  ...
  log_file /var/log/ceph/ceph-osd.231.log
 --- end dump of recent events --- 2017-09-15 21:58:20.155893 7f158d64f700 -1 *** Caught signal (Aborted) **  in thread 7f158d64f700 thread_name:tp_fstore_op  ceph version 10.2.6  (656b5b63ed7c43bd014bcafd81b001959d5f089f)  
     1: (()+0x8f3942) [0x55a1825bb942]  
     2: (()+0x10330) [0x7f159bc07330]  
     3: (gsignal()+0x37) [0x7f1599c61c37]  
     4: (abort()+0x148) [0x7f1599c65028]  
     5: (ceph::__ceph_assert_fail(char const*, char const*, int,  
         char const*)+0x265) [0x55a1826b0255]  
     6: (FileStore::_do_transaction(ObjectStore::Transaction&,             
        unsigned long, int, ThreadPool::TPHandle*)+0xefd)   
        [0x55a1823a26ed]
...

Yikes! This definitely does not serve Vaal. Here the Ceph OSD code has encountered an error it can't cope with. Backtraces and information about ops that were being processed are dumped for forensics. We've highlighted two lines out of the dozens above to discuss specifically.

The first highlighted line shows a failed code assertion. These often afford a quick stab at the general cause of the problem. In this case, the OSD formerly known as 231 has encountered a problem in FileStore code. These almost always reflect a problem with the underlying hardware, or at the very least an XFS filesystem that has gone astray for another reason.

os/filestore/FileStore.cc: 2920: FAILED assert(0 == "unexpected error")

In this specific case, the kernel log shows I/O errors on the underlying SSD drive.

Skipping down to the second highlighted line we see

2017-09-15 21:58:20.155893 7f158d64f700 -1 *** Caught signal (Aborted) ** in thread 7f158d64f700 thread_name:tp_fstore_op

When grepping (or groping) through voluminous OSD logs to discern failure causes it can be handy to search for the word signal in order to quickly zero in on the region of interest. Note though that Ceph daemons also log signal messages when they are shut down normally, say for a server reboot.

Table of Contents for OSD logs

Create new playlist

Sign In

Sign Up

Table of Contents for
OSD logs