OSD logs

Next, let's check out entries from osd.8's log. Since Ceph OSDs are more numerous than MONs and process a larger rate of transactions we find that they grow much more quickly. Be sure to allocate ample space for them. As above, this cluster is running Ceph's Jewel release.

2017-09-18 00:19:08.268212 7fe07b990700  1 leveldb: Compacting 4@0 
+ 5@1 files 2017-09-18 00:19:08.327958 7fe07b990700 1 leveldb: Generated table
#14159: 55666 keys, 2129686 bytes 2017-09-18 00:19:08.382062 7fe07b990700 1 leveldb: Generated table
#14160: 52973 keys, 2129584 bytes 2017-09-18 00:19:08.434451 7fe07b990700 1 leveldb: Generated table
#14161: 54404 keys, 2129467 bytes 2017-09-18 00:19:08.488200 7fe07b990700 1 leveldb: Generated table
#14162: 54929 keys, 2129347 bytes 2017-09-18 00:19:08.553861 7fe07b990700 1 leveldb: Generated table
#14163: 53000 keys, 2129511 bytes 2017-09-18 00:19:08.570029 7fe07b990700 1 leveldb: Generated table
#14164: 14956 keys, 600195 bytes 2017-09-18 00:19:08.594003 7fe07b990700 1 leveldb: Generated table
#14165: 17061 keys, 710889 bytes 2017-09-18 00:19:08.594016 7fe07b990700 1 leveldb: Compacted 4@0 +
5@1 files => 11958679 bytes 2017-09-18 00:19:08.595153 7fe07b990700 1 leveldb: compacted to:
files[ 0 7 17 0 0 0 0 ] 2017-09-18 00:19:08.595317 7fe07b990700 1 leveldb: Delete type=2
#14154

This is normal housekeeping. The OSD maintains an internal metadata database that it periodically redds up.

2017-09-18 00:19:53.953129 7fe0b5c2f700  0 --  
10.24.49.15:6816/123339 >> 10.24.49.147:6834/1012539
pipe(0x557b7551b400 sd=83 :6816 s=0 pgs=0 cs=0 l=0
c=0x557b76a0ea80).accept connect_seq 15 vs existing 15 state
standby 2017-09-18 00:19:53.953290 7fe0b5c2f700 0 --
10.24.49.15:6816/123339 >> 10.24.49.147:6834/1012539
pipe(0x557b7551b400 sd=83 :6816 s=0 pgs=0 cs=0 l=0
c=0x557b76a0ea80).accept connect_seq 16 vs existing 15 state
standby 2017-09-18 00:19:54.933317 7fe0b07db700 0 --
10.24.49.15:6816/123339 >> 10.24.50.15:6844/135559
pipe(0x557b7f57e000 sd=38 :6816 s=0 pgs=0 cs=0 l=0
c=0x557b7fb54700).accept connect_seq 57 vs existing 57 state
standby 2017-09-18 00:19:54.933774 7fe0b07db700 0 --
10.24.49.15:6816/123339 >> 10.24.50.15:6844/135559
pipe(0x557b7f57e000 sd=38 :6816 s=0 pgs=0 cs=0 l=0
c=0x557b7fb54700).accept connect_seq 58 vs existing 57 state
standby

The OSD has decided to relieve its boredom by accepting some connection requests.

2017-09-18 00:34:54.052375 7fe0b5c2f700  0 -- 
10.240.49.15:6816/123339 >> 10.240.49.147:6834/1012539
pipe(0x557b7551b400 sd=83 :6816 s=2 pgs=1045 cs=17 l=0
c=0x557b86d4e600).fault with nothing to send, going to standby 2017-09-18 00:34:55.031169 7fe0b07db700 0 --
10.240.49.15:6816/123339 >> 10.240.50.15:6844/135559
pipe(0x557b7f57e000 sd=38 :6816 s=2 pgs=3948 cs=59 l=0
c=0x557b76c85e00).fault with nothing to send, going to standby

The word fault in these messages is scary, but they're a normal part of the OSD request lifecycle.

2017-09-18 02:08:51.710567 7fe0e466f700  0 log_channel(cluster) log 
[INF] : 1.35da deep-scrub starts 2017-09-18 02:08:52.040149 7fe0e466f700 0 log_channel(cluster) log
[INF] : 1.35da deep-scrub ok

Here a deep scrub started and completed successfully. We love it when that happens.

2017-09-18 02:27:44.036497 7fe0926fc700  0 -- 
10.240.49.15:6816/123339 >> 10.240.50.17:6834/1124935
pipe(0x557b95384800 sd=34 :6816 s=0 pgs=0 cs=0 l=0
c=0x557b80759200).accept connect_seq 4 vs existing 3 state standby 2017-09-18 02:39:53.195393 7fe118064700 -1 osd.8 33380
heartbeat_check: no reply from 0x557b91ccd390 osd.272 since back
2017-09-18 02:39:32.671169 front 2017-09-18 02:39:32.671169 (cutoff
2017-09-18 02:39:33.195374) 2017-09-18 02:39:54.195516 7fe118064700 -1 osd.8 33380
heartbeat_check: no reply from 0x557b91ccd390 osd.272 since back
2017-09-18 02:39:32.671169 front 2017-09-18 02:39:32.671169 (cutoff
2017-09-18 02:39:34.195502)

Ugh!!! The OSD with the ID number 272 has gone incommunicado. These entries may indicate a network problem, exhaustion of the nf_conntrack table, or that osd.272 machine's daemon process has given up the ghost. Go ahead and take a garlic fries break. You don't want to tackle this next block on an empty stomach. Ellipses indicate spots that have been edited for relative brevity.

-4> 2017-09-15 21:58:20.149473 7f158d64f700  0    
filestore(/var/lib/ceph/osd/ceph-231) write couldn't open
meta/#-1:97b6cd72:::osdmap.25489:0#: (117) Structure needs
cleaning -3> 2017-09-15 21:58:20.149536 7f158d64f700 0
filestore(/var/lib/ceph/osd/ceph-231) error (117) Structure
needs cleaning not handled on operation 0x55a18d942b88
(17413.0.1, or op 1, counting from 0) -2> 2017-09-15 21:58:20.149544 7f158d64f700 0
filestore(/var/lib/ceph/osd/ceph-231) unexpected error code -1> 2017-09-15 21:58:20.149545 7f158d64f700 0
filestore(/var/lib/ceph/osd/ceph-231) transaction dump: { "ops": [ { "op_num": 0, "op_name": "write", "collection": "meta", "oid": "#-1:725cc562:::inc_osdmap.25489:0#", "length": 203, "offset": 0, "bufferlist length": 203 }, { "op_num": 1, "op_name": "write", "collection": "meta", "oid": "#-1:97b6cd72:::osdmap.25489:0#", "length": 404169, "offset": 0, "bufferlist length": 404169 }, { "op_num": 2, "op_name": "write", "collection": "meta", "oid": "#-1:7b3f43c4:::osd_superblock:0#", "length": 417, "offset": 0, "bufferlist length": 417 } ] } 0> 2017-09-15 21:58:20.152050 7f158d64f700 -1
os/filestore/FileStore.cc: In function 'void
FileStore::_do_transaction(ObjectStore::Transaction&,
uint64_t, int, ThreadPool::TPHandle*)' thread 7f158d64f700
time 2017-09-15 21:58:20.149608 os/filestore/FileStore.cc: 2920: FAILED assert(0 ==
"unexpected error") ceph version 10.2.6
(656b5b63ed7c43bd014bcafd81b001959d5f089f) 1: (ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x8b) [0x55a1826b007b] 2: (FileStore::_do_transaction(ObjectStore::Transaction&,
unsigned long, int, ThreadPool::TPHandle*)+0xefd)
[0x55a1823a26ed] 3: (FileStore::_do_transactions
(std::vector<ObjectStore::Transaction,
std::allocator<ObjectStore::Transaction> >&, unsigned long,
ThreadPool::TPHandle*)+0x3b) [0x55a1823a842b]
4: (FileStore::_do_op(FileStore::OpSequencer*,
ThreadPool::TPHandle&)+0x2b5) [0x55a1823a8715]
5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e)
[0x55a1826a145e]
6: (ThreadPool::WorkThread::entry()+0x10) [0x55a1826a2340]
7: (()+0x8184) [0x7f159bbff184]
8: (clone()+0x6d) [0x7f1599d28ffd] NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to
interpret this. --- logging levels --- 0/ 5 none 0/ 0
lockdep 0/ 0 context 0/ 0 crush 0/ 0 mds 0/ 1 optracker
...
log_file /var/log/ceph/ceph-osd.231.log
--- end dump of recent events --- 2017-09-15 21:58:20.155893 7f158d64f700 -1 *** Caught signal (Aborted) ** in thread 7f158d64f700 thread_name:tp_fstore_op ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
1: (()+0x8f3942) [0x55a1825bb942]
2: (()+0x10330) [0x7f159bc07330]
3: (gsignal()+0x37) [0x7f1599c61c37]
4: (abort()+0x148) [0x7f1599c65028]
5: (ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x265) [0x55a1826b0255]
6: (FileStore::_do_transaction(ObjectStore::Transaction&,
unsigned long, int, ThreadPool::TPHandle*)+0xefd)
[0x55a1823a26ed]
...

Yikes! This definitely does not serve Vaal. Here the Ceph OSD code has encountered an error it can't cope with. Backtraces and information about ops that were being processed are dumped for forensics. We've highlighted two lines out of the dozens above to discuss specifically.

The first highlighted line shows a failed code assertion. These often afford a quick stab at the general cause of the problem. In this case, the OSD formerly known as 231 has encountered a problem in FileStore code. These almost always reflect a problem with the underlying hardware, or at the very least an XFS filesystem that has gone astray for another reason.

os/filestore/FileStore.cc: 2920: FAILED assert(0 == "unexpected error")

In this specific case, the kernel log shows I/O errors on the underlying SSD drive.

Skipping down to the second highlighted line we see

2017-09-15 21:58:20.155893 7f158d64f700 -1 *** Caught signal (Aborted) ** in thread 7f158d64f700 thread_name:tp_fstore_op

When grepping (or groping) through voluminous OSD logs to discern failure causes it can be handy to search for the word signal in order to quickly zero in on the region of interest. Note though that Ceph daemons also log signal messages when they are shut down normally, say for a server reboot.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset