Appendix B. Problem determination tools

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Problem determination tools

This appendix describes Linux tools that are frequently used to gather data from a running system. This appendix also describes troubleshooting practices.

This appendix contains the following topics:

•Logs and configuration data gathering tools

•Troubleshooting pointers for Linux on Power

•Solving a RAID failure

Logs and configuration data gathering tools

Linux, as an open technology, has many tools that help sysadmins gather relevant information from a system before, during, and after a problem occurs. This section covers two of these tools:

•The sosreport tool

•The Scale-out LC System Event Log Collection Tool

The sosreport tool

Sosreport is an extensible and portable data collection tool that is primarily aimed at Linux distributions and other UNIX-like operating systems.

This section describes the installation and use of this tool with Ubuntu. The tool also can be uses with Red Hat Enterprise Linux (RHEL).

The main purpose of this tool is to gather and collect system logs and configuration from a running system. This is a package that is provided as standard in Ubuntu where you can collect logs and configuration information in a batch file.

Installing sosreport

To install the sosreport tool, install the package by running the following command (the system needs connectivity to an Ubuntu package repository):

$ sudo apt-get install sosreport

Reading package lists ...

(...) Processing triggers for man-db (2.7.5-1) ...

Setting up sosreport (3.4-1 ~ ubuntu 16.04.1) ...

Running sosreport

To run the sosreport tool, run the following command (you need root permission):

$ sudo sosreport

sosreport (version 3.4)

(...)

No changes will be made to system configuration.

Press ENTER to continue, or CTRL-C to quit.

Please enter your first initial and last name [comp02]:

Please enter the case id that you are generating this report for []:

(...)

Your sos report has been generated and saved in:

/tmp/sosreport-comp02-20170825144916.tar.xz

The checksum is: 908cbe3c9a8ecf3e2cb916a79666b916.

Please send this file to your support representative.

A .tgz file is generated under /tmp. You can expand it to see the contents or send it for analysis.

The Scale-out LC System Event Log Collection Tool

This tool is a Perl script that is provided by IBM Support that collects logs from a remote or local system through an SSH connection to its baseboard management controller (BMC). The script can be used with an open problem management report (PMR) to provide more information to IBM Support for systems under a support contract with IBM. The tool can be used to gather and centralize logs and configurations from many Power Systems LC systems in a central Linux repository.

Prerequisites

Before you install the tool, you must meet these prerequisites:

•The operating system where the collector tool is run must be Linux.

•The Linux system must have network connectivity to the BMC.

•The Linux system that is used to perform the data collection must have the following tool packages installed:

– ipmitool

– perl

– sshpass

To install these packages, run the following command:

$ sudo apt-get install perl sshpass ipmitool

Where to get the tool

The tool can be downloaded from Scale-out LC System Event Log Collection Tool.

There is more than one version of the tool, depending on the system’s model and type. You choose the one that applies to you and download the correct plc.zip file.

Installation steps

To install the tool, complete the following steps:

1. Copy the plc.zip package to a Linux system that has network connectivity to the BMC of the Scale-out LC server from which you need to collect data.

2. Extract the plc.zip file into a directory of your choice by running the following command:

$ unzip plc.zip

Archive: plc.zip

inflating: eSEL 2. p

inflating: led_status.sh

inflating: plc.pl inflating:

README

The directory now contains the following files:

– plc.pl

– eSEL2.pl

– led_status.sh

– README

Note: The files that are generated by this script are saved in the same directory from which the script is run.

Usage

This section shows the tool command syntax, its flags, a sample run, and its result:

$ plc.pl { -b bmc_address | -i } [-a admin_pw] [-s sysadmin_pw] [-h host -u user -p password] [-f]

Here are the flags:

-b BMC host name or IP address

-a BMC ADMIN password if changed from the default (admin)

-s BMC sysadmin password if changed from the default (superuser)

-i Interactive mode

-f Collect BMC firmware image

-h Linux host address

-u Linux host user ID

-p Linux host password

To use the tool, run the following command (in this example, the BMC IP is 10.10.10.10):

$ ./plc.pl -b 10.10.10.10 -a admin -f

Getting BMC data

Warning: Permanently added '10.10.10.10' (RSA) to the list of known hosts.

..........................

Getting IPMI Data

........

To list the resulting file, run the following command:

$ ls -l

10.7.22.1-2017-06-30.1045.powerlc.tar.gz

Errors

This section highlights a few errors messages and their exit codes:

•If sshpass is not found on the system, then the plc.pl script prints sshpass is required and not found on this system and exits with return code 1.

•If the BMC is not reachable by a ping, then the plc.pl script prints Unable to ping bmchostname/IPaddress and exits with return code 2.

•If a command not found error is returned when running the plc.pl command, try running the command by prefixing the command with ./ so that the command is ./plc.pl.

Troubleshooting pointers for Linux on Power

IBM provides a troubleshooting and problem analysis list and techniques for Linux on Power Systems at IBM Knowledge Center.

IBM Knowledge Center is good source for error analysis codes, tools, procedures, and other resources to help you solve problems with Linux on Power, and especially for NVIDIA graphical processing units (GPUs) running in these systems.

Solving a RAID failure

Although a software RAID is not a requirement of IBM PowerAI, it is a preferred practice to mitigate against local disk failure. Creating a RAID device is not mandatory for the installation of IBM PowerAI, but it provides high availability (HA) in case of a failure of one of the disks (solid-state disks (SDDs) or hard disk drives (HDDs)) that are part of the RAID array. The steps that are required to create a two-disk RAID array are described in “Creating a RAID1 array” on page 82.

RAID failure

This section covers the recovery procedure after the occurrence of one failure in one disk that is part of a RAID array.

This test assumes that a failure occurs in /dev/sda, where the PRepBoot area is created.

In this example, we completed the following sequence of steps to simulate the disk failure scenario and subsequent recovery. In the case of an actual failure, we would have started from step 3.

1. Confirm the current software RAID configuration.

2. Simulate a failure in one disk (pseudo failure to /dev/sda).

3. Remove the disk in /dev/sda.

4. Add a disk and create a partition.

Confirming the current software RAID configuration

To confirm the current software RAID configuration, run the following commands to check the state of the software-defined RAID:

$ sudo mdadm --detail /dev/md0

/ dev / md 0:

Version: 1.2

Creation Time: Mon Mar 6 10: 38: 20 2017

Raid Level: raid 1

Array Size: 976622592 (931.38 GiB 1000.06 GB)

Used Dev Size: 976622592 (931.38 GiB 1000.06 GB)

Raid Devices: 2

Total Devices: 2

Persistence: Superblock is persistent

Intent Bitmap: Internal

Update Time: Wed Mar 8 16: 10: 02 2 2017

State: clean

Active Devices: 2

Working Devices: 2

Failed Devices: 0

Spare Devices: 0

Name: ubuntu-disktest: 0 (local to host ubuntu-disktest)

UUID: 8d57fc2d:c43f9176:8cf093de:44a0db03

Events: 1696

Number Major Minor Raid Device State

0 8 2 0 active sync /dev/sda2

1 8 18 1 active sync /dev/sdb2

$ sudo cat /proc/mdstat

Personalities: [raid 1] [linear] [multipath] [raid 0] [raid 6] [raid 5] [raid 4] [raid 10]

md 0: active raid 1 sda2 [0] sdb2 [1]

976622592 blocks super 1.2 [2/2] [UU]

bitmap: 0/1 pages [0 KB], 65536 KB chunk

unused devices: <none>

Simulating a failure in one disk

To simulate a failure in one disk, complete the following steps.

Note: This exercise was completed on an established and understood system. The actual parameters, options, and device names depend on your actual system. As some of the commands that are listed are of a destructive nature, verify your configuration before performing these changes.

1. Simulate a failure on one disk that is part of the RAID array by running the following command:

$ sudo mdadm --fail /dev/md0 /dev/sda2

mdadm: set /dev/sda2 faulty in /dev/md0

•Check the status of the RAID array after the failure by running the following command:

$ sudo mdadm --detail / dev / md0

/ dev / md 0:

Version: 1.2

Creation Time: Mon Mar 6 10: 38: 20 2017

Raid Level: raid 1

Array Size: 976622592 (931.38 GiB 1000.06 GB)

Used Dev Size: 976622592 (931.38 GiB 1000.06 GB)

Raid Devices: 2

Total Devices: 2

Persistence: Superblock is persistent

Intent Bitmap: Internal

Update Time: Wed Mar 8 16: 12: 11 2017

State: clean, degraded

Active Devices: 1

Working Devices: 1

Failed Devices: 1

Spare Devices: 0

Name: ubuntu-disktest: 0 (local to host ubuntu-disktest)

UUID: 8d57fc2d:c43f9176:8cf093de:44a0db03

Events: 1700

Number Major Minor Raid Device State

0 0 0 0 removed

1 8 18 1 active sync /dev/sdb2

0 8 2 - faulty /dev/sda2

Removing the disk in /dev/sda

To remove a disk in /dev/sda, complete the following steps:

1. Remove the disk from the RAID definition:

– Exclude /dev/sda2 from the software-defined RAID:

$ sudo mdadm --remove /dev/md0 /dev/sda2

mdadm: hot removed /dev/sda2 from /dev/md0

– Exclude /dev/sda from the system:

$ sudo echo 1> /sys/block/sda/device/delete

– Confirm that /dev/sda is excluded from the system:

$ sudo fdisk -l

2. Pull out the disk physically from the system.

Adding a disk and creating a partition

To add a disk and create a partition, complete the following steps:

1. Install the disk in the system.

2. Recognize the disk in the system:

$ sudo echo "- - -" > /sys/class/scsi_host/host0/scan

3. Confirm that the disk is recognized as /dev/sda:

$ sudo fdisk -l

4. Create a partition on the disk:

$ sudo fdisk /dev/sda

Welcome to fdisk (util-linux 2.27.1).

Changes will remain in memory only until you decide to write them.

Be careful before using the write command.

Command (m for help): n

Partition type

p primary (0 primary, 0 extended, 4 free)

e extended (container for logical partitions)

Select (default p):

Using default response p.

Partition number (1-4, default 1):

First sector (2048-1953525167, default 2048):

Last sector, + sectors or + size {K,M,G,T,P} (2048-1953525167, default 1953525167): 16383

Created a new partition 1 of type 'Linux' and of size 7 MiB.

Command (m for help): n

Partition type

p primary (1 primary, 0 extended, 3 free)

e extended (container for logical partitions)

Select (default p):

Using default response p.

Partition number (2-4, default 2):

First sector (16384-1953525167, default 16384):

Last sector, + sectors or + size {K,M,G,T,P} (16384-1953525167, default 1953525167):

Created a new partition 2 of type 'Linux' and of size 931.5 GiB.

Command (m for help): t

Partition number (1, 2, default 2): 1

Partition type (type L to list all types): 41

Changed type of partition 'Linux' to 'PPC PReP Boot'.

Command (m for help): p

Disk /dev/sda: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical / physical): 512 bytes / 512 bytes

I / O size (minimum / optimal): 512 bytes / 512 bytes

Disklabel type: dos

Disk identifier: 0x19 bcc 23 d

Device Boot Start End Sectors Size Id Type

/dev/sda1 2048 16383 14336 7 M 41 PPC PReP Boot

/dev/sda2 16384 1953525 167 195 3 50 8 78 4 931.5 G 83 Linux

Command (m for help): w

The partition table has been altered.

Calling ioctl () to re-read partition table.

Syncing disks.

5. Copy the boot area from /dev/sdb1 to /dev/sda1:

$ sudo dd if=/dev/sdb1 of=/dev/sda1

14336 + 0 records in

14336 + 0 records out

7340032 bytes (7.3 MB, 7.0 MiB) copied, 0.119115 s, 61.6 MB / s

6. Add the disk to the RAID definition:

# mdadm -add /dev/md0 /dev/sda2

mdadm: added /dev/sda2

7. Check that the RAID array rebuilding has finished before completion:

– Sample output when the array is still being rebuilt:

$ sudo mdadm --detail /dev/md0

/dev/md0:

Version: 1.2

Creation Time: Mon Mar 6 10: 38: 20 2017

Raid Level: raid 1

Array Size: 976622592 (931.38 GiB 1000.06 GB)

Used Dev Size: 976622592 (931.38 GiB 1000.06 GB)

Raid Devices: 2

Total Devices: 2

Persistence: Superblock is persistent

Intent Bitmap: Internal

Update Time: Wed Mar 8 17: 20: 31 2017

State: clean, degraded, recovering

Active Devices: 1

Working Devices: 2

Failed Devices: 0

Spare Devices: 1

Rebuild Status: 0% complete

Name: ubuntu-disktest: 0 (local to host ubuntu-disktest)

UUID: 8d57fc2d:c43f9176:8cf093de:44a0db03

Events: 1883

Number Major Minor Raid Device State

2 8 2 0 spare rebuilding /dev/sda2

1 8 18 1 active sync /dev/sdb2

– Sample output when the array rebuilding has finished:

$ sudo mdadm --detail /dev/md0

/dev/md0:

Version: 1.2

Creation Time: Mon Mar 6 10: 38: 20 2017

Raid Level: raid 1

Array Size: 976622592 (931.38 GiB 1000.06 GB)

Used Dev Size: 976622592 (931.38 GiB 1000.06 GB)

Raid Devices: 2

Total Devices: 2

Persistence: Superblock is persistent

Intent Bitmap: Internal

Update Time: Thu Mar 9 09: 58: 42 2017

State: clean

Active Devices: 2

Working Devices: 2

Failed Devices: 0

Spare Devices: 0

Name: ubuntu-disktest: 0 (local to host ubuntu-disktest)

UUID: 8d57fc2d:c43f9176:8cf093de:44a0db03

Events: 3277

Number Major Minor Raid Device State

2 8 2 0 active sync /dev/sda2

1 8 18 1 active sync /dev/sdb2

Rebuild

Rebuild completed

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Appendix B. Problem determination tools

Create new playlist

Sign In

Sign Up

Table of Contents for
Appendix B. Problem determination tools