Problem determination tools
This appendix describes Linux tools that are frequently used to gather data from a running system. This appendix also describes troubleshooting practices.
This appendix contains the following topics:
Logs and configuration data gathering tools
Linux, as an open technology, has many tools that help sysadmins gather relevant information from a system before, during, and after a problem occurs. This section covers two of these tools:
The sosreport tool
The Scale-out LC System Event Log Collection Tool
The sosreport tool
Sosreport is an extensible and portable data collection tool that is primarily aimed at Linux distributions and other UNIX-like operating systems.
This section describes the installation and use of this tool with Ubuntu. The tool also can be uses with Red Hat Enterprise Linux (RHEL).
The main purpose of this tool is to gather and collect system logs and configuration from a running system. This is a package that is provided as standard in Ubuntu where you can collect logs and configuration information in a batch file.
Installing sosreport
To install the sosreport tool, install the package by running the following command (the system needs connectivity to an Ubuntu package repository):
$ sudo apt-get install sosreport
Reading package lists ...
(...) Processing triggers for man-db (2.7.5-1) ...
Setting up sosreport (3.4-1 ~ ubuntu 16.04.1) ...
Running sosreport
To run the sosreport tool, run the following command (you need root permission):
$ sudo sosreport
sosreport (version 3.4)
(...)
No changes will be made to system configuration.
Press ENTER to continue, or CTRL-C to quit.
Please enter your first initial and last name [comp02]:
Please enter the case id that you are generating this report for []:
(...)
Your sos report has been generated and saved in:
/tmp/sosreport-comp02-20170825144916.tar.xz
The checksum is: 908cbe3c9a8ecf3e2cb916a79666b916.
Please send this file to your support representative.
A .tgz file is generated under /tmp. You can expand it to see the contents or send it for analysis.
The Scale-out LC System Event Log Collection Tool
This tool is a Perl script that is provided by IBM Support that collects logs from a remote or local system through an SSH connection to its baseboard management controller (BMC). The script can be used with an open problem management report (PMR) to provide more information to IBM Support for systems under a support contract with IBM. The tool can be used to gather and centralize logs and configurations from many Power Systems LC systems in a central Linux repository.
Prerequisites
Before you install the tool, you must meet these prerequisites:
The operating system where the collector tool is run must be Linux.
The Linux system must have network connectivity to the BMC.
The Linux system that is used to perform the data collection must have the following tool packages installed:
 – ipmitool
 – perl
 – sshpass
To install these packages, run the following command:
$ sudo apt-get install perl sshpass ipmitool
Where to get the tool
The tool can be downloaded from Scale-out LC System Event Log Collection Tool.
There is more than one version of the tool, depending on the system’s model and type. You choose the one that applies to you and download the correct plc.zip file.
Installation steps
To install the tool, complete the following steps:
1. Copy the plc.zip package to a Linux system that has network connectivity to the BMC of the Scale-out LC server from which you need to collect data.
2. Extract the plc.zip file into a directory of your choice by running the following command:
$ unzip plc.zip
 
Archive: plc.zip
inflating: eSEL 2. p
inflating: led_status.sh
inflating: plc.pl inflating:
README
The directory now contains the following files:
 – plc.pl
 – eSEL2.pl
 – led_status.sh
 – README
 
Note: The files that are generated by this script are saved in the same directory from which the script is run.
Usage
This section shows the tool command syntax, its flags, a sample run, and its result:
$ plc.pl { -b bmc_address | -i } [-a admin_pw] [-s sysadmin_pw] [-h host -u user -p password] [-f]
Here are the flags:
-b BMC host name or IP address
-a BMC ADMIN password if changed from the default (admin)
-s BMC sysadmin password if changed from the default (superuser)
-i Interactive mode
-f Collect BMC firmware image
-h Linux host address
-u Linux host user ID
-p Linux host password
To use the tool, run the following command (in this example, the BMC IP is 10.10.10.10):
$ ./plc.pl -b 10.10.10.10 -a admin -f
Getting BMC data
Warning: Permanently added '10.10.10.10' (RSA) to the list of known hosts.
..........................
Getting IPMI Data
........
To list the resulting file, run the following command:
$ ls -l
10.7.22.1-2017-06-30.1045.powerlc.tar.gz
Errors
This section highlights a few errors messages and their exit codes:
If sshpass is not found on the system, then the plc.pl script prints sshpass is required and not found on this system and exits with return code 1.
If the BMC is not reachable by a ping, then the plc.pl script prints Unable to ping bmchostname/IPaddress and exits with return code 2.
If a command not found error is returned when running the plc.pl command, try running the command by prefixing the command with ./ so that the command is ./plc.pl.
Troubleshooting pointers for Linux on Power
IBM provides a troubleshooting and problem analysis list and techniques for Linux on Power Systems at IBM Knowledge Center.
IBM Knowledge Center is good source for error analysis codes, tools, procedures, and other resources to help you solve problems with Linux on Power, and especially for NVIDIA graphical processing units (GPUs) running in these systems.
Solving a RAID failure
Although a software RAID is not a requirement of IBM PowerAI, it is a preferred practice to mitigate against local disk failure. Creating a RAID device is not mandatory for the installation of IBM PowerAI, but it provides high availability (HA) in case of a failure of one of the disks (solid-state disks (SDDs) or hard disk drives (HDDs)) that are part of the RAID array. The steps that are required to create a two-disk RAID array are described in “Creating a RAID1 array” on page 82.
RAID failure
This section covers the recovery procedure after the occurrence of one failure in one disk that is part of a RAID array.
This test assumes that a failure occurs in /dev/sda, where the PRepBoot area is created.
In this example, we completed the following sequence of steps to simulate the disk failure scenario and subsequent recovery. In the case of an actual failure, we would have started from step 3.
1. Confirm the current software RAID configuration.
2. Simulate a failure in one disk (pseudo failure to /dev/sda).
3. Remove the disk in /dev/sda.
4. Add a disk and create a partition.
Confirming the current software RAID configuration
To confirm the current software RAID configuration, run the following commands to check the state of the software-defined RAID:
$ sudo mdadm --detail /dev/md0
/ dev / md 0:
Version: 1.2
Creation Time: Mon Mar 6 10: 38: 20 2017
Raid Level: raid 1
Array Size: 976622592 (931.38 GiB 1000.06 GB)
Used Dev Size: 976622592 (931.38 GiB 1000.06 GB)
Raid Devices: 2
Total Devices: 2
Persistence: Superblock is persistent
Intent Bitmap: Internal
Update Time: Wed Mar 8 16: 10: 02 2 2017
State: clean
Active Devices: 2
Working Devices: 2
Failed Devices: 0
Spare Devices: 0
Name: ubuntu-disktest: 0 (local to host ubuntu-disktest)
UUID: 8d57fc2d:c43f9176:8cf093de:44a0db03
Events: 1696
Number Major Minor Raid Device State
0 8 2 0 active sync /dev/sda2
1 8 18 1 active sync /dev/sdb2
 
$ sudo cat /proc/mdstat
Personalities: [raid 1] [linear] [multipath] [raid 0] [raid 6] [raid 5] [raid 4] [raid 10]
md 0: active raid 1 sda2 [0] sdb2 [1]
976622592 blocks super 1.2 [2/2] [UU]
bitmap: 0/1 pages [0 KB], 65536 KB chunk
unused devices: <none>
Simulating a failure in one disk
To simulate a failure in one disk, complete the following steps.
 
Note: This exercise was completed on an established and understood system. The actual parameters, options, and device names depend on your actual system. As some of the commands that are listed are of a destructive nature, verify your configuration before performing these changes.
1. Simulate a failure on one disk that is part of the RAID array by running the following command:
$ sudo mdadm --fail /dev/md0 /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md0
Check the status of the RAID array after the failure by running the following command:
$ sudo mdadm --detail / dev / md0
/ dev / md 0:
Version: 1.2
Creation Time: Mon Mar 6 10: 38: 20 2017
Raid Level: raid 1
Array Size: 976622592 (931.38 GiB 1000.06 GB)
Used Dev Size: 976622592 (931.38 GiB 1000.06 GB)
Raid Devices: 2
Total Devices: 2
Persistence: Superblock is persistent
Intent Bitmap: Internal
Update Time: Wed Mar 8 16: 12: 11 2017
State: clean, degraded
Active Devices: 1
Working Devices: 1
Failed Devices: 1
Spare Devices: 0
Name: ubuntu-disktest: 0 (local to host ubuntu-disktest)
UUID: 8d57fc2d:c43f9176:8cf093de:44a0db03
Events: 1700
Number Major Minor Raid Device State
0 0 0 0 removed
1 8 18 1 active sync /dev/sdb2
0 8 2 - faulty /dev/sda2
Removing the disk in /dev/sda
To remove a disk in /dev/sda, complete the following steps:
1. Remove the disk from the RAID definition:
 – Exclude /dev/sda2 from the software-defined RAID:
$ sudo mdadm --remove /dev/md0 /dev/sda2
mdadm: hot removed /dev/sda2 from /dev/md0
 – Exclude /dev/sda from the system:
$ sudo echo 1> /sys/block/sda/device/delete
 – Confirm that /dev/sda is excluded from the system:
$ sudo fdisk -l
2. Pull out the disk physically from the system.
Adding a disk and creating a partition
To add a disk and create a partition, complete the following steps:
1. Install the disk in the system.
2. Recognize the disk in the system:
$ sudo echo "- - -" > /sys/class/scsi_host/host0/scan
3. Confirm that the disk is recognized as /dev/sda:
$ sudo fdisk -l
4. Create a partition on the disk:
$ sudo fdisk /dev/sda
Welcome to fdisk (util-linux 2.27.1).
Changes will remain in memory only until you decide to write them.
Be careful before using the write command.
Command (m for help): n
Partition type
p primary (0 primary, 0 extended, 4 free)
e extended (container for logical partitions)
Select (default p):
 
Using default response p.
Partition number (1-4, default 1):
First sector (2048-1953525167, default 2048):
Last sector, + sectors or + size {K,M,G,T,P} (2048-1953525167, default 1953525167): 16383
 
Created a new partition 1 of type 'Linux' and of size 7 MiB.
Command (m for help): n
Partition type
p primary (1 primary, 0 extended, 3 free)
e extended (container for logical partitions)
Select (default p):
 
Using default response p.
 
Partition number (2-4, default 2):
First sector (16384-1953525167, default 16384):
Last sector, + sectors or + size {K,M,G,T,P} (16384-1953525167, default 1953525167):
Created a new partition 2 of type 'Linux' and of size 931.5 GiB.
Command (m for help): t
Partition number (1, 2, default 2): 1
Partition type (type L to list all types): 41
 
 
Changed type of partition 'Linux' to 'PPC PReP Boot'.
 
 
Command (m for help): p
Disk /dev/sda: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical / physical): 512 bytes / 512 bytes
I / O size (minimum / optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x19 bcc 23 d
 
Device Boot Start End Sectors Size Id Type
/dev/sda1 2048 16383 14336 7 M 41 PPC PReP Boot
/dev/sda2 16384 1953525 167 195 3 50 8 78 4 931.5 G 83 Linux
 
Command (m for help): w
The partition table has been altered.
Calling ioctl () to re-read partition table.
Syncing disks.
5. Copy the boot area from /dev/sdb1 to /dev/sda1:
$ sudo dd if=/dev/sdb1 of=/dev/sda1
14336 + 0 records in
14336 + 0 records out
7340032 bytes (7.3 MB, 7.0 MiB) copied, 0.119115 s, 61.6 MB / s
6. Add the disk to the RAID definition:
# mdadm -add /dev/md0 /dev/sda2
mdadm: added /dev/sda2
7. Check that the RAID array rebuilding has finished before completion:
 – Sample output when the array is still being rebuilt:
$ sudo mdadm --detail /dev/md0
/dev/md0:
Version: 1.2
Creation Time: Mon Mar 6 10: 38: 20 2017
Raid Level: raid 1
Array Size: 976622592 (931.38 GiB 1000.06 GB)
Used Dev Size: 976622592 (931.38 GiB 1000.06 GB)
Raid Devices: 2
Total Devices: 2
Persistence: Superblock is persistent
Intent Bitmap: Internal
Update Time: Wed Mar 8 17: 20: 31 2017
State: clean, degraded, recovering
Active Devices: 1
Working Devices: 2
Failed Devices: 0
Spare Devices: 1
Rebuild Status: 0% complete
Name: ubuntu-disktest: 0 (local to host ubuntu-disktest)
UUID: 8d57fc2d:c43f9176:8cf093de:44a0db03
Events: 1883
Number Major Minor Raid Device State
2 8 2 0 spare rebuilding /dev/sda2
1 8 18 1 active sync /dev/sdb2
 – Sample output when the array rebuilding has finished:
$ sudo mdadm --detail /dev/md0
 
/dev/md0:
Version: 1.2
Creation Time: Mon Mar 6 10: 38: 20 2017
Raid Level: raid 1
Array Size: 976622592 (931.38 GiB 1000.06 GB)
Used Dev Size: 976622592 (931.38 GiB 1000.06 GB)
Raid Devices: 2
Total Devices: 2
Persistence: Superblock is persistent
Intent Bitmap: Internal
Update Time: Thu Mar 9 09: 58: 42 2017
State: clean
Active Devices: 2
Working Devices: 2
Failed Devices: 0
Spare Devices: 0
Name: ubuntu-disktest: 0 (local to host ubuntu-disktest)
UUID: 8d57fc2d:c43f9176:8cf093de:44a0db03
Events: 3277
Number Major Minor Raid Device State
2 8 2 0 active sync /dev/sda2
1 8 18 1 active sync /dev/sdb2
Rebuild
Rebuild completed
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset