Troubleshooting

Occasionally, problems or unusual behavior can arise in processes. Solaris provides a number of tools to help us troubleshoot problems.

Truss

Solaris provides a tool called truss that can be used to actually see the system calls that a program makes during its execution. If we were to look at a simple example of a program that makes a system call, we could choose the rm command, which simply deletes a file. We know that whenever a program wants to delete a file it must make a system call to the kernel to actually perform the action. The system call to delete a file is actually called “unlink”; the name is appropriate because if a file has many links and we remove one of them, the file won't actually go. It is only when the last link is removed that the file no longer exists (see Chapter 6, “The Filesystem and Its Contents”). The following command will run the rm command and display all the system calls that were made while it ran:

hydrogen# ls -l testfile
-rw-r--r--   1 root     other        583 Dec 22 17:46 testfile
hydrogen# truss rm testfile
execve("/usr/bin/rm", 0xEFFFFD0C, 0xEFFFFD18)  argc = 2
open("/dev/zero", O_RDONLY)                     = 3
mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0xEF7C0000
stat("/usr/bin/rm", 0xEFFFFA00)                 = 0
open("/usr/lib/libc.so.1", O_RDONLY)            = 4
fstat(4, 0xEFFFF7BC)                            = 0
mmap(0x00000000, 4096, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0xEF7B0000
mmap(0x00000000, 770048, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0xEF6C0000
munmap(0xEF764000, 61440)                       = 0
mmap(0xEF773000, 27668, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 4, 667648)
 = 0xEF773000
mmap(0xEF77A000, 5480, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) =
 0xEF77A000
close(4)                                        = 0
open("/usr/lib/libdl.so.1", O_RDONLY)           = 4
fstat(4, 0xEFFFF7BC)                            = 0
mmap(0xEF7B0000, 4096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 4, 0) = 0xEF7B0000
close(4)                                        = 0
open("/usr/platform/SUNW,SPARCstation-LX/lib/libc_psr.so.1", O_RDONLY) Err#2 ENOENT
close(3)                                        = 0
brk(0x00022C20)                                 = 0
brk(0x00024C20)                                 = 0
open("/usr/lib/locale/en_GB/en_GB.so.2", O_RDONLY) = 3
fstat(3, 0xEFFFF19C)                            = 0
mmap(0x00000000, 4096, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0xEF7A0000
mmap(0x00000000, 86016, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0xEF6A0000
munmap(0xEF6A4000, 61440)                       = 0
mmap(0xEF6B3000, 5934, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 12288) =
 0xEF6B3000
close(3)                                        = 0
open("/dev/zero", O_RDONLY)                     = 3
mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0xEF790000
close(3)                                        = 0
munmap(0xEF7A0000, 4096)                        = 0
getrlimit64(RLIMIT_NOFILE, 0xEFFFFC98)          = 0
lstat64("testfile", 0xEFFFFBA0)                 = 0
access("testfile", 2)                           = 0
unlink("testfile")                              = 0
llseek(0, 0, SEEK_CUR)                          = 2660
_exit(0)
hydrogen#

Each line of output displays a system call, along with any parameters being passed to it. The return code from the system call is shown after the equals sign. Generally, a return code of zero means the system call was successful and any other return code demonstrates that an error occurred, but this is not always the case. The system call open() tells the kernel that you wish to open a file. As we saw in Chapter 6, “The Filesystem and Its Contents,” each time a file is opened a file descriptor is assigned to it, so this system call will return the value of the file descriptor that has been assigned to it. When a program wants to close a file it calls the close() system call and passes the file descriptor as the parameter.

It can be seen that even a relatively simple program can still make many system calls. The first system call comes from the truss process as it executes the rm process by calling the execve() system call (system calls are usually written with the empty brackets following their name).

There are many other system calls until the one that actually does what we want, which is the call to unlink(). The command finishes with the exit() call, passing a zero as the command succeeded in deleting the file.

The truss command can be very useful for troubleshooting processes that are not doing what they should. If the program you are trying to run ends without doing anything, then you can use truss as in the above example. You may find that a program is terminating prematurely because a system call is failing, in which case you should see the offending system call near the end of the truss output. Possible problems you would pick up here include failure to create a file (maybe due to a lack of permissions) or failure to open a file (maybe it is not there or has incorrect permissions). Alternatively, you may find that a program you are running appears to be hanging for no apparent reason. In this case, you can use truss to examine an already running program by using the “-p” option:

hydrogen# ps -ef | tail -5
  jsmith   537   535  0 15:38:34 pts/0    0:01 -sh
  jsmith   643   537  1 15:53:11 pts/0    0:00 pg
    root   579   577  0 15:44:39 pts/1    0:00 -sh
    root   644   579  2 15:53:17 pts/1    0:00 ps -ef
    root   577   173  0 15:44:38 ?        0:00 in.telnetd
hydrogen#

If, for example, we were worried that process ID 643 (shown above) had hung, we could examine it using truss to see what it was doing:

hydrogen# # truss -p 643
read(0, 0xEF73B150, 1024)       (sleeping...)
hydrogen#

Here we see that the process is currently in the read() system call, but it is sleeping rather than actually reading any data. This means that the process (in this case pg) is trying to read data from a file but there is no data for it to read, but there is also no end of file, so it just sits there waiting for data. In this case, we can see what must have happened. The user jsmith has run the pg command without supplying a filename so it is reading the standard input instead. The standard input is attached to the keyboard so it will read whatever is typed until it receives the EOF character (which is usually <control-d>). The user that ran the command is not typing anything, so the process goes into a sleeping state while it waits to receive data. This is a very simple example, but it demonstrates the type of troubleshooting that can be performed using truss to examine the system calls that a process is making.

Pargs

This command was only introduced with Solaris 9, but provides a number of useful features that would make a system administrator wonder how (s)he got on without it.

The default action of pargs is to display all the arguments that were supplied to a running process. This is very useful, but can't we get this information from a ps listing? We can for most processes, but there is a fixed length limit to the amount of information displayed by ps so we may not see all the arguments and parameters that a certain process was started with.

The following example shows the console login process. If we were a bit unsure of the arguments it was called with we could simply look using ps:

hydrogen# ps -ft console
     UID   PID  PPID  C    STIME TTY      TIME CMD
    root   244     1  0 13:22:56 console  0:00 /usr/lib/saf/
ttymon -g -h -p hydrogen console login:  -T sun -d /dev/console
hydrogen#

However, if we look using pargs we see that some information was missing from the ps listing:

hydrogen# pargs 244
244:    /usr/lib/saf/ttymon -g -h -p hydrogen console login:
   -T sun -d /dev/console
argv[0]: /usr/lib/saf/ttymon
argv[1]: -g
argv[2]: -h
argv[3]: -p
argv[4]: junibacken console login:
argv[5]: -T
argv[6]: sun
argv[7]: -d
argv[8]: /dev/console
argv[9]: -l
argv[10]: console
argv[11]: -m
argv[12]: ldterm,ttcompat
hydrogen#

There are other options to pargs. Possibly the most useful is the “-e” option, which will display the environment variables of a process:

hydrogen# pargs -e 244
244:    /usr/lib/saf/ttymon -g -h -p hydrogen console login:
   -T sun -d /dev/console
envp[0]: PATH=/usr/sbin:/usr/bin
envp[1]: TZ=Europe/Stockholm
hydrogen#

If you look at the man page you will see that there are a few other options to pargs, but these are the most useful.

Prex

This command will get a mention here, but that is about all. It has existed in Solaris for a while, but the man page only appeared at Solaris 8. Prex is a very powerful tool that is much more informative than truss, but it does take some time to get used to. It enables you to control tracing and set probes points in running processes or even the kernel itself. If you are familiar with debugging tools, such as sdb, then you may want to have a play with prex to see what it can offer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset