In this chapter, we've seen a lot of individual scripts and programs that enable the DBA to deal with different parts of the database. Now, it's time to pull them together and create a "dashboard" for the OnLine system.
Of all the tools that come with OnLine, there is nothing that provides a quick, one-page peek into the engine. Various invocations of tbstat will allow you to look at individual items, but nothing gives you a quick status report. If the OnLine system is a database engine, certainly there should be a dashboard. Like a dashboard, the status script gives you or your users one place to look for indications of current problems or problems to come.
This script depends upon having other scripts located in your $INFORMIXDIR/local_bin directory. It requires the following scripts:
logging
archiving
thislog
dbdf
Here is the code for the status script:
INFORMIX 103> cat status #! /bin/csh clear set INFORMIXLOCALBIN = "$INFORMIXDIR/local_bin" echo "Please wait....." set list = ( '$INFORMIXDIR/bin/tbstat -p | fgrep -v e' ) set dskreads = $list[l] set pagreads = $list[2] set bufreads = $list[3] set readcache = $list[4] set dskwrits = $list[5] set pagwrits = $list[6] set bufwrits = $list[7] set writecache = $list[8] set isamtot = $list[9] set open = $list[10] set start = $list[ll] set read = $list[12] set write = $list[13] set rewrite = $list[14] set delete = $list[15] set commit = $list[16] set rollbk = $list[17] set ovtbls = $list[18] set ovlock = $list [19] set ovuser = $list[20] set ovbuff = $list[21] set usercpu = $list[22] set syscpu = $list[23] set numckpts = $list[24] set flushes = $list[25] set bufwaits = $list[26] set lokwaits = $list[27] set lokreqs = $list[28] set deadlks = $list[29] set dltouts = $list[30] set lchwaits = $list[31] set ckpwaits = $list[32] set compress = $list[33] if ( $lokwaits > 0 ) then @ lokhold = ( $lokwaits * 100 ) / $lokreqs @ lokfigure = 100 - $lokhold else @ lokfigure = 100 endif if ( $lchwaits > 0 ) then @ lchhold = ( $lchwaits * 100 ) / $isamtot @ lchfigure = 100 - $lchhold else @ lchfigure = 100 endif fgrep Checkpoint $INFORMIXDIR/online.log | tail -6 >! ~/temp_stat #ps - efa > ~/tempjobs ps -a > ~/tempJobs @ NUMBERFULL = '$INFORMIXDIR/bin/tbstat -1 | grep -v RSAM | grep U | wc -1` endif @ cpunum = `echo $syscpu | awk -F. '{print $1}'` @ usernum = `echo $usercpu | awk -F. '{print $1}'` if ( $cpunum == 0 ) then @ cpunum = 1 endif @ ratio = $usernum / $cpunum echo " INFORMIX STATUS:COMPUTER NAME: TIME: `date` `$INFORMIXDIR/bin/tbstat -` CPU UTILIZATION: `uptime` RATIOS: READ CACHE WRITE CACHE LOCK HIT% LATCH HIT% USER/SYSTEM $readcache $writecache $lokfigure $lchfigure ${ratio}-l WAITS: CHECKPOINTS BUFFERS DEADLOCKS $ckpwaits $bufwaits $deadlks " echo " LOGS: There are now ${NUMBERFULL} log(s) in use. This includes the current log: `$INFORMIXLOCALBIN/thislog`" echo " CHECKPOINTS: The last six checkpoints occurred as follows: `cat ~/temp_stat | sed 's/Checkpoint Completed/ /g'`" echo ======================================================================== echo -n " CONTINUOUS BACKUP OF LOG FILES: `$INFORMIXLOCALBIN/logging` ARCHIVE: `$INFORMIXLOCALBIN/archiving` " echo " " Sh $INFORMIXLOCALBIN/dbdf $INFORMIXDIR/bin/tbstat -d >! ~/temphold99 fgrep PD ~/temphold99 > /dev/null if ($status == 0) then echo "CAUTION: A CHUNK IS DOWN" endif rm ~/temp_stat* ~/temphold* ~/tempjobs* echo " "
This script is designed to be relatively portable across systems. I've avoided hard coding anything into the system, with the exception of some of the code in the dbdf script that excludes page usage in my logspaces. Since the logfiles are preallocated, the space usage does not vary. You may or may not want to see the data.
The screen output looks like this:
Please wait..... INFORMIX STATUS:COMPUTER NAME: batman TIME: Fri Jan 21 15:40:11 CST 1994 RSAM Version 4.10.UD4 -- On-Line -- Up 3 days 17:41:18 -- 18704 Kbytes CPU UTILIZATION: 3:40pm up 3 days, 17:46, 79 users, load average: 12.94, 13.50, 13.27 RATIOS: READ CACHE WRITE CACHE LOCK HIT% LATCH HIT% USER/SYSTEM 98.39 97.10 100 100 9-1 WAITS: CHECKPOINTS BUFFERS DEADLOCKS 125 9941 5 LOGS: There are now 2 log(s) in use. This includes the current log: e06d26f0 11 U---C-L 3214 70f4f4 30000 13435 44.78 CHECKPOINTS: The last six checkpoints occurred as follows: 12:59:42 13:31:53 14:04:41 14:22:41 14:56:41 15:33:37 ============================================================================= CONTINUOUS BACKUP OF LOG FILES: OFF ARCHIVE: OFF rootdbs is 37.34 percent full and has 167305 free pages dbspace2 is 23.56 percent full and has 206379 free pages dbspacel is 89.84 percent full and has 166120 free pages $lowdbspace is 92.90 percent full and has 61061 free pages (Chunk down warning appears here if applicable)
The screen does tend to get a bit crowded. In fact, if you have more than four or five dbspaces in your system that are reported by the dbdf script, your report may scroll off the page. If this is the case, you could either reformat the screen output, taking out some of the blank lines that I put in for readability, or you could pipe the result through the UNIX more utility, which would allow the user to page through the output.
The first four lines of the screen give you basic status and identifying information about the system. It includes the computer name (from the UNIX hostname utility, some UNIX versions may not have this utility) and the status date on the first line. The second line is the header from tbstat and tells you the RSAM version of OnLine, its status (online, offline, checkpoints), its uptime, and the size of its shared memory segment. The fourth line gives you the output from your UNIX uptime utility so that you can get a feel for the load on your system.
The RATIOS section gives you the read cache and write cache figures from tbstat -p. A well-tuned system will have 95+ percent for read and 85+ percent for write cache hit rates. The next two numbers are derived from tbstat -p output. The LOCK HIT % entry gives you the percentage of time that requests for locks were successful. It is calculated by this formula:
lock hit % = 100 - ( lock waits x 100) /(lock requests)
This number should be close to 100 percent at all times. If this number begins to drop, it means that jobs are having to wait for locks to be released. Look for applications that are holding locks for too long or for problems with long transactions.
The LATCH HIT % statistic is similar to the LOCK HIT % statistic and shows you the number of times that latch requests were successful. There is no statistic available for the number of latch requests issued by the engine, so we used the statistics for the total number of ISAM calls. This gives a count of the number of database primitives that were executed. The formula for this calculation is
latch hit % = 100 - ( latch waits x 100)/ (number of isam calls)
This number should also remain at 100 percent. If it starts to go down, look at contention for resources such as buffers, logfiles, etc.
The WAITS line tells you the actual number of times that processes had to wait for checkpoints to complete and for buffers to get free. Use this in conjunction with the LATCH HIT and LOCK HIT percentages if you begin to see slowdowns. The DEADLOCKS number gives the number of times that the engine has taken action to avoid a deadlock situation. This is not the number of times the engine has been deadlocked. Theoretically, a deadlock cannot happen. If your DEADLOCKS count is large, look to your applications. Jobs are failing and are being rolled back because of these potential deadlocks. You have a design error somewhere.
The LOGS lines give you a count of the number of logs that are not yet freed by being backed up to tape. This count also counts the number of logs that have been backed up but have not been released, because they hold information regarding uncommitted transactions. Watch the number of logs carefully. If it gets close to your total number of logfiles, you could be looking at a long transaction rollback. Check this screen to be sure that you have continuous backup of logfiles running. If it is running and if you have many logs in use, run a tbstat -u and look for transactions that have a lot of reads or writes. They are probably the long ones. The second LOGS line shows you the information about the CURRENT logfile. It is in the same format as the tbstat -1 output.
The CHECKPOINTS lines show you the times of the last six checkpoints. If these checkpoints are closer together than you usually see, look for some jobs running that are causing a lot of logging. Long inserts and long deletes are a common culprit here. Unless you begin to get into logfile problems, there's not a lot that you can do. If you find that the checkpoints occur too frequently, you may need to tune the database.
The line immediately after the double lines gives you a quick status of your logging and archiving processes. If you are doing archives using the scripts in this section, you will be piping the output to the archive.log file. If so, when you are doing an archive, the percentage complete will appear after ARCHIVE. If not, it will just read ON or OFF. Continuous backup of logs will be a simple ON/OFF display.
The final section gives you dbspace by dbspace figures of percentage used and free pages. The final section will give you a warning about chunks that are offline if you have any. Note that this script does not consider mirroring chunks. If you are using Informix mirroring, you may wish to modify the script to show the mirror chunk data.