Monitoring and Troubleshooting

The Most Common Cause of Database Crashes for OnLine Systems

In an environment with many programmers who are familiar with UNIX, your most common problem will occur when somebody kills an engine process. Informix does not take lightly to the untimely death of a sqlturbo engine or an oninit process. This often causes the database to abort, necessitating a restart and recovery that can take anywhere from minutes to hours. In general, if a sqlturbo process is killed with a kill -9 command while it is in a critical section of code or when it is holding a latch, the engine will abort.

This problem can best be solved by enforcing a prohibition that no one but the DBA can kill -9 a sqlturbo process. It's safe for anyone to kill -15 a process. The problem here is that if a UNIX user gives a kill -15 command, the process will not necessarily die immediately. If the sqlturbo process goes into a rollback status because it received the command in the middle of a transaction, the process will remain around until the rollback completes. An impatient user may try a few kill -15 commands and finally try for a sure kill with kill -9. If the system is in rollback, OnLine will probably crash. If the process was holding any locks or if it was in a critical write process, the OnLine system will crash. The Informix command tbmode -z command works in the same way. The job doesn't immediately disappear.

Both kill -15 and tbmode -z are safe to use any time, as long as the user is prepared to be patient. It's usually best to have the users check with the DBA if the jobs don't die within about 15 minutes.

Processes that take longer than about 15 minutes can pose a quandary even to the DBA. It's possible that the query is a disjoint query and that the system is trying to process millions of rows. You could be in a long transaction on the verge of either going into forced rollback or filling up all your logs, causing a crash. If the DBA can't identify exactly what's happening, it is usually best to let the process finish its job, unless this will cause other problems in the database such as filling up the logs. If the runaway job begins to affect performance or response time, the DBA usually needs to kill it.

By the time the DBA sees the typical rogue sqlturbo process, the user has probably already tried to kill it with kill -15 or tbmode -z. It's probably in rollback. Rollback can be identified by the flags --R--X-- in the tbstat -u output. Using the UNIX ps command will tell the DBA whether the sqlturbo process is getting any time. If so, your rollback is in process. No matter what you do, the rollback has to complete. It'll either complete with the system online, or it will complete during the recovery stage of database startup. Online is usually best.

The worst case for the DBA is the process that was written in ESQL/C or one that has other reasons why the process only recognizes a kill -9. Here, if you absolutely have to stop the process, a kill -9 command is your last resort. If your UNIX operating system has process control (do you have a bg command to put a job in the background?), read the next paragraph before doing a kill -9. Otherwise, do it and cross your fingers.

Some UNIX operating systems that support placing jobs in the foreground or background have another option to the kill command that could possibly give you an out in the above situation. These versions of UNIX often have options to the kill command that allow you to stop execution of a process and restart it. On a Pyramid these options are kill -STOP and kill -CONT. These options are not universal, and they are not uniform between versions of UNIX. Check your manpage with man kill to see if you have these options. If you can stop the process by process control or by using your version of the kill -STOP command, you may see that the UNIX ps command shows that the process is eventually getting no time and is in a sleeping state. You can then use tbstat -u and check the flags for the sqlturbo you want to kill. If the flags don't show an X indicating that the process is not in a critical state, you have a chance. If it is in a critical state, use your equivalent of the kill -CONT command to restart the process. Keep stopping and starting it until it gets no time according to ps and doesn't have an X flag in the tbstat -u output, meaning that it is not in a critical section of code. Note the address column of the tbstat -u output for the sqlturbo process that you are trying to kill. Run a tbstat -s command and look for that address in the address column of the output. If the address is there, your process is holding a latch and doing the kill -9 will abort the engine. If the process is not holding a latch, you are safe in doing a kill -9. If it doesn't work and you really needed to kill the process, you haven't lost anything. You would have had to run the kill -9 command anyway.

No matter what you do, no matter how careful you are, you will crash like this sometimes. If the consequences are bad enough, the users usually learn to make the DBA kill the tough jobs.

Most of these problems have gone away with the introduction of the multithreaded architecture of the IDS engines. The only jobs that a user could kill would be the various oninit processes that comprise the executables of the engine. Killing one of these has nothing but bad effects, with the best being an immediate crash of the system and the worst being the need to restore from an archive. There's no point in even considering killing an oninit process unless it's the last thing you try prior to rebooting your system.

Thawing out a Frozen System

One of the true tests of the DBA's understanding of the system occurs when everything just suddenly freezes up. Nothing is getting done. Users find their screens just sitting there, Jobs either fail or disappear into the ether. The calls start coming in and the DBA has to go to work.

After learning that there is a problem, the first thing to do is to isolate it. Is the operating system up? If you are getting no screen response to your keystrokes, the problem is probably somewhere upstream of Informix. Either the OS is down or there is some sort of communication problem between the terminals and the system. How about the network? Call the system administrator and/or the network guru.

Assuming that you can actually communicate with the computer, run the Informix tb/onstat -u command. If it comes back with a shared memory error, your engine is down. Go into your online.log (or whatever you are naming it in your $TB/ONCONFIG file) and look at the bottom of the file. Look for lines with the word abort in them.

If you can identify the job that caused the abort, you will probably need to chase down the culprit and string him up by his thumbs. The username script is useful here.

Other than as a postmortem investigation, there's nothing else you can do. If the system has crashed, use tb/onmonitor or tb/onmode or one of the NT GUIs to bring it back up and hope that you aren't facing a long recovery period.

If you do not get a shared memory error when running tb/onstat -u, take a look at the flags column of the output. If you haven't committed all the flags to memory, run the seeuser script to look at the output with the added "training wheels" of informative column headings. The first character of the flag field will usually indicate the problem. If many or most of the processes show a "C," an "S," a "G," or an "L" in the first field, these jobs are waiting for something to happen. Usually, you can correlate this information with the information in the tb/onstat header and figure out why the engine is waiting.

If you are still in the dark about the situation, go back to the online.log. Look toward the end, or at the approximate time of the initial problem report. Look for any kind of error message that you aren't used to seeing. This implies that you have to know what is normal and what is odd. That's why you should always be monitoring your online.log. Any time you shut the system down or start it up, take a peek at the online.log file and verify that all is OK. Look for anything out of the ordinary.

If you don't find anything by looking at the online.log, run a few tb/onstat commands. Run a tb/onstat -p and see if anything is out of place. You may find high numbers in the deadlocks section or you may find high numbers in the ovlock, ovuser or other resource limitations sections. Run the status program. It will help you spot anything that looks out of place.

If you're still frozen up, run a tb/onstat -1 command. Are your logfiles full? Shame on you. You probably haven't backed up your logs, or you've failed to turn on continuous backup of logfiles. Go in and do a manual backup of your logfiles and you'll be moving again.

Still lost? Try running a few iterations of watch_hot and see if your system is getting any disk activity. If it is, something's going on. Do you know what it is? Maybe it's a backend run amok or a disjoint query. Run your ps UNIX command. Do you see any processes with lots of time on them? They may be the culprits. If these are onit proccesses, don't be alarmed at high user times. It's normal. What's not normal is if these onit processes are sucking up all of your CPU time. This usually means that some massive query is running or that the engine has shunted itself out into the bushes and needs to be rebooted.

If you see some backends that are running wild, trace them down using ps. If all of your jobs are run as individual users, the job will be easier. Who owns the sqlturbo? What else is she doing? One frequent cause of sqlturbos running amok often occurs when you are running INFORMIX-4GL. This product sometimes leaves the sqlturbos running if the 4GL process is either killed or orphaned. There is no known cure for this other than a DBA (or a script) that watches for processes that are running with no frontends attached.

In IDS systems, you won't see sqlturbo processes. Instead look at onstat -g sql and see what is running. Beware that on very busy SMP systems, onstat -g sql may never return. The system tables that this command accesses may be changing so fast that onstat can't keep up with them. Do a few onstat -D commands and see where the disk reads and writes are occurring. If nothing's happening but the CPU is still hogged up, suspect an engine run amok. On IDS systems, check for any jobs that are running with a very high PDQPRIORITY. These jobs may be hogging up all of the decision support memory. Run an onstat -g mgm and look for jobs that are held up at one of the gates.

If all of your jobs are run as a particular user, it may be more difficult to track down the particular job causing a problem. Look at the PID (process ID) and PPID (parent process ID) of the offending backends. Trace the ownership of the offender. The offender's PPID will be the PID of the job that started it. You may have to trace it back through several levels, but eventually you'll come up with some job that started the problem.

Or maybe you won't. Oftentimes a job that starts a series of processes will either be killed when the user realized that it was doing the wrong thing or will simply die of its own accord or when it begins to consume so many operating system resources that the kernel kills the job. These "orphan" or "zombie" processes often cause problems with Informix, as the child processes can continue to run even if the parent is no longer viable. In these cases, you often have to take a chance on killing the process or just letting it run until the operating system finally gets around to killing it.

Jobs that Just Fail

Sometimes you will need to troubleshoot problems that are less drastic than the wholesale freezing up of your system. Maybe a program is not behaving as planned, or maybe a job being run by one of your users fails for no apparent reason.

The approach is essentially the same as listed above except that you are often under less pressure to solve it because everyone else is still running normally. Avoid the temptation to put the troubleshooting job off until later. Usually when something like this comes up, it is a warning about other potential problems that are just lurking around and waiting for a chance to grab you.

In solving problems with specific jobs, it is most important that you get a full report of the symptoms and error messages from the user. Have them write down any error messages they get, or better yet, tell them not to touch the terminal until you have a chance to come see it. Don't ever assume that just because a user tells you something that it is either correct, complete, or relevant. Users often think they are doing one thing while they are really doing something completely different.

You still follow the logical progression of identifying and isolating the symptoms, checking the logs and utilities, forming a theory about the problem, and acting on the theory. Make lots of use of your logfiles. If the user's job has methods of logging errors or tracing execution, use them.

Many, if not most, of your jobs that fail are caused by inconsiderate or uninformed user activity. This is defined as a user who is not following the general rules of courtesy and sanity in a multiuser database environment.

Some of the rules that are often broken are:

  • Careless use of transactions

  • Failure to promptly commit or roll back a transaction

  • Creation of long transactions

  • Trying to update or delete from a large, heavily-used table while the system is busy

  • Careless query formulation

  • Creation of disjoint queries that return billions of rows

  • Failure to lock a table in exclusive mode while doing mass inserts into it or deletes from it. This causes the system to exceed the allowed number of locks.

If you find that the job that is failing is a long one or one that acts upon many rows, suspect either locking problems or long transaction problems. Either one should show up in the online.log as well as in specific error messages returned by the failing job.

If all else fails, consider whether your problem is really in the engine at all. Could you be reaching some sort of resource limits in the operating system? Could it be a terminal problem or a glitch on the line?

Above all, try not to be too dogmatic in your approach to troubleshooting. Try to be skeptical and flexible. Some of the worst troubleshooting problems are caused by red herrings. All the evidence seems to be pointing to one particular area. You've recently been having problems in just that area, and now, here it is again! Don't be too quick to make that assumption. It often seems that system problems come in clumps with complex failures of multiple systems at once. Unless you keep an open mind and look at each problem as an individual entity, you'll find yourself, like Alice, "chasing rabbits."

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset