Deciding on the service tree

Before configuring things, it's useful to think through the setup, and doubly so with IT services. A large service tree might look impressive, but it might not represent the actual functionality well and might even obscure the real system state.

Disk space being low is important, but it doesn't actually bring the system down; it doesn't affect the SLA. The best approach likely would be to only include specific checks that identify a service being available or operating in an acceptable manner; for example, the SLA might require some performance level to be maintained. Unless we want to have a large, complicated IT service tree, we should identify key factors in delivering the service and monitor those.

What are the key factors? If the service is simple enough and can be tested easily, we could have a direct test. Maybe the SLA requires that a website is available; in that case, a simple web.page.get item would suffice. If it's a web page-based system, we might want to check the page itself, log in, and perform some operation as a logged in user; this is possible with web scenarios.

We discussed web monitoring in more detail in Chapter 12, Monitoring Web Pages.

Sometimes, it might not be possible to use the interface directly—maybe it isn't possible to have a special user for monitoring purposes, or we aren't allowed to connect to the actual interface. In that case, we should use lower-level monitoring, concentrating on the main pieces of the system that must be available. We should still attempt to have the highest-level checks possible. For example, we could check whether web server software is running, whether we can connect to a TCP port, and whether we can connect to the backend database from the frontend system. Memory or disk usage on the database system and database low-level health don't matter from the high-level monitoring point of view. It should all be monitored, of course, but having the delete query rate too high usually doesn't affect the top-level service. On the other hand, if a service goes down, we might be unable to see, in the same tree, that it happened because a disk filled up—but that's an operational failure, and we can expect that the personnel responsible are using such low-level triggers with proper dependencies to resolve the issue.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset