Having an on-call schedule

Once alerts are configured and start to be raised, it does not make sense to configure them to not trigger before 8 AM and after 5 PM. In other words, it is necessary to make sure that alerts of a certain severity are followed up even outside of business hours.

In many companies where having alerts is new, there is some form of implicit expectation that some people will be available outside of office hours (alongside their regular duties) to handle these alerts. Sometimes, when an alert is raised only once or twice a year, and there are no agreements about response times, this might not even be a problem at all.

However, in many organizations—especially over time—there is an expectation that these alerts are responded to within a certain period of time. Besides that, the number of alerts may increase as systems become larger and more complex or the number of systems grows.

The way to cope with this is by creating an on-call schedule and formal agreements on what is expected of engineers and how the organization will reward them for their efforts. This allows them to set clear expectations and allows engineers to guard their free time based on these agreements. Having enough downtime helps the engineers relax between periods of higher stress. This allows them to stay alert when they are on call, ready to react when this is expected of them.

There is much material available on what constitutes a healthy on-call schedule and what doesn't, and the keyword here is healthy. Some general pointers are as follows:

  • Those who are on call during non-business hours should not be on call during business hours as well.
  • Provide engineers who are on call with reasonable compensation for being close to a phone, not under the influence, and so on. What is reasonable differs from situation to situation, but the more demanding being on call is, the higher the compensation should be.
  • Provide the proper tools for being on call. For example, when a response time of 30 minutes or less is expected, provide those on call with a backpack with a laptop, phone, and means to connect to the internet.
  • Ensure that every employee is not on call at least 75% of the time.
  • Allow employees to take time off in lieu, so they can be late for work if they had to respond to an alert overnight.

After every disturbance of the normal operation of a system, whether this is during business hours or after, a live site incident review can be performed to learn what happened and how to reduce the chance of it happening again.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset