Improving the On-Call Experience with Alert Management

Involving the team is critical to establishing “just enough” monitoring that proactively alerts you to real problems without being overwhelming.

Jan 22nd, 2024 6:26am by Paige Cruz

Featued image for: Improving the On-Call Experience with Alert Management

Featured image by Steve Bidmead from Pixabay.

Your application is experiencing user-facing issues, and the clock is ticking. Who knows first — you or your customers?

The answer depends on how much your organization has invested in monitoring and observability. In a perfect world, you’d have comprehensive monitors in place up and down the stack that cover key aspects of the user experience. They would also notify the right engineer at the right time with appropriate urgency to mitigate any issues before they impact customer experience.

If you’re reading this, you know that is not the world we live in. Far from it. Many organizations rely on reports from coworkers, customers or users to know when there are application issues.

Beware of underestimating the impact of glitches, disruptions and outages. New research shows customers may forgive occasional outages, but it could cost your organization credibility and some customers. Monitoring can be the difference between building or breaking customer trust and enabling or exhausting engineers.

Though it can be tricky to strike the right balance and configure “just enough” monitoring, knowing what questions to ask and anti-patterns to look for can help you make the case for investing in improving the on-call experience.

Why Do We Monitor, Again?

We set up monitors for things that matter to us, like an alarm to wake up in time for work or to report smoke or carbon monoxide in our homes. When it comes to technology, we use monitors to know whether the system is working as expected.

Unlike smoke alarms, there is no standard one-size-fits-all set of monitors for technology stacks. What signals you should monitor vary between companies, departments or even teams. It’s not harmful if you’re accidentally sent two of the same marketing email, but it is definitely not acceptable if you are accidentally charged twice for rent. Context deeply matters here.

The expectation with monitoring is that when you’re paged, the issue is urgent, real and requires your direct investigation and intervention. But the gap between that expectation and reality is bigger than the Grand Canyon: 59% of surveyed cloud native developers reported that only half of the alerts they receive are helpful or usable.

I suspect that part of the struggle with monitoring stems from their proactive nature. To set up or evaluate a monitor, you need to know what conditions you’re looking for and what threshold indicates an issue. This requires knowledge about the technology being monitored, the monitoring library or technology, operations experience, and access or permissions. No one sets out to create the on-call rotation from hell — it’s just that it’s hard to get monitoring right.

How Good Monitoring Goes Bad

The quest to dial in “just enough” monitoring, making happy customers and productive and fulfilled engineers, is never-ending. It is a moving target as your company goes through different stages, adopts new technology and architectures, and has opinionated engineers join or leave. A well-tuned monitor that works today may not be the right monitor eight months from now, and over time, individual services, monitors or team rotations can veer into one of two extremes: overmonitoring or undermonitoring.

Undermonitoring

Undermonitoring is when there is not enough monitoring in place watching over important operations or workflows. This puts the burden of discovery and investigation, which should be shouldered by automated monitors, on customers and engineers. The hard truth is when they have problems with an app or website, customers are quick to point fingers. More than half blame the brand over factors like their internet provider or hardware.

While I don’t believe in “stress free” incidents, my experience is that engineers who are alerted proactively (when the call is coming from inside the house) have a less stressful time investigating issues than when they’re reactively trying to fix an issue that is already impacting customers.

Symptoms of undermonitoring include:

Relying on users or customers to tell you when things are broken.
The rotation is suspiciously quiet, and teammates are in the habit of not keeping laptops with them during primary shifts.
Using the “scream test” (removing access to see if anyone complains) to validate changes.
Issues are almost always addressed reactively, when someone else lets you know.

Overmonitoring

Overmonitoring is where there is a high volume of alerts firing, alerts that are duplicated or grouped improperly (a “page storm”), or a high volume of false positives. The biggest impact is on the developers’ on-call experience. Instead of missing issues because there’s no monitoring, they could miss issues because the alerts indicate that everything is broken.

Symptoms of overmonitoring include:

Thinking certain alerts are normal and they’re safe to ignore.
Your first instinct after getting a page is to investigate whether or not it’s valid.
Alerts are added and rarely tuned or removed.
Permanently muted alerts are allowed to linger instead of being cleaned up.
Monitors are copied from the internet with default thresholds left as is, rather than being adapted to your environment or system or renamed with context about what they monitor.

How to Turn Bad Monitoring Good

If you recognize signs of overmonitoring or undermonitoring, and need to chart a course of action to “just enough” monitoring, this guide is for you. Note that this doesn’t address bigger issues, such as your team owning too many services or unreliability stemming from architecture and design.

If you’re ready to improve on-call duty, my first word of advice is not to do it alone. Effective monitoring takes continuous, intentional effort from everyone involved in the process. On-call and alert management is a shared responsibility for your team or rotation (if multiple teams are involved). Before diving into PromQL and pager stats, think about how to get your team or the rotation on board and what challenges you might experience bringing them in.

Resist the urge to start by looking at the sprawling set of queries you have and hacking away. You are one piece of a large puzzle. Bringing in perspectives from others in your rotation, engineers from dependent teams, and management is critical for affecting change and solving the actual problems.

To get a holistic view of your alerting environment — and improve the overall on-call experience — ask questions of yourself, your team and management. The answers to these questions can highlight the impact of on-call operations and help you advocate for bigger investments and change. Here are some questions to ask.

Questions to Ask Yourself

It’s time to look in the mirror and be honest with yourself.

What has been painful about managing alerts while on call? How has it affected you personally and professionally?
What specifically do you want to improve? Quieting pager storms? Decreasing frequent after-hours pages? Lack of documentation and runbooks?
What do you think is an attainable goal between now and your next time as primary on-call engineer? What is attainable in this quarter?
What is the appetite for change within your rotation or company?

Questions to Ask of Your Rotation

There is a wide variance of tolerance between people in the same rotation for bad on-call experiences. On-call duty spans from professional working hours into our precious personal time. This wide and deeply personal variance is why it’s important to talk through standards and expectations for your specific on-call rotation.

I developed a very simple on-call “feels” survey that can be a great launchpad for further discussions about the state of your rotation’s experience across your team:

Ask everyone to rate the following statements on this scale:

Strongly Agree || Agree || Undecided || Disagree || Strongly Disagree

I have been able to get overrides and swaps when needed.
I am able to decline project sprint work and interviews during primary rotation.
I feel confident picking up the pager and going on call.
I am confident that when I am paged, the alert will be actionable.
At the end of my primary rotation week, I felt burned out by the demands of on-call duty.
I am able to dedicate time to proactively improving on-call production during primary rotations.
I feel that my manager has sufficient insight into my on-call duties.
What are the top three processes that should be reexamined this quarter? For example, deployment freezes, on-call handoffs, tuning alerts, standardizing runbooks, on-call holiday compensation, production readiness checklists, etc.

Questions to Ask Teams in Your Service Neighborhood

If you’re working in a cloud native environment with many containerized microservices, you’re familiar with managing and understanding dependencies. There are technical dependencies between your services and other teams, between the cloud infrastructure and your applications, and between your service and third-party APIs. Your orbit involves the teams and services that depend on you and those you depend on — I call this your “service neighborhood.”

If you’re an app developer, it’s key to understand the line of responsibility between you and the team(s) you’re managing infrastructure for (e.g., the Kafka team, the Kubernetes team). What should you be alerted for, what should they be alerted for, and what scenarios would make sense for both parties to be paged? Can they share any runbooks or helpful queries or dashboards to bookmark? Will they review your set of monitors and provide feedback?

Whoever is closest to the component or technology is best-suited to give monitoring recommendations.

Questions to Ask Management

Talking to management about the impact to you and your work, the customer experience and the ability to execute on organizational business goals will help you be more effective in improving the on-call experience. This isn’t about engineering managers or product or project managers not “being technical enough” or unable to empathize with the demands of on-call duty, but about bridging the gap between what you care about and what they care about.

Ask management:

How many times did reactive firefighting mean missing project deadlines or deliverables?
How has on-call duty affected your ability to perform work or your work-life balance?
What are the business risks associated with not addressing on-call and monitoring issues?

This conversation is a two-way street. Ask engineering managers and product managers to help answer the question: “What impacts to the user experience or product functionality within our team’s domain are urgent enough to wake someone up for?” This helps to scope and prioritize your efforts.

Closing Thoughts

Effective monitors can be the difference between mobilizing a proactive incident response and a haphazard, reactive one. Customers and seasoned engineers with institutional knowledge are not an infinitely renewable resource, and companies have a responsibility to do right by them by investing in effective monitoring. Lay the foundation for improving your on-call experience today by gathering requirements from all perspectives before launching improvement initiatives.

Paige Cruz is a senior developer advocate at Chronosphere who spent the first part of her career as a software engineer-turned-site reliability engineer at New Relic and Lightstep among others. She hosts "Off-Call" a podcast that explores the human side...