Strategy: make on-call obsolete, not convenient
Solving inherent on-call problems is greatly more important (and urgent) than making alerts’ UX
better, e.g. have a proper documentation per alert (playbook), include a link in every alert to the
Why? It’s a paradox: If you’re doing a good job, you’ll practice less on-call during work hours and outside of work hours, making it less effective to invest in on-call documentation or howtos when they do happen. We’ll invest mostly in alerts’ UX when the risk is too high (often after the fact, or before that with “Game Days”, more on it later), rather than in every alert we have.
It will make it a requirement for everyone to learn and understand how to build and operate resilient distributed systems. Operational health is part of every estimation an engineer is giving and part of every Design Review we present.
Operational health will include tests (local strategy, production strategy), monitoring, and alerts we seek to have to avoid the on-call burden.
As part of making on-call obsolete, Mature Products (internal link redacted) will have explicit and visible SLOs (Service Level Objective) to our clients around availability, latency, throughput, etc. Products will also include SLOs for the team that operate it, e.g. low-level alerts & high-level alerts, during work hours and off work hours. It should be monitored and automatically reported weekly (email the team) to track progress.
The only failures we won’t be able to automate are those that are too expensive or too rare. In those cases, nothing will save us from calling our “domain expert” and getting them to join us during an on-call incident. After these incidents happen, doing a team retrospective and “lessons learned” from it, including documenting it, will be extremely meaningful and effective.
Measure where we are now
Each Product should have an Operational Health dashboard in Kibana/Honeycomb, i.e. “[Product] Operational Health”. This will be based on events gathering from PagerDuty. This will be provided as part of Forter’s Observability stack.
Internal measurements baseline (dashboard):
- Daily number of high alerts during work hours
- Why: see how noisy (good and bad) our alerts are. Gives us a good proxy for how much investment we’ll need to make these alerts obsolete as the system is more robust. Question the value/usefulness of the alert as a high alert that might wake someone up.
- Daily number of low alerts during work hours
- Why: see how noisy (good and bad) our alerts are. Gives us a good proxy for how much investment we’ll need to make these alerts obsolete as the system is more robust. Question the value/usefulness of the alert.
- Daily number of high level alerts off work hours
- Why: understand how often people get paged and need to act.
- Daily number of false alerts off work hours: i.e. wake up to see the problem
was auto-resolved, or Non-actionable alerts – wake up to find an alert is actually
non-actionable and is “notification” only.
- Why: frustrating to get paged only to understand I’m not required. This is noise our system should have resolved without us – either the alert system (via state, e.g. “service A report error X 5 times in the past T seconds”) or the production system (via state, e.g. count failure and report a special event if it’s above X)
- Weekly number of false negatives alerts off work hours: i.e. didn’t wake up
while the system was in a bad state
- Why: Spot areas where we lack alerts that could have saved us money or honored our uptime/latency SLA. The assumption is that it’s very rare, so long it is only when it’s clear (part of BetterNext) – we usually detect those in the morning, by an analyst or engineer
- We should manually report that to some spreadsheet or push it via Betty (bot) to our ELK with proper tagging.
Tactics to consider
Systems act differently outside of working hours
When you build Products, you should plan that their Operational strategy will look different during different hours. Our scale-out mechanisms are tested mostly during these hours of high traffic.
The system should be aware of the work/off hours of the team that supports it. The system should change its scale strategy given the amount of traffic OR or on-call shifts of the team. The former will save us money, the latter will save us energy/morale. In many cases, paying more during off-hours with extra capacity will make sense, if we know how to cap it, e.g. not more than $X/day.
Systems act differently under load
The system should know if it’s under a heavy load and change its behavior. Sometimes load can be expected (e.g. holidays sales). Use that information.
Why? Retries are a dangerous solution for distributed systems as we assume we can completely DoS ourselves with such policies. Yet under low load, retry is an extremely simple and powerful solution.
Given an external system that monitors throughput or other indications of load (e.g. holidays calendar), and updates every 30 seconds a dedicated storage, we can change our strategy of how our systems behave with Circuit Breakers (open, close, semi-open).
We can turn to a more naive solution for X minutes until we see the load is back to normal, e.g. use more fast path optimization, move more traffic to low queue, don’t allow certain resource consuming flows, report alerts as a low priority instead of high priority, etc.
Learn from our Domain Experts via “Game Days”
One way to proactively deal with the paradox mentioned above is to practice “Team Game Day” (example: here). This will get the domain expert to create a few scenarios where the team can practice their intuition (“what do you think will happen?”) and explain how they’d debug it if it happens.
The domain expert can share their knowledge and explain how they’d approach it if it happens. The team can finish the day by writing proper documentation that mostly covers how to quickly reduce the number of options to look for (how to eliminate potential issues during an incident) and how to use the current tools the team has – important dashboards, how to read alerts, etc.
Audit Existing Alerts
Existing systems have lots of alerts that are non-actionable, non-urgent, indicate an indirect symptom or should be aggregated in a dimension, e.g. all machines going down in a single AvailabilityZone will page one-by-one. Having an audit of these alerts and deleting, de-duplicating, rephrasing, and reassigning alerts can reduce the team’s load as a proactive measure.
This should come hand-in-hand with a team constantly feeling like they are the sole owners of the alerts they receive and, as such, are trusted to delete those alerts they deem to be irrelevant (with a review process to be decided inside the team).
Supporting urgent changes (deployments) is not on-call
The need (and Operational thinking) here is completely different. There is no technical problem
involved but rather a business one. This should be measured separately and tools should be built to
tackle that either automatically or by empowering our clients to safely and securely make the change
The team should be called only if the change failed to happen safely due to some technical problem.
Some systems were born from this need: Spectator and Confetti (internal systems) are just two examples. Consider if we need more (or better) tools to cover use-cases that happen often.
Consider letting your clients deploy the changes if the deployment is automated.
If this is very common in your team, you should track how often it happens and when, and set KPIs to track and solve them. Solutions will be very different, and usually involve tools and training our clients on how to use them effectively.
Moving away from fire fighting to cruise control?
Owners: Engineering Managers
- Make sure you have Operational Health dashboards for your Products. Use data from PagerDuty to describe the team’s on-call burden. Make it visible to the team and to your clients. Use data to understand Products stability and performance, don’t rely on your memory.
- Set an Alerting SLO for your Products and team (team owns a suite of products): e.g. No more
than 1 high alert during off-hours every month, no more than 10 high alerts during work hours a
month, etc. Cover both high and low alerts. Alert fatigue is dangerous.
- SLO breached? Understand what can be done immediately. Treat it just like any other production incident, and delay current efforts while communicating with stakeholders. Engineering managers should stand behind this and verify it happens.
- Group alerts by systems and run a diagnostic to understand what can be done to resolve these alerts completely. Use the strategies above if needed. Bring an estimation of how much effort it will take versus how many alerts (or how often) it will solve.
- Invest in these areas every quarter. Make it part of your quarterly planning. Chiefs, Tech
Leads, and Engineering Managers should represent and push to prioritize that with their clients.
If the team is busy with firefighting, it cannot work on new efforts. It’s in your clients’ best
interest for you to focus on stability to get the team focusing (later on) on adding value.
- Set clear internal SLOs for your Products and Services. Explicitly “advertise” it to Engineering. Start the process towards only high-alerting on things that are directly impacting any of the SLIs. This, as a long-term strategy, will give teams the psychological safety they need to remove “bad” alerts.
This is an ongoing effort. But most probably you should stop now and invest in 1-3, and then make sure 4 and 5 are taken into account every quarter.
Keeping on-call hygiene
- Engineering Managers – Make sure that people take time off after an incident. If someone woke up
during the night, it’s expected to take a few hours off (on Forter’s expense!) depending on the
length of the incident – If it was a long one, we expect people to take a day off to
- Product Chief (EM to verify) – Send a monthly report of the statistics and the effort that was
made to improve Operational Health. You can do it per Product and for the team (all products) so
the team will see the trend. Remind them what was done to create this impact.
Note: if the Product is stable for more than 3 quarters, that can wait for the
Director’s quarterly update (see next bullet).
- Directors will share on a quarterly basis their Products-space, SLOs, and actual performance.
This will be done as part of the “Quarterly lookback” slides that each Director is presenting to
- Practice a “Game Day” at least once every 2 quarters until the majority of the team feels
comfortable operating the tools the team has, eliminating options quickly, and having enough
context shared between team members to help each other out without depending only on your Domain
Experts. Directors should organize it in their groups, assuming multiple teams within the group
will be influenced by it, and to build some group knowledge.
- AWS page about the game concept
- Stripe’s post from via Wayback Machine
- Google’s DiRT talk
- Dark Debt