By Adrian Hilton, Customer Reliability Engineer
Editor’s note: When you start to run many applications or services in your company, then you’ll start to bump up against the limit of what your primary SRE (or Ops) team can support. In this installment of CRE Life Lessons we’re going to look at how you can make good, principled and defensible decisions about which of your company’s applications and services you should give to your SREs to support, and how to decide when that subset needs to change.
In Google, we’re fortunate to have Site Reliability Engineering (SRE) teams supporting both our horizontal infrastructure such as storage, networking and load balancing, and our major applications such as Search, Maps and Photos. Nevertheless, the combination of software engineering and system engineering skills required of the role make it hard to find and recruit SREs, and demand for them steadily outstrips supply.
Over time we’ve found some practical limits to the number of applications that an SRE team can support, and learned the characteristics of applications that are more trouble to support than others. If your company runs many production applications, your SRE team is unlikely to be able to support them all.
Q: How will I know when my company’s SRE team is at its limit? How do I choose the best subset of applications to support? When should the SRE team drop support for an application?
Good questions all; let’s explore them in more detail.
Practical limitations on SRE support
At Google, the rule of thumb for the minimum SRE team needed to staff a pager rotation without burn-out is six engineers; for a 24/7 pager rotation with a target response time under 30 minutes, we don’t want any engineer to be on-call for more than 12 continuous hours because we don’t want paging alerts interrupting their sleep. This implies two groups of six engineers each, with a wide geographic spread so that each team can handle pages mostly in their daytime.
At any one time, there’s usually a designated primary who responds to pages, and a secondary who catches fall-through pages e.g., if the primary is temporarily out of contact, or is in the middle of managing an incident. The primary and secondary handle normal ops work, freeing the rest of the team for project work such as improving reliability, building better monitoring or increasing automation of ops tasks. Therefore every engineer has two weeks out of six focused on operational work — one as primary, one as secondary.
Q: Surely 12 to 16 engineers can handle support for all the applications your development team can feasibly write?
Actually, no. Our experience is that there is a definite cognitive limit to how many different applications or services an SRE team can effectively manage; any single engineer needs to be sufficiently familiar with each app to troubleshoot, diagnose and resolve most production problems with each app. If you want to make it easy to support many apps at once, you’ll want to make them as similar as possible: design them to use common patterns and back-end services, standardize on common tools for operational tasks like rollout, monitoring and alerting, and deploy them on similar schedules. This reduces the per-app cognitive load, but doesn’t eliminate it.
If you do have enough SREs then you might consider making two teams (again, subject to the 2 x 6 minimum staffing limit) and give them separate responsibilities. At Google, it’s not unusual for a single SRE team to split into front-end and back-end shards, each taking responsibility for supporting only that half of the system, as it grows in size. (We call this team mitosis.)
Your SRE team’s maximum number of supported services will be strongly influenced by factors such as:
the regular operational tasks needed to keep the services running well, for example releases, bug fixes, non-urgent alerts/bugs. These can be reduced (but not eliminated) by automation;
“interrupts” — unscheduled non-critical human requests. We’ve found these awkwardly resistant to efforts to reduce them; the most effective strategy has been self-service tools that address the 50-70% of repeated queries;
emergency alert response, incident management and follow-up. The best way to spend less time on these is to make the service more reliable, and to have better-tuned alerts (i.e., that are actionable and which, if they fire, strongly indicate real problems with the service).
Q: What about the four weeks out of six during which an SRE isn’t doing operational work — could we use that time to increase our SRE team’s supported service capacity?
You could do this but at Google we view this as “eating your seed corn.” The goal is to have the machines do all the things that are possible for machines to do, and for that to happen you need to leave breathing room for your SREs to do project work such as producing new automation for your service. In our experience, once a team crosses the 50% ops work threshold, it quickly descends a slippery slope to 100% ops. In that condition you’re losing the engineering effort that will give you medium-to-long term operational benefits such as reducing the frequency, duration and impact of future incidents. When you move your SRE team into nearly full-time ops work, you lose the benefit of its engineering design and development skills.
Note in particular that SRE engineering project work can reduce operational load by addressing many of the factors we described above, which were limiting how many services an SRE team could support.
Given the above, you may well find yourself in a position where you want your SRE team to onboard a new service but in practice they are not able to support it on on a sustainable basis.
You’re out of SRE support capacity – now what?
At Google our working principle is that any service that’s not explicitly supported by SRE must be supported by its developer team; if you have enough developers to write a new application then you probably have enough developers to support it. Our developers tend to use the same monitoring, rollout and incident management tools as the SREs they work with, so the operational support workload is similar. In any case, we like developers that wrote an application to directly support it for a little while so they can get a good feel for how customers are experiencing it. The things they learn doing so help SREs to onboard the service later.
Q: What about the next application we want the developers to write? Won’t they be too busy supporting the current application?
This may be true — the current application may be generating a high operational workload, due to excessive alerts, or a lack of automation. However, this gives the developer team a practical incentive to spend time making the application easier to support — tuning alerts, spending developer time on automation, and reducing the velocity of functional changes.
When developers are overloaded with operational work, SREs might be able to lend operational expertise and development effort to reduce the developers’ workloads to a manageable level. However, SREs still shouldn’t take on operational responsibility for the service, as this won’t solve the fundamental problem.
When one team develops an application and another team bears the brunt of the operational work for it, moral hazard thrives. Developers want high development velocity; it’s not in their interest to spend days running down and eliminating every odd bug that occasionally causes their server to run out of memory and need to be restarted. Meanwhile, the operational team is getting paged to do those restarts several times per day — it’s very much in their interest to get that bug fixed since it’s their sleep that is being interrupted. Not surprisingly, when developers bear the operational load for their own system, they too are incented to spend time making it easier to support. This also turns out to be important for persuading an SRE team to support their application, as we shall see later.
Choosing which applications to support
The easiest way to prioritize the applications for SRE to support is by revenue or other business criticality, i.e., how important it will be if the service goes down. After all, having an SRE team supporting your service should improve its reliability and availability.
Q: Sounds good to me; surely prioritizing by business impact is always the right choice?
Not always. There are services which actually don’t need much support work; a good example is a simple infrastructure service (say, a distributed key-value store) that has reached maturity and is updated only infrequently. Since nothing is really changing in the service, it’s unlikely to break spontaneously. Even if it’s a critical dependency of several user-facing applications, it might not make sense to dedicate SRE support; rather, let its developers hold the pager and handle the low volume of operational work.
In Google we consider that SRE teams have seven areas of focus that developers typically don’t:
Monitoring and metrics. For example, detecting response latency, error or unanswered query rate, and peak utilization of resources
Emergency response. Running on-call rotations, traffic-dip detection, primary/secondary/escalation, writing playbooks, running Wheels of Misfortune
Capacity planning. Doing quarterly projections, handling a sudden sustained load spike, running utilization-improvement projects
Service turn-up and turn-down. For services which run in many locations (e.g., to reduce end-user latency), planning location turn-up/down schedules and automating the process to reduce risks and operational load
Change management. Canarying, 1% experiments, rolling upgrades, quick-fail rollbacks, and measuring error budgets
Performance. Stress and load testing, resource-usage efficiency monitoring and optimization.
Data Integrity. Ensuring that non-reconstructible data is stored resiliently and highly available for reads, including the ability to rapidly restore it from backups
With the possible exception of “emergency response” and “data integrity,” our key-value store wouldn’t benefit substantially from any of these areas of expertise, and the marginal benefit of having SREs rather than developers support it is low. On the other hand, the opportunity cost of spending SRE support capacity on it is high; there are likely to be other applications which could benefit from more of SREs’ expertise.
One other reason that SREs might take on responsibility for an infrastructure service that doesn’t need SRE expertise if it is a crucial dependency of services they already run. In that case, there could be a significant benefit to them of having visibility into, and control of, changes to that service.
In part 2 of this blog post, we’ll take a look at how our SRE team could determine how — and indeed, whether — to onboard a business-critical service once it has been identified as able to benefit from SRE support.
Quelle: Google Cloud Platform