Site Reliability Engineering

What this chapter is about
This chapter introduces the whole idea of Site Reliability Engineering (SRE) by explaining why Google created it in the first place. It looks at the old way companies used to run their systems—traditional “sysadmin + operations” teams—and shows why that approach starts to break down once things get big and complex.
Then it explains Google’s alternative: treating operations as an engineering problem. Instead of relying on people to manually keep systems running, SRE focuses on building software and automation that does the heavy lifting.

Main ideas (explained simply)
▪️ The old Dev vs Ops split doesn’t scale
Traditionally, developers build things and operations teams keep those things running.
The problem? As systems grow, so does the manual work. You end up throwing more and more people at the problem, and the two sides often have competing goals (Dev wants speed, Ops wants stability).
▪️ Google’s approach: hire software engineers to run systems
Google flipped the model.
Instead of asking ops folks to do repetitive manual tasks, they hired people with strong engineering skills who would automate those tasks away.
▪️ SREs should spend no more than 50% of their time on manual work
If SREs spend all day fixing outages and responding to alerts, they’ll never have time to improve the system.
So Google sets a hard rule: SREs can only spend half their time on “ops” tasks.
The rest is reserved for designing better tools, smarter automation, and more reliable systems.
▪️ Reliability is measured, not guessed
Instead of aiming for “100% uptime” (which is impossible and unnecessary), SRE uses something called an error budget—basically the acceptable amount of downtime.
This budget becomes the shared language between Dev and SRE.
▪️ Most outages come from changes
New releases, config updates, traffic shifts—these are the usual suspects.
SRE focuses on:

gradually rolling changes out
watching how they behave
rolling them back fast if needed

▪️ Monitoring = only alert humans when they must act
A machine should filter the noise and only wake up a human when action is truly needed.
▪️ Write things down, and learn from every outage
SRE uses playbooks for emergencies and blameless postmortems afterward.
The goal isn’t to blame people—it’s to fix the system.
▪️ SRE teams also plan capacity and care about efficiency
They don’t just react to problems—they look ahead to make sure there’s enough infrastructure, and they continually tune systems to run smoothly and cost‑effectively.

  1. Why it matters
    This chapter is important because it sets the tone for the whole SRE philosophy:
    👉 Operations should be done by writing software, not by doing manual work.
    👉 Reliability is something you can measure and manage—not guess.
    👉 Good systems grow because of engineering, not heroic effort.
    For anyone working in tech—especially if you’re moving toward SRE, architecture, or leadership—this chapter helps you rethink what “keeping systems running” really means.
  2. Real‑world connection
    Here are some simple ways to think about the ideas:

The ops/dev conflict is like drivers vs mechanics.
Drivers want speed; mechanics want safety. Without shared goals, it’s endless argument.

Automation is like building a dishwasher.
If you wash every plate by hand forever, you’re stuck. If you spend a week building a dishwasher, you free up time every day after that.

Error budgets are like speed limits.
You can drive fast, but only as long as it stays safe. If accidents increase, you slow down.

Progressive rollouts are like trying new food with one bite first.
You don’t eat a full spoonful before checking if you like the taste.

Blameless postmortems are like sports replay studies.
Teams review mistakes to improve—not to shame players.

Friendly takeaways
Here are some easy‑to‑remember ideas from this chapter:
🌼 1. Manual work doesn’t scale
If you do the same task repeatedly, automate it or improve the system so it disappears.
🌼 2. Reliability isn’t free—make it a conscious choice
Aim for useful reliability, not perfection.
🌼 3. Teams work better when they share the same goals
Error budgets give Dev and Ops a common target.
🌼 4. Most problems come from change
If you can measure, manage, and roll back changes, you’ll avoid most outages.
🌼 5. Learning beats blaming
Healthy engineering cultures improve systems instead of punishing individuals.