Learning Goals
3 minBy the end of this lesson you can:
- Explain and apply idempotence as a design default, not an afterthought.
- Build observability: logs, metrics, and a heartbeat so silence means failure.
- Design alerting that pages on real problems without crying wolf.
- Adopt the supporting habits: config over hard-coding, secrets in env, dry-run by default.
Warm-Up · Script vs. System
5 minA SCRIPT A SYSTEM runs when you run it → runs unattended, on schedule works on your machine → works anywhere (config + env) prints to your screen → logs to files you can read later breaks silently → alerts you before users notice re-running may double-act → re-running is safe (idempotent) "it worked once" → "it keeps working"
You've learned every tool in this level. The DevOps mindset is the set of habits that combine them into something dependable. Three pillars: idempotence (safe to re-run), observability (you can see what happened), alerting (you find out about problems first). Master these and your automations become infrastructure people trust.
New Concept · The Three Pillars
14 minPillar 1 · Idempotence — design for re-runs
Schedulers retry. Humans re-run. Crashes resume. So every operation should be safe to repeat. Make idempotence the default, not a patch:
# create-if-missing, not create (which errors on the 2nd run) Path("output").mkdir(parents=True, exist_ok=True) # upsert, not insert (no duplicate rows on re-run) db.execute("INSERT OR REPLACE INTO t VALUES (?, ?)", (id, value)) # check-then-act with a key (no double email/charge) if id not in already_sent: send(id); already_sent.add(id)
Ask of every step: "what happens if this runs twice?" If the answer is "a problem," redesign it. Idempotence is what makes retries and resumes (Lesson 45) safe.
Pillar 2 · Observability — see what happened
# 1) structured logs (Lessons 13-14): what, when, how serious log.info("processed %d records in %.1fs", n, elapsed) # 2) metrics: numbers you can track over time metrics = {"records": n, "duration_s": elapsed, "errors": errs} Path("metrics.jsonl").open("a").write(json.dumps(metrics) + "\n") # 3) heartbeat: prove "it ran", so silence == failure write_heartbeat(success=True, at=datetime.now())
- Logs tell the story of one run (and persist via rotation).
- Metrics let you spot trends — "the job is getting slower," "errors are creeping up."
- Heartbeat turns absence into a signal: if the expected success ping doesn't arrive, something is wrong even if nothing crashed loudly.
Pillar 3 · Alerting — find out first
The goal: you learn about a problem before your users do. But over-alerting is its own failure — if everything pages, people ignore everything.
Good alert actionable, urgent, rare → page someone Noise informational, frequent, "FYI" → log it, don't page Alert fatigue too many alerts → all ignored → the real danger
# alert on what's actionable, at the right severity, debounced if overall == "critical": raise_alert("health", "critical", details) # pages on-call (Lesson 34) elif overall == "warning": raise_alert("health", "warning", details) # Slack only # everything else → just logs
For every alert ask: "Is this actionable (can someone do something) and urgent (does it need doing now)?" If not, it's a log line, not an alert. Debounce (Lesson 34) so a sustained problem pages once, and escalate (Lesson 33) for the rare true emergency. Protect people's attention like the scarce resource it is.
The supporting habits
- Config over hard-coding — paths, thresholds, recipients in config/env (Lesson 8), so the same code runs in dev and prod.
- Secrets in the environment — never in code or git (Lessons 8, 25); rotate on exposure.
- Dry-run by default for destructive actions — show before you act (Lessons 38, 44).
- Fail loud, recover gracefully — exit non-zero, alert, but checkpoint so recovery is cheap (Lesson 45).
Worked Example · A Production Checklist Applied
12 minGoal: take a naive script and harden it through the three pillars — the same transformation you'd apply to anything before trusting it in production.
# BEFORE — a "works on my machine" script import requests, csv data = requests.get("https://api.example.com/sales?key=sk-live-abc123").json() with open("/Users/me/report.csv", "w") as f: # hard-coded path w = csv.writer(f) for row in data: # crashes on one bad row w.writerow([row["id"], row["amount"]]) print("done") # no record, no alert
# AFTER — production-grade import os, csv, json, logging, sys from pathlib import Path from datetime import datetime from dotenv import load_dotenv import requests from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry HERE = Path(__file__).resolve().parent load_dotenv(HERE / ".env") # secrets from env logging.basicConfig(level=logging.INFO, # observability: logs format="%(asctime)s %(levelname)s %(message)s") log = logging.getLogger("sales") def session(): retry = Retry(total=5, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504]) s = requests.Session(); s.mount("https://", HTTPAdapter(max_retries=retry)) return s def run() -> int: out = Path(os.getenv("OUTPUT_DIR", HERE / "out")) # config over hard-coding out.mkdir(parents=True, exist_ok=True) # idempotent started = datetime.now() try: r = session().get("https://api.example.com/sales", headers={"Authorization": f"Bearer {os.environ['API_KEY']}"}, timeout=15) # secret in env, resilient r.raise_for_status() rows, errors = r.json(), 0 target = out / "report.csv" with open(target, "w", newline="", encoding="utf-8") as f: w = csv.writer(f); w.writerow(["id", "amount"]) for row in rows: try: w.writerow([row["id"], float(row["amount"])]) except (KeyError, ValueError): # per-item isolation errors += 1; log.warning("bad row skipped: %s", row) elapsed = (datetime.now() - started).total_seconds() log.info("wrote %d rows (%d skipped) in %.1fs", len(rows), errors, elapsed) # metrics + heartbeat (out / "metrics.jsonl").open("a").write(json.dumps( {"at": started.isoformat(), "rows": len(rows), "errors": errors, "duration_s": elapsed}) + "\n") return 0 except Exception as e: log.exception("run failed") # alert(f"sales export FAILED: {e}", "critical") # Lesson 34 return 1 if __name__ == "__main__": sys.exit(run()) # fail loud (exit code)
Read the diff
Same job, transformed: the secret moved to the environment, the path became config with a safe default, mkdir(exist_ok=True) made it idempotent, a resilient session handles transient errors, per-row try/except isolates bad data, logs + a metrics line + (commented) alert give observability, and sys.exit fails loud for the scheduler. None of it is new — it's the habits applied. That checklist is what "production-ready" means.
Try It Yourself
13 minTake any automation you've written and answer, for each step: "what happens if it runs twice?" List the non-idempotent steps and how you'd fix each (mkdir exist_ok, upsert, key check).
Add a metrics line (rows/duration/errors as JSONL) and a heartbeat file to one of your jobs. Then write a tiny checker that warns if the heartbeat is older than expected — proving silence is detectable.
Hint
import json, time from pathlib import Path HB = Path("heartbeat.json") def beat(): HB.write_text(json.dumps({"at": time.time()})) def check(max_age_s=3600): if not HB.exists(): print("NEVER RAN"); return age = time.time() - json.loads(HB.read_text())["at"] print("STALE" if age > max_age_s else f"ok ({age:.0f}s ago)")
List every place one of your tools could fail. For each, decide: page (critical), Slack (warning), or just log (info). Justify each with the "actionable + urgent?" test. The goal: as few pages as possible, none of them noise.
Mini-Challenge · The Readiness Linter
8 minWrite a "production-readiness" checklist as a script that scans one of your automations and reports which habits it has: secrets in env (not hard-coded — reuse Lesson 25's secret scanner), uses logging (not bare print), has error handling, idempotent dir creation, an exit code, config not hard-coded paths. Output a score and the gaps.
Show a sample solution
import re from pathlib import Path CHECKS = { "uses logging": lambda s: "logging" in s and "log." in s, "no bare print debug": lambda s: s.count("print(") <= 2, "has error handling": lambda s: "try:" in s and "except" in s, "secrets from env": lambda s: "os.environ" in s or "getenv" in s, "no hardcoded secret": lambda s: not re.search(r"sk-[A-Za-z0-9]{16,}", s), "idempotent mkdir": lambda s: "exist_ok=True" in s or "mkdir" not in s, "has exit code": lambda s: "sys.exit" in s or "return 0" in s, } def lint(path: str) -> None: src = Path(path).read_text(encoding="utf-8") passed = 0 for name, test in CHECKS.items(): ok = test(src) passed += ok print(f" [{'✓' if ok else ' '}] {name}") print(f"\nReadiness: {passed}/{len(CHECKS)}") lint("my_automation.py")
Non-negotiables: checks for logging, error handling, env secrets, idempotence, exit code; prints a score + gaps.
Recap
3 minThe DevOps mindset turns scripts into systems via three pillars: idempotence (every step safe to re-run — the default, not a patch), observability (logs + metrics + heartbeat, so you can always see what happened and silence means failure), and alerting (page only on actionable, urgent, rare problems — protect attention from alert fatigue). Supporting habits: config over hard-coding, secrets in env, dry-run for destructive actions, fail loud but recover cheaply. None of this is new code — it's the disciplined application of everything in Level 7. Carry this checklist into the capstone.
Vocabulary Card
- idempotence
- A property where running an operation repeatedly is as safe as running it once.
- observability
- Being able to understand a system's behaviour from its outputs (logs, metrics).
- heartbeat
- A periodic "I'm alive/I ran" signal whose absence indicates failure.
- alert fatigue
- When too many alerts cause people to ignore them all.
Homework
4 minPick your best automation from this level and do a full "production hardening" pass against the three pillars and supporting habits. Run your readiness linter before and after. Write a short before/after note listing each habit you added and why it matters. This is the dress rehearsal for the capstone — bring something genuinely production-grade.
Sample · before/after hardening note
Tool: daily sales export. Readiness: 2/7 → 7/7 Added: + secrets to .env (was hard-coded API key) — leak/rotation safety + output path from config w/ default — runs in dev & prod unchanged + mkdir(exist_ok=True) — idempotent, re-run safe + resilient session (retries) — survives transient API blips + per-row try/except — one bad record no longer kills the export + logging + metrics.jsonl + heartbeat — silence now = failure + sys.exit(1) + critical alert on failure — fails loud, pages me Why it matters: before, a 3am API hiccup meant no report and no warning. Now it retries, and if it truly fails I'm paged with the reason, while a re-run is safe and resumes cleanly.
Non-negotiables: a real hardening pass touching all three pillars, linter before/after, and a why for each habit.