PY-L7-46 · DevOps Mindset: Idempotence, Logging, Alerts

Learning Goals

3 min

By the end of this lesson you can:

Explain and apply idempotence as a design default, not an afterthought.
Build observability: logs, metrics, and a heartbeat so silence means failure.
Design alerting that pages on real problems without crying wolf.
Adopt the supporting habits: config over hard-coding, secrets in env, dry-run by default.

Warm-Up · Script vs. System

5 min

A SCRIPT                       A SYSTEM
runs when you run it       →   runs unattended, on schedule
works on your machine      →   works anywhere (config + env)
prints to your screen      →   logs to files you can read later
breaks silently            →   alerts you before users notice
re-running may double-act  →   re-running is safe (idempotent)
"it worked once"           →   "it keeps working"

Today's big idea

You've learned every tool in this level. The DevOps mindset is the set of habits that combine them into something dependable. Three pillars: idempotence (safe to re-run), observability (you can see what happened), alerting (you find out about problems first). Master these and your automations become infrastructure people trust.

New Concept · The Three Pillars

14 min

Pillar 1 · Idempotence — design for re-runs

Schedulers retry. Humans re-run. Crashes resume. So every operation should be safe to repeat. Make idempotence the default, not a patch:

# create-if-missing, not create (which errors on the 2nd run)
Path("output").mkdir(parents=True, exist_ok=True)

# upsert, not insert (no duplicate rows on re-run)
db.execute("INSERT OR REPLACE INTO t VALUES (?, ?)", (id, value))

# check-then-act with a key (no double email/charge)
if id not in already_sent:
    send(id); already_sent.add(id)

Ask of every step: "what happens if this runs twice?" If the answer is "a problem," redesign it. Idempotence is what makes retries and resumes (Lesson 45) safe.

Pillar 2 · Observability — see what happened

# 1) structured logs (Lessons 13-14): what, when, how serious
log.info("processed %d records in %.1fs", n, elapsed)

# 2) metrics: numbers you can track over time
metrics = {"records": n, "duration_s": elapsed, "errors": errs}
Path("metrics.jsonl").open("a").write(json.dumps(metrics) + "\n")

# 3) heartbeat: prove "it ran", so silence == failure
write_heartbeat(success=True, at=datetime.now())

Logs tell the story of one run (and persist via rotation).
Metrics let you spot trends — "the job is getting slower," "errors are creeping up."
Heartbeat turns absence into a signal: if the expected success ping doesn't arrive, something is wrong even if nothing crashed loudly.

Pillar 3 · Alerting — find out first

The goal: you learn about a problem before your users do. But over-alerting is its own failure — if everything pages, people ignore everything.

Good alert       actionable, urgent, rare        → page someone
Noise            informational, frequent, "FYI"   → log it, don't page
Alert fatigue    too many alerts → all ignored    → the real danger

# alert on what's actionable, at the right severity, debounced
if overall == "critical":
    raise_alert("health", "critical", details)   # pages on-call (Lesson 34)
elif overall == "warning":
    raise_alert("health", "warning", details)    # Slack only
# everything else → just logs

The alerting test

For every alert ask: "Is this actionable (can someone do something) and urgent (does it need doing now)?" If not, it's a log line, not an alert. Debounce (Lesson 34) so a sustained problem pages once, and escalate (Lesson 33) for the rare true emergency. Protect people's attention like the scarce resource it is.

The supporting habits

Config over hard-coding — paths, thresholds, recipients in config/env (Lesson 8), so the same code runs in dev and prod.
Secrets in the environment — never in code or git (Lessons 8, 25); rotate on exposure.
Dry-run by default for destructive actions — show before you act (Lessons 38, 44).
Fail loud, recover gracefully — exit non-zero, alert, but checkpoint so recovery is cheap (Lesson 45).

Worked Example · A Production Checklist Applied

12 min

Goal: take a naive script and harden it through the three pillars — the same transformation you'd apply to anything before trusting it in production.

# BEFORE — a "works on my machine" script
import requests, csv

data = requests.get("https://api.example.com/sales?key=sk-live-abc123").json()
with open("/Users/me/report.csv", "w") as f:        # hard-coded path
    w = csv.writer(f)
    for row in data:                                 # crashes on one bad row
        w.writerow([row["id"], row["amount"]])
print("done")                                        # no record, no alert

# AFTER — production-grade
import os, csv, json, logging, sys
from pathlib import Path
from datetime import datetime
from dotenv import load_dotenv
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

HERE = Path(__file__).resolve().parent
load_dotenv(HERE / ".env")                           # secrets from env
logging.basicConfig(level=logging.INFO,              # observability: logs
    format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger("sales")

def session():
    retry = Retry(total=5, backoff_factor=1,
                  status_forcelist=[429, 500, 502, 503, 504])
    s = requests.Session(); s.mount("https://", HTTPAdapter(max_retries=retry))
    return s

def run() -> int:
    out = Path(os.getenv("OUTPUT_DIR", HERE / "out"))  # config over hard-coding
    out.mkdir(parents=True, exist_ok=True)             # idempotent
    started = datetime.now()
    try:
        r = session().get("https://api.example.com/sales",
                          headers={"Authorization": f"Bearer {os.environ['API_KEY']}"},
                          timeout=15)                  # secret in env, resilient
        r.raise_for_status()
        rows, errors = r.json(), 0
        target = out / "report.csv"
        with open(target, "w", newline="", encoding="utf-8") as f:
            w = csv.writer(f); w.writerow(["id", "amount"])
            for row in rows:
                try:
                    w.writerow([row["id"], float(row["amount"])])
                except (KeyError, ValueError):         # per-item isolation
                    errors += 1; log.warning("bad row skipped: %s", row)
        elapsed = (datetime.now() - started).total_seconds()
        log.info("wrote %d rows (%d skipped) in %.1fs", len(rows), errors, elapsed)
        # metrics + heartbeat
        (out / "metrics.jsonl").open("a").write(json.dumps(
            {"at": started.isoformat(), "rows": len(rows),
             "errors": errors, "duration_s": elapsed}) + "\n")
        return 0
    except Exception as e:
        log.exception("run failed")
        # alert(f"sales export FAILED: {e}", "critical")   # Lesson 34
        return 1

if __name__ == "__main__":
    sys.exit(run())                                    # fail loud (exit code)

Read the diff

Same job, transformed: the secret moved to the environment, the path became config with a safe default, mkdir(exist_ok=True) made it idempotent, a resilient session handles transient errors, per-row try/except isolates bad data, logs + a metrics line + (commented) alert give observability, and sys.exit fails loud for the scheduler. None of it is new — it's the habits applied. That checklist is what "production-ready" means.

Try It Yourself

13 min

01 🟢 Idempotence audit

Take any automation you've written and answer, for each step: "what happens if it runs twice?" List the non-idempotent steps and how you'd fix each (mkdir exist_ok, upsert, key check).

02 🟡 Add a heartbeat + metrics

Add a metrics line (rows/duration/errors as JSONL) and a heartbeat file to one of your jobs. Then write a tiny checker that warns if the heartbeat is older than expected — proving silence is detectable.

Hint

import json, time
from pathlib import Path
HB = Path("heartbeat.json")

def beat(): HB.write_text(json.dumps({"at": time.time()}))

def check(max_age_s=3600):
    if not HB.exists(): print("NEVER RAN"); return
    age = time.time() - json.loads(HB.read_text())["at"]
    print("STALE" if age > max_age_s else f"ok ({age:.0f}s ago)")

03 🔴 Triage your alerts

List every place one of your tools could fail. For each, decide: page (critical), Slack (warning), or just log (info). Justify each with the "actionable + urgent?" test. The goal: as few pages as possible, none of them noise.

Mini-Challenge · The Readiness Linter

8 min

Write a "production-readiness" checklist as a script that scans one of your automations and reports which habits it has: secrets in env (not hard-coded — reuse Lesson 25's secret scanner), uses logging (not bare print), has error handling, idempotent dir creation, an exit code, config not hard-coded paths. Output a score and the gaps.

Show a sample solution

import re
from pathlib import Path

CHECKS = {
    "uses logging":        lambda s: "logging" in s and "log." in s,
    "no bare print debug": lambda s: s.count("print(") <= 2,
    "has error handling":  lambda s: "try:" in s and "except" in s,
    "secrets from env":    lambda s: "os.environ" in s or "getenv" in s,
    "no hardcoded secret": lambda s: not re.search(r"sk-[A-Za-z0-9]{16,}", s),
    "idempotent mkdir":    lambda s: "exist_ok=True" in s or "mkdir" not in s,
    "has exit code":       lambda s: "sys.exit" in s or "return 0" in s,
}

def lint(path: str) -> None:
    src = Path(path).read_text(encoding="utf-8")
    passed = 0
    for name, test in CHECKS.items():
        ok = test(src)
        passed += ok
        print(f"  [{'✓' if ok else ' '}] {name}")
    print(f"\nReadiness: {passed}/{len(CHECKS)}")

lint("my_automation.py")

Non-negotiables: checks for logging, error handling, env secrets, idempotence, exit code; prints a score + gaps.

Recap

3 min

The DevOps mindset turns scripts into systems via three pillars: idempotence (every step safe to re-run — the default, not a patch), observability (logs + metrics + heartbeat, so you can always see what happened and silence means failure), and alerting (page only on actionable, urgent, rare problems — protect attention from alert fatigue). Supporting habits: config over hard-coding, secrets in env, dry-run for destructive actions, fail loud but recover cheaply. None of this is new code — it's the disciplined application of everything in Level 7. Carry this checklist into the capstone.

Vocabulary Card

idempotence: A property where running an operation repeatedly is as safe as running it once.
observability: Being able to understand a system's behaviour from its outputs (logs, metrics).
heartbeat: A periodic "I'm alive/I ran" signal whose absence indicates failure.
alert fatigue: When too many alerts cause people to ignore them all.

Homework

4 min

Pick your best automation from this level and do a full "production hardening" pass against the three pillars and supporting habits. Run your readiness linter before and after. Write a short before/after note listing each habit you added and why it matters. This is the dress rehearsal for the capstone — bring something genuinely production-grade.

Sample · before/after hardening note

Tool: daily sales export.  Readiness: 2/7 → 7/7

Added:
+ secrets to .env (was hard-coded API key) — leak/rotation safety
+ output path from config w/ default — runs in dev & prod unchanged
+ mkdir(exist_ok=True) — idempotent, re-run safe
+ resilient session (retries) — survives transient API blips
+ per-row try/except — one bad record no longer kills the export
+ logging + metrics.jsonl + heartbeat — silence now = failure
+ sys.exit(1) + critical alert on failure — fails loud, pages me

Why it matters: before, a 3am API hiccup meant no report and no
warning. Now it retries, and if it truly fails I'm paged with the
reason, while a re-run is safe and resumes cleanly.

Non-negotiables: a real hardening pass touching all three pillars, linter before/after, and a why for each habit.

A SCRIPT A SYSTEM runs when you run it → runs unattended, on schedule works on your machine → works anywhere (config + env) prints to your screen → logs to files you can read later breaks silently → alerts you before users notice re-running may double-act → re-running is safe (idempotent) "it worked once" → "it keeps working"

# create-if-missing, not create (which errors on the 2nd run) Path("output").mkdir(parents=True, exist_ok=True) # upsert, not insert (no duplicate rows on re-run) db.execute("INSERT OR REPLACE INTO t VALUES (?, ?)", (id, value)) # check-then-act with a key (no double email/charge) if id not in already_sent: send(id); already_sent.add(id)

# 1) structured logs (Lessons 13-14): what, when, how serious log.info("processed %d records in %.1fs", n, elapsed) # 2) metrics: numbers you can track over time metrics = {"records": n, "duration_s": elapsed, "errors": errs} Path("metrics.jsonl").open("a").write(json.dumps(metrics) + "\n") # 3) heartbeat: prove "it ran", so silence == failure write_heartbeat(success=True, at=datetime.now())

Good alert actionable, urgent, rare → page someone Noise informational, frequent, "FYI" → log it, don't page Alert fatigue too many alerts → all ignored → the real danger

# alert on what's actionable, at the right severity, debounced if overall == "critical": raise_alert("health", "critical", details) # pages on-call (Lesson 34) elif overall == "warning": raise_alert("health", "warning", details) # Slack only # everything else → just logs

# BEFORE — a "works on my machine" script import requests, csv data = requests.get("https://api.example.com/sales?key=sk-live-abc123").json() with open("/Users/me/report.csv", "w") as f: # hard-coded path w = csv.writer(f) for row in data: # crashes on one bad row w.writerow([row["id"], row["amount"]]) print("done") # no record, no alert

# AFTER — production-grade import os, csv, json, logging, sys from pathlib import Path from datetime import datetime from dotenv import load_dotenv import requests from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry HERE = Path(__file__).resolve().parent load_dotenv(HERE / ".env") # secrets from env logging.basicConfig(level=logging.INFO, # observability: logs format="%(asctime)s %(levelname)s %(message)s") log = logging.getLogger("sales") def session(): retry = Retry(total=5, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504]) s = requests.Session(); s.mount("https://", HTTPAdapter(max_retries=retry)) return s def run() -> int: out = Path(os.getenv("OUTPUT_DIR", HERE / "out")) # config over hard-coding out.mkdir(parents=True, exist_ok=True) # idempotent started = datetime.now() try: r = session().get("https://api.example.com/sales", headers={"Authorization": f"Bearer {os.environ['API_KEY']}"}, timeout=15) # secret in env, resilient r.raise_for_status() rows, errors = r.json(), 0 target = out / "report.csv" with open(target, "w", newline="", encoding="utf-8") as f: w = csv.writer(f); w.writerow(["id", "amount"]) for row in rows: try: w.writerow([row["id"], float(row["amount"])]) except (KeyError, ValueError): # per-item isolation errors += 1; log.warning("bad row skipped: %s", row) elapsed = (datetime.now() - started).total_seconds() log.info("wrote %d rows (%d skipped) in %.1fs", len(rows), errors, elapsed) # metrics + heartbeat (out / "metrics.jsonl").open("a").write(json.dumps( {"at": started.isoformat(), "rows": len(rows), "errors": errors, "duration_s": elapsed}) + "\n") return 0 except Exception as e: log.exception("run failed") # alert(f"sales export FAILED: {e}", "critical") # Lesson 34 return 1 if __name__ == "__main__": sys.exit(run()) # fail loud (exit code)

import json, time from pathlib import Path HB = Path("heartbeat.json") def beat(): HB.write_text(json.dumps({"at": time.time()})) def check(max_age_s=3600): if not HB.exists(): print("NEVER RAN"); return age = time.time() - json.loads(HB.read_text())["at"] print("STALE" if age > max_age_s else f"ok ({age:.0f}s ago)")

import re from pathlib import Path CHECKS = { "uses logging": lambda s: "logging" in s and "log." in s, "no bare print debug": lambda s: s.count("print(") <= 2, "has error handling": lambda s: "try:" in s and "except" in s, "secrets from env": lambda s: "os.environ" in s or "getenv" in s, "no hardcoded secret": lambda s: not re.search(r"sk-[A-Za-z0-9]{16,}", s), "idempotent mkdir": lambda s: "exist_ok=True" in s or "mkdir" not in s, "has exit code": lambda s: "sys.exit" in s or "return 0" in s, } def lint(path: str) -> None: src = Path(path).read_text(encoding="utf-8") passed = 0 for name, test in CHECKS.items(): ok = test(src) passed += ok print(f" [{'✓' if ok else ' '}] {name}") print(f"\nReadiness: {passed}/{len(CHECKS)}") lint("my_automation.py")