Learning Goals
3 minBy the end of this lesson you can:
- Read CPU, memory, and disk usage with
psutil. - Check an HTTP endpoint's health and a TCP port's reachability.
- Turn raw metrics into a status (OK / WARN / CRITICAL) via thresholds.
- Produce a structured health report ready to alert on.
Warm-Up · What "Healthy" Means
5 min"Is the server OK?" breaks into concrete, checkable questions:
Resource healthy? check with CPU not pegged psutil.cpu_percent() memory room to spare psutil.virtual_memory() disk not nearly full psutil.disk_usage() service HTTP 200 requests.get(/health) port accepting conns socket.connect_ex()
pip install psutil
Monitoring = measure a metric, compare it to a threshold, classify the result (OK/WARN/CRITICAL). psutil gives you the system metrics in one cross-platform library; requests and socket check services. Wrap each check to return a structured result, and the alert system from Lesson 34 does the rest.
New Concept · Measuring Health
14 minSystem metrics with psutil
import psutil cpu = psutil.cpu_percent(interval=1) # % over a 1-second sample print(f"CPU: {cpu}%") mem = psutil.virtual_memory() print(f"Memory: {mem.percent}% used ({mem.available // 1_000_000} MB free)") disk = psutil.disk_usage("/") print(f"Disk: {disk.percent}% used ({disk.free // 1_000_000_000} GB free)") print(f"Boot: {psutil.boot_time()} procs: {len(psutil.pids())}")
cpu_percent(interval=1)samples for a second — a single instant reading is meaningless.virtual_memory()anddisk_usage(path)both expose a.percentplus raw bytes.- Works identically on Windows, macOS, and Linux — no OS-specific commands.
Checking a service endpoint
import requests def http_ok(url: str, timeout: int = 5) -> tuple[bool, str]: try: r = requests.get(url, timeout=timeout) return r.status_code == 200, f"HTTP {r.status_code}" except requests.RequestException as e: return False, str(e)
A /health endpoint returning 200 is the classic "is the app alive?" check — far better than just "is the port open," because it proves the app actually responds.
Checking a TCP port
import socket def port_open(host: str, port: int, timeout: int = 3) -> bool: with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: s.settimeout(timeout) return s.connect_ex((host, port)) == 0 # 0 = connected
connect_ex returns 0 if it can open a connection — a lightweight "is something listening on this port?" check (e.g. is the database on 5432 up?).
Classifying against thresholds
def classify(value: float, warn: float, crit: float) -> str: if value >= crit: return "critical" if value >= warn: return "warning" return "ok" print(classify(92, warn=80, crit=90)) # critical print(classify(85, warn=80, crit=90)) # warning print(classify(40, warn=80, crit=90)) # ok
Two thresholds turn a number into a verdict. Pick them deliberately: WARN = "look into it soon," CRITICAL = "act now." These map straight onto the alert levels from Lesson 34.
psutil reads the local machine. To check a remote server, either SSH in and run the checks there (Lesson 41 — uptime, df, etc.), or have each server expose a /health endpoint you hit over HTTP. A central monitor usually does the latter for services and SSH for deeper metrics.
Worked Example · A Health-Check Runner
12 minGoal: run a suite of checks, classify each, roll up an overall status, and return a structured report — the thing you'd schedule and feed to your alert system.
import psutil, requests, socket, logging from datetime import datetime logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s") log = logging.getLogger("health") def classify(value, warn, crit): return "critical" if value >= crit else "warning" if value >= warn else "ok" def run_checks() -> dict: checks = [] cpu = psutil.cpu_percent(interval=1) checks.append({"name": "cpu", "value": cpu, "status": classify(cpu, 80, 95), "unit": "%"}) mem = psutil.virtual_memory().percent checks.append({"name": "memory", "value": mem, "status": classify(mem, 80, 95), "unit": "%"}) disk = psutil.disk_usage("/").percent checks.append({"name": "disk", "value": disk, "status": classify(disk, 80, 90), "unit": "%"}) # service check try: ok = requests.get("http://localhost:8000/health", timeout=5).status_code == 200 except requests.RequestException: ok = False checks.append({"name": "web", "value": ok, "status": "ok" if ok else "critical", "unit": ""}) # roll-up: worst status wins order = {"ok": 0, "warning": 1, "critical": 2} overall = max((c["status"] for c in checks), key=lambda s: order[s]) report = {"time": datetime.now().isoformat(), "overall": overall, "checks": checks} for c in checks: log.info("%-7s %s %s → %s", c["name"], c["value"], c["unit"], c["status"]) log.info("OVERALL: %s", overall.upper()) return report report = run_checks() # hand off to the alert system (Lesson 34): # if report["overall"] != "ok": # alert(f"Health {report['overall']}: {report['checks']}", report['overall'])
INFO cpu 12.4 % → ok INFO memory 47.0 % → ok INFO disk 91.0 % → critical INFO web True → ok INFO OVERALL: CRITICAL
Read the code
Each check returns the same shape — name, value, status — so they're uniform to log, roll up, and alert on. The overall status is simply the worst of all checks (a single critical makes the box critical). Notice the seam at the bottom: when overall isn't OK, you call the Lesson 34 alert() with the matching level — health checks produce events, the alert system routes them. Schedule this every few minutes (Lesson 35/36) and you have continuous monitoring.
Try It Yourself
13 minPrint your own CPU %, memory %, and disk % using psutil. Run it while opening some apps and watch the numbers change.
Write checks for a URL you control (or a public one) and a port (e.g. 443 on a website). Report each as up/down with the reason on failure.
Hint
print(http_ok("https://example.com")) # (True, "HTTP 200") print(port_open("example.com", 443)) # True
Use psutil.process_iter to list the top 5 processes by memory usage. Useful for "what's eating all the RAM?" diagnostics.
Hint
import psutil procs = [(p.info["memory_percent"], p.info["name"]) for p in psutil.process_iter(["name", "memory_percent"])] for mem, name in sorted(procs, reverse=True)[:5]: print(f"{mem:5.1f}% {name}")
Mini-Challenge · The Config-Driven Monitor
8 minBuild a monitor whose checks come from a config (JSON/dict): a list of targets like {"type": "http", "url": ...}, {"type": "port", "host":..., "port":...}, {"type": "disk", "path":..., "warn":..., "crit":...}. Run them all and produce a status report. New checks = config entries, not code.
Show a sample solution
import psutil, requests, socket def check_disk(c): v = psutil.disk_usage(c["path"]).percent return {"name": c["path"], "value": v, "status": classify(v, c["warn"], c["crit"])} def check_http(c): try: ok = requests.get(c["url"], timeout=5).status_code == 200 except requests.RequestException: ok = False return {"name": c["url"], "value": ok, "status": "ok" if ok else "critical"} def check_port(c): with socket.socket() as s: s.settimeout(3) ok = s.connect_ex((c["host"], c["port"])) == 0 return {"name": f"{c['host']}:{c['port']}", "value": ok, "status": "ok" if ok else "critical"} HANDLERS = {"disk": check_disk, "http": check_http, "port": check_port} CONFIG = [ {"type": "disk", "path": "/", "warn": 80, "crit": 90}, {"type": "http", "url": "https://example.com"}, {"type": "port", "host": "example.com", "port": 443}, ] for c in CONFIG: print(HANDLERS[c["type"]](c))
Non-negotiables: checks from config, a handler per type, uniform result shape, easily extensible.
Recap
3 minHealth monitoring is measure → compare → classify. psutil gives cross-platform CPU (cpu_percent(interval=1)), memory, and disk metrics; requests checks an HTTP /health endpoint and socket.connect_ex checks a port. Turn each number into OK/WARN/CRITICAL with two thresholds, give every check a uniform result shape, and roll up to the worst status. psutil reads the local box; for remote, SSH in (Lesson 41) or hit a health endpoint. The report feeds straight into the alert system (Lesson 34) and runs on a schedule (35/36) — continuous monitoring, assembled from parts you already have.
Vocabulary Card
- psutil
- A cross-platform library for system metrics (CPU, memory, disk, processes).
- health endpoint
- A URL (e.g.
/health) that returns 200 when the app is alive. - threshold
- A value that separates OK from WARN from CRITICAL.
- roll-up
- Combining many check statuses into one overall status (worst wins).
Homework
4 minBuild healthcheck.py: a config-driven monitor with CPU/memory/disk (psutil), at least one HTTP and one port check, threshold classification, and a worst-wins overall status. Wire it to your Lesson 34 alert system so a non-OK result fires the right-level alert (and recovers when healthy again). Schedule it to run every few minutes and confirm it alerts when you push a metric over threshold (e.g. fill a small disk/partition or stop a service).
Sample · healthcheck.py wiring
from alerts import raise_alert, clear_alert # Lesson 34 def main(): report = run_checks() # config-driven, returns overall+checks if report["overall"] == "ok": clear_alert("health") # recovery if previously alerting else: details = ", ".join(f"{c['name']}={c['value']}{c.get('unit','')}" for c in report["checks"] if c["status"] != "ok") raise_alert("health", report["overall"], f"health {report['overall']}: {details}") # cron: */5 * * * * python healthcheck.py # Debounce in raise_alert means a sustained-high disk pages once, # clear_alert sends the all-clear when it drops back under threshold.
Non-negotiables: psutil + service checks, thresholds, worst-wins, wired to alerts with recovery, scheduled, alert proven.