PY-L7-43 · Server Health Checks

Learning Goals

3 min

By the end of this lesson you can:

Read CPU, memory, and disk usage with psutil.
Check an HTTP endpoint's health and a TCP port's reachability.
Turn raw metrics into a status (OK / WARN / CRITICAL) via thresholds.
Produce a structured health report ready to alert on.

Warm-Up · What "Healthy" Means

5 min

"Is the server OK?" breaks into concrete, checkable questions:

Resource   healthy?            check with
CPU        not pegged          psutil.cpu_percent()
memory     room to spare       psutil.virtual_memory()
disk       not nearly full     psutil.disk_usage()
service    HTTP 200            requests.get(/health)
port       accepting conns     socket.connect_ex()

pip install psutil

Today's big idea

Monitoring = measure a metric, compare it to a threshold, classify the result (OK/WARN/CRITICAL). psutil gives you the system metrics in one cross-platform library; requests and socket check services. Wrap each check to return a structured result, and the alert system from Lesson 34 does the rest.

New Concept · Measuring Health

14 min

System metrics with psutil

import psutil

cpu = psutil.cpu_percent(interval=1)        # % over a 1-second sample
print(f"CPU: {cpu}%")

mem = psutil.virtual_memory()
print(f"Memory: {mem.percent}% used ({mem.available // 1_000_000} MB free)")

disk = psutil.disk_usage("/")
print(f"Disk: {disk.percent}% used ({disk.free // 1_000_000_000} GB free)")

print(f"Boot: {psutil.boot_time()}  procs: {len(psutil.pids())}")

cpu_percent(interval=1) samples for a second — a single instant reading is meaningless.
virtual_memory() and disk_usage(path) both expose a .percent plus raw bytes.
Works identically on Windows, macOS, and Linux — no OS-specific commands.

Checking a service endpoint

import requests

def http_ok(url: str, timeout: int = 5) -> tuple[bool, str]:
    try:
        r = requests.get(url, timeout=timeout)
        return r.status_code == 200, f"HTTP {r.status_code}"
    except requests.RequestException as e:
        return False, str(e)

A /health endpoint returning 200 is the classic "is the app alive?" check — far better than just "is the port open," because it proves the app actually responds.

Checking a TCP port

import socket

def port_open(host: str, port: int, timeout: int = 3) -> bool:
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.settimeout(timeout)
        return s.connect_ex((host, port)) == 0   # 0 = connected

connect_ex returns 0 if it can open a connection — a lightweight "is something listening on this port?" check (e.g. is the database on 5432 up?).

Classifying against thresholds

def classify(value: float, warn: float, crit: float) -> str:
    if value >= crit:
        return "critical"
    if value >= warn:
        return "warning"
    return "ok"

print(classify(92, warn=80, crit=90))    # critical
print(classify(85, warn=80, crit=90))    # warning
print(classify(40, warn=80, crit=90))    # ok

Two thresholds turn a number into a verdict. Pick them deliberately: WARN = "look into it soon," CRITICAL = "act now." These map straight onto the alert levels from Lesson 34.

Local vs. remote checks

psutil reads the local machine. To check a remote server, either SSH in and run the checks there (Lesson 41 — uptime, df, etc.), or have each server expose a /health endpoint you hit over HTTP. A central monitor usually does the latter for services and SSH for deeper metrics.

Worked Example · A Health-Check Runner

12 min

Goal: run a suite of checks, classify each, roll up an overall status, and return a structured report — the thing you'd schedule and feed to your alert system.

import psutil, requests, socket, logging
from datetime import datetime

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
log = logging.getLogger("health")

def classify(value, warn, crit):
    return "critical" if value >= crit else "warning" if value >= warn else "ok"

def run_checks() -> dict:
    checks = []

    cpu = psutil.cpu_percent(interval=1)
    checks.append({"name": "cpu", "value": cpu,
                   "status": classify(cpu, 80, 95), "unit": "%"})

    mem = psutil.virtual_memory().percent
    checks.append({"name": "memory", "value": mem,
                   "status": classify(mem, 80, 95), "unit": "%"})

    disk = psutil.disk_usage("/").percent
    checks.append({"name": "disk", "value": disk,
                   "status": classify(disk, 80, 90), "unit": "%"})

    # service check
    try:
        ok = requests.get("http://localhost:8000/health", timeout=5).status_code == 200
    except requests.RequestException:
        ok = False
    checks.append({"name": "web", "value": ok,
                   "status": "ok" if ok else "critical", "unit": ""})

    # roll-up: worst status wins
    order = {"ok": 0, "warning": 1, "critical": 2}
    overall = max((c["status"] for c in checks), key=lambda s: order[s])

    report = {"time": datetime.now().isoformat(),
              "overall": overall, "checks": checks}
    for c in checks:
        log.info("%-7s %s %s  → %s", c["name"], c["value"], c["unit"], c["status"])
    log.info("OVERALL: %s", overall.upper())
    return report

report = run_checks()
# hand off to the alert system (Lesson 34):
# if report["overall"] != "ok":
#     alert(f"Health {report['overall']}: {report['checks']}", report['overall'])

INFO cpu     12.4 %  → ok
INFO memory  47.0 %  → ok
INFO disk    91.0 %  → critical
INFO web     True   → ok
INFO OVERALL: CRITICAL

Read the code

Each check returns the same shape — name, value, status — so they're uniform to log, roll up, and alert on. The overall status is simply the worst of all checks (a single critical makes the box critical). Notice the seam at the bottom: when overall isn't OK, you call the Lesson 34 alert() with the matching level — health checks produce events, the alert system routes them. Schedule this every few minutes (Lesson 35/36) and you have continuous monitoring.

Try It Yourself

13 min

01 🟢 Snapshot your machine

Print your own CPU %, memory %, and disk % using psutil. Run it while opening some apps and watch the numbers change.

02 🟡 Endpoint + port checks

Write checks for a URL you control (or a public one) and a port (e.g. 443 on a website). Report each as up/down with the reason on failure.

Hint

print(http_ok("https://example.com"))     # (True, "HTTP 200")
print(port_open("example.com", 443))      # True

03 🔴 Top processes

Use psutil.process_iter to list the top 5 processes by memory usage. Useful for "what's eating all the RAM?" diagnostics.

Hint

import psutil
procs = [(p.info["memory_percent"], p.info["name"])
         for p in psutil.process_iter(["name", "memory_percent"])]
for mem, name in sorted(procs, reverse=True)[:5]:
    print(f"{mem:5.1f}%  {name}")

Mini-Challenge · The Config-Driven Monitor

8 min

Build a monitor whose checks come from a config (JSON/dict): a list of targets like {"type": "http", "url": ...}, {"type": "port", "host":..., "port":...}, {"type": "disk", "path":..., "warn":..., "crit":...}. Run them all and produce a status report. New checks = config entries, not code.

Show a sample solution

import psutil, requests, socket

def check_disk(c):
    v = psutil.disk_usage(c["path"]).percent
    return {"name": c["path"], "value": v,
            "status": classify(v, c["warn"], c["crit"])}

def check_http(c):
    try: ok = requests.get(c["url"], timeout=5).status_code == 200
    except requests.RequestException: ok = False
    return {"name": c["url"], "value": ok,
            "status": "ok" if ok else "critical"}

def check_port(c):
    with socket.socket() as s:
        s.settimeout(3)
        ok = s.connect_ex((c["host"], c["port"])) == 0
    return {"name": f"{c['host']}:{c['port']}", "value": ok,
            "status": "ok" if ok else "critical"}

HANDLERS = {"disk": check_disk, "http": check_http, "port": check_port}

CONFIG = [
    {"type": "disk", "path": "/", "warn": 80, "crit": 90},
    {"type": "http", "url": "https://example.com"},
    {"type": "port", "host": "example.com", "port": 443},
]

for c in CONFIG:
    print(HANDLERS[c["type"]](c))

Non-negotiables: checks from config, a handler per type, uniform result shape, easily extensible.

Recap

3 min

Health monitoring is measure → compare → classify. psutil gives cross-platform CPU (cpu_percent(interval=1)), memory, and disk metrics; requests checks an HTTP /health endpoint and socket.connect_ex checks a port. Turn each number into OK/WARN/CRITICAL with two thresholds, give every check a uniform result shape, and roll up to the worst status. psutil reads the local box; for remote, SSH in (Lesson 41) or hit a health endpoint. The report feeds straight into the alert system (Lesson 34) and runs on a schedule (35/36) — continuous monitoring, assembled from parts you already have.

Vocabulary Card

psutil: A cross-platform library for system metrics (CPU, memory, disk, processes).
health endpoint: A URL (e.g. /health) that returns 200 when the app is alive.
threshold: A value that separates OK from WARN from CRITICAL.
roll-up: Combining many check statuses into one overall status (worst wins).

Homework

4 min

Build healthcheck.py: a config-driven monitor with CPU/memory/disk (psutil), at least one HTTP and one port check, threshold classification, and a worst-wins overall status. Wire it to your Lesson 34 alert system so a non-OK result fires the right-level alert (and recovers when healthy again). Schedule it to run every few minutes and confirm it alerts when you push a metric over threshold (e.g. fill a small disk/partition or stop a service).

Sample · healthcheck.py wiring

from alerts import raise_alert, clear_alert   # Lesson 34

def main():
    report = run_checks()              # config-driven, returns overall+checks
    if report["overall"] == "ok":
        clear_alert("health")          # recovery if previously alerting
    else:
        details = ", ".join(f"{c['name']}={c['value']}{c.get('unit','')}"
                            for c in report["checks"]
                            if c["status"] != "ok")
        raise_alert("health", report["overall"],
                    f"health {report['overall']}: {details}")

# cron: */5 * * * * python healthcheck.py
# Debounce in raise_alert means a sustained-high disk pages once,
# clear_alert sends the all-clear when it drops back under threshold.

Non-negotiables: psutil + service checks, thresholds, worst-wins, wired to alerts with recovery, scheduled, alert proven.

Resource healthy? check with CPU not pegged psutil.cpu_percent() memory room to spare psutil.virtual_memory() disk not nearly full psutil.disk_usage() service HTTP 200 requests.get(/health) port accepting conns socket.connect_ex()

import psutil cpu = psutil.cpu_percent(interval=1) # % over a 1-second sample print(f"CPU: {cpu}%") mem = psutil.virtual_memory() print(f"Memory: {mem.percent}% used ({mem.available // 1_000_000} MB free)") disk = psutil.disk_usage("/") print(f"Disk: {disk.percent}% used ({disk.free // 1_000_000_000} GB free)") print(f"Boot: {psutil.boot_time()} procs: {len(psutil.pids())}")

import requests def http_ok(url: str, timeout: int = 5) -> tuple[bool, str]: try: r = requests.get(url, timeout=timeout) return r.status_code == 200, f"HTTP {r.status_code}" except requests.RequestException as e: return False, str(e)

import socket def port_open(host: str, port: int, timeout: int = 3) -> bool: with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: s.settimeout(timeout) return s.connect_ex((host, port)) == 0 # 0 = connected

def classify(value: float, warn: float, crit: float) -> str: if value >= crit: return "critical" if value >= warn: return "warning" return "ok" print(classify(92, warn=80, crit=90)) # critical print(classify(85, warn=80, crit=90)) # warning print(classify(40, warn=80, crit=90)) # ok

import psutil, requests, socket, logging from datetime import datetime logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s") log = logging.getLogger("health") def classify(value, warn, crit): return "critical" if value >= crit else "warning" if value >= warn else "ok" def run_checks() -> dict: checks = [] cpu = psutil.cpu_percent(interval=1) checks.append({"name": "cpu", "value": cpu, "status": classify(cpu, 80, 95), "unit": "%"}) mem = psutil.virtual_memory().percent checks.append({"name": "memory", "value": mem, "status": classify(mem, 80, 95), "unit": "%"}) disk = psutil.disk_usage("/").percent checks.append({"name": "disk", "value": disk, "status": classify(disk, 80, 90), "unit": "%"}) # service check try: ok = requests.get("http://localhost:8000/health", timeout=5).status_code == 200 except requests.RequestException: ok = False checks.append({"name": "web", "value": ok, "status": "ok" if ok else "critical", "unit": ""}) # roll-up: worst status wins order = {"ok": 0, "warning": 1, "critical": 2} overall = max((c["status"] for c in checks), key=lambda s: order[s]) report = {"time": datetime.now().isoformat(), "overall": overall, "checks": checks} for c in checks: log.info("%-7s %s %s → %s", c["name"], c["value"], c["unit"], c["status"]) log.info("OVERALL: %s", overall.upper()) return report report = run_checks() # hand off to the alert system (Lesson 34): # if report["overall"] != "ok": # alert(f"Health {report['overall']}: {report['checks']}", report['overall'])

import psutil procs = [(p.info["memory_percent"], p.info["name"]) for p in psutil.process_iter(["name", "memory_percent"])] for mem, name in sorted(procs, reverse=True)[:5]: print(f"{mem:5.1f}% {name}")

import psutil, requests, socket def check_disk(c): v = psutil.disk_usage(c["path"]).percent return {"name": c["path"], "value": v, "status": classify(v, c["warn"], c["crit"])} def check_http(c): try: ok = requests.get(c["url"], timeout=5).status_code == 200 except requests.RequestException: ok = False return {"name": c["url"], "value": ok, "status": "ok" if ok else "critical"} def check_port(c): with socket.socket() as s: s.settimeout(3) ok = s.connect_ex((c["host"], c["port"])) == 0 return {"name": f"{c['host']}:{c['port']}", "value": ok, "status": "ok" if ok else "critical"} HANDLERS = {"disk": check_disk, "http": check_http, "port": check_port} CONFIG = [ {"type": "disk", "path": "/", "warn": 80, "crit": 90}, {"type": "http", "url": "https://example.com"}, {"type": "port", "host": "example.com", "port": 443}, ] for c in CONFIG: print(HANDLERS[c["type"]](c))