PY-L8-23 · Process Inspection & Anomaly Detection

Learning Goals

3 min

By the end of this lesson you can:

Diff a live host snapshot against a baseline to find new processes/ports/connections.
Inspect process lineage (parent → child) and binary paths for red flags.
Apply layered anomaly heuristics and score suspicion.
Reason about false positives and the limits of heuristic detection.

Warm-Up · Normal Is the Best Detector

5 min

You can't maintain a list of every possible piece of malware — there are millions and new ones daily. But you can know what your own clean system looks like. Anything that deviates from that known-good baseline is worth a look.

Today's big idea

Anomaly detection flips the problem: instead of recognising "bad" (impossible to enumerate), you recognise "not normal." Diff a live snapshot against your baseline (Lesson 22) and the new things — a process that appeared, a port that opened, a connection to an unfamiliar IP — bubble up. Add heuristics about how a process looks (its path, parent, name) and you have lightweight host-based intrusion detection. It's imperfect (false positives), but it's how real EDR tools start.

New Concept · Diff, Lineage & Heuristics

14 min

1. Diff against the baseline

import json, psutil

def current_names() -> set:
    names = set()
    for p in psutil.process_iter(["name"]):
        names.add(p.info["name"])
    return names

def diff_processes(baseline_path: str) -> set:
    base = set(json.loads(open(baseline_path).read())["process_names"])
    now = current_names()
    new = now - base                 # processes present NOW but not in baseline
    return new

print("new since baseline:", diff_processes("host_baseline.json"))

Set subtraction does the work: now - baseline = what appeared. Do the same for listening ports and established remotes. New items aren't necessarily malicious (you may have launched a new app), but they're where to look first.

2. Inspect process lineage

import psutil

def lineage(pid: int) -> list[str]:
    """Walk parent chain: who launched this process?"""
    chain = []
    p = psutil.Process(pid)
    while p:
        chain.append(f"{p.name()}({p.pid})")
        try:
            p = p.parent()
        except psutil.Error:
            break
    return chain        # e.g. ['evil.bin(900)', 'bash(880)', 'sshd(440)', ...]

Lineage is forensic gold. A shell (bash, cmd, powershell) spawned by a web server process is a huge red flag — it often means a webshell or RCE gave an attacker command execution. Legitimate apps have predictable parents; attacker payloads often don't.

3. Anomaly heuristics (layered, scored)

import psutil

SUSPICIOUS_PARENTS = {"nginx", "apache2", "httpd", "java", "node"}  # shouldn't spawn shells
SHELLS = {"sh", "bash", "cmd.exe", "powershell.exe"}

def suspicion_score(p: psutil.Process) -> tuple[int, list[str]]:
    score, reasons = 0, []
    try:
        exe = (p.exe() or "").lower()
        name = p.name().lower()
        parent = p.parent().name().lower() if p.parent() else ""

        if "/tmp" in exe or "\\temp\\" in exe or "/dev/shm" in exe:
            score += 3; reasons.append("binary in a temp/world-writable dir")
        if name in SHELLS and parent in SUSPICIOUS_PARENTS:
            score += 5; reasons.append(f"shell spawned by {parent} (RCE?)")
        if p.username() in ("root", "SYSTEM") and "/tmp" in exe:
            score += 3; reasons.append("privileged process from temp dir")
        if not exe:
            score += 1; reasons.append("no backing executable (memory-only?)")
    except psutil.Error:
        pass
    return score, reasons

No single heuristic is proof — so we score and combine. A binary in /tmp (+3) that's a shell spawned by nginx (+5) running as root scores high enough to demand investigation. Each rule encodes a known attacker behaviour (TTP).

⚠️ Heuristics produce false positives — by design

A high score means "investigate," not "guilty." Legitimate software sometimes runs from unusual paths or spawns shells. The goal is to narrow hundreds of processes to a handful worth a human look — then you confirm with context (signatures, file origin, network behaviour) before any action. Tuning to reduce noise without missing real threats is the eternal balancing act of detection.

Worked Example · A Mini Host-IDS

12 min

Goal: combine the baseline diff with the heuristic scorer into a small host intrusion-detection scan that ranks processes by suspicion and explains why — so a human can triage fast.

import json, psutil

SHELLS = {"sh", "bash", "zsh", "cmd.exe", "powershell.exe"}
SHOULDNT_SPAWN_SHELL = {"nginx", "apache2", "httpd", "node", "java", "python"}

def score(p: psutil.Process, baseline_names: set) -> tuple[int, list[str]]:
    s, reasons = 0, []
    try:
        name = p.name().lower()
        exe = (p.exe() or "").lower()
        parent = p.parent().name().lower() if p.parent() else ""
        if name not in baseline_names:
            s += 1; reasons.append("not in baseline")
        if any(d in exe for d in ("/tmp", "\\temp\\", "/dev/shm")):
            s += 3; reasons.append("runs from temp/world-writable dir")
        if name in SHELLS and parent in SHOULDNT_SPAWN_SHELL:
            s += 5; reasons.append(f"shell child of {parent} — possible RCE")
        if p.cpu_percent() > 85:
            s += 2; reasons.append("very high CPU (miner?)")
    except psutil.Error:
        pass
    return s, reasons

def scan(baseline_path: str) -> None:
    baseline = set(n.lower()
                   for n in json.loads(open(baseline_path).read())["process_names"])
    findings = []
    for p in psutil.process_iter():
        s, reasons = score(p, baseline)
        if s > 0:
            findings.append((s, p.pid, p.name(), reasons))

    print("Host IDS scan — investigate highest scores first:\n")
    for s, pid, name, reasons in sorted(findings, reverse=True)[:10]:
        flag = "🔴" if s >= 5 else "🟡" if s >= 3 else "⚪"
        print(f"{flag} score {s:2}  {name} (pid {pid})")
        for r in reasons:
            print(f"            • {r}")

scan("host_baseline.json")

Host IDS scan — investigate highest scores first:

🔴 score  9  sh (pid 9123)
            • not in baseline
            • runs from temp/world-writable dir
            • shell child of nginx — possible RCE
🟡 score  3  xmrig (pid 9201)
            • not in baseline
            • very high CPU (miner?)
⚪ score  1  newapp (pid 9300)
            • not in baseline

Read the code

This is the shape of real host-based detection: combine baseline-diff ("is it new?") with behavioural heuristics ("does it look like an attacker TTP?"), score, and rank so the analyst's attention goes to the 🔴 nine before the ⚪ one. The top finding — a shell, running from /tmp, spawned by nginx — is a textbook web-RCE signature worth dropping everything for. The bottom finding (a new app you installed) is almost certainly benign noise. The tool doesn't decide guilt; it focuses the human, which is exactly what detection tooling is for.

Try It Yourself

13 min

01 🟢 Baseline diff

Capture a baseline, launch a new application, then diff to confirm the new process (and any new ports/connections) is detected. This proves the "recognise not-normal" approach.

02 🟡 Lineage tracer

Write a function that prints the full parent chain of a given process. Run it on a deliberately-spawned shell (open a terminal, find its PID) and observe the lineage up to your shell/init.

Hint

def lineage(pid):
    chain = []
    p = psutil.Process(pid)
    while p:
        chain.append(p.name())
        try: p = p.parent()
        except psutil.Error: break
    return " ← ".join(chain)
print(lineage(some_pid))

03 🔴 Tune the scorer

Run the mini host-IDS on your own machine and examine the false positives. Adjust the heuristics/weights (e.g. allow-list known-good processes that legitimately run from unusual paths) to cut noise without removing the genuinely-suspicious rules. Document what you changed and why.

Mini-Challenge · Connection Anomaly Detector

8 min

Extend detection to the network: diff current established remote addresses against a baseline of known-good destinations, and flag connections to new remotes — especially from unexpected processes or to non-standard ports. This catches beaconing that a pure process check would miss.

Show a sample solution

import json, psutil

def connection_anomalies(baseline_path: str) -> None:
    known = set(json.loads(open(baseline_path).read())["established_remotes"])
    for c in psutil.net_connections("inet"):
        if c.status != "ESTABLISHED" or not c.raddr:
            continue
        remote = f"{c.raddr.ip}:{c.raddr.port}"
        try:
            proc = psutil.Process(c.pid).name() if c.pid else "?"
        except psutil.Error:
            proc = "?"
        if remote not in known:
            flag = "🔴" if c.raddr.port not in (80, 443) else "🟡"
            print(f"{flag} NEW outbound: {proc} → {remote} "
                  f"(not in baseline{' · non-web port' if flag=='🔴' else ''})")

connection_anomalies("host_baseline.json")

Non-negotiables: diff remotes vs. baseline, map to process, and raise severity for non-standard ports / unexpected processes.

Recap

3 min

Anomaly detection recognises "not normal" rather than enumerating "bad." Diff a live snapshot against a known-good baseline (processes, ports, remotes) to surface what's new; inspect process lineage (a shell spawned by a web server screams RCE) and binary paths (temp dirs are suspicious); then combine layered heuristics into a score and rank, so an analyst's attention goes to the worst first. It's the core of host-based IDS/EDR — and it's inherently noisy, so a high score means "investigate," never "auto-act." Tuning the false-positive/false-negative balance is the real skill.

Vocabulary Card

anomaly detection: Flagging deviations from known-good behaviour, not known-bad signatures.
process lineage: The parent→child chain showing what launched a process.
TTP: Tactics, Techniques & Procedures — attacker behaviours heuristics encode.
suspicion score: A combined weight from multiple heuristics, used to rank for triage.

Homework

4 min

Build a mini host-IDS that diffs processes and connections against a baseline and scores suspicion with lineage + path + resource heuristics. Run it on your own machine, tune away the worst false positives, and write up: the heuristics you kept and why, two false positives you saw, and how you'd confirm a real finding before acting.

Sample · host-IDS tuning notes

Heuristics kept (each maps to a real attacker behaviour):
- shell child of a web/app server (+5): the strongest RCE signal.
- binary in /tmp, /dev/shm, %TEMP% (+3): malware drop locations.
- new process not in baseline (+1): weak alone, useful combined.
- sustained high CPU (+2): miner indicator.

False positives I saw:
1. A package manager ran a script from a temp dir (+3) — legit.
   Fix: allow-list the known package-manager parent.
2. VS Code spawned 'bash' (it's not a "web server", so it didn't
   trip the +5 rule) but a new helper process scored +1 — benign;
   acceptable low noise.

Confirming a real finding before acting:
- check the binary's path + digital signature (is it signed/known?)
- check its network connections (beaconing to a strange IP?)
- check the parent chain (how did it get launched?)
- only THEN isolate/kill — and only on a host I administer.

Non-negotiables: process+connection diff with scored heuristics, real tuning against your machine, documented false positives, and a confirm-before-act process.

import json, psutil def current_names() -> set: names = set() for p in psutil.process_iter(["name"]): names.add(p.info["name"]) return names def diff_processes(baseline_path: str) -> set: base = set(json.loads(open(baseline_path).read())["process_names"]) now = current_names() new = now - base # processes present NOW but not in baseline return new print("new since baseline:", diff_processes("host_baseline.json"))

import psutil def lineage(pid: int) -> list[str]: """Walk parent chain: who launched this process?""" chain = [] p = psutil.Process(pid) while p: chain.append(f"{p.name()}({p.pid})") try: p = p.parent() except psutil.Error: break return chain # e.g. ['evil.bin(900)', 'bash(880)', 'sshd(440)', ...]

import psutil SUSPICIOUS_PARENTS = {"nginx", "apache2", "httpd", "java", "node"} # shouldn't spawn shells SHELLS = {"sh", "bash", "cmd.exe", "powershell.exe"} def suspicion_score(p: psutil.Process) -> tuple[int, list[str]]: score, reasons = 0, [] try: exe = (p.exe() or "").lower() name = p.name().lower() parent = p.parent().name().lower() if p.parent() else "" if "/tmp" in exe or "\\temp\\" in exe or "/dev/shm" in exe: score += 3; reasons.append("binary in a temp/world-writable dir") if name in SHELLS and parent in SUSPICIOUS_PARENTS: score += 5; reasons.append(f"shell spawned by {parent} (RCE?)") if p.username() in ("root", "SYSTEM") and "/tmp" in exe: score += 3; reasons.append("privileged process from temp dir") if not exe: score += 1; reasons.append("no backing executable (memory-only?)") except psutil.Error: pass return score, reasons

import json, psutil SHELLS = {"sh", "bash", "zsh", "cmd.exe", "powershell.exe"} SHOULDNT_SPAWN_SHELL = {"nginx", "apache2", "httpd", "node", "java", "python"} def score(p: psutil.Process, baseline_names: set) -> tuple[int, list[str]]: s, reasons = 0, [] try: name = p.name().lower() exe = (p.exe() or "").lower() parent = p.parent().name().lower() if p.parent() else "" if name not in baseline_names: s += 1; reasons.append("not in baseline") if any(d in exe for d in ("/tmp", "\\temp\\", "/dev/shm")): s += 3; reasons.append("runs from temp/world-writable dir") if name in SHELLS and parent in SHOULDNT_SPAWN_SHELL: s += 5; reasons.append(f"shell child of {parent} — possible RCE") if p.cpu_percent() > 85: s += 2; reasons.append("very high CPU (miner?)") except psutil.Error: pass return s, reasons def scan(baseline_path: str) -> None: baseline = set(n.lower() for n in json.loads(open(baseline_path).read())["process_names"]) findings = [] for p in psutil.process_iter(): s, reasons = score(p, baseline) if s > 0: findings.append((s, p.pid, p.name(), reasons)) print("Host IDS scan — investigate highest scores first:\n") for s, pid, name, reasons in sorted(findings, reverse=True)[:10]: flag = "🔴" if s >= 5 else "🟡" if s >= 3 else "⚪" print(f"{flag} score {s:2} {name} (pid {pid})") for r in reasons: print(f" • {r}") scan("host_baseline.json")

Host IDS scan — investigate highest scores first: 🔴 score 9 sh (pid 9123) • not in baseline • runs from temp/world-writable dir • shell child of nginx — possible RCE 🟡 score 3 xmrig (pid 9201) • not in baseline • very high CPU (miner?) ⚪ score 1 newapp (pid 9300) • not in baseline

def lineage(pid): chain = [] p = psutil.Process(pid) while p: chain.append(p.name()) try: p = p.parent() except psutil.Error: break return " ← ".join(chain) print(lineage(some_pid))

import json, psutil def connection_anomalies(baseline_path: str) -> None: known = set(json.loads(open(baseline_path).read())["established_remotes"]) for c in psutil.net_connections("inet"): if c.status != "ESTABLISHED" or not c.raddr: continue remote = f"{c.raddr.ip}:{c.raddr.port}" try: proc = psutil.Process(c.pid).name() if c.pid else "?" except psutil.Error: proc = "?" if remote not in known: flag = "🔴" if c.raddr.port not in (80, 443) else "🟡" print(f"{flag} NEW outbound: {proc} → {remote} " f"(not in baseline{' · non-web port' if flag=='🔴' else ''})") connection_anomalies("host_baseline.json")