Learning Goals
3 minBy the end of this lesson you can:
- Detect changed files reliably by hashing their contents.
- Do incremental backups — copy only what changed.
- Keep versioned, dated snapshots and rotate old ones (retention).
- One-way sync a source folder to a destination (mirror), including deletes.
Warm-Up · The 3-2-1 Rule
5 minBackup wisdom in one line: 3 copies of your data, on 2 different media, with 1 off-site. We're building the engine that makes the extra copies — automatically and efficiently.
A naive backup re-copies everything every time — slow and wasteful. A smart one copies only the files that changed, which means you must reliably answer "did this file change?" The trustworthy answer is a content hash: if two files have the same hash, they're identical; if different, they changed. Hash-based incremental backups are fast, correct, and the foundation of every real backup tool.
New Concept · Hashing, Incremental, Retention
14 minHashing a file
import hashlib from pathlib import Path def file_hash(path: Path, algo: str = "sha256") -> str: h = hashlib.new(algo) with open(path, "rb") as f: for chunk in iter(lambda: f.read(65536), b""): # read in 64KB blocks h.update(chunk) return h.hexdigest()
Reading in chunks lets you hash huge files without loading them into memory. The hex digest is a fingerprint: same content → same hash. Comparing hashes (not timestamps) is the reliable change-detector — timestamps can lie, content can't.
A manifest of the last backup
import json def load_manifest(path: Path) -> dict: return json.loads(path.read_text()) if path.exists() else {} def save_manifest(path: Path, manifest: dict) -> None: path.write_text(json.dumps(manifest, indent=2))
Store {relative_path: hash} for the last backup. Next run, re-hash the source and compare to the manifest to find what's new or changed — that's the "incremental" magic.
Incremental copy
import shutil def incremental_backup(src: Path, dest: Path, manifest_path: Path) -> dict: old = load_manifest(manifest_path) new, changed = {}, [] for f in src.rglob("*"): if not f.is_file(): continue rel = str(f.relative_to(src)) digest = file_hash(f) new[rel] = digest if old.get(rel) != digest: # new or modified target = dest / rel target.parent.mkdir(parents=True, exist_ok=True) shutil.copy2(f, target) # copy2 keeps timestamps changed.append(rel) save_manifest(manifest_path, new) return {"total": len(new), "copied": len(changed), "files": changed}
Only files whose hash differs from the manifest get copied. The first run copies everything; later runs copy just the deltas — often a handful of files instead of thousands.
Versioned snapshots + retention
from datetime import datetime def snapshot(src: Path, backup_root: Path, keep: int = 7) -> Path: stamp = datetime.now().strftime("%Y%m%d-%H%M%S") dest = backup_root / stamp shutil.copytree(src, dest) # retention: keep only the newest <keep> snapshots snaps = sorted([d for d in backup_root.iterdir() if d.is_dir()], reverse=True) for old in snaps[keep:]: shutil.rmtree(old) return dest
Dated snapshot folders give you point-in-time recovery ("restore yesterday"); the retention loop keeps disk bounded by deleting all but the newest N — the rotation pattern from Lesson 7.
A backup keeps history (snapshots). A mirror sync makes the destination exactly match the source — including deleting from the destination files that no longer exist in the source. Mirror is dangerous: a deleted source file is gone from the mirror too, so it's not a substitute for versioned backups. Use mirror for "keep this folder identical"; use snapshots for "don't lose history."
Worked Example · An Incremental Backup Tool
12 minGoal: a CLI that does a fast incremental backup, reports exactly what changed, and offers a verify mode that re-hashes the backup to confirm integrity.
import argparse, hashlib, json, shutil, logging from pathlib import Path from datetime import datetime logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s") log = logging.getLogger("backup") def file_hash(path: Path) -> str: h = hashlib.sha256() with open(path, "rb") as f: for chunk in iter(lambda: f.read(65536), b""): h.update(chunk) return h.hexdigest() def backup(src: str, dest: str) -> None: src_p, dest_p = Path(src), Path(dest) dest_p.mkdir(parents=True, exist_ok=True) manifest_p = dest_p / ".manifest.json" old = json.loads(manifest_p.read_text()) if manifest_p.exists() else {} new, copied, unchanged = {}, 0, 0 for f in src_p.rglob("*"): if not f.is_file(): continue rel = str(f.relative_to(src_p)) digest = file_hash(f) new[rel] = digest if old.get(rel) == digest: unchanged += 1 continue target = dest_p / rel target.parent.mkdir(parents=True, exist_ok=True) shutil.copy2(f, target) copied += 1 log.info("copied %s", rel) manifest_p.write_text(json.dumps(new, indent=2)) log.info("backup done: %d copied, %d unchanged, %d total", copied, unchanged, len(new)) def verify(dest: str) -> None: dest_p = Path(dest) manifest = json.loads((dest_p / ".manifest.json").read_text()) bad = 0 for rel, expected in manifest.items(): f = dest_p / rel if not f.exists() or file_hash(f) != expected: log.error("CORRUPT/MISSING: %s", rel); bad += 1 log.info("verify: %d files, %d problems", len(manifest), bad) if __name__ == "__main__": p = argparse.ArgumentParser(description="Incremental backup.") sub = p.add_subparsers(dest="cmd", required=True) b = sub.add_parser("run"); b.add_argument("src"); b.add_argument("dest") v = sub.add_parser("verify"); v.add_argument("dest") a = p.parse_args() backup(a.src, a.dest) if a.cmd == "run" else verify(a.dest)
$ python backup.py run documents backups/docs INFO copied report.docx INFO copied notes/todo.txt INFO backup done: 2 copied, 148 unchanged, 150 total $ python backup.py verify backups/docs INFO verify: 150 files, 0 problems
Read the code
The manifest of hashes makes the second run almost instant: 148 unchanged files are skipped, only the 2 edited ones copy. The verify command re-hashes the backup against the manifest to catch silent corruption (bit-rot, bad disk) — a real backup tool you can trust must prove the copy is intact, not just assume it. Schedule backup.py run with cron (Lesson 36) and your documents back themselves up incrementally every night.
Try It Yourself
13 minHash two copies of a file (one edited), confirm the hashes differ, then make them identical and confirm the hashes match. This proves hashing detects change.
Run the backup on a test folder, edit one file, run again, and confirm only the edited file is copied the second time (watch the "copied" count).
Improve Lesson 6's size-based duplicate finder to be exact: group files by hash (not size) so only byte-identical files are flagged. Report sets of duplicates and the space wasted.
Hint
from collections import defaultdict groups = defaultdict(list) for f in Path("photos").rglob("*"): if f.is_file(): groups[file_hash(f)].append(f) for digest, files in groups.items(): if len(files) > 1: wasted = files[0].stat().st_size * (len(files) - 1) print(f"{len(files)} copies, {wasted} bytes wasted:") for f in files: print(" ", f)
Mini-Challenge · One-Way Mirror Sync
8 minWrite mirror(src, dest, dry_run=True) that makes dest exactly match src: copy new/changed files and delete files in dest that no longer exist in src. Default to dry_run=True (just print what would happen) — because deletion is dangerous, the safe default is to show, not do.
Show a sample solution
import shutil from pathlib import Path def mirror(src: str, dest: str, dry_run: bool = True) -> None: src_p, dest_p = Path(src), Path(dest) src_files = {str(f.relative_to(src_p)) for f in src_p.rglob("*") if f.is_file()} dest_files = {str(f.relative_to(dest_p)) for f in dest_p.rglob("*") if f.is_file()} for rel in src_files: # copy new/changed s, d = src_p / rel, dest_p / rel if not d.exists() or file_hash(s) != file_hash(d): print(f"{'WOULD COPY' if dry_run else 'COPY'}: {rel}") if not dry_run: d.parent.mkdir(parents=True, exist_ok=True) shutil.copy2(s, d) for rel in dest_files - src_files: # delete extras print(f"{'WOULD DELETE' if dry_run else 'DELETE'}: {rel}") if not dry_run: (dest_p / rel).unlink() mirror("source", "backup") # safe preview # mirror("source", "backup", dry_run=False) # actually do it
Non-negotiables: copies new/changed, deletes extras, defaults to dry-run preview, hash-based change detection.
Recap
3 minSmart backups copy only what changed, and the reliable change-detector is a content hash (compare hashes, not timestamps). Store a manifest of path → hash, re-hash the source each run, and copy just the deltas — instant after the first run. Keep dated snapshots for point-in-time recovery and a retention loop to bound disk. Verify backups by re-hashing against the manifest. Know the difference between a versioned backup (keeps history) and a mirror sync (deletes to match, dangerous — default to dry-run). These primitives underpin every real backup system, including the database backups of Lesson 39.
Vocabulary Card
- content hash
- A fingerprint of a file's bytes; identical content → identical hash.
- incremental backup
- Copying only files that changed since the last backup.
- snapshot / retention
- A dated full copy; retention prunes all but the newest N.
- mirror sync
- Making a destination exactly match a source, including deletions.
Homework
4 minBuild timemachine.py: incremental backup with a hash manifest, dated snapshots with retention (keep N), a verify command, and a safe dry-run mirror mode — all behind argparse subcommands and logging. Back up a real folder of yours, edit a file, back up again, and confirm only the change copied. Then verify integrity. Note how you'd schedule it nightly.
Sample · timemachine.py usage
$ python timemachine.py backup ~/docs backups/docs INFO 150 copied, 0 unchanged (first run) # edit one file… $ python timemachine.py backup ~/docs backups/docs INFO 1 copied, 149 unchanged (incremental!) $ python timemachine.py snapshot ~/docs backups/snaps --keep 7 INFO snapshot 20260528-080000 created; pruned 1 old $ python timemachine.py verify backups/docs INFO 150 files, 0 problems $ python timemachine.py mirror ~/docs backups/mirror # dry-run preview WOULD DELETE: old_draft.txt # Schedule nightly: 0 2 * * * /usr/bin/python3 /path/timemachine.py backup ~/docs backups/docs >> bk.log 2>&1
Non-negotiables: incremental via hashes, snapshots+retention, verify, dry-run mirror, a scheduling note.