Learning Goals
3 minBy the end of this lesson you can:
- List a folder with
iterdirand search it withglob. - Search a whole tree recursively with
rglob(orglob("**/...")). - Read file metadata: size with
stat().st_size, modified time withstat().st_mtime. - Filter, sort, and summarise the files you find.
Warm-Up · Glob Patterns
5 minA "glob" is a wildcard pattern for matching filenames. You've seen them in the shell:
*.csv every file ending in .csv (in one folder) report_* every file starting with "report_" data_??.txt data_01.txt, data_99.txt — ? matches one character **/*.py every .py file in this folder AND all subfolders
* matches any run of characters in one folder; ** means "this folder and every folder under it." pathlib understands these patterns directly: folder.rglob("*.csv") finds every CSV in an entire tree, lazily, one at a time.
New Concept · Searching Trees
14 minOne folder: iterdir and glob
from pathlib import Path folder = Path("data") for item in folder.iterdir(): # everything in this folder, files + subfolders print(item) for csv in folder.glob("*.csv"): # only CSVs in this folder print(csv)
iterdir lists everything one level deep; glob(pattern) filters to matches in that one folder.
The whole tree: rglob
for py in Path("project").rglob("*.py"): # every .py, however deep print(py) # rglob("*.py") is shorthand for glob("**/*.py")
rglob ("recursive glob") descends into every subfolder. This is the workhorse of file automation — "find all the X anywhere under here."
They're lazy generators
glob and rglob yield paths one at a time, so they handle huge trees without loading everything into memory. Wrap in list() if you need a count or to sort:
pdfs = list(Path("docs").rglob("*.pdf")) print(f"Found {len(pdfs)} PDFs")
Reading metadata with stat()
from datetime import datetime p = Path("report.csv") info = p.stat() print(info.st_size) # size in bytes print(datetime.fromtimestamp(info.st_mtime)) # last modified, as a datetime
st_size— file size in bytes.st_mtime— last-modified time, as a Unix timestamp (seconds). Convert withdatetime.fromtimestamp.
Filter and sort what you find
# the five biggest log files anywhere under /var logs = Path("/var").rglob("*.log") biggest = sorted(logs, key=lambda p: p.stat().st_size, reverse=True)[:5] for p in biggest: mb = p.stat().st_size / 1_000_000 print(f"{mb:6.1f} MB {p}")
Because the results are just Path objects, all your list skills apply: sorted with a key, slicing for top-N, comprehensions for filtering.
On Linux, *.CSV won't match data.csv. To be safe, match case-insensitively yourself: [p for p in folder.iterdir() if p.suffix.lower() == ".csv"]. Also note glob skips dot-files like .gitignore unless your pattern starts with a dot.
Worked Example · A Disk-Usage Report
12 minGoal: scan a folder tree and report total size, file count, and a breakdown by extension — like a mini "what's eating my disk" tool.
from pathlib import Path from collections import defaultdict def disk_report(root: str) -> None: base = Path(root) total_bytes = 0 count = 0 by_ext = defaultdict(lambda: [0, 0]) # ext -> [count, bytes] for f in base.rglob("*"): if f.is_file(): size = f.stat().st_size ext = f.suffix.lower() or "(none)" total_bytes += size count += 1 by_ext[ext][0] += 1 by_ext[ext][1] += size print(f"{count} files, {total_bytes/1_000_000:.1f} MB total\n") print(f"{'ext':<8}{'files':>8}{'MB':>10}") for ext, (n, b) in sorted(by_ext.items(), key=lambda kv: kv[1][1], reverse=True): print(f"{ext:<8}{n:>8}{b/1_000_000:>10.2f}") disk_report("project")
128 files, 14.3 MB total ext files MB .png 41 8.92 .py 63 3.10 .json 18 1.74 .md 6 0.55
Read the code
rglob("*") walks the whole tree; is_file() skips folders; stat().st_size gives each size. A defaultdict accumulates count and bytes per extension, then sorted by bytes ranks the disk hogs. Notice how the metadata + sorting pattern from the concept section drives a genuinely useful tool. Swap the print loop for a CSV writer and you've got a report you can email — which is exactly where Lesson 22 is headed.
Try It Yourself
13 minCount how many .py files live under a folder tree using rglob. Then count .txt. Print both.
Find every file under a folder modified in the last 7 days. Compare stat().st_mtime against time.time() - 7*86400.
Hint
import time from pathlib import Path cutoff = time.time() - 7 * 86400 recent = [p for p in Path("project").rglob("*") if p.is_file() and p.stat().st_mtime > cutoff] for p in recent: print(p)
Group files under a tree by their size in bytes. Print any size that has more than one file — a quick (imperfect) duplicate finder. (Lesson 38 makes it exact with hashing.)
Hint
from pathlib import Path from collections import defaultdict groups = defaultdict(list) for p in Path("downloads").rglob("*"): if p.is_file(): groups[p.stat().st_size].append(p) for size, files in groups.items(): if len(files) > 1: print(f"{size} bytes:") for f in files: print(" ", f)
Mini-Challenge · The Cleanup Scout
8 minWrite scout(root) that finds "cleanup candidates": files bigger than 50 MB OR not modified in over a year. Print each with its size and age in days, sorted biggest-first. Crucially, it should only report — never delete (that's a human decision).
Show a sample solution
import time from pathlib import Path def scout(root: str) -> None: now = time.time() hits = [] for p in Path(root).rglob("*"): if not p.is_file(): continue st = p.stat() age_days = (now - st.st_mtime) / 86400 if st.st_size > 50_000_000 or age_days > 365: hits.append((st.st_size, age_days, p)) for size, age, p in sorted(hits, reverse=True): print(f"{size/1_000_000:7.1f} MB {age:5.0f}d {p}") scout("downloads")
Non-negotiables: recursive scan, size OR age filter, sorted output, and report-only (no deletion).
Recap
3 miniterdir lists one folder; glob(pattern) filters it; rglob(pattern) searches the entire tree. They return lazy generators of Path objects, so wrap in list() to count or sort. Each path carries metadata via stat() — st_size for bytes, st_mtime for last-modified — which you convert with datetime.fromtimestamp. Combine search + metadata + sorted and you can report on, filter, and prioritise files across a whole tree. That trio powers nearly every file automation you'll write.
Vocabulary Card
- glob
- A wildcard pattern (
*,?) for matching filenames. - rglob
- Recursive glob — searches a folder and all its subfolders.
- stat()
- Returns a path's metadata: size, modified time, permissions, and more.
- st_mtime
- Last-modified time as a Unix timestamp (seconds since 1970).
Homework
4 minBuild treesize.py <folder> (using argparse from Lesson 3) that prints the 10 largest files under a folder tree, newest-modified first as a tie-breaker, showing size in MB and a human-readable modified date. Add a --ext flag to restrict to one extension.
Sample · treesize.py
import argparse from pathlib import Path from datetime import datetime p = argparse.ArgumentParser(description="Largest files in a tree.") p.add_argument("folder") p.add_argument("--ext", help="restrict to one extension, e.g. .png") a = p.parse_args() pattern = f"*{a.ext}" if a.ext else "*" files = [f for f in Path(a.folder).rglob(pattern) if f.is_file()] files.sort(key=lambda f: (f.stat().st_size, f.stat().st_mtime), reverse=True) for f in files[:10]: st = f.stat() when = datetime.fromtimestamp(st.st_mtime).strftime("%Y-%m-%d %H:%M") print(f"{st.st_size/1_000_000:8.2f} MB {when} {f}")
Non-negotiables: argparse folder + --ext, recursive search, top-10 by size, readable date.