PY-L7-06 · pathlib in Practice: Find, Walk, Glob

Learning Goals

3 min

By the end of this lesson you can:

List a folder with iterdir and search it with glob.
Search a whole tree recursively with rglob (or glob("**/...")).
Read file metadata: size with stat().st_size, modified time with stat().st_mtime.
Filter, sort, and summarise the files you find.

Warm-Up · Glob Patterns

5 min

A "glob" is a wildcard pattern for matching filenames. You've seen them in the shell:

*.csv        every file ending in .csv (in one folder)
report_*     every file starting with "report_"
data_??.txt  data_01.txt, data_99.txt — ? matches one character
**/*.py      every .py file in this folder AND all subfolders

Today's big idea

* matches any run of characters in one folder; ** means "this folder and every folder under it." pathlib understands these patterns directly: folder.rglob("*.csv") finds every CSV in an entire tree, lazily, one at a time.

New Concept · Searching Trees

14 min

One folder: iterdir and glob

from pathlib import Path

folder = Path("data")

for item in folder.iterdir():       # everything in this folder, files + subfolders
    print(item)

for csv in folder.glob("*.csv"):    # only CSVs in this folder
    print(csv)

iterdir lists everything one level deep; glob(pattern) filters to matches in that one folder.

The whole tree: rglob

for py in Path("project").rglob("*.py"):    # every .py, however deep
    print(py)

# rglob("*.py") is shorthand for glob("**/*.py")

rglob ("recursive glob") descends into every subfolder. This is the workhorse of file automation — "find all the X anywhere under here."

They're lazy generators

glob and rglob yield paths one at a time, so they handle huge trees without loading everything into memory. Wrap in list() if you need a count or to sort:

pdfs = list(Path("docs").rglob("*.pdf"))
print(f"Found {len(pdfs)} PDFs")

Reading metadata with stat()

from datetime import datetime

p = Path("report.csv")
info = p.stat()

print(info.st_size)                          # size in bytes
print(datetime.fromtimestamp(info.st_mtime)) # last modified, as a datetime

st_size — file size in bytes.
st_mtime — last-modified time, as a Unix timestamp (seconds). Convert with datetime.fromtimestamp.

Filter and sort what you find

# the five biggest log files anywhere under /var
logs = Path("/var").rglob("*.log")
biggest = sorted(logs, key=lambda p: p.stat().st_size, reverse=True)[:5]

for p in biggest:
    mb = p.stat().st_size / 1_000_000
    print(f"{mb:6.1f} MB  {p}")

Because the results are just Path objects, all your list skills apply: sorted with a key, slicing for top-N, comprehensions for filtering.

Case sensitivity & hidden files

On Linux, *.CSV won't match data.csv. To be safe, match case-insensitively yourself: [p for p in folder.iterdir() if p.suffix.lower() == ".csv"]. Also note glob skips dot-files like .gitignore unless your pattern starts with a dot.

Worked Example · A Disk-Usage Report

12 min

Goal: scan a folder tree and report total size, file count, and a breakdown by extension — like a mini "what's eating my disk" tool.

from pathlib import Path
from collections import defaultdict

def disk_report(root: str) -> None:
    base = Path(root)
    total_bytes = 0
    count = 0
    by_ext = defaultdict(lambda: [0, 0])    # ext -> [count, bytes]

    for f in base.rglob("*"):
        if f.is_file():
            size = f.stat().st_size
            ext = f.suffix.lower() or "(none)"
            total_bytes += size
            count += 1
            by_ext[ext][0] += 1
            by_ext[ext][1] += size

    print(f"{count} files, {total_bytes/1_000_000:.1f} MB total\n")
    print(f"{'ext':<8}{'files':>8}{'MB':>10}")
    for ext, (n, b) in sorted(by_ext.items(),
                              key=lambda kv: kv[1][1], reverse=True):
        print(f"{ext:<8}{n:>8}{b/1_000_000:>10.2f}")

disk_report("project")

128 files, 14.3 MB total

ext        files        MB
.png          41      8.92
.py           63      3.10
.json         18      1.74
.md            6      0.55

Read the code

rglob("*") walks the whole tree; is_file() skips folders; stat().st_size gives each size. A defaultdict accumulates count and bytes per extension, then sorted by bytes ranks the disk hogs. Notice how the metadata + sorting pattern from the concept section drives a genuinely useful tool. Swap the print loop for a CSV writer and you've got a report you can email — which is exactly where Lesson 22 is headed.

Try It Yourself

13 min

01 🟢 Count by type

Count how many .py files live under a folder tree using rglob. Then count .txt. Print both.

02 🟡 Recently changed

Find every file under a folder modified in the last 7 days. Compare stat().st_mtime against time.time() - 7*86400.

Hint

import time
from pathlib import Path

cutoff = time.time() - 7 * 86400
recent = [p for p in Path("project").rglob("*")
          if p.is_file() and p.stat().st_mtime > cutoff]
for p in recent:
    print(p)

03 🔴 Find duplicates by size

Group files under a tree by their size in bytes. Print any size that has more than one file — a quick (imperfect) duplicate finder. (Lesson 38 makes it exact with hashing.)

Hint

from pathlib import Path
from collections import defaultdict

groups = defaultdict(list)
for p in Path("downloads").rglob("*"):
    if p.is_file():
        groups[p.stat().st_size].append(p)

for size, files in groups.items():
    if len(files) > 1:
        print(f"{size} bytes:")
        for f in files:
            print("  ", f)

Mini-Challenge · The Cleanup Scout

8 min

Write scout(root) that finds "cleanup candidates": files bigger than 50 MB OR not modified in over a year. Print each with its size and age in days, sorted biggest-first. Crucially, it should only report — never delete (that's a human decision).

Show a sample solution

import time
from pathlib import Path

def scout(root: str) -> None:
    now = time.time()
    hits = []
    for p in Path(root).rglob("*"):
        if not p.is_file():
            continue
        st = p.stat()
        age_days = (now - st.st_mtime) / 86400
        if st.st_size > 50_000_000 or age_days > 365:
            hits.append((st.st_size, age_days, p))

    for size, age, p in sorted(hits, reverse=True):
        print(f"{size/1_000_000:7.1f} MB  {age:5.0f}d  {p}")

scout("downloads")

Non-negotiables: recursive scan, size OR age filter, sorted output, and report-only (no deletion).

Recap

3 min

iterdir lists one folder; glob(pattern) filters it; rglob(pattern) searches the entire tree. They return lazy generators of Path objects, so wrap in list() to count or sort. Each path carries metadata via stat() — st_size for bytes, st_mtime for last-modified — which you convert with datetime.fromtimestamp. Combine search + metadata + sorted and you can report on, filter, and prioritise files across a whole tree. That trio powers nearly every file automation you'll write.

Vocabulary Card

glob: A wildcard pattern (*, ?) for matching filenames.
rglob: Recursive glob — searches a folder and all its subfolders.
stat(): Returns a path's metadata: size, modified time, permissions, and more.
st_mtime: Last-modified time as a Unix timestamp (seconds since 1970).

Homework

4 min

Build treesize.py <folder> (using argparse from Lesson 3) that prints the 10 largest files under a folder tree, newest-modified first as a tie-breaker, showing size in MB and a human-readable modified date. Add a --ext flag to restrict to one extension.

Sample · treesize.py

import argparse
from pathlib import Path
from datetime import datetime

p = argparse.ArgumentParser(description="Largest files in a tree.")
p.add_argument("folder")
p.add_argument("--ext", help="restrict to one extension, e.g. .png")
a = p.parse_args()

pattern = f"*{a.ext}" if a.ext else "*"
files = [f for f in Path(a.folder).rglob(pattern) if f.is_file()]
files.sort(key=lambda f: (f.stat().st_size, f.stat().st_mtime),
           reverse=True)

for f in files[:10]:
    st = f.stat()
    when = datetime.fromtimestamp(st.st_mtime).strftime("%Y-%m-%d %H:%M")
    print(f"{st.st_size/1_000_000:8.2f} MB  {when}  {f}")

Non-negotiables: argparse folder + --ext, recursive search, top-10 by size, readable date.

*.csv every file ending in .csv (in one folder) report_* every file starting with "report_" data_??.txt data_01.txt, data_99.txt — ? matches one character **/*.py every .py file in this folder AND all subfolders

from pathlib import Path folder = Path("data") for item in folder.iterdir(): # everything in this folder, files + subfolders print(item) for csv in folder.glob("*.csv"): # only CSVs in this folder print(csv)

from datetime import datetime p = Path("report.csv") info = p.stat() print(info.st_size) # size in bytes print(datetime.fromtimestamp(info.st_mtime)) # last modified, as a datetime

# the five biggest log files anywhere under /var logs = Path("/var").rglob("*.log") biggest = sorted(logs, key=lambda p: p.stat().st_size, reverse=True)[:5] for p in biggest: mb = p.stat().st_size / 1_000_000 print(f"{mb:6.1f} MB {p}")

from pathlib import Path from collections import defaultdict def disk_report(root: str) -> None: base = Path(root) total_bytes = 0 count = 0 by_ext = defaultdict(lambda: [0, 0]) # ext -> [count, bytes] for f in base.rglob("*"): if f.is_file(): size = f.stat().st_size ext = f.suffix.lower() or "(none)" total_bytes += size count += 1 by_ext[ext][0] += 1 by_ext[ext][1] += size print(f"{count} files, {total_bytes/1_000_000:.1f} MB total\n") print(f"{'ext':<8}{'files':>8}{'MB':>10}") for ext, (n, b) in sorted(by_ext.items(), key=lambda kv: kv[1][1], reverse=True): print(f"{ext:<8}{n:>8}{b/1_000_000:>10.2f}") disk_report("project")

import time from pathlib import Path cutoff = time.time() - 7 * 86400 recent = [p for p in Path("project").rglob("*") if p.is_file() and p.stat().st_mtime > cutoff] for p in recent: print(p)

from pathlib import Path from collections import defaultdict groups = defaultdict(list) for p in Path("downloads").rglob("*"): if p.is_file(): groups[p.stat().st_size].append(p) for size, files in groups.items(): if len(files) > 1: print(f"{size} bytes:") for f in files: print(" ", f)

import time from pathlib import Path def scout(root: str) -> None: now = time.time() hits = [] for p in Path(root).rglob("*"): if not p.is_file(): continue st = p.stat() age_days = (now - st.st_mtime) / 86400 if st.st_size > 50_000_000 or age_days > 365: hits.append((st.st_size, age_days, p)) for size, age, p in sorted(hits, reverse=True): print(f"{size/1_000_000:7.1f} MB {age:5.0f}d {p}") scout("downloads")

`pathlib` in Practice: Find, Walk, Glob

Learning Goals

Warm-Up · Glob Patterns

New Concept · Searching Trees

One folder: iterdir and glob

The whole tree: rglob

They're lazy generators

Reading metadata with stat()

Filter and sort what you find

Worked Example · A Disk-Usage Report

Read the code

Try It Yourself

Mini-Challenge · The Cleanup Scout

Recap

Vocabulary Card

Homework

Sample · treesize.py

`pathlib` in Practice: Find, Walk, Glob

Learning Goals

Warm-Up · Glob Patterns

New Concept · Searching Trees

One folder: iterdir and glob

The whole tree: rglob

They're lazy generators

Reading metadata with stat()

Filter and sort what you find

Worked Example · A Disk-Usage Report

Read the code

Try It Yourself

Mini-Challenge · The Cleanup Scout

Recap

Vocabulary Card

Homework

Sample · treesize.py