PY-L7-20 · PDF Automation: pypdf Read & Extract Text

Learning Goals

3 min

By the end of this lesson you can:

Open a PDF and read its page count and metadata with pypdf.
Extract text from one page or a whole document.
Search extracted text and pull out fields with regex.
Recognise the limits of text extraction (scanned PDFs need OCR).

Warm-Up · PDFs Are Weird

5 min

A PDF isn't a document of text — it's a set of drawing instructions: "put this glyph at x=72, y=640." There's often no concept of "paragraph" or even reliable reading order. Two PDFs that look identical can extract very differently.

pip install pypdf      # modern, pure-Python PDF library

Today's big idea

pypdf reconstructs text as best it can from those drawing instructions. For digitally-generated PDFs (exported from Word, a website, an accounting system) it works well. For scanned PDFs — images of paper — there's no text to extract; you'd need OCR (e.g. pytesseract). Know which kind you have before you trust the output.

New Concept · Reading PDFs

14 min

Opening and inspecting

from pypdf import PdfReader

reader = PdfReader("invoice.pdf")
print(len(reader.pages))            # number of pages
print(reader.metadata.title)        # document title (may be None)
print(reader.metadata.author)
print(reader.metadata.creation_date)

reader.pages is a list-like of page objects; reader.metadata holds title/author/dates (any of which can be None — always guard).

Extracting text

# one page (0-based)
first_page = reader.pages[0]
print(first_page.extract_text())

# the whole document
full_text = "\n".join(page.extract_text() or "" for page in reader.pages)
print(len(full_text), "characters extracted")

Note the or "": extract_text() can return None for image-only pages, which would crash a join. Guard it. Pages are 0-based here (it's a Python list), unlike openpyxl's 1-based cells.

Encrypted PDFs

reader = PdfReader("protected.pdf")
if reader.is_encrypted:
    reader.decrypt("the-password")   # supply the password you were given
text = reader.pages[0].extract_text()

Many statements are password-protected. decrypt unlocks a PDF you have the legitimate password for — for reading your own documents, not bypassing protection you aren't entitled to.

Pulling out fields with regex

import re

text = reader.pages[0].extract_text() or ""

# find an invoice number like "INV-2026-0042"
m = re.search(r"INV-\d{4}-\d{4}", text)
invoice_no = m.group() if m else None

# find a total like "Total: $1,234.56"
m = re.search(r"Total:\s*\$([\d,]+\.\d{2})", text)
total = m.group(1).replace(",", "") if m else None

print(invoice_no, total)

Once you have the text, it's a string — your regex and parsing skills do the rest. This is how you read 500 invoices and pull every total into a spreadsheet automatically.

Extraction is messy — plan for it

Spacing, column order, and line breaks vary wildly. Build your regex against several real samples, anchor on stable labels ("Total:", "Invoice #"), and always handle the "not found" case. Never assume a field is present — log and skip when it isn't.

Worked Example · Batch Invoice Extractor

12 min

Goal: scan a folder of invoice PDFs, pull the invoice number, date, and total from each, write a CSV summary, and log any that don't parse — a genuine accounts-payable automation.

import re, csv, logging
from pathlib import Path
from pypdf import PdfReader

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
log = logging.getLogger("invoices")

PATTERNS = {
    "invoice_no": re.compile(r"Invoice\s*#?\s*([A-Z0-9-]+)"),
    "date":       re.compile(r"Date:\s*(\d{4}-\d{2}-\d{2})"),
    "total":      re.compile(r"Total:\s*\$?([\d,]+\.\d{2})"),
}

def extract(pdf: Path) -> dict | None:
    reader = PdfReader(str(pdf))
    text = "\n".join(p.extract_text() or "" for p in reader.pages)
    if not text.strip():
        log.warning("%s: no text (scanned image?) — skipping", pdf.name)
        return None

    record = {"file": pdf.name}
    for field, pattern in PATTERNS.items():
        m = pattern.search(text)
        record[field] = m.group(1).replace(",", "") if m else ""
    if not record["total"]:
        log.warning("%s: no total found", pdf.name)
    return record

def run(folder: str, out: str) -> None:
    records = []
    for pdf in sorted(Path(folder).glob("*.pdf")):
        rec = extract(pdf)
        if rec:
            records.append(rec)
            log.info("%s → %s", pdf.name, rec["total"] or "?")

    with open(out, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=["file", "invoice_no", "date", "total"])
        w.writeheader(); w.writerows(records)
    log.info("wrote %d invoices → %s", len(records), out)

run("invoices", "invoice_summary.csv")

INFO inv-001.pdf → 1200.00
INFO inv-002.pdf → 950.50
WARNING inv-003.pdf: no text (scanned image?) — skipping
INFO wrote 2 invoices → invoice_summary.csv

Read the code

Compiled regex patterns in a dict keep the extraction rules readable and easy to extend. The two guard clauses are the real lesson: an empty extraction means a scanned image (logged and skipped), and a missing total is flagged but doesn't crash the batch. The output is a CSV — straight into Lesson 15's tools or Lesson 19's Excel report. This turns hours of manual data entry into seconds, while honestly surfacing the documents a human still needs to look at.

Try It Yourself

13 min

01 🟢 PDF info

Open any PDF and print its page count, title, author, and the first 200 characters of text. Guard against None metadata.

02 🟡 Keyword search

Write find_in_pdf(path, word) that returns the page numbers (1-based, for humans) where a word appears. Case-insensitive.

Hint

from pypdf import PdfReader

def find_in_pdf(path, word):
    reader = PdfReader(path)
    hits = []
    for i, page in enumerate(reader.pages, start=1):
        text = (page.extract_text() or "").lower()
        if word.lower() in text:
            hits.append(i)
    return hits

print(find_in_pdf("report.pdf", "revenue"))

03 🔴 Bulk word count

Scan a folder of PDFs and report the total word count across all of them, plus the longest single document. Skip image-only PDFs with a warning.

Hint

from pathlib import Path
from pypdf import PdfReader

totals = {}
for pdf in Path("docs").glob("*.pdf"):
    text = "".join(p.extract_text() or "" for p in PdfReader(str(pdf)).pages)
    if not text.strip():
        print("skip (no text):", pdf.name); continue
    totals[pdf.name] = len(text.split())

print("total words:", sum(totals.values()))
print("longest:", max(totals, key=totals.get))

Mini-Challenge · The Document Indexer

8 min

Build build_index(folder) that scans a folder of PDFs and produces a search index: a dict mapping each keyword (from a given list) to the list of files containing it. Then write search(index, word) that returns the matching files. A tiny full-text search over a document pile.

Show a sample solution

from pathlib import Path
from collections import defaultdict
from pypdf import PdfReader

def build_index(folder: str, keywords: list[str]) -> dict:
    index = defaultdict(list)
    for pdf in Path(folder).glob("*.pdf"):
        text = "".join(p.extract_text() or ""
                       for p in PdfReader(str(pdf)).pages).lower()
        for kw in keywords:
            if kw.lower() in text:
                index[kw.lower()].append(pdf.name)
    return dict(index)

def search(index: dict, word: str) -> list[str]:
    return index.get(word.lower(), [])

idx = build_index("contracts", ["nda", "termination", "renewal"])
print(search(idx, "renewal"))

Non-negotiables: keyword→files index, case-insensitive, a working search lookup.

Recap

3 min

pypdf's PdfReader opens a PDF: len(reader.pages) for the count, reader.metadata for title/author/dates (guard for None), and page.extract_text() for the text (also guard for None on image pages). Once you have text it's just a string — use regex to pull fields like invoice numbers and totals, always handling the "not found" case. Extraction is messy and order isn't guaranteed, so test against real samples; and remember scanned PDFs are images needing OCR, not text extraction. Pipe the results to CSV/Excel and you've automated document data entry.

Vocabulary Card

PdfReader: pypdf's object for opening and reading a PDF.
extract_text: Reconstructs a page's text from its drawing instructions (may be None).
metadata: A PDF's title, author, and dates — any can be missing.
OCR: Optical character recognition — needed to read text from scanned images.

Homework

4 min

Build pdfgrep.py <folder> <pattern> (argparse) that searches every PDF in a folder for a regex pattern and prints each match with its file and page number. Add a --context flag that also prints ~40 characters around each match. Warn on image-only PDFs and handle encrypted files gracefully.

Sample · pdfgrep.py

import argparse, re
from pathlib import Path
from pypdf import PdfReader

p = argparse.ArgumentParser(description="grep across PDFs")
p.add_argument("folder"); p.add_argument("pattern")
p.add_argument("--context", action="store_true")
a = p.parse_args()
rx = re.compile(a.pattern, re.IGNORECASE)

for pdf in sorted(Path(a.folder).glob("*.pdf")):
    try:
        reader = PdfReader(str(pdf))
        if reader.is_encrypted:
            reader.decrypt("")          # try empty password
    except Exception as e:
        print(f"!! {pdf.name}: {e}"); continue

    for i, page in enumerate(reader.pages, start=1):
        text = page.extract_text() or ""
        if not text.strip():
            continue
        for m in rx.finditer(text):
            if a.context:
                s = max(0, m.start() - 40); e = m.end() + 40
                snippet = text[s:e].replace("\n", " ")
                print(f"{pdf.name} p{i}: …{snippet}…")
            else:
                print(f"{pdf.name} p{i}: {m.group()}")

Non-negotiables: regex search, file+page output, --context snippets, image/encrypted handling.

from pypdf import PdfReader reader = PdfReader("invoice.pdf") print(len(reader.pages)) # number of pages print(reader.metadata.title) # document title (may be None) print(reader.metadata.author) print(reader.metadata.creation_date)

# one page (0-based) first_page = reader.pages[0] print(first_page.extract_text()) # the whole document full_text = "\n".join(page.extract_text() or "" for page in reader.pages) print(len(full_text), "characters extracted")

import re text = reader.pages[0].extract_text() or "" # find an invoice number like "INV-2026-0042" m = re.search(r"INV-\d{4}-\d{4}", text) invoice_no = m.group() if m else None # find a total like "Total: $1,234.56" m = re.search(r"Total:\s*\$([\d,]+\.\d{2})", text) total = m.group(1).replace(",", "") if m else None print(invoice_no, total)

import re, csv, logging from pathlib import Path from pypdf import PdfReader logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s") log = logging.getLogger("invoices") PATTERNS = { "invoice_no": re.compile(r"Invoice\s*#?\s*([A-Z0-9-]+)"), "date": re.compile(r"Date:\s*(\d{4}-\d{2}-\d{2})"), "total": re.compile(r"Total:\s*\$?([\d,]+\.\d{2})"), } def extract(pdf: Path) -> dict | None: reader = PdfReader(str(pdf)) text = "\n".join(p.extract_text() or "" for p in reader.pages) if not text.strip(): log.warning("%s: no text (scanned image?) — skipping", pdf.name) return None record = {"file": pdf.name} for field, pattern in PATTERNS.items(): m = pattern.search(text) record[field] = m.group(1).replace(",", "") if m else "" if not record["total"]: log.warning("%s: no total found", pdf.name) return record def run(folder: str, out: str) -> None: records = [] for pdf in sorted(Path(folder).glob("*.pdf")): rec = extract(pdf) if rec: records.append(rec) log.info("%s → %s", pdf.name, rec["total"] or "?") with open(out, "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=["file", "invoice_no", "date", "total"]) w.writeheader(); w.writerows(records) log.info("wrote %d invoices → %s", len(records), out) run("invoices", "invoice_summary.csv")

from pypdf import PdfReader def find_in_pdf(path, word): reader = PdfReader(path) hits = [] for i, page in enumerate(reader.pages, start=1): text = (page.extract_text() or "").lower() if word.lower() in text: hits.append(i) return hits print(find_in_pdf("report.pdf", "revenue"))

from pathlib import Path from pypdf import PdfReader totals = {} for pdf in Path("docs").glob("*.pdf"): text = "".join(p.extract_text() or "" for p in PdfReader(str(pdf)).pages) if not text.strip(): print("skip (no text):", pdf.name); continue totals[pdf.name] = len(text.split()) print("total words:", sum(totals.values())) print("longest:", max(totals, key=totals.get))

from pathlib import Path from collections import defaultdict from pypdf import PdfReader def build_index(folder: str, keywords: list[str]) -> dict: index = defaultdict(list) for pdf in Path(folder).glob("*.pdf"): text = "".join(p.extract_text() or "" for p in PdfReader(str(pdf)).pages).lower() for kw in keywords: if kw.lower() in text: index[kw.lower()].append(pdf.name) return dict(index) def search(index: dict, word: str) -> list[str]: return index.get(word.lower(), []) idx = build_index("contracts", ["nda", "termination", "renewal"]) print(search(idx, "renewal"))

PDF Automation: `pypdf` Read & Extract Text

Learning Goals

Warm-Up · PDFs Are Weird

New Concept · Reading PDFs

Opening and inspecting

Extracting text

Encrypted PDFs

Pulling out fields with regex

Worked Example · Batch Invoice Extractor

Read the code

Try It Yourself

Mini-Challenge · The Document Indexer

Recap

Vocabulary Card

Homework

Sample · pdfgrep.py

PDF Automation: `pypdf` Read & Extract Text

Learning Goals

Warm-Up · PDFs Are Weird

New Concept · Reading PDFs

Opening and inspecting

Extracting text

Encrypted PDFs

Pulling out fields with regex

Worked Example · Batch Invoice Extractor

Read the code

Try It Yourself

Mini-Challenge · The Document Indexer

Recap

Vocabulary Card

Homework

Sample · pdfgrep.py