Learning Goals
3 minBy the end of this lesson you can:
- Open a PDF and read its page count and metadata with
pypdf. - Extract text from one page or a whole document.
- Search extracted text and pull out fields with regex.
- Recognise the limits of text extraction (scanned PDFs need OCR).
Warm-Up · PDFs Are Weird
5 minA PDF isn't a document of text — it's a set of drawing instructions: "put this glyph at x=72, y=640." There's often no concept of "paragraph" or even reliable reading order. Two PDFs that look identical can extract very differently.
pip install pypdf # modern, pure-Python PDF library
pypdf reconstructs text as best it can from those drawing instructions. For digitally-generated PDFs (exported from Word, a website, an accounting system) it works well. For scanned PDFs — images of paper — there's no text to extract; you'd need OCR (e.g. pytesseract). Know which kind you have before you trust the output.
New Concept · Reading PDFs
14 minOpening and inspecting
from pypdf import PdfReader reader = PdfReader("invoice.pdf") print(len(reader.pages)) # number of pages print(reader.metadata.title) # document title (may be None) print(reader.metadata.author) print(reader.metadata.creation_date)
reader.pages is a list-like of page objects; reader.metadata holds title/author/dates (any of which can be None — always guard).
Extracting text
# one page (0-based) first_page = reader.pages[0] print(first_page.extract_text()) # the whole document full_text = "\n".join(page.extract_text() or "" for page in reader.pages) print(len(full_text), "characters extracted")
Note the or "": extract_text() can return None for image-only pages, which would crash a join. Guard it. Pages are 0-based here (it's a Python list), unlike openpyxl's 1-based cells.
Encrypted PDFs
reader = PdfReader("protected.pdf") if reader.is_encrypted: reader.decrypt("the-password") # supply the password you were given text = reader.pages[0].extract_text()
Many statements are password-protected. decrypt unlocks a PDF you have the legitimate password for — for reading your own documents, not bypassing protection you aren't entitled to.
Pulling out fields with regex
import re text = reader.pages[0].extract_text() or "" # find an invoice number like "INV-2026-0042" m = re.search(r"INV-\d{4}-\d{4}", text) invoice_no = m.group() if m else None # find a total like "Total: $1,234.56" m = re.search(r"Total:\s*\$([\d,]+\.\d{2})", text) total = m.group(1).replace(",", "") if m else None print(invoice_no, total)
Once you have the text, it's a string — your regex and parsing skills do the rest. This is how you read 500 invoices and pull every total into a spreadsheet automatically.
Spacing, column order, and line breaks vary wildly. Build your regex against several real samples, anchor on stable labels ("Total:", "Invoice #"), and always handle the "not found" case. Never assume a field is present — log and skip when it isn't.
Worked Example · Batch Invoice Extractor
12 minGoal: scan a folder of invoice PDFs, pull the invoice number, date, and total from each, write a CSV summary, and log any that don't parse — a genuine accounts-payable automation.
import re, csv, logging from pathlib import Path from pypdf import PdfReader logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s") log = logging.getLogger("invoices") PATTERNS = { "invoice_no": re.compile(r"Invoice\s*#?\s*([A-Z0-9-]+)"), "date": re.compile(r"Date:\s*(\d{4}-\d{2}-\d{2})"), "total": re.compile(r"Total:\s*\$?([\d,]+\.\d{2})"), } def extract(pdf: Path) -> dict | None: reader = PdfReader(str(pdf)) text = "\n".join(p.extract_text() or "" for p in reader.pages) if not text.strip(): log.warning("%s: no text (scanned image?) — skipping", pdf.name) return None record = {"file": pdf.name} for field, pattern in PATTERNS.items(): m = pattern.search(text) record[field] = m.group(1).replace(",", "") if m else "" if not record["total"]: log.warning("%s: no total found", pdf.name) return record def run(folder: str, out: str) -> None: records = [] for pdf in sorted(Path(folder).glob("*.pdf")): rec = extract(pdf) if rec: records.append(rec) log.info("%s → %s", pdf.name, rec["total"] or "?") with open(out, "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=["file", "invoice_no", "date", "total"]) w.writeheader(); w.writerows(records) log.info("wrote %d invoices → %s", len(records), out) run("invoices", "invoice_summary.csv")
INFO inv-001.pdf → 1200.00 INFO inv-002.pdf → 950.50 WARNING inv-003.pdf: no text (scanned image?) — skipping INFO wrote 2 invoices → invoice_summary.csv
Read the code
Compiled regex patterns in a dict keep the extraction rules readable and easy to extend. The two guard clauses are the real lesson: an empty extraction means a scanned image (logged and skipped), and a missing total is flagged but doesn't crash the batch. The output is a CSV — straight into Lesson 15's tools or Lesson 19's Excel report. This turns hours of manual data entry into seconds, while honestly surfacing the documents a human still needs to look at.
Try It Yourself
13 minOpen any PDF and print its page count, title, author, and the first 200 characters of text. Guard against None metadata.
Write find_in_pdf(path, word) that returns the page numbers (1-based, for humans) where a word appears. Case-insensitive.
Hint
from pypdf import PdfReader def find_in_pdf(path, word): reader = PdfReader(path) hits = [] for i, page in enumerate(reader.pages, start=1): text = (page.extract_text() or "").lower() if word.lower() in text: hits.append(i) return hits print(find_in_pdf("report.pdf", "revenue"))
Scan a folder of PDFs and report the total word count across all of them, plus the longest single document. Skip image-only PDFs with a warning.
Hint
from pathlib import Path from pypdf import PdfReader totals = {} for pdf in Path("docs").glob("*.pdf"): text = "".join(p.extract_text() or "" for p in PdfReader(str(pdf)).pages) if not text.strip(): print("skip (no text):", pdf.name); continue totals[pdf.name] = len(text.split()) print("total words:", sum(totals.values())) print("longest:", max(totals, key=totals.get))
Mini-Challenge · The Document Indexer
8 minBuild build_index(folder) that scans a folder of PDFs and produces a search index: a dict mapping each keyword (from a given list) to the list of files containing it. Then write search(index, word) that returns the matching files. A tiny full-text search over a document pile.
Show a sample solution
from pathlib import Path from collections import defaultdict from pypdf import PdfReader def build_index(folder: str, keywords: list[str]) -> dict: index = defaultdict(list) for pdf in Path(folder).glob("*.pdf"): text = "".join(p.extract_text() or "" for p in PdfReader(str(pdf)).pages).lower() for kw in keywords: if kw.lower() in text: index[kw.lower()].append(pdf.name) return dict(index) def search(index: dict, word: str) -> list[str]: return index.get(word.lower(), []) idx = build_index("contracts", ["nda", "termination", "renewal"]) print(search(idx, "renewal"))
Non-negotiables: keyword→files index, case-insensitive, a working search lookup.
Recap
3 minpypdf's PdfReader opens a PDF: len(reader.pages) for the count, reader.metadata for title/author/dates (guard for None), and page.extract_text() for the text (also guard for None on image pages). Once you have text it's just a string — use regex to pull fields like invoice numbers and totals, always handling the "not found" case. Extraction is messy and order isn't guaranteed, so test against real samples; and remember scanned PDFs are images needing OCR, not text extraction. Pipe the results to CSV/Excel and you've automated document data entry.
Vocabulary Card
- PdfReader
- pypdf's object for opening and reading a PDF.
- extract_text
- Reconstructs a page's text from its drawing instructions (may be None).
- metadata
- A PDF's title, author, and dates — any can be missing.
- OCR
- Optical character recognition — needed to read text from scanned images.
Homework
4 minBuild pdfgrep.py <folder> <pattern> (argparse) that searches every PDF in a folder for a regex pattern and prints each match with its file and page number. Add a --context flag that also prints ~40 characters around each match. Warn on image-only PDFs and handle encrypted files gracefully.
Sample · pdfgrep.py
import argparse, re from pathlib import Path from pypdf import PdfReader p = argparse.ArgumentParser(description="grep across PDFs") p.add_argument("folder"); p.add_argument("pattern") p.add_argument("--context", action="store_true") a = p.parse_args() rx = re.compile(a.pattern, re.IGNORECASE) for pdf in sorted(Path(a.folder).glob("*.pdf")): try: reader = PdfReader(str(pdf)) if reader.is_encrypted: reader.decrypt("") # try empty password except Exception as e: print(f"!! {pdf.name}: {e}"); continue for i, page in enumerate(reader.pages, start=1): text = page.extract_text() or "" if not text.strip(): continue for m in rx.finditer(text): if a.context: s = max(0, m.start() - 40); e = m.end() + 40 snippet = text[s:e].replace("\n", " ") print(f"{pdf.name} p{i}: …{snippet}…") else: print(f"{pdf.name} p{i}: {m.group()}")
Non-negotiables: regex search, file+page output, --context snippets, image/encrypted handling.