Learning Goals
3 minBy the end of this lesson you can:
- Install and import
openpyxl, and open a workbook. - List worksheets and select one by name or as the active sheet.
- Read a cell by coordinate (
ws["B2"]) or row/column index. - Iterate over rows to pull a sheet into Python data.
Warm-Up · CSV vs. Real Excel
5 minCSV is plain text — one table, no formatting, no formulas. A real .xlsx file is a zipped bundle of XML with multiple sheets, fonts, colours, merged cells, charts, and formulas. You can't read it with open().
pip install openpyxl # the standard library for .xlsx in Python
An Excel file is a Workbook containing one or more Worksheets, each a grid of Cells addressed like A1, B2. openpyxl gives you those three objects. Read cells by their familiar Excel coordinates, or iterate rows like you did with CSV — but now with sheets, types, and formatting intact.
New Concept · Workbook → Sheet → Cell
14 minOpening a workbook
from openpyxl import load_workbook wb = load_workbook("sales.xlsx", data_only=True) print(wb.sheetnames) # ['Summary', 'January', 'February']
data_only=True is important: it reads the computed values of formula cells (what you see in Excel) rather than the formula text like "=SUM(B2:B10)". Use it whenever you want results, not formulas.
Selecting a sheet
ws = wb.active # the sheet that was open when saved ws = wb["January"] # by name print(ws.title) # "January" print(ws.max_row, ws.max_column) # dimensions, e.g. 50 4
Reading single cells
# by Excel coordinate print(ws["A1"].value) # the header in A1 print(ws["B2"].value) # a data cell # by row, column number (both 1-based!) print(ws.cell(row=2, column=2).value) # same as B2
Unlike Python lists (0-based), openpyxl rows and columns start at 1 — matching Excel. ws.cell(row=1, column=1) is A1. Forgetting this is the #1 beginner bug.
Iterating rows
# every row as a tuple of values for row in ws.iter_rows(values_only=True): print(row) # ('Date', 'Region', 'Amount'), then data rows… # skip the header, read columns A and C only for row in ws.iter_rows(min_row=2, min_col=1, max_col=3, values_only=True): date, region, amount = row print(date, amount)
values_only=True gives plain values (not cell objects). min_row=2 skips the header. min_col/max_col select a column range — exactly the controls you need to read a real report.
Turning a sheet into dicts (like DictReader)
rows = ws.iter_rows(values_only=True) headers = next(rows) # first row = column names records = [dict(zip(headers, row)) for row in rows] print(records[0]) # {'Date': ..., 'Region': 'KL', 'Amount': 120}
This bridges Excel to the dict-of-rows shape you used for CSV in Lesson 15 — so all your CSV-era logic (filter, group, validate) applies unchanged.
Cell types come through
Unlike CSV (everything is a string), openpyxl preserves types: numbers arrive as int/float, dates as datetime objects, booleans as bool. No manual int() conversion needed — though always guard against None for empty cells.
Worked Example · Summarise a Multi-Sheet Workbook
12 minGoal: a workbook has one sheet per month, each with Date, Region, Amount. Read them all and report the grand total plus a per-region breakdown — the report a manager currently builds by hand every quarter.
from openpyxl import load_workbook from collections import defaultdict import logging logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s") log = logging.getLogger("xlsx") def summarise(path: str) -> None: wb = load_workbook(path, data_only=True, read_only=True) by_region = defaultdict(float) grand_total = 0.0 for name in wb.sheetnames: ws = wb[name] rows = ws.iter_rows(min_row=2, values_only=True) # skip header sheet_total = 0.0 for date, region, amount in rows: if amount is None: # empty cell — skip continue amount = float(amount) by_region[region] += amount sheet_total += amount grand_total += sheet_total log.info("%s: %.2f", name, sheet_total) wb.close() print(f"\nGrand total: {grand_total:.2f}") for region, total in sorted(by_region.items(), key=lambda kv: kv[1], reverse=True): print(f" {region:12} {total:>10.2f}") summarise("quarterly.xlsx")
INFO January: 28100.00 INFO February: 26450.50 INFO March: 29660.00 Grand total: 84210.50 Central 31200.00 North 28010.50 South 25000.00
Read the code
We loop sheets via wb.sheetnames, skip each header with min_row=2, and accumulate into a defaultdict — the exact pattern from the CSV lessons, now reading native Excel. Two production touches: read_only=True streams large workbooks without loading everything into memory, and the if amount is None guard handles the blank cells real spreadsheets always have. The quarterly hand-built report is now one function call.
Try It Yourself
13 minOpen any .xlsx (export one from a spreadsheet app or make one). Print its sheet names, and for the first sheet its dimensions and the value of cell A1.
Read a sheet into a list of dicts (header row as keys) and print the first three records and the total row count.
Hint
from openpyxl import load_workbook ws = load_workbook("data.xlsx", data_only=True).active rows = ws.iter_rows(values_only=True) headers = next(rows) records = [dict(zip(headers, r)) for r in rows] print(records[:3], "…", len(records), "rows")
Write column_values(ws, name) that finds the column whose header matches name and returns all its values below the header — regardless of which letter that column is in.
Hint
def column_values(ws, name): headers = [c.value for c in ws[1]] # first row idx = headers.index(name) # 0-based position return [row[idx] for row in ws.iter_rows(min_row=2, values_only=True)] print(column_values(ws, "Amount"))
Mini-Challenge · The Excel-to-CSV Converter
8 minWrite xlsx_to_csv(path) that exports every sheet of a workbook to its own CSV file (book__SheetName.csv), preserving values. Handle dates by formatting them as YYYY-MM-DD strings. This bridges Excel into all your Lesson 15 CSV tools.
Show a sample solution
import csv from datetime import datetime from pathlib import Path from openpyxl import load_workbook def cellval(v): if isinstance(v, datetime): return v.strftime("%Y-%m-%d") return "" if v is None else v def xlsx_to_csv(path: str) -> None: wb = load_workbook(path, data_only=True, read_only=True) stem = Path(path).stem for name in wb.sheetnames: ws = wb[name] out = f"{stem}__{name}.csv" with open(out, "w", newline="", encoding="utf-8") as f: w = csv.writer(f) for row in ws.iter_rows(values_only=True): w.writerow([cellval(v) for v in row]) print("wrote", out) wb.close() xlsx_to_csv("report.xlsx")
Non-negotiables: one CSV per sheet, date formatting, None→empty, read_only for safety.
Recap
3 minopenpyxl opens real .xlsx files: a Workbook holds Worksheets (list via wb.sheetnames, select with wb["Name"]), each a grid of Cells read by coordinate (ws["B2"].value) or 1-based index. Use data_only=True for computed values, read_only=True for big files, and iter_rows(values_only=True) — with min_row=2 to skip headers — to pull a sheet into Python data. Zip headers with rows to get the dict-of-rows shape from CSV land, and guard against None empty cells. Next: writing styled workbooks back out.
Vocabulary Card
- load_workbook
- Opens an .xlsx file into a Workbook object.
- data_only
- Reads computed formula results instead of formula text.
- iter_rows
- Iterates a sheet's rows;
values_only=Trueyields plain values. - 1-based indexing
- openpyxl rows/columns start at 1, matching Excel (not Python lists).
Homework
4 minMake a small .xlsx with a header row and 20+ data rows (e.g. a class gradebook: Name, Subject, Score). Write analyse.py <file> that reads it, prints the highest and lowest scorer, the average per subject, and how many rows had a blank or invalid score. Use iter_rows and guard against None.
Sample · analyse.py
import argparse from collections import defaultdict from openpyxl import load_workbook p = argparse.ArgumentParser(); p.add_argument("file") a = p.parse_args() ws = load_workbook(a.file, data_only=True).active rows = ws.iter_rows(min_row=2, values_only=True) scores = [] # (name, subject, score) bad = 0 subject_totals = defaultdict(lambda: [0, 0]) # subject -> [sum, count] for name, subject, score in rows: if not isinstance(score, (int, float)): bad += 1 continue scores.append((name, subject, score)) subject_totals[subject][0] += score subject_totals[subject][1] += 1 if scores: top = max(scores, key=lambda r: r[2]) low = min(scores, key=lambda r: r[2]) print(f"Top: {top[0]} ({top[2]})") print(f"Low: {low[0]} ({low[2]})") for subj, (total, n) in subject_totals.items(): print(f"{subj}: avg {total/n:.1f}") print(f"Invalid/blank scores: {bad}")
Non-negotiables: reads real .xlsx, top/low scorer, per-subject average, counts invalid scores.