PY-L7-18 · Excel Automation: openpyxl Read

Learning Goals

3 min

By the end of this lesson you can:

Install and import openpyxl, and open a workbook.
List worksheets and select one by name or as the active sheet.
Read a cell by coordinate (ws["B2"]) or row/column index.
Iterate over rows to pull a sheet into Python data.

Warm-Up · CSV vs. Real Excel

5 min

CSV is plain text — one table, no formatting, no formulas. A real .xlsx file is a zipped bundle of XML with multiple sheets, fonts, colours, merged cells, charts, and formulas. You can't read it with open().

pip install openpyxl     # the standard library for .xlsx in Python

Today's big idea

An Excel file is a Workbook containing one or more Worksheets, each a grid of Cells addressed like A1, B2. openpyxl gives you those three objects. Read cells by their familiar Excel coordinates, or iterate rows like you did with CSV — but now with sheets, types, and formatting intact.

New Concept · Workbook → Sheet → Cell

14 min

Opening a workbook

from openpyxl import load_workbook

wb = load_workbook("sales.xlsx", data_only=True)
print(wb.sheetnames)            # ['Summary', 'January', 'February']

data_only=True is important: it reads the computed values of formula cells (what you see in Excel) rather than the formula text like "=SUM(B2:B10)". Use it whenever you want results, not formulas.

Selecting a sheet

ws = wb.active                  # the sheet that was open when saved
ws = wb["January"]              # by name
print(ws.title)                 # "January"
print(ws.max_row, ws.max_column)  # dimensions, e.g. 50 4

Reading single cells

# by Excel coordinate
print(ws["A1"].value)           # the header in A1
print(ws["B2"].value)           # a data cell

# by row, column number (both 1-based!)
print(ws.cell(row=2, column=2).value)   # same as B2

⚠️ openpyxl is 1-based

Unlike Python lists (0-based), openpyxl rows and columns start at 1 — matching Excel. ws.cell(row=1, column=1) is A1. Forgetting this is the #1 beginner bug.

Iterating rows

# every row as a tuple of values
for row in ws.iter_rows(values_only=True):
    print(row)        # ('Date', 'Region', 'Amount'), then data rows…

# skip the header, read columns A and C only
for row in ws.iter_rows(min_row=2, min_col=1, max_col=3, values_only=True):
    date, region, amount = row
    print(date, amount)

values_only=True gives plain values (not cell objects). min_row=2 skips the header. min_col/max_col select a column range — exactly the controls you need to read a real report.

Turning a sheet into dicts (like DictReader)

rows = ws.iter_rows(values_only=True)
headers = next(rows)                       # first row = column names
records = [dict(zip(headers, row)) for row in rows]
print(records[0])      # {'Date': ..., 'Region': 'KL', 'Amount': 120}

This bridges Excel to the dict-of-rows shape you used for CSV in Lesson 15 — so all your CSV-era logic (filter, group, validate) applies unchanged.

Cell types come through

Unlike CSV (everything is a string), openpyxl preserves types: numbers arrive as int/float, dates as datetime objects, booleans as bool. No manual int() conversion needed — though always guard against None for empty cells.

Worked Example · Summarise a Multi-Sheet Workbook

12 min

Goal: a workbook has one sheet per month, each with Date, Region, Amount. Read them all and report the grand total plus a per-region breakdown — the report a manager currently builds by hand every quarter.

from openpyxl import load_workbook
from collections import defaultdict
import logging

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
log = logging.getLogger("xlsx")

def summarise(path: str) -> None:
    wb = load_workbook(path, data_only=True, read_only=True)
    by_region = defaultdict(float)
    grand_total = 0.0

    for name in wb.sheetnames:
        ws = wb[name]
        rows = ws.iter_rows(min_row=2, values_only=True)   # skip header
        sheet_total = 0.0
        for date, region, amount in rows:
            if amount is None:           # empty cell — skip
                continue
            amount = float(amount)
            by_region[region] += amount
            sheet_total += amount
        grand_total += sheet_total
        log.info("%s: %.2f", name, sheet_total)

    wb.close()
    print(f"\nGrand total: {grand_total:.2f}")
    for region, total in sorted(by_region.items(),
                                key=lambda kv: kv[1], reverse=True):
        print(f"  {region:12} {total:>10.2f}")

summarise("quarterly.xlsx")

INFO January: 28100.00
INFO February: 26450.50
INFO March: 29660.00

Grand total: 84210.50
  Central        31200.00
  North          28010.50
  South          25000.00

Read the code

We loop sheets via wb.sheetnames, skip each header with min_row=2, and accumulate into a defaultdict — the exact pattern from the CSV lessons, now reading native Excel. Two production touches: read_only=True streams large workbooks without loading everything into memory, and the if amount is None guard handles the blank cells real spreadsheets always have. The quarterly hand-built report is now one function call.

Try It Yourself

13 min

01 🟢 Inspect a workbook

Open any .xlsx (export one from a spreadsheet app or make one). Print its sheet names, and for the first sheet its dimensions and the value of cell A1.

02 🟡 Sheet to records

Read a sheet into a list of dicts (header row as keys) and print the first three records and the total row count.

Hint

from openpyxl import load_workbook

ws = load_workbook("data.xlsx", data_only=True).active
rows = ws.iter_rows(values_only=True)
headers = next(rows)
records = [dict(zip(headers, r)) for r in rows]
print(records[:3], "…", len(records), "rows")

03 🔴 Find a column by name

Write column_values(ws, name) that finds the column whose header matches name and returns all its values below the header — regardless of which letter that column is in.

Hint

def column_values(ws, name):
    headers = [c.value for c in ws[1]]      # first row
    idx = headers.index(name)                # 0-based position
    return [row[idx] for row in
            ws.iter_rows(min_row=2, values_only=True)]

print(column_values(ws, "Amount"))

Mini-Challenge · The Excel-to-CSV Converter

8 min

Write xlsx_to_csv(path) that exports every sheet of a workbook to its own CSV file (book__SheetName.csv), preserving values. Handle dates by formatting them as YYYY-MM-DD strings. This bridges Excel into all your Lesson 15 CSV tools.

Show a sample solution

import csv
from datetime import datetime
from pathlib import Path
from openpyxl import load_workbook

def cellval(v):
    if isinstance(v, datetime):
        return v.strftime("%Y-%m-%d")
    return "" if v is None else v

def xlsx_to_csv(path: str) -> None:
    wb = load_workbook(path, data_only=True, read_only=True)
    stem = Path(path).stem
    for name in wb.sheetnames:
        ws = wb[name]
        out = f"{stem}__{name}.csv"
        with open(out, "w", newline="", encoding="utf-8") as f:
            w = csv.writer(f)
            for row in ws.iter_rows(values_only=True):
                w.writerow([cellval(v) for v in row])
        print("wrote", out)
    wb.close()

xlsx_to_csv("report.xlsx")

Non-negotiables: one CSV per sheet, date formatting, None→empty, read_only for safety.

Recap

3 min

openpyxl opens real .xlsx files: a Workbook holds Worksheets (list via wb.sheetnames, select with wb["Name"]), each a grid of Cells read by coordinate (ws["B2"].value) or 1-based index. Use data_only=True for computed values, read_only=True for big files, and iter_rows(values_only=True) — with min_row=2 to skip headers — to pull a sheet into Python data. Zip headers with rows to get the dict-of-rows shape from CSV land, and guard against None empty cells. Next: writing styled workbooks back out.

Vocabulary Card

load_workbook: Opens an .xlsx file into a Workbook object.
data_only: Reads computed formula results instead of formula text.
iter_rows: Iterates a sheet's rows; values_only=True yields plain values.
1-based indexing: openpyxl rows/columns start at 1, matching Excel (not Python lists).

Homework

4 min

Make a small .xlsx with a header row and 20+ data rows (e.g. a class gradebook: Name, Subject, Score). Write analyse.py <file> that reads it, prints the highest and lowest scorer, the average per subject, and how many rows had a blank or invalid score. Use iter_rows and guard against None.

Sample · analyse.py

import argparse
from collections import defaultdict
from openpyxl import load_workbook

p = argparse.ArgumentParser(); p.add_argument("file")
a = p.parse_args()

ws = load_workbook(a.file, data_only=True).active
rows = ws.iter_rows(min_row=2, values_only=True)

scores = []          # (name, subject, score)
bad = 0
subject_totals = defaultdict(lambda: [0, 0])   # subject -> [sum, count]
for name, subject, score in rows:
    if not isinstance(score, (int, float)):
        bad += 1
        continue
    scores.append((name, subject, score))
    subject_totals[subject][0] += score
    subject_totals[subject][1] += 1

if scores:
    top = max(scores, key=lambda r: r[2])
    low = min(scores, key=lambda r: r[2])
    print(f"Top:  {top[0]} ({top[2]})")
    print(f"Low:  {low[0]} ({low[2]})")
for subj, (total, n) in subject_totals.items():
    print(f"{subj}: avg {total/n:.1f}")
print(f"Invalid/blank scores: {bad}")

Non-negotiables: reads real .xlsx, top/low scorer, per-subject average, counts invalid scores.

ws = wb.active # the sheet that was open when saved ws = wb["January"] # by name print(ws.title) # "January" print(ws.max_row, ws.max_column) # dimensions, e.g. 50 4

# by Excel coordinate print(ws["A1"].value) # the header in A1 print(ws["B2"].value) # a data cell # by row, column number (both 1-based!) print(ws.cell(row=2, column=2).value) # same as B2

# every row as a tuple of values for row in ws.iter_rows(values_only=True): print(row) # ('Date', 'Region', 'Amount'), then data rows… # skip the header, read columns A and C only for row in ws.iter_rows(min_row=2, min_col=1, max_col=3, values_only=True): date, region, amount = row print(date, amount)

rows = ws.iter_rows(values_only=True) headers = next(rows) # first row = column names records = [dict(zip(headers, row)) for row in rows] print(records[0]) # {'Date': ..., 'Region': 'KL', 'Amount': 120}

from openpyxl import load_workbook from collections import defaultdict import logging logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s") log = logging.getLogger("xlsx") def summarise(path: str) -> None: wb = load_workbook(path, data_only=True, read_only=True) by_region = defaultdict(float) grand_total = 0.0 for name in wb.sheetnames: ws = wb[name] rows = ws.iter_rows(min_row=2, values_only=True) # skip header sheet_total = 0.0 for date, region, amount in rows: if amount is None: # empty cell — skip continue amount = float(amount) by_region[region] += amount sheet_total += amount grand_total += sheet_total log.info("%s: %.2f", name, sheet_total) wb.close() print(f"\nGrand total: {grand_total:.2f}") for region, total in sorted(by_region.items(), key=lambda kv: kv[1], reverse=True): print(f" {region:12} {total:>10.2f}") summarise("quarterly.xlsx")

from openpyxl import load_workbook ws = load_workbook("data.xlsx", data_only=True).active rows = ws.iter_rows(values_only=True) headers = next(rows) records = [dict(zip(headers, r)) for r in rows] print(records[:3], "…", len(records), "rows")

def column_values(ws, name): headers = [c.value for c in ws[1]] # first row idx = headers.index(name) # 0-based position return [row[idx] for row in ws.iter_rows(min_row=2, values_only=True)] print(column_values(ws, "Amount"))

import csv from datetime import datetime from pathlib import Path from openpyxl import load_workbook def cellval(v): if isinstance(v, datetime): return v.strftime("%Y-%m-%d") return "" if v is None else v def xlsx_to_csv(path: str) -> None: wb = load_workbook(path, data_only=True, read_only=True) stem = Path(path).stem for name in wb.sheetnames: ws = wb[name] out = f"{stem}__{name}.csv" with open(out, "w", newline="", encoding="utf-8") as f: w = csv.writer(f) for row in ws.iter_rows(values_only=True): w.writerow([cellval(v) for v in row]) print("wrote", out) wb.close() xlsx_to_csv("report.xlsx")

Excel Automation: `openpyxl` Read

Learning Goals

Warm-Up · CSV vs. Real Excel

New Concept · Workbook → Sheet → Cell

Opening a workbook

Selecting a sheet

Reading single cells

Iterating rows

Turning a sheet into dicts (like DictReader)

Cell types come through

Worked Example · Summarise a Multi-Sheet Workbook

Read the code

Try It Yourself

Mini-Challenge · The Excel-to-CSV Converter

Recap

Vocabulary Card

Homework

Sample · analyse.py

Excel Automation: `openpyxl` Read

Learning Goals

Warm-Up · CSV vs. Real Excel

New Concept · Workbook → Sheet → Cell

Opening a workbook

Selecting a sheet

Reading single cells

Iterating rows

Turning a sheet into dicts (like DictReader)

Cell types come through

Worked Example · Summarise a Multi-Sheet Workbook

Read the code

Try It Yourself

Mini-Challenge · The Excel-to-CSV Converter

Recap

Vocabulary Card

Homework

Sample · analyse.py