PY-L7-23 · Web Scraping for Automation: Robust Selectors

Learning Goals

3 min

By the end of this lesson you can:

Fetch a page with requests and parse it with BeautifulSoup.
Choose robust selectors (semantic, attribute-based) over fragile ones (deep CSS chains).
Scrape politely: check robots.txt, set a User-Agent, and rate-limit.
Handle missing elements without crashing the run.

Warm-Up · Why Scrapers Rot

5 min

A scraper you wrote last month suddenly returns nothing. Why? The site changed a class name, reordered a div, or wrapped the content in a new container. Fragile selectors break on the tiniest redesign:

FRAGILE:  div.col-3 > div:nth-child(2) > span.text-sm.font-bold
          (breaks if ANY wrapper or order changes)

ROBUST:   [data-testid="price"]   or   h1   or   .product-title
          (survives layout changes; anchored to meaning)

Today's big idea

Scrape by meaning, not by position. Prefer IDs, data-* attributes, semantic tags (h1, article), and stable class names that describe content (.price) over layout (.col-3). And always assume the element might be missing — defensive parsing is what makes a scraper survive in the wild.

New Concept · Fetch, Parse, Select Robustly

14 min

Fetch politely

import requests

headers = {"User-Agent": "MyResearchBot/1.0 (contact@example.com)"}
resp = requests.get("https://example.com/products", headers=headers, timeout=10)
resp.raise_for_status()       # raise on 404/500 etc.
html = resp.text

A descriptive User-Agent identifies your bot honestly (some sites block blank/default ones).
timeout= prevents a hung request from freezing your script forever.
raise_for_status() turns HTTP errors into exceptions you can catch.

Parse with BeautifulSoup

from bs4 import BeautifulSoup        # pip install beautifulsoup4

soup = BeautifulSoup(html, "html.parser")

title = soup.find("h1")                       # first <h1>
price = soup.select_one('[data-testid="price"]')   # CSS selector, first match
items = soup.select("article.product")        # all matches → list

find(tag) / find_all(tag) — by tag name and attributes.
select_one(css) / select(css) — by CSS selector.

Robust extraction with guards

def text_of(node) -> str:
    return node.get_text(strip=True) if node else ""

title_node = soup.select_one("h1, .product-title, [data-testid='title']")
title = text_of(title_node)        # "" if none of the selectors matched

Two robustness tricks: a comma-separated selector tries several options (first match wins), and a text_of helper returns "" instead of crashing when the node is None. Never assume find succeeded.

Looping over items defensively

products = []
for card in soup.select("article.product, .product-card, [data-product]"):
    name  = text_of(card.select_one(".name, h2, [data-name]"))
    price = text_of(card.select_one(".price, [data-testid='price']"))
    if name:                       # skip empty/garbage cards
        products.append({"name": name, "price": price})

Check robots.txt & rate-limit

import time
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if not rp.can_fetch("MyResearchBot", "https://example.com/products"):
    raise SystemExit("Disallowed by robots.txt — do not scrape this path")

# between requests, pause so you don't hammer the server:
time.sleep(1)                      # 1 request/second is courteous

⚠️ Scrape ethically and legally

Respect robots.txt and the site's Terms of Service. Rate-limit (a sleep between requests) so you don't overload the server. Never scrape personal data or copyrighted content you aren't permitted to use. If the site offers an API (next lessons), use that instead — it's more stable and explicitly allowed. When in doubt, ask permission.

Worked Example · A Resilient Product Scraper

12 min

Goal: scrape a product-listing page into a CSV, written so a minor site redesign won't break it — multi-selector fallbacks, guards, polite delays, and logging.

import csv, time, logging
import requests
from bs4 import BeautifulSoup

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
log = logging.getLogger("scraper")

HEADERS = {"User-Agent": "PriceWatch/1.0 (you@example.com)"}

def text_of(node) -> str:
    return node.get_text(strip=True) if node else ""

def fetch(url: str) -> BeautifulSoup:
    resp = requests.get(url, headers=HEADERS, timeout=10)
    resp.raise_for_status()
    return BeautifulSoup(resp.text, "html.parser")

def scrape(url: str) -> list[dict]:
    soup = fetch(url)
    cards = soup.select("article.product, .product-card, [data-product]")
    log.info("found %d product cards", len(cards))

    products = []
    for card in cards:
        name  = text_of(card.select_one(".name, h2, [data-name]"))
        price = text_of(card.select_one(".price, [data-testid='price']"))
        if not name:
            log.debug("skipping card with no name")
            continue
        products.append({"name": name, "price": price})
    return products

def run(urls: list[str], out: str) -> None:
    all_products = []
    for url in urls:
        try:
            all_products += scrape(url)
        except requests.RequestException as e:
            log.error("failed on %s: %s", url, e)
        time.sleep(1)                       # be polite between pages

    with open(out, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=["name", "price"])
        w.writeheader(); w.writerows(all_products)
    log.info("wrote %d products → %s", len(all_products), out)

run(["https://example.com/products?page=1",
     "https://example.com/products?page=2"], "products.csv")

INFO found 24 product cards
INFO found 24 product cards
INFO wrote 47 products → products.csv

Read the code

Every selector has fallbacks (.name, h2, [data-name]), so if the site renames .name to .product-name tomorrow, the h2 or [data-name] alternative likely still catches it. The text_of guard means a missing price yields "" rather than a crash, one failed page is logged and skipped (the others still run), and the sleep(1) keeps it courteous. This is the difference between a demo and a scraper you can actually leave running — and it feeds straight into the CSV tools from Lesson 15.

Try It Yourself

13 min

Practise on sites built for scraping practice (e.g. books.toscrape.com or quotes.toscrape.com) — they're explicitly safe and legal to scrape.

01 🟢 Grab the title

Fetch a page and print its <title> and first <h1>. Use a User-Agent and a timeout, and call raise_for_status().

02 🟡 List with fallbacks

Scrape all quotes (or book titles) from a practice site into a list, using a multi-selector with a fallback and the text_of guard so a missing field never crashes.

Hint

soup = fetch("https://quotes.toscrape.com")
quotes = []
for q in soup.select(".quote"):
    text = text_of(q.select_one(".text, span.text"))
    author = text_of(q.select_one(".author, small"))
    if text:
        quotes.append({"text": text, "author": author})

03 🔴 Follow pagination

Scrape across multiple pages by finding the "next" link and following it until there isn't one — with a sleep between requests and a page cap to avoid runaway loops.

Hint

import time
url = "https://quotes.toscrape.com/"
for _ in range(20):                 # cap pages
    soup = fetch(url)
    # …extract this page…
    nxt = soup.select_one("li.next a")
    if not nxt:
        break
    url = "https://quotes.toscrape.com" + nxt["href"]
    time.sleep(1)

Mini-Challenge · The Change Detector

8 min

Build a tool that scrapes one value (e.g. a price or headline) from a practice page, saves it to a JSON file, and on the next run compares to the saved value — printing "unchanged" or "CHANGED: old → new." This is the heart of price-watch and uptime-watch bots.

Show a sample solution

import json, requests
from pathlib import Path
from bs4 import BeautifulSoup

STATE = Path("watch.json")
HEADERS = {"User-Agent": "Watcher/1.0"}

def current(url, selector):
    soup = BeautifulSoup(
        requests.get(url, headers=HEADERS, timeout=10).text, "html.parser")
    node = soup.select_one(selector)
    return node.get_text(strip=True) if node else None

def check(url, selector):
    now = current(url, selector)
    old = json.loads(STATE.read_text())["value"] if STATE.exists() else None
    if old is None:
        print("first run, baseline:", now)
    elif now != old:
        print(f"CHANGED: {old} → {now}")
    else:
        print("unchanged:", now)
    STATE.write_text(json.dumps({"value": now}))

check("https://books.toscrape.com", ".price_color")

Non-negotiables: scrape one value, persist it, compare on next run, report change.

Recap

3 min

Durable scraping = robust selectors + polite, defensive code. Fetch with requests (descriptive User-Agent, timeout, raise_for_status), parse with BeautifulSoup (find/select), and select by meaning — IDs, data-*, semantic tags, comma-separated fallbacks — not fragile positional CSS chains. Guard every extraction (a node may be None), skip and log failures rather than crashing, and always respect robots.txt, Terms of Service, and rate limits. When an API exists, prefer it (next lessons) — it's more stable and explicitly permitted.

Vocabulary Card

selector: A rule (CSS or tag/attribute) that picks elements out of HTML.
robust selector: One anchored to meaning (id, data-*, semantic tag) that survives redesigns.
robots.txt: A file declaring which paths bots may or may not fetch.
rate limiting: Pausing between requests to avoid overloading a server.

Homework

4 min

Using a practice scraping site, build catalogue.py that scrapes a full multi-page listing into a CSV with at least three fields, using robust multi-selectors, guards, a robots.txt check, a per-request sleep, and logging. Add a short comment block at the top stating the site, that it's scraping-permitted, and which selectors you chose and why.

Sample · catalogue.py (core)

# Site: books.toscrape.com — a sandbox built FOR scraping practice.
# Selectors: ".product_pod" anchors each book (stable semantic class);
#   "h3 a[title]" for the title (the title attr is the full name);
#   ".price_color" for price. All survive minor restyles.
import csv, time, logging, requests
from bs4 import BeautifulSoup
from urllib.robotparser import RobotFileParser

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
log = logging.getLogger("cat")
HEADERS = {"User-Agent": "BookCat/1.0 (you@example.com)"}
BASE = "https://books.toscrape.com/"

rp = RobotFileParser(); rp.set_url(BASE + "robots.txt"); rp.read()

def text_of(n): return n.get_text(strip=True) if n else ""

books, url = [], BASE
for _ in range(50):
    if not rp.can_fetch("BookCat", url):
        log.error("blocked by robots.txt: %s", url); break
    soup = BeautifulSoup(requests.get(url, headers=HEADERS,
                         timeout=10).text, "html.parser")
    for pod in soup.select(".product_pod"):
        title = pod.select_one("h3 a")
        books.append({
            "title": title["title"] if title else "",
            "price": text_of(pod.select_one(".price_color")),
            "stock": text_of(pod.select_one(".availability")),
        })
    nxt = soup.select_one("li.next a")
    if not nxt:
        break
    url = BASE + "catalogue/" + nxt["href"].replace("catalogue/", "")
    time.sleep(1)

with open("books.csv", "w", newline="", encoding="utf-8") as f:
    w = csv.DictWriter(f, fieldnames=["title", "price", "stock"])
    w.writeheader(); w.writerows(books)
log.info("scraped %d books", len(books))

Non-negotiables: robots check, robust selectors, guards, sleep, logging, and a justification comment.

FRAGILE: div.col-3 > div:nth-child(2) > span.text-sm.font-bold (breaks if ANY wrapper or order changes) ROBUST: [data-testid="price"] or h1 or .product-title (survives layout changes; anchored to meaning)

import requests headers = {"User-Agent": "MyResearchBot/1.0 (contact@example.com)"} resp = requests.get("https://example.com/products", headers=headers, timeout=10) resp.raise_for_status() # raise on 404/500 etc. html = resp.text

from bs4 import BeautifulSoup # pip install beautifulsoup4 soup = BeautifulSoup(html, "html.parser") title = soup.find("h1") # first <h1> price = soup.select_one('[data-testid="price"]') # CSS selector, first match items = soup.select("article.product") # all matches → list

def text_of(node) -> str: return node.get_text(strip=True) if node else "" title_node = soup.select_one("h1, .product-title, [data-testid='title']") title = text_of(title_node) # "" if none of the selectors matched

products = [] for card in soup.select("article.product, .product-card, [data-product]"): name = text_of(card.select_one(".name, h2, [data-name]")) price = text_of(card.select_one(".price, [data-testid='price']")) if name: # skip empty/garbage cards products.append({"name": name, "price": price})

import time from urllib.robotparser import RobotFileParser rp = RobotFileParser() rp.set_url("https://example.com/robots.txt") rp.read() if not rp.can_fetch("MyResearchBot", "https://example.com/products"): raise SystemExit("Disallowed by robots.txt — do not scrape this path") # between requests, pause so you don't hammer the server: time.sleep(1) # 1 request/second is courteous

import csv, time, logging import requests from bs4 import BeautifulSoup logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s") log = logging.getLogger("scraper") HEADERS = {"User-Agent": "PriceWatch/1.0 (you@example.com)"} def text_of(node) -> str: return node.get_text(strip=True) if node else "" def fetch(url: str) -> BeautifulSoup: resp = requests.get(url, headers=HEADERS, timeout=10) resp.raise_for_status() return BeautifulSoup(resp.text, "html.parser") def scrape(url: str) -> list[dict]: soup = fetch(url) cards = soup.select("article.product, .product-card, [data-product]") log.info("found %d product cards", len(cards)) products = [] for card in cards: name = text_of(card.select_one(".name, h2, [data-name]")) price = text_of(card.select_one(".price, [data-testid='price']")) if not name: log.debug("skipping card with no name") continue products.append({"name": name, "price": price}) return products def run(urls: list[str], out: str) -> None: all_products = [] for url in urls: try: all_products += scrape(url) except requests.RequestException as e: log.error("failed on %s: %s", url, e) time.sleep(1) # be polite between pages with open(out, "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=["name", "price"]) w.writeheader(); w.writerows(all_products) log.info("wrote %d products → %s", len(all_products), out) run(["https://example.com/products?page=1", "https://example.com/products?page=2"], "products.csv")

soup = fetch("https://quotes.toscrape.com") quotes = [] for q in soup.select(".quote"): text = text_of(q.select_one(".text, span.text")) author = text_of(q.select_one(".author, small")) if text: quotes.append({"text": text, "author": author})

import time url = "https://quotes.toscrape.com/" for _ in range(20): # cap pages soup = fetch(url) # …extract this page… nxt = soup.select_one("li.next a") if not nxt: break url = "https://quotes.toscrape.com" + nxt["href"] time.sleep(1)

import json, requests from pathlib import Path from bs4 import BeautifulSoup STATE = Path("watch.json") HEADERS = {"User-Agent": "Watcher/1.0"} def current(url, selector): soup = BeautifulSoup( requests.get(url, headers=HEADERS, timeout=10).text, "html.parser") node = soup.select_one(selector) return node.get_text(strip=True) if node else None def check(url, selector): now = current(url, selector) old = json.loads(STATE.read_text())["value"] if STATE.exists() else None if old is None: print("first run, baseline:", now) elif now != old: print(f"CHANGED: {old} → {now}") else: print("unchanged:", now) STATE.write_text(json.dumps({"value": now})) check("https://books.toscrape.com", ".price_color")