Learning Goals
3 minBy the end of this lesson you can:
- Fetch a page with
requestsand parse it withBeautifulSoup. - Choose robust selectors (semantic, attribute-based) over fragile ones (deep CSS chains).
- Scrape politely: check
robots.txt, set a User-Agent, and rate-limit. - Handle missing elements without crashing the run.
Warm-Up · Why Scrapers Rot
5 minA scraper you wrote last month suddenly returns nothing. Why? The site changed a class name, reordered a div, or wrapped the content in a new container. Fragile selectors break on the tiniest redesign:
FRAGILE: div.col-3 > div:nth-child(2) > span.text-sm.font-bold
(breaks if ANY wrapper or order changes)
ROBUST: [data-testid="price"] or h1 or .product-title
(survives layout changes; anchored to meaning)Scrape by meaning, not by position. Prefer IDs, data-* attributes, semantic tags (h1, article), and stable class names that describe content (.price) over layout (.col-3). And always assume the element might be missing — defensive parsing is what makes a scraper survive in the wild.
New Concept · Fetch, Parse, Select Robustly
14 minFetch politely
import requests headers = {"User-Agent": "MyResearchBot/1.0 (contact@example.com)"} resp = requests.get("https://example.com/products", headers=headers, timeout=10) resp.raise_for_status() # raise on 404/500 etc. html = resp.text
- A descriptive User-Agent identifies your bot honestly (some sites block blank/default ones).
timeout=prevents a hung request from freezing your script forever.raise_for_status()turns HTTP errors into exceptions you can catch.
Parse with BeautifulSoup
from bs4 import BeautifulSoup # pip install beautifulsoup4 soup = BeautifulSoup(html, "html.parser") title = soup.find("h1") # first <h1> price = soup.select_one('[data-testid="price"]') # CSS selector, first match items = soup.select("article.product") # all matches → list
find(tag)/find_all(tag)— by tag name and attributes.select_one(css)/select(css)— by CSS selector.
Robust extraction with guards
def text_of(node) -> str: return node.get_text(strip=True) if node else "" title_node = soup.select_one("h1, .product-title, [data-testid='title']") title = text_of(title_node) # "" if none of the selectors matched
Two robustness tricks: a comma-separated selector tries several options (first match wins), and a text_of helper returns "" instead of crashing when the node is None. Never assume find succeeded.
Looping over items defensively
products = [] for card in soup.select("article.product, .product-card, [data-product]"): name = text_of(card.select_one(".name, h2, [data-name]")) price = text_of(card.select_one(".price, [data-testid='price']")) if name: # skip empty/garbage cards products.append({"name": name, "price": price})
Check robots.txt & rate-limit
import time from urllib.robotparser import RobotFileParser rp = RobotFileParser() rp.set_url("https://example.com/robots.txt") rp.read() if not rp.can_fetch("MyResearchBot", "https://example.com/products"): raise SystemExit("Disallowed by robots.txt — do not scrape this path") # between requests, pause so you don't hammer the server: time.sleep(1) # 1 request/second is courteous
Respect robots.txt and the site's Terms of Service. Rate-limit (a sleep between requests) so you don't overload the server. Never scrape personal data or copyrighted content you aren't permitted to use. If the site offers an API (next lessons), use that instead — it's more stable and explicitly allowed. When in doubt, ask permission.
Worked Example · A Resilient Product Scraper
12 minGoal: scrape a product-listing page into a CSV, written so a minor site redesign won't break it — multi-selector fallbacks, guards, polite delays, and logging.
import csv, time, logging import requests from bs4 import BeautifulSoup logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s") log = logging.getLogger("scraper") HEADERS = {"User-Agent": "PriceWatch/1.0 (you@example.com)"} def text_of(node) -> str: return node.get_text(strip=True) if node else "" def fetch(url: str) -> BeautifulSoup: resp = requests.get(url, headers=HEADERS, timeout=10) resp.raise_for_status() return BeautifulSoup(resp.text, "html.parser") def scrape(url: str) -> list[dict]: soup = fetch(url) cards = soup.select("article.product, .product-card, [data-product]") log.info("found %d product cards", len(cards)) products = [] for card in cards: name = text_of(card.select_one(".name, h2, [data-name]")) price = text_of(card.select_one(".price, [data-testid='price']")) if not name: log.debug("skipping card with no name") continue products.append({"name": name, "price": price}) return products def run(urls: list[str], out: str) -> None: all_products = [] for url in urls: try: all_products += scrape(url) except requests.RequestException as e: log.error("failed on %s: %s", url, e) time.sleep(1) # be polite between pages with open(out, "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=["name", "price"]) w.writeheader(); w.writerows(all_products) log.info("wrote %d products → %s", len(all_products), out) run(["https://example.com/products?page=1", "https://example.com/products?page=2"], "products.csv")
INFO found 24 product cards INFO found 24 product cards INFO wrote 47 products → products.csv
Read the code
Every selector has fallbacks (.name, h2, [data-name]), so if the site renames .name to .product-name tomorrow, the h2 or [data-name] alternative likely still catches it. The text_of guard means a missing price yields "" rather than a crash, one failed page is logged and skipped (the others still run), and the sleep(1) keeps it courteous. This is the difference between a demo and a scraper you can actually leave running — and it feeds straight into the CSV tools from Lesson 15.
Try It Yourself
13 minPractise on sites built for scraping practice (e.g. books.toscrape.com or quotes.toscrape.com) — they're explicitly safe and legal to scrape.
Fetch a page and print its <title> and first <h1>. Use a User-Agent and a timeout, and call raise_for_status().
Scrape all quotes (or book titles) from a practice site into a list, using a multi-selector with a fallback and the text_of guard so a missing field never crashes.
Hint
soup = fetch("https://quotes.toscrape.com") quotes = [] for q in soup.select(".quote"): text = text_of(q.select_one(".text, span.text")) author = text_of(q.select_one(".author, small")) if text: quotes.append({"text": text, "author": author})
Scrape across multiple pages by finding the "next" link and following it until there isn't one — with a sleep between requests and a page cap to avoid runaway loops.
Hint
import time url = "https://quotes.toscrape.com/" for _ in range(20): # cap pages soup = fetch(url) # …extract this page… nxt = soup.select_one("li.next a") if not nxt: break url = "https://quotes.toscrape.com" + nxt["href"] time.sleep(1)
Mini-Challenge · The Change Detector
8 minBuild a tool that scrapes one value (e.g. a price or headline) from a practice page, saves it to a JSON file, and on the next run compares to the saved value — printing "unchanged" or "CHANGED: old → new." This is the heart of price-watch and uptime-watch bots.
Show a sample solution
import json, requests from pathlib import Path from bs4 import BeautifulSoup STATE = Path("watch.json") HEADERS = {"User-Agent": "Watcher/1.0"} def current(url, selector): soup = BeautifulSoup( requests.get(url, headers=HEADERS, timeout=10).text, "html.parser") node = soup.select_one(selector) return node.get_text(strip=True) if node else None def check(url, selector): now = current(url, selector) old = json.loads(STATE.read_text())["value"] if STATE.exists() else None if old is None: print("first run, baseline:", now) elif now != old: print(f"CHANGED: {old} → {now}") else: print("unchanged:", now) STATE.write_text(json.dumps({"value": now})) check("https://books.toscrape.com", ".price_color")
Non-negotiables: scrape one value, persist it, compare on next run, report change.
Recap
3 minDurable scraping = robust selectors + polite, defensive code. Fetch with requests (descriptive User-Agent, timeout, raise_for_status), parse with BeautifulSoup (find/select), and select by meaning — IDs, data-*, semantic tags, comma-separated fallbacks — not fragile positional CSS chains. Guard every extraction (a node may be None), skip and log failures rather than crashing, and always respect robots.txt, Terms of Service, and rate limits. When an API exists, prefer it (next lessons) — it's more stable and explicitly permitted.
Vocabulary Card
- selector
- A rule (CSS or tag/attribute) that picks elements out of HTML.
- robust selector
- One anchored to meaning (id, data-*, semantic tag) that survives redesigns.
- robots.txt
- A file declaring which paths bots may or may not fetch.
- rate limiting
- Pausing between requests to avoid overloading a server.
Homework
4 minUsing a practice scraping site, build catalogue.py that scrapes a full multi-page listing into a CSV with at least three fields, using robust multi-selectors, guards, a robots.txt check, a per-request sleep, and logging. Add a short comment block at the top stating the site, that it's scraping-permitted, and which selectors you chose and why.
Sample · catalogue.py (core)
# Site: books.toscrape.com — a sandbox built FOR scraping practice. # Selectors: ".product_pod" anchors each book (stable semantic class); # "h3 a[title]" for the title (the title attr is the full name); # ".price_color" for price. All survive minor restyles. import csv, time, logging, requests from bs4 import BeautifulSoup from urllib.robotparser import RobotFileParser logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s") log = logging.getLogger("cat") HEADERS = {"User-Agent": "BookCat/1.0 (you@example.com)"} BASE = "https://books.toscrape.com/" rp = RobotFileParser(); rp.set_url(BASE + "robots.txt"); rp.read() def text_of(n): return n.get_text(strip=True) if n else "" books, url = [], BASE for _ in range(50): if not rp.can_fetch("BookCat", url): log.error("blocked by robots.txt: %s", url); break soup = BeautifulSoup(requests.get(url, headers=HEADERS, timeout=10).text, "html.parser") for pod in soup.select(".product_pod"): title = pod.select_one("h3 a") books.append({ "title": title["title"] if title else "", "price": text_of(pod.select_one(".price_color")), "stock": text_of(pod.select_one(".availability")), }) nxt = soup.select_one("li.next a") if not nxt: break url = BASE + "catalogue/" + nxt["href"].replace("catalogue/", "") time.sleep(1) with open("books.csv", "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=["title", "price", "stock"]) w.writeheader(); w.writerows(books) log.info("scraped %d books", len(books))
Non-negotiables: robots check, robust selectors, guards, sleep, logging, and a justification comment.