Learning Goals
3 min- Install and use
beautifulsoup4(pluslxmlparser). - Pick elements with
soup.find/soup.find_all. - Use CSS selectors with
soup.select/soup.select_one. - Read text, attributes, and children of each result.
Warm-Up · Install & First Soup
5 minpip install beautifulsoup4 lxml requests
from bs4 import BeautifulSoup html = """ <html><body> <h1 class="hero">Hello world</h1> <p>Paragraph one.</p> <p class="lead">Important paragraph.</p> </body></html> """ soup = BeautifulSoup(html, "lxml") print(soup.h1.text) # → Hello world print(soup.find("p").text) # → Paragraph one. print(soup.find("p", class_="lead").text) # → Important paragraph.
Pick the simplest method that gets the job done: shortcuts like soup.h1 for one-of-a-kind tags; find/find_all for tag + attribute filters; select when you need full CSS power.
New Concept · find, find_all, select
14 minThe page we'll use
html = """ <ul class="book-list"> <li class="book"> <a href="/b/1">Refactoring</a> <span class="price">RM 89</span> </li> <li class="book featured"> <a href="/b/2">The Pragmatic Programmer</a> <span class="price">RM 95</span> </li> <li class="book"> <a href="/b/3">Clean Code</a> <span class="price">RM 79</span> </li> </ul> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, "lxml")
find — first match
first = soup.find("li", class_="book") print(first.a.text) # → Refactoring print(first.a["href"]) # → /b/1 print(first.find("span", class_="price").text) # → RM 89
Two patterns to notice:
tag.text— the text inside the element.tag["attr"]— reads an attribute.tag.get("attr")returnsNoneif it's missing.
find_all — every match
for book in soup.find_all("li", class_="book"): title = book.a.text price = book.find("span", class_="price").text print(f" {title:<30} {price}")
Refactoring RM 89 The Pragmatic Programmer RM 95 Clean Code RM 79
select / select_one — full CSS
# Every book that's also featured print(soup.select_one("li.book.featured a").text) # → The Pragmatic Programmer # Every price span for p in soup.select(".book .price"): print(p.text) # Direct children only — > selector for li in soup.select("ul.book-list > li"): print(li.a.text)
get_text(strip=True) for tidy text
Real-world HTML has lots of whitespace. get_text(strip=True) trims it; get_text(" ", strip=True) joins children with a single space.
soup.find("p").get_text(" ", strip=True)
Worked Example · Scrape books.toscrape.com
12 min# scrape_books.py — title + price + rating from the first page import requests from bs4 import BeautifulSoup URL = "https://books.toscrape.com/" r = requests.get(URL, timeout=10, headers={"User-Agent": "advaslearning-py-l4 (demo)"}) r.raise_for_status() soup = BeautifulSoup(r.text, "lxml") books = [] for card in soup.select("article.product_pod"): title = card.h3.a["title"] # full title is the attribute price = card.select_one(".price_color").get_text(strip=True) rating = card.select_one("p.star-rating")["class"][1] # ["star-rating", "Three"] books.append({"title": title, "price": price, "rating": rating}) # Top 5 by name length, just to show data shaping for b in books[:5]: print(f" {b['rating']:<6} {b['price']:<8} {b['title']}") print(f"\nfound {len(books)} books on the page")
Sample output
Three £51.77 A Light in the Attic One £53.74 Tipping the Velvet One £50.10 Soumission Four £47.82 Sharp Objects Five £54.23 Sapiens: A Brief History of Humankind found 20 books on the page
Read the diff
Three selectors capture everything:
article.product_pod— the repeating book block..price_color— the price text.p.star-rating— the rating is stored in the CSS class, not the text.
That last one is real-world: half the time the data you want is in an attribute, not text. Inspect carefully.
Try It Yourself
13 minFrom any page you fetch, print every h1, h2, and h3, prefixed with the level number.
Hint
for tag in soup.find_all(["h1", "h2", "h3"]): print(f" H{tag.name[1]} {tag.get_text(strip=True)}")
Pass a list to find_all to match any of those tag names.
Print href + visible text for every <a> on the page. Skip empty texts.
Hint
for a in soup.find_all("a"): href = a.get("href") txt = a.get_text(strip=True) if href and txt: print(f" {txt:<40} → {href}")
Scrape books.toscrape.com, parse the price to a float (strip the £), and print only books priced between £20 and £30.
Hint
for card in soup.select("article.product_pod"): raw = card.select_one(".price_color").get_text(strip=True) price = float(raw.replace("£", "")) if 20 <= price <= 30: print(f" £{price:.2f} {card.h3.a['title']}")
Mini-Challenge · All 50 Pages
8 minbooks.toscrape.com has 50 pages. Loop them (https://books.toscrape.com/catalogue/page-{n}.html), accumulate every book's title and price into a single CSV. Sleep 0.3s between pages. Print "saved 1000 books" when done.
Show one possible solution
# books_all.py — scrape all 50 pages → books.csv import csv, time, requests from bs4 import BeautifulSoup BASE = "https://books.toscrape.com/catalogue/page-{}.html" HEADERS = {"User-Agent": "advaslearning-py-l4 (educational)"} rows = [] for n in range(1, 51): r = requests.get(BASE.format(n), headers=HEADERS, timeout=10) r.raise_for_status() soup = BeautifulSoup(r.text, "lxml") for card in soup.select("article.product_pod"): rows.append({ "title": card.h3.a["title"], "price": card.select_one(".price_color").get_text(strip=True), "rating": card.select_one("p.star-rating")["class"][1], }) time.sleep(0.3) with open("books.csv", "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=["title", "price", "rating"]) w.writeheader() w.writerows(rows) print(f"saved {len(rows)} books")
Non-negotiables: User-Agent, sleep between pages, write a clean CSV.
Recap
3 minThree methods cover 95% of scraping: find for one element, find_all for many, select when you want a full CSS selector. Data lives in .text, in tag["attr"] or sometimes in a class name. Inspect, isolate the selector, write it once. Tomorrow we ship a real project — a price tracker.
Vocabulary Card
- soup
- The BeautifulSoup object — the root of the parsed tree.
- find / find_all
- Tag-name search, optionally with attribute filters.
- select / select_one
- CSS-selector search. Use when you need
>,.class.class, nth-of-type, etc. - get_text(strip=True)
- Text content with whitespace cleaned up.
Homework
4 minScrape any non-login page you find interesting (a Wikipedia article's tables, IMDb's top 250, your school's news page). Extract a small dataset (≥ 10 rows, ≥ 3 columns) and save it as JSON. Include a docstring at the top explaining where the data came from, when you scraped it, and the selector you used. (Be polite — robots.txt, User-Agent, sleeps.)
Sample skeleton
"""scrape_news.py Source : https://example.com/news Date : 2026-05-28 Selector: article.story (title, date, link) """ import json, requests from bs4 import BeautifulSoup r = requests.get( "https://example.com/news", timeout=10, headers={"User-Agent": "advaslearning-py-l4 (homework)"}, ) soup = BeautifulSoup(r.text, "lxml") data = [] for a in soup.select("article.story")[:10]: data.append({ "title": a.select_one("h2").get_text(strip=True), "date": a.select_one("time").get_text(strip=True), "link": a.select_one("a")["href"], }) with open("news.json", "w", encoding="utf-8") as f: json.dump(data, f, indent=2, ensure_ascii=False) print(f"saved {len(data)} stories")
Non-negotiables: docstring with provenance, polite headers, real selector, JSON saved.