PY-L4-15 · BeautifulSoup — Selecting Elements

Learning Goals

3 min

Install and use beautifulsoup4 (plus lxml parser).
Pick elements with soup.find / soup.find_all.
Use CSS selectors with soup.select / soup.select_one.
Read text, attributes, and children of each result.

Warm-Up · Install & First Soup

5 min

pip install beautifulsoup4 lxml requests

from bs4 import BeautifulSoup

html = """
<html><body>
  <h1 class="hero">Hello world</h1>
  <p>Paragraph one.</p>
  <p class="lead">Important paragraph.</p>
</body></html>
"""

soup = BeautifulSoup(html, "lxml")
print(soup.h1.text)               # → Hello world
print(soup.find("p").text)        # → Paragraph one.
print(soup.find("p", class_="lead").text)  # → Important paragraph.

Today's big idea

Pick the simplest method that gets the job done: shortcuts like soup.h1 for one-of-a-kind tags; find/find_all for tag + attribute filters; select when you need full CSS power.

New Concept · find, find_all, select

14 min

The page we'll use

html = """
<ul class="book-list">
  <li class="book">
    <a href="/b/1">Refactoring</a>
    <span class="price">RM 89</span>
  </li>
  <li class="book featured">
    <a href="/b/2">The Pragmatic Programmer</a>
    <span class="price">RM 95</span>
  </li>
  <li class="book">
    <a href="/b/3">Clean Code</a>
    <span class="price">RM 79</span>
  </li>
</ul>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")

find — first match

first = soup.find("li", class_="book")
print(first.a.text)        # → Refactoring
print(first.a["href"])     # → /b/1
print(first.find("span", class_="price").text)  # → RM 89

Two patterns to notice:

tag.text — the text inside the element.
tag["attr"] — reads an attribute. tag.get("attr") returns None if it's missing.

find_all — every match

for book in soup.find_all("li", class_="book"):
    title = book.a.text
    price = book.find("span", class_="price").text
    print(f"  {title:<30} {price}")

  Refactoring                    RM 89
  The Pragmatic Programmer       RM 95
  Clean Code                     RM 79

select / select_one — full CSS

# Every book that's also featured
print(soup.select_one("li.book.featured a").text)
# → The Pragmatic Programmer

# Every price span
for p in soup.select(".book .price"):
    print(p.text)

# Direct children only — > selector
for li in soup.select("ul.book-list > li"):
    print(li.a.text)

get_text(strip=True) for tidy text

Real-world HTML has lots of whitespace. get_text(strip=True) trims it; get_text(" ", strip=True) joins children with a single space.

soup.find("p").get_text(" ", strip=True)

Worked Example · Scrape books.toscrape.com

12 min

# scrape_books.py — title + price + rating from the first page
import requests
from bs4 import BeautifulSoup

URL = "https://books.toscrape.com/"

r = requests.get(URL, timeout=10,
                 headers={"User-Agent": "advaslearning-py-l4 (demo)"})
r.raise_for_status()

soup = BeautifulSoup(r.text, "lxml")

books = []
for card in soup.select("article.product_pod"):
    title  = card.h3.a["title"]                            # full title is the attribute
    price  = card.select_one(".price_color").get_text(strip=True)
    rating = card.select_one("p.star-rating")["class"][1]  # ["star-rating", "Three"]
    books.append({"title": title, "price": price, "rating": rating})

# Top 5 by name length, just to show data shaping
for b in books[:5]:
    print(f"  {b['rating']:<6}  {b['price']:<8}  {b['title']}")

print(f"\nfound {len(books)} books on the page")

Sample output

  Three   £51.77    A Light in the Attic
  One     £53.74    Tipping the Velvet
  One     £50.10    Soumission
  Four    £47.82    Sharp Objects
  Five    £54.23    Sapiens: A Brief History of Humankind

found 20 books on the page

Read the diff

Three selectors capture everything:

article.product_pod — the repeating book block.
.price_color — the price text.
p.star-rating — the rating is stored in the CSS class, not the text.

That last one is real-world: half the time the data you want is in an attribute, not text. Inspect carefully.

Try It Yourself

13 min

01 🟢 All headings

From any page you fetch, print every h1, h2, and h3, prefixed with the level number.

Hint

for tag in soup.find_all(["h1", "h2", "h3"]):
    print(f"  H{tag.name[1]}  {tag.get_text(strip=True)}")

Pass a list to find_all to match any of those tag names.

02 🟡 Every link with text

Print href + visible text for every <a> on the page. Skip empty texts.

Hint

for a in soup.find_all("a"):
    href = a.get("href")
    txt  = a.get_text(strip=True)
    if href and txt:
        print(f"  {txt:<40} → {href}")

03 🔴 Books in a price range

Scrape books.toscrape.com, parse the price to a float (strip the £), and print only books priced between £20 and £30.

Hint

for card in soup.select("article.product_pod"):
    raw = card.select_one(".price_color").get_text(strip=True)
    price = float(raw.replace("£", ""))
    if 20 <= price <= 30:
        print(f"  £{price:.2f}  {card.h3.a['title']}")

Mini-Challenge · All 50 Pages

8 min

books.toscrape.com has 50 pages. Loop them (https://books.toscrape.com/catalogue/page-{n}.html), accumulate every book's title and price into a single CSV. Sleep 0.3s between pages. Print "saved 1000 books" when done.

Show one possible solution

# books_all.py — scrape all 50 pages → books.csv
import csv, time, requests
from bs4 import BeautifulSoup

BASE = "https://books.toscrape.com/catalogue/page-{}.html"
HEADERS = {"User-Agent": "advaslearning-py-l4 (educational)"}

rows = []
for n in range(1, 51):
    r = requests.get(BASE.format(n), headers=HEADERS, timeout=10)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "lxml")
    for card in soup.select("article.product_pod"):
        rows.append({
            "title":  card.h3.a["title"],
            "price":  card.select_one(".price_color").get_text(strip=True),
            "rating": card.select_one("p.star-rating")["class"][1],
        })
    time.sleep(0.3)

with open("books.csv", "w", newline="", encoding="utf-8") as f:
    w = csv.DictWriter(f, fieldnames=["title", "price", "rating"])
    w.writeheader()
    w.writerows(rows)
print(f"saved {len(rows)} books")

Non-negotiables: User-Agent, sleep between pages, write a clean CSV.

Recap

3 min

Three methods cover 95% of scraping: find for one element, find_all for many, select when you want a full CSS selector. Data lives in .text, in tag["attr"] or sometimes in a class name. Inspect, isolate the selector, write it once. Tomorrow we ship a real project — a price tracker.

Vocabulary Card

soup: The BeautifulSoup object — the root of the parsed tree.
find / find_all: Tag-name search, optionally with attribute filters.
select / select_one: CSS-selector search. Use when you need >, .class.class, nth-of-type, etc.
get_text(strip=True): Text content with whitespace cleaned up.

Homework

4 min

Scrape any non-login page you find interesting (a Wikipedia article's tables, IMDb's top 250, your school's news page). Extract a small dataset (≥ 10 rows, ≥ 3 columns) and save it as JSON. Include a docstring at the top explaining where the data came from, when you scraped it, and the selector you used. (Be polite — robots.txt, User-Agent, sleeps.)

Sample skeleton

"""scrape_news.py
Source : https://example.com/news
Date   : 2026-05-28
Selector: article.story (title, date, link)
"""
import json, requests
from bs4 import BeautifulSoup

r = requests.get(
    "https://example.com/news", timeout=10,
    headers={"User-Agent": "advaslearning-py-l4 (homework)"},
)
soup = BeautifulSoup(r.text, "lxml")
data = []
for a in soup.select("article.story")[:10]:
    data.append({
        "title": a.select_one("h2").get_text(strip=True),
        "date":  a.select_one("time").get_text(strip=True),
        "link":  a.select_one("a")["href"],
    })

with open("news.json", "w", encoding="utf-8") as f:
    json.dump(data, f, indent=2, ensure_ascii=False)
print(f"saved {len(data)} stories")

Non-negotiables: docstring with provenance, polite headers, real selector, JSON saved.

from bs4 import BeautifulSoup html = """ <html><body> <h1 class="hero">Hello world</h1> Paragraph one. Important paragraph. </body></html> """ soup = BeautifulSoup(html, "lxml") print(soup.h1.text) # → Hello world print(soup.find("p").text) # → Paragraph one. print(soup.find("p", class_="lead").text) # → Important paragraph.

html = """ <ul class="book-list"> <li class="book"> <a href="/b/1">Refactoring</a> RM 89 </li> <li class="book featured"> <a href="/b/2">The Pragmatic Programmer</a> RM 95 </li> <li class="book"> <a href="/b/3">Clean Code</a> RM 79 </li> </ul> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, "lxml")

# Every book that's also featured print(soup.select_one("li.book.featured a").text) # → The Pragmatic Programmer # Every price span for p in soup.select(".book .price"): print(p.text) # Direct children only — > selector for li in soup.select("ul.book-list > li"): print(li.a.text)

# scrape_books.py — title + price + rating from the first page import requests from bs4 import BeautifulSoup URL = "https://books.toscrape.com/" r = requests.get(URL, timeout=10, headers={"User-Agent": "advaslearning-py-l4 (demo)"}) r.raise_for_status() soup = BeautifulSoup(r.text, "lxml") books = [] for card in soup.select("article.product_pod"): title = card.h3.a["title"] # full title is the attribute price = card.select_one(".price_color").get_text(strip=True) rating = card.select_one("p.star-rating")["class"][1] # ["star-rating", "Three"] books.append({"title": title, "price": price, "rating": rating}) # Top 5 by name length, just to show data shaping for b in books[:5]: print(f" {b['rating']:<6} {b['price']:<8} {b['title']}") print(f"\nfound {len(books)} books on the page")

Three £51.77 A Light in the Attic One £53.74 Tipping the Velvet One £50.10 Soumission Four £47.82 Sharp Objects Five £54.23 Sapiens: A Brief History of Humankind found 20 books on the page

for card in soup.select("article.product_pod"): raw = card.select_one(".price_color").get_text(strip=True) price = float(raw.replace("£", "")) if 20 <= price <= 30: print(f" £{price:.2f} {card.h3.a['title']}")

# books_all.py — scrape all 50 pages → books.csv import csv, time, requests from bs4 import BeautifulSoup BASE = "https://books.toscrape.com/catalogue/page-{}.html" HEADERS = {"User-Agent": "advaslearning-py-l4 (educational)"} rows = [] for n in range(1, 51): r = requests.get(BASE.format(n), headers=HEADERS, timeout=10) r.raise_for_status() soup = BeautifulSoup(r.text, "lxml") for card in soup.select("article.product_pod"): rows.append({ "title": card.h3.a["title"], "price": card.select_one(".price_color").get_text(strip=True), "rating": card.select_one("p.star-rating")["class"][1], }) time.sleep(0.3) with open("books.csv", "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=["title", "price", "rating"]) w.writeheader() w.writerows(rows) print(f"saved {len(rows)} books")