PY-L4-14 · Web Scraping — HTML Basics for Scrapers

Learning Goals

3 min

Recognise the four parts of an HTML element: tag, attributes, text, children.
Read the DOM tree of a simple page by hand.
Use the browser's "Inspect Element" to find a CSS selector for any node.
Apply the scraping ethics checklist: robots.txt, ToS, rate limit, identification.

Warm-Up · Parts of a Tag

5 min

<a href="/about" class="link primary">About us</a>
└┬┘└────┬─────┘└──────┬────────┘└──┬──┘
tag   attribute    attribute    text content
       (href)      (class)

Four things to spot:

Tag name — a, div, p, img.
Attributes — key="value" pairs inside the opening tag.
Text content — what's between the open and close tag.
Children — nested tags between the open and close.

Today's big idea

HTML is a tree. Every node has a tag, optional attributes, optional text, and an ordered list of children. BeautifulSoup's job is to let you walk that tree by tag name, attribute, or CSS selector.

New Concept · The DOM Tree

14 min

A tiny page

<html>
  <head>
    <title>Books</title>
  </head>
  <body>
    <h1>Best Sellers</h1>
    <ul class="book-list">
      <li class="book">
        <a href="/b/1">Refactoring</a>
        <span class="price">RM 89</span>
      </li>
      <li class="book featured">
        <a href="/b/2">The Pragmatic Programmer</a>
        <span class="price">RM 95</span>
      </li>
    </ul>
  </body>
</html>

Draw it as a tree:

html
├── head
│   └── title  "Books"
└── body
    ├── h1     "Best Sellers"
    └── ul.book-list
        ├── li.book
        │   ├── a (href=/b/1)  "Refactoring"
        │   └── span.price      "RM 89"
        └── li.book.featured
            ├── a (href=/b/2)  "The Pragmatic Programmer"
            └── span.price      "RM 95"

CSS selectors — the universal address

tag          a            every <a>
.class       .price       every element with class="price"
#id          #checkout    the element with id="checkout"
parent child ul li        every <li> inside a <ul>
combo        li.book a    every <a> inside <li class="book">

Selectors compose just like CSS in your front-end work. Tomorrow you'll feed these to soup.select().

Find your selector in 3 clicks

Right-click the element on a real page in Chrome / Firefox / Edge → Inspect.
In the Elements panel, right-click the highlighted tag → Copy → Copy selector.
Paste; clean it up. (Browser-generated selectors are often too long.)

Ethical scraping checklist

Check https://site.com/robots.txt. If it says Disallow: /, don't scrape that path.
Read the site's Terms of Service — many big sites forbid scraping.
Identify yourself with a User-Agent that names you and your purpose.
Rate-limit: at most ~1 request per second; usually slower.
Cache aggressively so you re-fetch only when needed.
Never scrape behind a login wall unless you have permission.

Worked Example · Fetch & Inspect

12 min

Hit a friendly scraping playground — https://books.toscrape.com exists specifically to be scraped. Print the raw HTML, then save a snippet for inspection.

# fetch_books.py — get the HTML, save a slice
import requests

HEADERS = {
    "User-Agent": "advaslearning-py-l4 (educational scraping demo)",
}

r = requests.get("https://books.toscrape.com/", headers=HEADERS, timeout=10)
r.raise_for_status()

print(f"status: {r.status_code}")
print(f"length: {len(r.text):,} chars")

# Save a snippet so we can read it
with open("books_home.html", "w", encoding="utf-8") as f:
    f.write(r.text)

# Print the first 500 chars to peek at the shape
print("\n--- first 500 chars ---")
print(r.text[:500])

Sample output (truncated)

status: 200
length: 51,224 chars

--- first 500 chars ---
<!DOCTYPE html>
<html lang="en-us">
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>
        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        ...

Open books_home.html in VS Code, search for "product_pod", and you'll see the repeating block we'll target tomorrow with BeautifulSoup.

Read the diff

Three good habits already: a descriptive User-Agent, a timeout, and saving the raw page so we can inspect it without re-hitting the server. Tomorrow we add BeautifulSoup and start extracting.

Try It Yourself

13 min

01 🟢 Draw the tree

Take this HTML and draw the DOM tree on paper:

<div class="card">
  <h2>News</h2>
  <ul>
    <li><a href="/1">Article one</a></li>
    <li><a href="/2">Article two</a></li>
  </ul>
</div>

Hint

div.card
├── h2  "News"
└── ul
    ├── li
    │   └── a href=/1  "Article one"
    └── li
        └── a href=/2  "Article two"

02 🟡 Write the selectors

For the same HTML, write a CSS selector for: (a) the heading, (b) every link, (c) only the second link.

Hint

a) .card h2
b) .card a
c) .card li:nth-of-type(2) a

03 🔴 Inspect a real page

Open https://news.ycombinator.com. Right-click any story title → Inspect. Find the CSS selector that uniquely identifies story titles. Write it down.

Hint

At time of writing: span.titleline > a
(this changes; that's why scrapers are fragile)

Selector changes break scrapers. That's why robust scrapers use fallback selectors and run regularly so you spot breakage early.

Mini-Challenge · robots.txt Audit

8 min

Fetch robots.txt from three sites (e.g., github.com, reddit.com, books.toscrape.com). Print which paths are disallowed for a generic bot (User-agent: *).

Show one possible solution

# robots_audit.py
import requests

SITES = ["https://github.com", "https://www.reddit.com",
         "https://books.toscrape.com"]

for site in SITES:
    print(f"\n🤖 {site}/robots.txt")
    r = requests.get(f"{site}/robots.txt", timeout=5)
    in_star_block = False
    for line in r.text.splitlines():
        line = line.strip()
        if line.lower().startswith("user-agent:"):
            in_star_block = line.split(":", 1)[1].strip() == "*"
        elif in_star_block and line.lower().startswith("disallow:"):
            path = line.split(":", 1)[1].strip()
            print(f"  ⛔ {path or '/ (everything)'}")

Non-negotiables: parse the User-agent: * block specifically; not every site has one, and not every disallow line is for everyone.

Recap

3 min

HTML is a tree of tags with attributes, text and children. CSS selectors are the addressing scheme — same syntax you used for styling. Always check robots.txt and ToS, set a real User-Agent, rate-limit, cache. Tomorrow we let BeautifulSoup walk the tree and pull out what we want.

Vocabulary Card

DOM: Document Object Model — the in-memory tree representation of an HTML document.
tag / attribute / text: The three pieces of every element: <a href="...">Text</a>.
CSS selector: String like div.card a that addresses elements in the tree.
robots.txt: A file at site root that tells bots which paths to skip. Honour it.

Homework

4 min

Build page_audit.py. Given a URL on the command line:

Fetch the page (with a descriptive User-Agent and timeout).
Save the HTML to {host}.html.
Print: response status, page length, title tag text, count of <a> tags (rough — count occurrences of "<a " in the text), count of <img> tags.
Print the first 10 disallow lines from that site's robots.txt.

Sample · page_audit.py

# page_audit.py
import re, sys, requests
from urllib.parse import urlparse

url = sys.argv[1]
host = urlparse(url).netloc.replace(".", "_")

r = requests.get(url, timeout=10,
                 headers={"User-Agent": "advaslearning-py-l4 (audit)"})
r.raise_for_status()
with open(f"{host}.html", "w", encoding="utf-8") as f:
    f.write(r.text)

print(f"status:    {r.status_code}")
print(f"length:    {len(r.text):,} chars")

title = re.search(r"<title[^>]*>(.*?)</title>", r.text, re.S | re.I)
print(f"title:     {title.group(1).strip() if title else '(none)'}")
print(f"<a>  tags: {r.text.count('<a ')}")
print(f"<img>tags: {r.text.count('<img')}")

print("\nrobots.txt disallow (first 10)")
robots = requests.get(f"{urlparse(url).scheme}://{urlparse(url).netloc}/robots.txt",
                      timeout=5).text
shown = 0
in_star = False
for line in robots.splitlines():
    if line.lower().startswith("user-agent:"):
        in_star = line.split(":", 1)[1].strip() == "*"
    elif in_star and line.lower().startswith("disallow:") and shown < 10:
        print(f"  ⛔ {line.split(':', 1)[1].strip()}")
        shown += 1

Non-negotiables: real User-Agent, file saved, friendly summary, robots block parsed correctly.

<a href="/about" class="link primary">About us</a> └┬┘└────┬─────┘└──────┬────────┘└──┬──┘ tag attribute attribute text content (href) (class)

<html> <head> <title>Books</title> </head> <body> <h1>Best Sellers</h1> <ul class="book-list"> <li class="book"> <a href="/b/1">Refactoring</a> <span class="price">RM 89</span> </li> <li class="book featured"> <a href="/b/2">The Pragmatic Programmer</a> <span class="price">RM 95</span> </li> </ul> </body> </html>

html ├── head │ └── title "Books" └── body ├── h1 "Best Sellers" └── ul.book-list ├── li.book │ ├── a (href=/b/1) "Refactoring" │ └── span.price "RM 89" └── li.book.featured ├── a (href=/b/2) "The Pragmatic Programmer" └── span.price "RM 95"

tag a every <a> .class .price every element with class="price" #id #checkout the element with id="checkout" parent child ul li every <li> inside a <ul> combo li.book a every <a> inside <li class="book">

# fetch_books.py — get the HTML, save a slice import requests HEADERS = { "User-Agent": "advaslearning-py-l4 (educational scraping demo)", } r = requests.get("https://books.toscrape.com/", headers=HEADERS, timeout=10) r.raise_for_status() print(f"status: {r.status_code}") print(f"length: {len(r.text):,} chars") # Save a snippet so we can read it with open("books_home.html", "w", encoding="utf-8") as f: f.write(r.text) # Print the first 500 chars to peek at the shape print("\n--- first 500 chars ---") print(r.text[:500])

status: 200 length: 51,224 chars --- first 500 chars --- <!DOCTYPE html> <html lang="en-us"> <head> <title> All products | Books to Scrape - Sandbox </title> <meta http-equiv="content-type" content="text/html; charset=UTF-8" /> ...

# robots_audit.py import requests SITES = ["https://github.com", "https://www.reddit.com", "https://books.toscrape.com"] for site in SITES: print(f"\n🤖 {site}/robots.txt") r = requests.get(f"{site}/robots.txt", timeout=5) in_star_block = False for line in r.text.splitlines(): line = line.strip() if line.lower().startswith("user-agent:"): in_star_block = line.split(":", 1)[1].strip() == "*" elif in_star_block and line.lower().startswith("disallow:"): path = line.split(":", 1)[1].strip() print(f" ⛔ {path or '/ (everything)'}")