Learning Goals
3 min- Recognise the four parts of an HTML element: tag, attributes, text, children.
- Read the DOM tree of a simple page by hand.
- Use the browser's "Inspect Element" to find a CSS selector for any node.
- Apply the scraping ethics checklist: robots.txt, ToS, rate limit, identification.
Warm-Up · Parts of a Tag
5 min<a href="/about" class="link primary">About us</a>
└┬┘└────┬─────┘└──────┬────────┘└──┬──┘
tag attribute attribute text content
(href) (class)Four things to spot:
- Tag name —
a,div,p,img. - Attributes —
key="value"pairs inside the opening tag. - Text content — what's between the open and close tag.
- Children — nested tags between the open and close.
HTML is a tree. Every node has a tag, optional attributes, optional text, and an ordered list of children. BeautifulSoup's job is to let you walk that tree by tag name, attribute, or CSS selector.
New Concept · The DOM Tree
14 minA tiny page
<html>
<head>
<title>Books</title>
</head>
<body>
<h1>Best Sellers</h1>
<ul class="book-list">
<li class="book">
<a href="/b/1">Refactoring</a>
<span class="price">RM 89</span>
</li>
<li class="book featured">
<a href="/b/2">The Pragmatic Programmer</a>
<span class="price">RM 95</span>
</li>
</ul>
</body>
</html>Draw it as a tree:
html
├── head
│ └── title "Books"
└── body
├── h1 "Best Sellers"
└── ul.book-list
├── li.book
│ ├── a (href=/b/1) "Refactoring"
│ └── span.price "RM 89"
└── li.book.featured
├── a (href=/b/2) "The Pragmatic Programmer"
└── span.price "RM 95"CSS selectors — the universal address
tag a every <a> .class .price every element with class="price" #id #checkout the element with id="checkout" parent child ul li every <li> inside a <ul> combo li.book a every <a> inside <li class="book">
Selectors compose just like CSS in your front-end work. Tomorrow you'll feed these to soup.select().
Find your selector in 3 clicks
- Right-click the element on a real page in Chrome / Firefox / Edge → Inspect.
- In the Elements panel, right-click the highlighted tag → Copy → Copy selector.
- Paste; clean it up. (Browser-generated selectors are often too long.)
Ethical scraping checklist
- Check
https://site.com/robots.txt. If it saysDisallow: /, don't scrape that path. - Read the site's Terms of Service — many big sites forbid scraping.
- Identify yourself with a User-Agent that names you and your purpose.
- Rate-limit: at most ~1 request per second; usually slower.
- Cache aggressively so you re-fetch only when needed.
- Never scrape behind a login wall unless you have permission.
Worked Example · Fetch & Inspect
12 minHit a friendly scraping playground — https://books.toscrape.com exists specifically to be scraped. Print the raw HTML, then save a snippet for inspection.
# fetch_books.py — get the HTML, save a slice import requests HEADERS = { "User-Agent": "advaslearning-py-l4 (educational scraping demo)", } r = requests.get("https://books.toscrape.com/", headers=HEADERS, timeout=10) r.raise_for_status() print(f"status: {r.status_code}") print(f"length: {len(r.text):,} chars") # Save a snippet so we can read it with open("books_home.html", "w", encoding="utf-8") as f: f.write(r.text) # Print the first 500 chars to peek at the shape print("\n--- first 500 chars ---") print(r.text[:500])
Sample output (truncated)
status: 200
length: 51,224 chars
--- first 500 chars ---
<!DOCTYPE html>
<html lang="en-us">
<head>
<title>
All products | Books to Scrape - Sandbox
</title>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
...Open books_home.html in VS Code, search for "product_pod", and you'll see the repeating block we'll target tomorrow with BeautifulSoup.
Read the diff
Three good habits already: a descriptive User-Agent, a timeout, and saving the raw page so we can inspect it without re-hitting the server. Tomorrow we add BeautifulSoup and start extracting.
Try It Yourself
13 minTake this HTML and draw the DOM tree on paper:
<div class="card">
<h2>News</h2>
<ul>
<li><a href="/1">Article one</a></li>
<li><a href="/2">Article two</a></li>
</ul>
</div>Hint
div.card
├── h2 "News"
└── ul
├── li
│ └── a href=/1 "Article one"
└── li
└── a href=/2 "Article two"For the same HTML, write a CSS selector for: (a) the heading, (b) every link, (c) only the second link.
Hint
a) .card h2 b) .card a c) .card li:nth-of-type(2) a
Open https://news.ycombinator.com. Right-click any story title → Inspect. Find the CSS selector that uniquely identifies story titles. Write it down.
Hint
At time of writing: span.titleline > a (this changes; that's why scrapers are fragile)
Selector changes break scrapers. That's why robust scrapers use fallback selectors and run regularly so you spot breakage early.
Mini-Challenge · robots.txt Audit
8 minFetch robots.txt from three sites (e.g., github.com, reddit.com, books.toscrape.com). Print which paths are disallowed for a generic bot (User-agent: *).
Show one possible solution
# robots_audit.py import requests SITES = ["https://github.com", "https://www.reddit.com", "https://books.toscrape.com"] for site in SITES: print(f"\n🤖 {site}/robots.txt") r = requests.get(f"{site}/robots.txt", timeout=5) in_star_block = False for line in r.text.splitlines(): line = line.strip() if line.lower().startswith("user-agent:"): in_star_block = line.split(":", 1)[1].strip() == "*" elif in_star_block and line.lower().startswith("disallow:"): path = line.split(":", 1)[1].strip() print(f" ⛔ {path or '/ (everything)'}")
Non-negotiables: parse the User-agent: * block specifically; not every site has one, and not every disallow line is for everyone.
Recap
3 minHTML is a tree of tags with attributes, text and children. CSS selectors are the addressing scheme — same syntax you used for styling. Always check robots.txt and ToS, set a real User-Agent, rate-limit, cache. Tomorrow we let BeautifulSoup walk the tree and pull out what we want.
Vocabulary Card
- DOM
- Document Object Model — the in-memory tree representation of an HTML document.
- tag / attribute / text
- The three pieces of every element:
<a href="...">Text</a>. - CSS selector
- String like
div.card athat addresses elements in the tree. - robots.txt
- A file at site root that tells bots which paths to skip. Honour it.
Homework
4 minBuild page_audit.py. Given a URL on the command line:
- Fetch the page (with a descriptive User-Agent and timeout).
- Save the HTML to
{host}.html. - Print: response status, page length,
titletag text, count of<a>tags (rough — count occurrences of"<a "in the text), count of<img>tags. - Print the first 10 disallow lines from that site's
robots.txt.
Sample · page_audit.py
# page_audit.py import re, sys, requests from urllib.parse import urlparse url = sys.argv[1] host = urlparse(url).netloc.replace(".", "_") r = requests.get(url, timeout=10, headers={"User-Agent": "advaslearning-py-l4 (audit)"}) r.raise_for_status() with open(f"{host}.html", "w", encoding="utf-8") as f: f.write(r.text) print(f"status: {r.status_code}") print(f"length: {len(r.text):,} chars") title = re.search(r"<title[^>]*>(.*?)</title>", r.text, re.S | re.I) print(f"title: {title.group(1).strip() if title else '(none)'}") print(f"<a> tags: {r.text.count('<a ')}") print(f"<img>tags: {r.text.count('<img')}") print("\nrobots.txt disallow (first 10)") robots = requests.get(f"{urlparse(url).scheme}://{urlparse(url).netloc}/robots.txt", timeout=5).text shown = 0 in_star = False for line in robots.splitlines(): if line.lower().startswith("user-agent:"): in_star = line.split(":", 1)[1].strip() == "*" elif in_star and line.lower().startswith("disallow:") and shown < 10: print(f" ⛔ {line.split(':', 1)[1].strip()}") shown += 1
Non-negotiables: real User-Agent, file saved, friendly summary, robots block parsed correctly.