PY-L5-45 · RAG Lite — Search Your Notes with an LLM

Learning Goals

3 min

Explain RAG: retrieve relevant text, then generate grounded in it.
Split documents into chunks.
Retrieve the best chunks (TF-IDF now; embeddings as a stretch).
Build a prompt that says "answer ONLY from this context".

Warm-Up · Open-Book Exam

5 min

Closed-book (plain LLM): answers from memory → can hallucinate,
                          doesn't know your private notes.
Open-book (RAG):          1. find the relevant pages in YOUR notes
                          2. give them to the model
                          3. "answer using only these pages"

Today's big idea

RAG turns a closed-book exam into an open-book one. You retrieve the relevant text and paste it into the prompt, then instruct the model to answer only from it. This grounds answers in your real documents and slashes hallucination.

New Concept · Chunk, Retrieve, Generate

14 min

1. Chunk your documents

def chunk(text, size=500, overlap=50):
    chunks, i = [], 0
    while i < len(text):
        chunks.append(text[i:i + size])
        i += size - overlap        # overlap keeps sentences from splitting
    return chunks

Chunks must be small enough to fit several in a prompt, big enough to be meaningful. Overlap avoids cutting an answer in half at a boundary.

2. Retrieve with TF-IDF (no API needed)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def build_index(chunks):
    vec = TfidfVectorizer(stop_words="english")
    matrix = vec.fit_transform(chunks)
    return vec, matrix

def retrieve(query, chunks, vec, matrix, k=3):
    q = vec.transform([query])
    sims = cosine_similarity(q, matrix)[0]
    top = sims.argsort()[-k:][::-1]
    return [chunks[i] for i in top]

TF-IDF + cosine similarity finds the chunks whose words best match the question — a cheap, no-API retriever. (Real systems use neural embeddings; see the stretch.)

3. Generate grounded in the context

import anthropic
client = anthropic.Anthropic()

def answer(query, context_chunks):
    context = "\n\n---\n\n".join(context_chunks)
    msg = client.messages.create(
        model="claude-haiku-4-5", max_tokens=400,
        system="Answer ONLY using the provided context. "
               "If the answer isn't there, say 'I don't know from the notes.'",
        messages=[{"role": "user",
                   "content": f"Context:\n{context}\n\nQuestion: {query}"}],
    )
    return msg.content[0].text

The honesty rule

The system prompt's "if it's not in the context, say you don't know" is what stops the model inventing answers. RAG without that instruction still hallucinates.

Worked Example · Q&A Over Your Notes

12 min

# rag_lite.py — ask questions about a text file
import anthropic
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

client = anthropic.Anthropic()

def chunk(text, size=500, overlap=50):
    out, i = [], 0
    while i < len(text):
        out.append(text[i:i + size]); i += size - overlap
    return out

def answer(query, chunks, vec, matrix):
    q = vec.transform([query])
    top = cosine_similarity(q, matrix)[0].argsort()[-3:][::-1]
    context = "\n\n---\n\n".join(chunks[i] for i in top)
    msg = client.messages.create(
        model="claude-haiku-4-5", max_tokens=400,
        system="Answer ONLY from the context. If absent, say "
               "'Not found in the notes.' Cite the relevant phrase.",
        messages=[{"role": "user",
                   "content": f"Context:\n{context}\n\nQ: {query}"}],
    )
    return msg.content[0].text

# build the index once
text = Path("biology_notes.txt").read_text()
chunks = chunk(text)
vec = TfidfVectorizer(stop_words="english")
matrix = vec.fit_transform(chunks)

# ask away
for q in ["What is mitosis?", "Who won the 2018 World Cup?"]:
    print(f"Q: {q}\nA: {answer(q, chunks, vec, matrix)}\n")

Sample output

Q: What is mitosis?
A: Mitosis is cell division producing two identical daughter cells, as the
   notes state: "mitosis divides one cell into two genetically identical cells."

Q: Who won the 2018 World Cup?
A: Not found in the notes.

Read the diff

The model answered the biology question from the notes (and cited the phrase) but correctly refused the World Cup question — because it's not in the context and the system prompt forbids guessing. That refusal is the whole point: a grounded, honest assistant over your documents.

Try It Yourself

13 min

01 🟢 Index your notes

Point RAG-Lite at one of your own study-notes text files. Ask 3 questions; confirm it answers from the text and refuses off-topic ones.

02 🟡 Show the sources

Print which chunks were retrieved for each question, so you can verify the answer is grounded.

03 🔴 Real embeddings

Swap TF-IDF for neural embeddings (sentence-transformers, model "all-MiniLM-L6-v2"). Compare retrieval quality on questions phrased differently from the notes.

Hint

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np
embed = SentenceTransformer("all-MiniLM-L6-v2")
chunk_vecs = embed.encode(chunks)
q_vec = embed.encode([query])
sims = (chunk_vecs @ q_vec.T).ravel()   # cosine if normalised

Embeddings match by meaning, so "cell splitting" can retrieve a chunk about "mitosis" even with no shared words.

Mini-Challenge · Study-Notes Web App

8 min

Combine RAG-Lite with the Lesson 44 Flask app: upload/select a notes file, then chat with it. The backend retrieves + grounds every answer. A personal "chat with my notes" tool.

Recap

3 min

RAG = chunk your docs → retrieve the relevant chunks (TF-IDF or embeddings) → generate an answer grounded in them, with a system prompt that forbids guessing. It gives the model knowledge it never had and cuts hallucination. Embeddings retrieve by meaning, not just keywords. That's the last LLM technique — next, the ethics that govern all of this.

Vocabulary Card

RAG: Retrieval-Augmented Generation — retrieve relevant text, then generate grounded in it.
chunk: A small slice of a document, sized to fit several in a prompt.
embedding: A vector capturing a text's meaning; similar meanings sit close together.
grounding: Forcing the model to answer from provided sources, not its memory.

Homework

4 min

Build a RAG-Lite tool over a document you care about (your notes, a rulebook, a manual). Show 3 grounded answers + 1 honest "not found". Print retrieved chunks for transparency. Bonus: try real embeddings and compare.

Closed-book (plain LLM): answers from memory → can hallucinate, doesn't know your private notes. Open-book (RAG): 1. find the relevant pages in YOUR notes 2. give them to the model 3. "answer using only these pages"

def chunk(text, size=500, overlap=50): chunks, i = [], 0 while i < len(text): chunks.append(text[i:i + size]) i += size - overlap # overlap keeps sentences from splitting return chunks

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity def build_index(chunks): vec = TfidfVectorizer(stop_words="english") matrix = vec.fit_transform(chunks) return vec, matrix def retrieve(query, chunks, vec, matrix, k=3): q = vec.transform([query]) sims = cosine_similarity(q, matrix)[0] top = sims.argsort()[-k:][::-1] return [chunks[i] for i in top]

import anthropic client = anthropic.Anthropic() def answer(query, context_chunks): context = "\n\n---\n\n".join(context_chunks) msg = client.messages.create( model="claude-haiku-4-5", max_tokens=400, system="Answer ONLY using the provided context. " "If the answer isn't there, say 'I don't know from the notes.'", messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}], ) return msg.content[0].text

# rag_lite.py — ask questions about a text file import anthropic from pathlib import Path from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity client = anthropic.Anthropic() def chunk(text, size=500, overlap=50): out, i = [], 0 while i < len(text): out.append(text[i:i + size]); i += size - overlap return out def answer(query, chunks, vec, matrix): q = vec.transform([query]) top = cosine_similarity(q, matrix)[0].argsort()[-3:][::-1] context = "\n\n---\n\n".join(chunks[i] for i in top) msg = client.messages.create( model="claude-haiku-4-5", max_tokens=400, system="Answer ONLY from the context. If absent, say " "'Not found in the notes.' Cite the relevant phrase.", messages=[{"role": "user", "content": f"Context:\n{context}\n\nQ: {query}"}], ) return msg.content[0].text # build the index once text = Path("biology_notes.txt").read_text() chunks = chunk(text) vec = TfidfVectorizer(stop_words="english") matrix = vec.fit_transform(chunks) # ask away for q in ["What is mitosis?", "Who won the 2018 World Cup?"]: print(f"Q: {q}\nA: {answer(q, chunks, vec, matrix)}\n")

Q: What is mitosis? A: Mitosis is cell division producing two identical daughter cells, as the notes state: "mitosis divides one cell into two genetically identical cells." Q: Who won the 2018 World Cup? A: Not found in the notes.

# pip install sentence-transformers from sentence_transformers import SentenceTransformer import numpy as np embed = SentenceTransformer("all-MiniLM-L6-v2") chunk_vecs = embed.encode(chunks) q_vec = embed.encode([query]) sims = (chunk_vecs @ q_vec.T).ravel() # cosine if normalised