PY-L5-43 · Streaming LLM Responses

Learning Goals

3 min

Use client.messages.stream to receive tokens live.
Print text as it arrives (the typing effect).
Understand why streaming improves perceived speed.
Collect the full text + stop reason after streaming.

Warm-Up · Wait vs Watch

5 min

non-streaming:  ...... (10s of blank screen) ...... full answer appears
streaming:      answer appears word-by-word as it's generated

Same total time, completely different feel. Streaming lets the user start reading immediately and signals "I'm working". Every good chat UI streams.

Today's big idea

The model generates one token at a time anyway. Non-streaming buffers them all before returning; streaming hands you each token as it's produced. Same result, much better experience — and essential for long replies.

New Concept · messages.stream

14 min

The streaming context manager

import anthropic
client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-haiku-4-5",
    max_tokens=500,
    messages=[{"role": "user", "content": "Write a short poem about the sea."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)   # print each chunk, no newline
print()   # final newline

stream.text_stream yields text chunks as they arrive. end="", flush=True makes them appear immediately instead of buffering.

Get the full message afterwards

with client.messages.stream(...) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
    final = stream.get_final_message()   # full text, usage, stop_reason
print("\ntokens:", final.usage.output_tokens)

Build the string while streaming

chunks = []
with client.messages.stream(...) as stream:
    for text in stream.text_stream:
        chunks.append(text)
        print(text, end="", flush=True)
full_reply = "".join(chunks)

Lower-level events (optional)

For fine control (tool-use deltas, etc.) you can iterate stream directly and inspect event types. For most apps, text_stream is all you need.

Worked Example · A Streaming Chat REPL

12 min

# stream_chat.py — a multi-turn streaming chat in the terminal
import anthropic
client = anthropic.Anthropic()

SYSTEM = "You are a concise, friendly assistant. Keep replies under 80 words."
history = []   # remembers the conversation

print("Chat (type 'quit' to exit)\n")
while True:
    user = input("You: ").strip()
    if user.lower() in ("quit", "exit"):
        break

    history.append({"role": "user", "content": user})

    print("AI: ", end="", flush=True)
    chunks = []
    with client.messages.stream(
        model="claude-haiku-4-5", max_tokens=400,
        system=SYSTEM, messages=history,
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)
            chunks.append(text)
    print("\n")

    history.append({"role": "assistant", "content": "".join(chunks)})

Sample session

You: what's a fun fact about octopuses?
AI: Octopuses have three hearts and blue blood! Two hearts pump blood to the
gills, and the third pumps it to the body. Even cooler — the body heart stops
beating when they swim, which is why they often prefer to crawl. 🐙

You: do they sleep?
AI: Yes! And they may even dream — their skin flickers with colour while...

Read the diff

Two things make this a real chat: streaming (replies type out live) and history (we append both sides, so the model remembers context — "they" in the second question resolves to octopuses). Append the user message before the call and the assistant's full reply after. That's the entire conversational loop.

Try It Yourself

13 min

01 🟢 Stream a story

Stream a 200-word story to the terminal with the typing effect. Print the token count at the end.

02 🟡 Memory test

In the chat REPL, tell it your name, then ask it later. Confirm history makes it remember. Remove history and show it forgets.

03 🔴 Cap the history

Long chats cost more tokens each turn. Keep only the last N message pairs (a sliding window) so cost stays bounded. Show the token count stays flat.

Hint

MAX_TURNS = 6   # keep last 6 messages
history = history[-MAX_TURNS:]

Mini-Challenge · Streaming + Tools

8 min

Combine streaming with the chat history and a system persona to build a polished terminal assistant. Add a /clear command to reset history and a token counter that shows the running cost of the session.

Recap

3 min

Streaming hands you tokens as they're generated via client.messages.stream + stream.text_stream. Print with end="", flush=True for the live typing effect, and collect chunks to keep the full reply. Maintain a history list (append user then assistant) for memory, and cap it to control cost. Next: put this in a Flask web app.

Vocabulary Card

streaming: Receiving the reply incrementally as it's generated.
text_stream: Iterator yielding text chunks during a stream.
conversation history: The running list of messages that gives the model memory.
context window: The max tokens the model can attend to; long histories fill it (and cost more).

Homework

4 min

Build a polished terminal chat: streaming, a system persona, conversation memory, a /clear command, a sliding-window history cap, and a running token counter. Commit it with a README.

import anthropic client = anthropic.Anthropic() with client.messages.stream( model="claude-haiku-4-5", max_tokens=500, messages=[{"role": "user", "content": "Write a short poem about the sea."}], ) as stream: for text in stream.text_stream: print(text, end="", flush=True) # print each chunk, no newline print() # final newline

with client.messages.stream(...) as stream: for text in stream.text_stream: print(text, end="", flush=True) final = stream.get_final_message() # full text, usage, stop_reason print("\ntokens:", final.usage.output_tokens)

# stream_chat.py — a multi-turn streaming chat in the terminal import anthropic client = anthropic.Anthropic() SYSTEM = "You are a concise, friendly assistant. Keep replies under 80 words." history = [] # remembers the conversation print("Chat (type 'quit' to exit)\n") while True: user = input("You: ").strip() if user.lower() in ("quit", "exit"): break history.append({"role": "user", "content": user}) print("AI: ", end="", flush=True) chunks = [] with client.messages.stream( model="claude-haiku-4-5", max_tokens=400, system=SYSTEM, messages=history, ) as stream: for text in stream.text_stream: print(text, end="", flush=True) chunks.append(text) print("\n") history.append({"role": "assistant", "content": "".join(chunks)})

You: what's a fun fact about octopuses? AI: Octopuses have three hearts and blue blood! Two hearts pump blood to the gills, and the third pumps it to the body. Even cooler — the body heart stops beating when they swim, which is why they often prefer to crawl. 🐙 You: do they sleep? AI: Yes! And they may even dream — their skin flickers with colour while...