Learning Goals
3 min- Use
client.messages.streamto receive tokens live. - Print text as it arrives (the typing effect).
- Understand why streaming improves perceived speed.
- Collect the full text + stop reason after streaming.
Warm-Up · Wait vs Watch
5 minnon-streaming: ...... (10s of blank screen) ...... full answer appears streaming: answer appears word-by-word as it's generated
Same total time, completely different feel. Streaming lets the user start reading immediately and signals "I'm working". Every good chat UI streams.
The model generates one token at a time anyway. Non-streaming buffers them all before returning; streaming hands you each token as it's produced. Same result, much better experience — and essential for long replies.
New Concept · messages.stream
14 minThe streaming context manager
import anthropic client = anthropic.Anthropic() with client.messages.stream( model="claude-haiku-4-5", max_tokens=500, messages=[{"role": "user", "content": "Write a short poem about the sea."}], ) as stream: for text in stream.text_stream: print(text, end="", flush=True) # print each chunk, no newline print() # final newline
stream.text_stream yields text chunks as they arrive. end="", flush=True makes them appear immediately instead of buffering.
Get the full message afterwards
with client.messages.stream(...) as stream: for text in stream.text_stream: print(text, end="", flush=True) final = stream.get_final_message() # full text, usage, stop_reason print("\ntokens:", final.usage.output_tokens)
Build the string while streaming
chunks = [] with client.messages.stream(...) as stream: for text in stream.text_stream: chunks.append(text) print(text, end="", flush=True) full_reply = "".join(chunks)
Lower-level events (optional)
For fine control (tool-use deltas, etc.) you can iterate stream directly and inspect event types. For most apps, text_stream is all you need.
Worked Example · A Streaming Chat REPL
12 min# stream_chat.py — a multi-turn streaming chat in the terminal import anthropic client = anthropic.Anthropic() SYSTEM = "You are a concise, friendly assistant. Keep replies under 80 words." history = [] # remembers the conversation print("Chat (type 'quit' to exit)\n") while True: user = input("You: ").strip() if user.lower() in ("quit", "exit"): break history.append({"role": "user", "content": user}) print("AI: ", end="", flush=True) chunks = [] with client.messages.stream( model="claude-haiku-4-5", max_tokens=400, system=SYSTEM, messages=history, ) as stream: for text in stream.text_stream: print(text, end="", flush=True) chunks.append(text) print("\n") history.append({"role": "assistant", "content": "".join(chunks)})
Sample session
You: what's a fun fact about octopuses? AI: Octopuses have three hearts and blue blood! Two hearts pump blood to the gills, and the third pumps it to the body. Even cooler — the body heart stops beating when they swim, which is why they often prefer to crawl. 🐙 You: do they sleep? AI: Yes! And they may even dream — their skin flickers with colour while...
Read the diff
Two things make this a real chat: streaming (replies type out live) and history (we append both sides, so the model remembers context — "they" in the second question resolves to octopuses). Append the user message before the call and the assistant's full reply after. That's the entire conversational loop.
Try It Yourself
13 minStream a 200-word story to the terminal with the typing effect. Print the token count at the end.
In the chat REPL, tell it your name, then ask it later. Confirm history makes it remember. Remove history and show it forgets.
Long chats cost more tokens each turn. Keep only the last N message pairs (a sliding window) so cost stays bounded. Show the token count stays flat.
Hint
MAX_TURNS = 6 # keep last 6 messages history = history[-MAX_TURNS:]
Mini-Challenge · Streaming + Tools
8 minCombine streaming with the chat history and a system persona to build a polished terminal assistant. Add a /clear command to reset history and a token counter that shows the running cost of the session.
Recap
3 minStreaming hands you tokens as they're generated via client.messages.stream + stream.text_stream. Print with end="", flush=True for the live typing effect, and collect chunks to keep the full reply. Maintain a history list (append user then assistant) for memory, and cap it to control cost. Next: put this in a Flask web app.
Vocabulary Card
- streaming
- Receiving the reply incrementally as it's generated.
- text_stream
- Iterator yielding text chunks during a stream.
- conversation history
- The running list of messages that gives the model memory.
- context window
- The max tokens the model can attend to; long histories fill it (and cost more).
Homework
4 minBuild a polished terminal chat: streaming, a system persona, conversation memory, a /clear command, a sliding-window history cap, and a running token counter. Commit it with a README.
Extend stream_chat.py with the sliding-window cap and a /clear branch. Accumulate final.usage.output_tokens across turns for the running counter.