Learning Goals
3 minBy the end of this lesson you can:
- Load and dump JSON files with
json.load/json.dump(and the string forms). - Reach into nested structures safely with
.getand defaults. - Validate that incoming JSON has the keys and types you expect.
- Transform a nested document into a flat shape (e.g. ready for CSV).
Warm-Up · JSON ↔ Python
5 minJSON maps almost perfectly onto Python types you already know:
JSON Python
object {} → dict
array [] → list
string → str
number → int / float
true/false → True / False
null → NoneOnce parsed, JSON is just nested dicts and lists — everything you mastered in earlier levels. The two real skills are reaching in safely (data from the outside world is often missing keys) and reshaping it into the structure your program wants. Trust nothing; validate everything at the boundary.
New Concept · Load, Reach, Validate, Reshape
14 minLoading and dumping
import json from pathlib import Path # from a file data = json.loads(Path("config.json").read_text(encoding="utf-8")) # back to a file, pretty-printed Path("out.json").write_text( json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8")
json.loads(s)parses a string;json.load(file)reads from a file object. (Thes= "string".)json.dumps(obj, indent=2)makes readable output;ensure_ascii=Falsekeeps emoji and accents as themselves.
Reaching in safely
user = { "name": "Aisha", "address": {"city": "Kuala Lumpur"}, "roles": ["admin", "editor"], } # risky: KeyError if 'address' or 'zip' is missing # zip_code = user["address"]["zip"] # safe: .get with defaults, never crashes city = user.get("address", {}).get("city", "unknown") zip_code = user.get("address", {}).get("zip", "—") first_role = (user.get("roles") or ["none"])[0] print(city, zip_code, first_role)
External JSON is unreliable — keys vanish, values are null. Chaining .get(key, default) walks the tree without ever raising KeyError. The .get("address", ) trick gives you an empty dict to keep chaining on.
Validating the shape
def validate(record: dict) -> list[str]: errors = [] if not isinstance(record.get("name"), str) or not record["name"]: errors.append("name must be a non-empty string") if not isinstance(record.get("age"), int): errors.append("age must be an integer") if not isinstance(record.get("emails"), list): errors.append("emails must be a list") return errors bad = {"name": "", "age": "old", "emails": "a@b.com"} print(validate(bad)) # ['name must be a non-empty string', 'age must be an integer', 'emails must be a list']
A validator that returns a list of problems (rather than raising at the first) lets you report everything wrong at once — far friendlier for whoever sends you the data. For big or shared schemas, libraries like pydantic (Level 8) or jsonschema automate this; the hand-rolled version is plenty for small jobs.
Transforming / reshaping
# nested API response → flat rows for a CSV api = { "users": [ {"id": 1, "name": "Aisha", "address": {"city": "KL"}}, {"id": 2, "name": "Ben", "address": {"city": "London"}}, ] } flat = [ {"id": u["id"], "name": u["name"], "city": u.get("address", {}).get("city", "")} for u in api["users"] ] print(flat) # [{'id': 1, 'name': 'Aisha', 'city': 'KL'}, {'id': 2, ...}]
Flattening nested JSON into a list of flat dicts is the bridge to Lesson 15's DictWriter — JSON in, CSV out. The comprehension reads as "for each user, build the flat record I want."
Worked Example · API Dump → Clean CSV
12 minGoal: take a messy nested JSON export, validate each record, flatten the good ones to CSV, and log the bad ones — a realistic "clean the export" job.
import json, csv, logging from pathlib import Path logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s") log = logging.getLogger("json2csv") def flatten(user: dict) -> dict: return { "id": user.get("id", ""), "name": user.get("name", "").strip(), "email": (user.get("contact", {}).get("email") or "").lower(), "city": user.get("address", {}).get("city", ""), "roles": "|".join(user.get("roles", [])), # list → pipe-joined string } def is_valid(user: dict) -> bool: return bool(user.get("id")) and "@" in ( user.get("contact", {}).get("email") or "") raw = json.loads(Path("export.json").read_text(encoding="utf-8")) users = raw.get("users", []) good, bad = [], 0 for u in users: if is_valid(u): good.append(flatten(u)) else: bad += 1 log.warning("skipping invalid record id=%s", u.get("id", "?")) with open("users.csv", "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=["id", "name", "email", "city", "roles"]) w.writeheader(); w.writerows(good) log.info("wrote %d users, skipped %d invalid", len(good), bad)
WARNING skipping invalid record id=7 WARNING skipping invalid record id=? INFO wrote 48 users, skipped 2 invalid
Read the code
Two small functions carry the logic: flatten reshapes one nested user into a flat row (note roles, a list, becomes a pipe-joined string for CSV), and is_valid guards the boundary. Every reach uses .get(..., default) so a missing key never crashes the batch — one bad record doesn't sink the other 48. This JSON→validate→flatten→CSV pipeline is one of the most common automations you'll ever write.
Try It Yourself
13 minLoad a JSON file, add a new top-level key (e.g. "processed_at" with the current ISO timestamp), and write it back pretty-printed. Confirm the file is still valid JSON.
Write deep_get(data, *keys, default=None) that walks a chain of keys through nested dicts and returns the default if any link is missing. deep_get(d, "a", "b", "c") ≈ d["a"]["b"]["c"] but never crashes.
Hint
def deep_get(data, *keys, default=None): for key in keys: if isinstance(data, dict) and key in data: data = data[key] else: return default return data print(deep_get(user, "address", "zip", default="—"))
Write check(record, schema) where schema is a dict mapping field → expected type (e.g. {"name": str, "age": int}). Return a list of mismatch messages. Test it on good and bad records.
Hint
def check(record, schema): errors = [] for field, expected in schema.items(): value = record.get(field) if not isinstance(value, expected): errors.append( f"{field}: expected {expected.__name__}, " f"got {type(value).__name__}") return errors print(check({"name": "A", "age": "x"}, {"name": str, "age": int}))
Mini-Challenge · The Config Merger
8 minWrite deep_merge(base, override) that merges two nested JSON-like dicts: values in override win, but nested dicts merge recursively rather than replacing wholesale. This is how layered config (defaults + user overrides) works in real apps.
Show a sample solution
def deep_merge(base: dict, override: dict) -> dict: result = dict(base) for key, value in override.items(): if (key in result and isinstance(result[key], dict) and isinstance(value, dict)): result[key] = deep_merge(result[key], value) else: result[key] = value return result defaults = {"server": {"host": "localhost", "port": 8000}, "debug": False} user = {"server": {"port": 9000}, "debug": True} print(deep_merge(defaults, user)) # {'server': {'host': 'localhost', 'port': 9000}, 'debug': True}
Non-negotiables: recursive merge for nested dicts, override wins for scalars, base untouched.
Recap
3 minParsed JSON is just nested dicts and lists. Load with json.loads/load, dump with json.dumps(obj, indent=2, ensure_ascii=False). Reach into external data with chained .get(key, default) so missing keys never crash you. Validate at the boundary — a function returning a list of problems reports everything at once — and reshape nested documents into the flat structure your code (or a CSV) needs. JSON→validate→flatten→CSV is a workhorse pipeline; for big schemas, lean on pydantic/jsonschema later.
Vocabulary Card
- json.loads / dumps
- Parse a JSON string into Python / serialise Python back to a JSON string.
- .get(key, default)
- Safe dict access that returns a fallback instead of raising.
- schema validation
- Checking incoming data has the expected keys and types.
- flatten
- Turning nested structures into flat records (e.g. for CSV).
Homework
4 minBuild jsontool.py with argparse subcommands: validate <file> (checks each record against a hard-coded schema and prints a report of problems), flatten <file> <out.csv> (turns a nested JSON array into a CSV), and pretty <file> (re-writes the file with 2-space indentation). Handle malformed JSON with a clear error rather than a crash.
Sample · jsontool.py (core)
import argparse, json, csv, sys from pathlib import Path def read_json(path): try: return json.loads(Path(path).read_text(encoding="utf-8")) except json.JSONDecodeError as e: print(f"Invalid JSON in {path}: {e}") sys.exit(1) def cmd_pretty(a): data = read_json(a.file) Path(a.file).write_text( json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8") print("formatted", a.file) def cmd_flatten(a): data = read_json(a.file) records = data if isinstance(data, list) else data.get("items", []) fields = sorted({k for r in records for k in r}) with open(a.out, "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=fields, extrasaction="ignore") w.writeheader() for r in records: w.writerow({k: r.get(k, "") for k in fields}) print(f"wrote {len(records)} rows → {a.out}") def cmd_validate(a): schema = {"id": int, "name": str} data = read_json(a.file) for i, r in enumerate(data): bad = [f for f, t in schema.items() if not isinstance(r.get(f), t)] if bad: print(f"record {i}: bad fields {bad}") p = argparse.ArgumentParser(); sub = p.add_subparsers(dest="cmd", required=True) for name, fn in [("pretty", cmd_pretty), ("validate", cmd_validate)]: sp = sub.add_parser(name); sp.add_argument("file"); sp.set_defaults(func=fn) fl = sub.add_parser("flatten"); fl.add_argument("file"); fl.add_argument("out") fl.set_defaults(func=cmd_flatten) args = p.parse_args(); args.func(args)
Non-negotiables: three subcommands, JSONDecodeError handling, flatten to CSV, schema validate report.