PY-L7-17 · JSON Automation: Validate & Transform

Learning Goals

3 min

By the end of this lesson you can:

Load and dump JSON files with json.load/json.dump (and the string forms).
Reach into nested structures safely with .get and defaults.
Validate that incoming JSON has the keys and types you expect.
Transform a nested document into a flat shape (e.g. ready for CSV).

Warm-Up · JSON ↔ Python

5 min

JSON maps almost perfectly onto Python types you already know:

JSON          Python
object {}  →  dict
array []   →  list
string     →  str
number     →  int / float
true/false →  True / False
null       →  None

Today's big idea

Once parsed, JSON is just nested dicts and lists — everything you mastered in earlier levels. The two real skills are reaching in safely (data from the outside world is often missing keys) and reshaping it into the structure your program wants. Trust nothing; validate everything at the boundary.

New Concept · Load, Reach, Validate, Reshape

14 min

Loading and dumping

import json
from pathlib import Path

# from a file
data = json.loads(Path("config.json").read_text(encoding="utf-8"))

# back to a file, pretty-printed
Path("out.json").write_text(
    json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8")

json.loads(s) parses a string; json.load(file) reads from a file object. (The s = "string".)
json.dumps(obj, indent=2) makes readable output; ensure_ascii=False keeps emoji and accents as themselves.

Reaching in safely

user = {
    "name": "Aisha",
    "address": {"city": "Kuala Lumpur"},
    "roles": ["admin", "editor"],
}

# risky: KeyError if 'address' or 'zip' is missing
# zip_code = user["address"]["zip"]

# safe: .get with defaults, never crashes
city = user.get("address", {}).get("city", "unknown")
zip_code = user.get("address", {}).get("zip", "—")
first_role = (user.get("roles") or ["none"])[0]
print(city, zip_code, first_role)

External JSON is unreliable — keys vanish, values are null. Chaining .get(key, default) walks the tree without ever raising KeyError. The .get("address", ) trick gives you an empty dict to keep chaining on.

Validating the shape

def validate(record: dict) -> list[str]:
    errors = []
    if not isinstance(record.get("name"), str) or not record["name"]:
        errors.append("name must be a non-empty string")
    if not isinstance(record.get("age"), int):
        errors.append("age must be an integer")
    if not isinstance(record.get("emails"), list):
        errors.append("emails must be a list")
    return errors

bad = {"name": "", "age": "old", "emails": "a@b.com"}
print(validate(bad))
# ['name must be a non-empty string', 'age must be an integer', 'emails must be a list']

A validator that returns a list of problems (rather than raising at the first) lets you report everything wrong at once — far friendlier for whoever sends you the data. For big or shared schemas, libraries like pydantic (Level 8) or jsonschema automate this; the hand-rolled version is plenty for small jobs.

Transforming / reshaping

# nested API response → flat rows for a CSV
api = {
    "users": [
        {"id": 1, "name": "Aisha", "address": {"city": "KL"}},
        {"id": 2, "name": "Ben",   "address": {"city": "London"}},
    ]
}

flat = [
    {"id": u["id"], "name": u["name"],
     "city": u.get("address", {}).get("city", "")}
    for u in api["users"]
]
print(flat)
# [{'id': 1, 'name': 'Aisha', 'city': 'KL'}, {'id': 2, ...}]

Flattening nested JSON into a list of flat dicts is the bridge to Lesson 15's DictWriter — JSON in, CSV out. The comprehension reads as "for each user, build the flat record I want."

Worked Example · API Dump → Clean CSV

12 min

Goal: take a messy nested JSON export, validate each record, flatten the good ones to CSV, and log the bad ones — a realistic "clean the export" job.

import json, csv, logging
from pathlib import Path

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
log = logging.getLogger("json2csv")

def flatten(user: dict) -> dict:
    return {
        "id":    user.get("id", ""),
        "name":  user.get("name", "").strip(),
        "email": (user.get("contact", {}).get("email") or "").lower(),
        "city":  user.get("address", {}).get("city", ""),
        "roles": "|".join(user.get("roles", [])),   # list → pipe-joined string
    }

def is_valid(user: dict) -> bool:
    return bool(user.get("id")) and "@" in (
        user.get("contact", {}).get("email") or "")

raw = json.loads(Path("export.json").read_text(encoding="utf-8"))
users = raw.get("users", [])

good, bad = [], 0
for u in users:
    if is_valid(u):
        good.append(flatten(u))
    else:
        bad += 1
        log.warning("skipping invalid record id=%s", u.get("id", "?"))

with open("users.csv", "w", newline="", encoding="utf-8") as f:
    w = csv.DictWriter(f, fieldnames=["id", "name", "email", "city", "roles"])
    w.writeheader(); w.writerows(good)

log.info("wrote %d users, skipped %d invalid", len(good), bad)

WARNING skipping invalid record id=7
WARNING skipping invalid record id=?
INFO wrote 48 users, skipped 2 invalid

Read the code

Two small functions carry the logic: flatten reshapes one nested user into a flat row (note roles, a list, becomes a pipe-joined string for CSV), and is_valid guards the boundary. Every reach uses .get(..., default) so a missing key never crashes the batch — one bad record doesn't sink the other 48. This JSON→validate→flatten→CSV pipeline is one of the most common automations you'll ever write.

Try It Yourself

13 min

01 🟢 Round-trip

Load a JSON file, add a new top-level key (e.g. "processed_at" with the current ISO timestamp), and write it back pretty-printed. Confirm the file is still valid JSON.

02 🟡 Safe deep-get

Write deep_get(data, *keys, default=None) that walks a chain of keys through nested dicts and returns the default if any link is missing. deep_get(d, "a", "b", "c") ≈ d["a"]["b"]["c"] but never crashes.

Hint

def deep_get(data, *keys, default=None):
    for key in keys:
        if isinstance(data, dict) and key in data:
            data = data[key]
        else:
            return default
    return data

print(deep_get(user, "address", "zip", default="—"))

03 🔴 Schema checker

Write check(record, schema) where schema is a dict mapping field → expected type (e.g. {"name": str, "age": int}). Return a list of mismatch messages. Test it on good and bad records.

Hint

def check(record, schema):
    errors = []
    for field, expected in schema.items():
        value = record.get(field)
        if not isinstance(value, expected):
            errors.append(
                f"{field}: expected {expected.__name__}, "
                f"got {type(value).__name__}")
    return errors

print(check({"name": "A", "age": "x"}, {"name": str, "age": int}))

Mini-Challenge · The Config Merger

8 min

Write deep_merge(base, override) that merges two nested JSON-like dicts: values in override win, but nested dicts merge recursively rather than replacing wholesale. This is how layered config (defaults + user overrides) works in real apps.

Show a sample solution

def deep_merge(base: dict, override: dict) -> dict:
    result = dict(base)
    for key, value in override.items():
        if (key in result and isinstance(result[key], dict)
                and isinstance(value, dict)):
            result[key] = deep_merge(result[key], value)
        else:
            result[key] = value
    return result

defaults = {"server": {"host": "localhost", "port": 8000}, "debug": False}
user     = {"server": {"port": 9000}, "debug": True}
print(deep_merge(defaults, user))
# {'server': {'host': 'localhost', 'port': 9000}, 'debug': True}

Non-negotiables: recursive merge for nested dicts, override wins for scalars, base untouched.

Recap

3 min

Parsed JSON is just nested dicts and lists. Load with json.loads/load, dump with json.dumps(obj, indent=2, ensure_ascii=False). Reach into external data with chained .get(key, default) so missing keys never crash you. Validate at the boundary — a function returning a list of problems reports everything at once — and reshape nested documents into the flat structure your code (or a CSV) needs. JSON→validate→flatten→CSV is a workhorse pipeline; for big schemas, lean on pydantic/jsonschema later.

Vocabulary Card

json.loads / dumps: Parse a JSON string into Python / serialise Python back to a JSON string.
.get(key, default): Safe dict access that returns a fallback instead of raising.
schema validation: Checking incoming data has the expected keys and types.
flatten: Turning nested structures into flat records (e.g. for CSV).

Homework

4 min

Build jsontool.py with argparse subcommands: validate <file> (checks each record against a hard-coded schema and prints a report of problems), flatten <file> <out.csv> (turns a nested JSON array into a CSV), and pretty <file> (re-writes the file with 2-space indentation). Handle malformed JSON with a clear error rather than a crash.

Sample · jsontool.py (core)

import argparse, json, csv, sys
from pathlib import Path

def read_json(path):
    try:
        return json.loads(Path(path).read_text(encoding="utf-8"))
    except json.JSONDecodeError as e:
        print(f"Invalid JSON in {path}: {e}")
        sys.exit(1)

def cmd_pretty(a):
    data = read_json(a.file)
    Path(a.file).write_text(
        json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8")
    print("formatted", a.file)

def cmd_flatten(a):
    data = read_json(a.file)
    records = data if isinstance(data, list) else data.get("items", [])
    fields = sorted({k for r in records for k in r})
    with open(a.out, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fields, extrasaction="ignore")
        w.writeheader()
        for r in records:
            w.writerow({k: r.get(k, "") for k in fields})
    print(f"wrote {len(records)} rows → {a.out}")

def cmd_validate(a):
    schema = {"id": int, "name": str}
    data = read_json(a.file)
    for i, r in enumerate(data):
        bad = [f for f, t in schema.items()
               if not isinstance(r.get(f), t)]
        if bad:
            print(f"record {i}: bad fields {bad}")

p = argparse.ArgumentParser(); sub = p.add_subparsers(dest="cmd", required=True)
for name, fn in [("pretty", cmd_pretty), ("validate", cmd_validate)]:
    sp = sub.add_parser(name); sp.add_argument("file"); sp.set_defaults(func=fn)
fl = sub.add_parser("flatten"); fl.add_argument("file"); fl.add_argument("out")
fl.set_defaults(func=cmd_flatten)
args = p.parse_args(); args.func(args)

Non-negotiables: three subcommands, JSONDecodeError handling, flatten to CSV, schema validate report.

import json from pathlib import Path # from a file data = json.loads(Path("config.json").read_text(encoding="utf-8")) # back to a file, pretty-printed Path("out.json").write_text( json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8")

user = { "name": "Aisha", "address": {"city": "Kuala Lumpur"}, "roles": ["admin", "editor"], } # risky: KeyError if 'address' or 'zip' is missing # zip_code = user["address"]["zip"] # safe: .get with defaults, never crashes city = user.get("address", {}).get("city", "unknown") zip_code = user.get("address", {}).get("zip", "—") first_role = (user.get("roles") or ["none"])[0] print(city, zip_code, first_role)

def validate(record: dict) -> list[str]: errors = [] if not isinstance(record.get("name"), str) or not record["name"]: errors.append("name must be a non-empty string") if not isinstance(record.get("age"), int): errors.append("age must be an integer") if not isinstance(record.get("emails"), list): errors.append("emails must be a list") return errors bad = {"name": "", "age": "old", "emails": "a@b.com"} print(validate(bad)) # ['name must be a non-empty string', 'age must be an integer', 'emails must be a list']

# nested API response → flat rows for a CSV api = { "users": [ {"id": 1, "name": "Aisha", "address": {"city": "KL"}}, {"id": 2, "name": "Ben", "address": {"city": "London"}}, ] } flat = [ {"id": u["id"], "name": u["name"], "city": u.get("address", {}).get("city", "")} for u in api["users"] ] print(flat) # [{'id': 1, 'name': 'Aisha', 'city': 'KL'}, {'id': 2, ...}]

import json, csv, logging from pathlib import Path logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s") log = logging.getLogger("json2csv") def flatten(user: dict) -> dict: return { "id": user.get("id", ""), "name": user.get("name", "").strip(), "email": (user.get("contact", {}).get("email") or "").lower(), "city": user.get("address", {}).get("city", ""), "roles": "|".join(user.get("roles", [])), # list → pipe-joined string } def is_valid(user: dict) -> bool: return bool(user.get("id")) and "@" in ( user.get("contact", {}).get("email") or "") raw = json.loads(Path("export.json").read_text(encoding="utf-8")) users = raw.get("users", []) good, bad = [], 0 for u in users: if is_valid(u): good.append(flatten(u)) else: bad += 1 log.warning("skipping invalid record id=%s", u.get("id", "?")) with open("users.csv", "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=["id", "name", "email", "city", "roles"]) w.writeheader(); w.writerows(good) log.info("wrote %d users, skipped %d invalid", len(good), bad)

def deep_get(data, *keys, default=None): for key in keys: if isinstance(data, dict) and key in data: data = data[key] else: return default return data print(deep_get(user, "address", "zip", default="—"))

def check(record, schema): errors = [] for field, expected in schema.items(): value = record.get(field) if not isinstance(value, expected): errors.append( f"{field}: expected {expected.__name__}, " f"got {type(value).__name__}") return errors print(check({"name": "A", "age": "x"}, {"name": str, "age": int}))

def deep_merge(base: dict, override: dict) -> dict: result = dict(base) for key, value in override.items(): if (key in result and isinstance(result[key], dict) and isinstance(value, dict)): result[key] = deep_merge(result[key], value) else: result[key] = value return result defaults = {"server": {"host": "localhost", "port": 8000}, "debug": False} user = {"server": {"port": 9000}, "debug": True} print(deep_merge(defaults, user)) # {'server': {'host': 'localhost', 'port': 9000}, 'debug': True}