PY-L4-46 · Statistics & Datetime Deep Dive

Learning Goals

3 min

Compute mean, median, mode, stdev, variance, percentiles.
Spot when median tells a different story than mean (outliers!).
Resample time series with df.resample (D, W, M).
Compute rolling and expanding statistics.

Warm-Up · Mean vs Median

5 min

import statistics
salaries = [3500, 3800, 3900, 4000, 4200, 50000]
print(statistics.mean(salaries))     # 11_566  ← dragged up by one big number
print(statistics.median(salaries))   # 3_950   ← reflects most people

Use median when outliers exist. Use mean when the data is symmetric and outliers are rare.

Today's big idea

Each summary statistic answers a slightly different question. Pick the one that matches what you actually want to know — and tell your reader why.

New Concept · Stats Toolkit

14 min

statistics module (stdlib)

import statistics as st

st.mean(x)         # arithmetic average
st.median(x)       # middle value
st.mode(x)         # most-common value
st.stdev(x)        # sample standard deviation
st.variance(x)     # sample variance
st.quantiles(x, n=4)   # [Q1, Q2, Q3]

pandas conveniences

s = df["score"]
s.mean(), s.median(), s.std(), s.var()
s.quantile(0.5)        # = median
s.quantile([0.25, 0.5, 0.75])

s.skew()               # asymmetry
s.kurtosis()           # tail heaviness

s.value_counts()        # frequency table
s.value_counts(bins=5)  # auto-binned histogram counts

Outlier detection — z-score

mu, sigma = s.mean(), s.std()
z = (s - mu) / sigma
outliers = s[z.abs() > 3]
print(outliers)

A z-score > 3 is the classic "more than 3 standard deviations from the mean" rule. Use as a starting point, not a verdict — domain knowledge matters.

Time-series resample

df = df.set_index("date")
df["sales"].resample("W").sum()       # weekly total
df["sales"].resample("ME").mean()     # monthly mean (Month End)
df["sales"].resample("QE").max()      # quarterly max

Rolling and expanding

df["sales"].rolling(window=7).mean()        # 7-day moving average
df["sales"].rolling(window=7).std()         # 7-day rolling std
df["sales"].expanding().mean()              # running cumulative mean

Worked Example · A Stats Report

12 min

import pandas as pd

df = pd.read_csv("clean.csv", parse_dates=["date"])
df["total"] = df["quantity"] * df["price"]
s = df["total"]

print(f"n        : {s.count()}")
print(f"mean     : {s.mean():.2f}")
print(f"median   : {s.median():.2f}")
print(f"stdev    : {s.std():.2f}")
print(f"p25/p75  : {s.quantile(0.25):.2f}, {s.quantile(0.75):.2f}")
print(f"min/max  : {s.min():.2f}, {s.max():.2f}")
print(f"skew     : {s.skew():.2f}")

# Outliers
z = (s - s.mean()) / s.std()
print(f"\noutliers (|z| > 2): {(z.abs() > 2).sum()} rows")
print(df.loc[z.abs() > 2, ["date", "customer", "product", "total"]])

# Weekly resample
weekly = df.set_index("date")["total"].resample("W-MON").sum()
print("\nweekly revenue:")
print(weekly)

Read the diff

Seven stats answer most questions about distribution. The outlier list shows specific rows, not just a count. The weekly resample turns daily noise into a story. PCED tasks fit this exact shape — given a series, report.

Try It Yourself

13 min

01 🟢 Six numbers

Print n, mean, median, stdev, min, max for any numeric column.

02 🟡 Find outliers

Print rows where the value is more than 2 stdev from the mean.

03 🔴 Compare monthly to weekly

For a long enough series, plot weekly sum vs monthly sum. Comment on what each emphasises.

Mini-Challenge · Z-Score by Group

8 min

For each product, compute the per-row z-score of total within that product. Show the top 3 highest z-scores overall.

Show one possible solution

def z(s): return (s - s.mean()) / s.std()
df["z"] = df.groupby("product")["total"].transform(z)
print(df.nlargest(3, "z")[["date", "product", "total", "z"]])

transform applies a function within each group and aligns back to the original index — that's how you write per-group z-scores cleanly.

Recap

3 min

Mean for symmetric, median for skewed; std + quantiles describe spread; z-score finds extremes. resample rolls daily series into weekly / monthly. rolling + expanding show local and cumulative trends. The PCED exam is mostly "given a Series, here are eight questions" — practice fluency.

Homework

4 min

Take your real CSV. Produce a one-page Markdown stats report: n, mean, median, stdev, percentiles, three outliers (if any), a weekly resample plot. This is the kind of artefact PCED tasks ask for.

import statistics salaries = [3500, 3800, 3900, 4000, 4200, 50000] print(statistics.mean(salaries)) # 11_566 ← dragged up by one big number print(statistics.median(salaries)) # 3_950 ← reflects most people

import statistics as st st.mean(x) # arithmetic average st.median(x) # middle value st.mode(x) # most-common value st.stdev(x) # sample standard deviation st.variance(x) # sample variance st.quantiles(x, n=4) # [Q1, Q2, Q3]

s = df["score"] s.mean(), s.median(), s.std(), s.var() s.quantile(0.5) # = median s.quantile([0.25, 0.5, 0.75]) s.skew() # asymmetry s.kurtosis() # tail heaviness s.value_counts() # frequency table s.value_counts(bins=5) # auto-binned histogram counts

df["sales"].rolling(window=7).mean() # 7-day moving average df["sales"].rolling(window=7).std() # 7-day rolling std df["sales"].expanding().mean() # running cumulative mean

import pandas as pd df = pd.read_csv("clean.csv", parse_dates=["date"]) df["total"] = df["quantity"] * df["price"] s = df["total"] print(f"n : {s.count()}") print(f"mean : {s.mean():.2f}") print(f"median : {s.median():.2f}") print(f"stdev : {s.std():.2f}") print(f"p25/p75 : {s.quantile(0.25):.2f}, {s.quantile(0.75):.2f}") print(f"min/max : {s.min():.2f}, {s.max():.2f}") print(f"skew : {s.skew():.2f}") # Outliers z = (s - s.mean()) / s.std() print(f"\noutliers (|z| > 2): {(z.abs() > 2).sum()} rows") print(df.loc[z.abs() > 2, ["date", "customer", "product", "total"]]) # Weekly resample weekly = df.set_index("date")["total"].resample("W-MON").sum() print("\nweekly revenue:") print(weekly)

Statistics & Datetime Deep Dive (PCED-aligned)

Learning Goals

Warm-Up · Mean vs Median

New Concept · Stats Toolkit

statistics module (stdlib)

pandas conveniences

Outlier detection — z-score

Time-series resample

Rolling and expanding

Worked Example · A Stats Report

Read the diff

Try It Yourself

Mini-Challenge · Z-Score by Group

Recap

Homework

Statistics & Datetime Deep Dive (PCED-aligned)

Learning Goals

Warm-Up · Mean vs Median

New Concept · Stats Toolkit

statistics module (stdlib)

pandas conveniences

Outlier detection — z-score

Time-series resample

Rolling and expanding

Worked Example · A Stats Report

Read the diff

Try It Yourself

Mini-Challenge · Z-Score by Group

Recap

Homework