PY-L4-32 · Challenge — Investigate a Real Dataset

Goals

3 min

Pick a real dataset (data.gov.my, world-bank, OWID, Kaggle).
Run the full pipeline: load, profile, clean, transform, answer.
Answer five specific questions with code and a one-line interpretation.
Commit your notebook / script so a teammate can re-run it.

Pick Your Dataset

5 min

Good places to find a small, clean-enough CSV:

data.gov.my — official Malaysian open data. Try the daily cases CSV, or the Bank Negara rates.
ourworldindata.org — global CSV exports of every chart.
kaggle.com/datasets — "Malaysia", "ASEAN", or any topic you care about.

Pick something with at least 1000 rows, 5+ columns, and at least one date column. Save the URL — your load step will use pd.read_csv(URL).

Today's big idea

Real data analysis is iterative. Run, look, refine. Don't plan five steps ahead — load, look at head(), write the next line.

The Five Questions

14 min

Pick five from this menu (or adapt):

How many rows, columns, missing values? What's the date range?
What's the all-time average and standard deviation of a key numeric column?
Which group (state, category, year) has the highest average / total / count?
What's the 7-day rolling average of the main metric? Plot or print last 10 values.
Top 5 and bottom 5 values overall.
Compare two slices (e.g., this year vs last). Biggest movers?
Are there outliers? Define one (e.g., > mean + 3×std) and list them.

Skeleton — fill in the blanks

# investigate.py
import pandas as pd

URL = "https://example.com/your.csv"
df = pd.read_csv(URL, parse_dates=["date"])    # <-- adapt

# Q1 — shape, missing
print(f"shape   : {df.shape}")
print(f"missing : {df.isna().sum().sum()}")
print(f"date    : {df['date'].min().date()} → {df['date'].max().date()}")

# Q2 — mean + std
metric = "cases_new"  # <-- adapt
print(f"\nmean {metric}: {df[metric].mean():.2f}  std: {df[metric].std():.2f}")

# Q3 — top group
group = "state"  # <-- adapt
print(df.groupby(group)[metric].mean().sort_values(ascending=False).head(5))

# Q4 — rolling
roll = df.set_index("date")[metric].sort_index().rolling(7).mean()
print("\n7-day rolling, last 10:")
print(roll.tail(10).round(1))

# Q5 — top/bottom
print("\ntop 5:")
print(df.nlargest(5, metric)[[metric, "date", group]])
print("\nbottom 5 (non-zero):")
print(df[df[metric] > 0].nsmallest(5, metric)[[metric, "date", group]])

Worked Example — data.gov.my Daily Cases

12 min

(The skeleton above, applied to one specific file. If the URL is gone by the time you read this, swap in any time-series CSV with a date and a numeric metric.)

import pandas as pd

URL = "https://storage.data.gov.my/healthcare/covid_cases.csv"
df = pd.read_csv(URL, parse_dates=["date"])

# Q1
print("shape:", df.shape)
print("missing:")
print(df.isna().sum())

# Q2
print(f"\nmean cases_new: {df['cases_new'].mean():.1f}  "
      f"std: {df['cases_new'].std():.1f}")

# Q3 — by state? if the file is national-only, by year
df["year"] = df["date"].dt.year
print(df.groupby("year")["cases_new"].sum().sort_values(ascending=False))

# Q4 — rolling
roll = df.set_index("date")["cases_new"].sort_index().rolling(7).mean()
print("\n7-day rolling (last 10):")
print(roll.tail(10).round(0))

# Q5 — worst days
print("\nworst 5 days:")
print(df.nlargest(5, "cases_new")[["date", "cases_new"]])

# Q6 — biggest movers year-on-year
yoy = (df.groupby("year")["cases_new"].sum().reset_index()
         .sort_values("year"))
yoy["delta"] = yoy["cases_new"].diff()
print("\nyear-on-year change:")
print(yoy)

Your Turn

13 min

Now do it for your dataset. Pace:

5 min — find the dataset; load it; shape + head.
10 min — clean (dropna, dtypes, dedupe).
30 min — answer the five questions in code.
10 min — write a one-sentence interpretation per question.
5 min — save the notebook / commit the script.

Stretch · Reproducible Report

8 min

Put the script + a README in a folder. The README says: where the data came from, what assumptions you made, how to re-run. Include the date you ran it. This is what makes data work share-able.

Recap

3 min

You ran the full pandas pipeline on real data. Five questions, five answers, one interpretation each. This is the work data analysts and junior data scientists do every day. Next week we make it look good with matplotlib.

Homework · The Report

4 min

Write a one-page markdown file with the title "Five things I learned about <dataset>". Embed your code and the result for each question. Submit the markdown + the .py file.

# investigate.py import pandas as pd URL = "https://example.com/your.csv" df = pd.read_csv(URL, parse_dates=["date"]) # <-- adapt # Q1 — shape, missing print(f"shape : {df.shape}") print(f"missing : {df.isna().sum().sum()}") print(f"date : {df['date'].min().date()} → {df['date'].max().date()}") # Q2 — mean + std metric = "cases_new" # <-- adapt print(f"\nmean {metric}: {df[metric].mean():.2f} std: {df[metric].std():.2f}") # Q3 — top group group = "state" # <-- adapt print(df.groupby(group)[metric].mean().sort_values(ascending=False).head(5)) # Q4 — rolling roll = df.set_index("date")[metric].sort_index().rolling(7).mean() print("\n7-day rolling, last 10:") print(roll.tail(10).round(1)) # Q5 — top/bottom print("\ntop 5:") print(df.nlargest(5, metric)[[metric, "date", group]]) print("\nbottom 5 (non-zero):") print(df[df[metric] > 0].nsmallest(5, metric)[[metric, "date", group]])

import pandas as pd URL = "https://storage.data.gov.my/healthcare/covid_cases.csv" df = pd.read_csv(URL, parse_dates=["date"]) # Q1 print("shape:", df.shape) print("missing:") print(df.isna().sum()) # Q2 print(f"\nmean cases_new: {df['cases_new'].mean():.1f} " f"std: {df['cases_new'].std():.1f}") # Q3 — by state? if the file is national-only, by year df["year"] = df["date"].dt.year print(df.groupby("year")["cases_new"].sum().sort_values(ascending=False)) # Q4 — rolling roll = df.set_index("date")["cases_new"].sort_index().rolling(7).mean() print("\n7-day rolling (last 10):") print(roll.tail(10).round(0)) # Q5 — worst days print("\nworst 5 days:") print(df.nlargest(5, "cases_new")[["date", "cases_new"]]) # Q6 — biggest movers year-on-year yoy = (df.groupby("year")["cases_new"].sum().reset_index() .sort_values("year")) yoy["delta"] = yoy["cases_new"].diff() print("\nyear-on-year change:") print(yoy)