Goals
3 min- Pick a real dataset (data.gov.my, world-bank, OWID, Kaggle).
- Run the full pipeline: load, profile, clean, transform, answer.
- Answer five specific questions with code and a one-line interpretation.
- Commit your notebook / script so a teammate can re-run it.
Pick Your Dataset
5 minGood places to find a small, clean-enough CSV:
- data.gov.my — official Malaysian open data. Try the daily cases CSV, or the Bank Negara rates.
- ourworldindata.org — global CSV exports of every chart.
- kaggle.com/datasets — "Malaysia", "ASEAN", or any topic you care about.
Pick something with at least 1000 rows, 5+ columns, and at least one date column. Save the URL — your load step will use pd.read_csv(URL).
Real data analysis is iterative. Run, look, refine. Don't plan five steps ahead — load, look at head(), write the next line.
The Five Questions
14 minPick five from this menu (or adapt):
- How many rows, columns, missing values? What's the date range?
- What's the all-time average and standard deviation of a key numeric column?
- Which group (state, category, year) has the highest average / total / count?
- What's the 7-day rolling average of the main metric? Plot or print last 10 values.
- Top 5 and bottom 5 values overall.
- Compare two slices (e.g., this year vs last). Biggest movers?
- Are there outliers? Define one (e.g., > mean + 3×std) and list them.
Skeleton — fill in the blanks
# investigate.py import pandas as pd URL = "https://example.com/your.csv" df = pd.read_csv(URL, parse_dates=["date"]) # <-- adapt # Q1 — shape, missing print(f"shape : {df.shape}") print(f"missing : {df.isna().sum().sum()}") print(f"date : {df['date'].min().date()} → {df['date'].max().date()}") # Q2 — mean + std metric = "cases_new" # <-- adapt print(f"\nmean {metric}: {df[metric].mean():.2f} std: {df[metric].std():.2f}") # Q3 — top group group = "state" # <-- adapt print(df.groupby(group)[metric].mean().sort_values(ascending=False).head(5)) # Q4 — rolling roll = df.set_index("date")[metric].sort_index().rolling(7).mean() print("\n7-day rolling, last 10:") print(roll.tail(10).round(1)) # Q5 — top/bottom print("\ntop 5:") print(df.nlargest(5, metric)[[metric, "date", group]]) print("\nbottom 5 (non-zero):") print(df[df[metric] > 0].nsmallest(5, metric)[[metric, "date", group]])
Worked Example — data.gov.my Daily Cases
12 min(The skeleton above, applied to one specific file. If the URL is gone by the time you read this, swap in any time-series CSV with a date and a numeric metric.)
import pandas as pd URL = "https://storage.data.gov.my/healthcare/covid_cases.csv" df = pd.read_csv(URL, parse_dates=["date"]) # Q1 print("shape:", df.shape) print("missing:") print(df.isna().sum()) # Q2 print(f"\nmean cases_new: {df['cases_new'].mean():.1f} " f"std: {df['cases_new'].std():.1f}") # Q3 — by state? if the file is national-only, by year df["year"] = df["date"].dt.year print(df.groupby("year")["cases_new"].sum().sort_values(ascending=False)) # Q4 — rolling roll = df.set_index("date")["cases_new"].sort_index().rolling(7).mean() print("\n7-day rolling (last 10):") print(roll.tail(10).round(0)) # Q5 — worst days print("\nworst 5 days:") print(df.nlargest(5, "cases_new")[["date", "cases_new"]]) # Q6 — biggest movers year-on-year yoy = (df.groupby("year")["cases_new"].sum().reset_index() .sort_values("year")) yoy["delta"] = yoy["cases_new"].diff() print("\nyear-on-year change:") print(yoy)
Your Turn
13 minNow do it for your dataset. Pace:
- 5 min — find the dataset; load it;
shape+head. - 10 min — clean (dropna, dtypes, dedupe).
- 30 min — answer the five questions in code.
- 10 min — write a one-sentence interpretation per question.
- 5 min — save the notebook / commit the script.
Stretch · Reproducible Report
8 minPut the script + a README in a folder. The README says: where the data came from, what assumptions you made, how to re-run. Include the date you ran it. This is what makes data work share-able.
Recap
3 minYou ran the full pandas pipeline on real data. Five questions, five answers, one interpretation each. This is the work data analysts and junior data scientists do every day. Next week we make it look good with matplotlib.
Homework · The Report
4 minWrite a one-page markdown file with the title "Five things I learned about <dataset>". Embed your code and the result for each question. Submit the markdown + the .py file.