Learning Goals
3 min- Install pandas and import it as
pd. - Load a CSV with
pd.read_csv. - Get a fast feel for the data with
head,info,describe,shape,columns. - Inspect dtypes — and why pandas guesses sometimes wrongly.
Warm-Up · Install & First Load
5 minpip install pandas
import pandas as pd df = pd.read_csv("students.csv") print(df)
name age score 0 Aisyah 13 88 1 Wei Jie 14 75 2 Suresh 13 92 3 Mei 14 80
Three things to spot: column names came from the header, the index started at 0, every dtype was inferred from the values. That's pandas in one breath.
A DataFrame is a 2-D labelled table. Rows have an index, columns have names. Every operation either reshapes the table or pulls a summary out of it. The 5-method tour below tells you what you're looking at within seconds.
New Concept · The 5-Method Tour
14 min1. shape — "how big?"
print(df.shape) # → (4, 3) -- 4 rows, 3 columns
2. columns — "what columns?"
print(df.columns.tolist()) # → ['name', 'age', 'score']
3. head / tail — "what does it look like?"
df.head() # first 5 rows df.head(3) # first 3 df.tail(2) # last 2
4. info — "dtypes and missing?"
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 4 non-null object 1 age 4 non-null int64 2 score 4 non-null int64 dtypes: int64(2), object(1)
Three columns, no missing values, two int64s and one object (string). If a column has the wrong dtype, you spot it here.
5. describe — "what do the numbers look like?"
df.describe()
age score count 4.000000 4.000000 mean 13.500000 83.750000 std 0.577350 7.804913 min 13.000000 75.000000 25% 13.000000 78.750000 50% 13.500000 84.000000 75% 14.000000 89.000000 max 14.000000 92.000000
Counts, mean, std, min, max, and three quartiles for every numeric column. One method, half a stats course.
Common kwargs to read_csv
sep=";" different delimiter
header=None no header row
names=[...] supply your own column names
parse_dates=["d"] parse a column as datetime on load
na_values=["-"] treat extra strings as missing
dtype={"id": str} force a column's dtypeWorked Example · A Friendly First Look
12 minSave the L4-08 cleaned orders as clean.csv. Run this in a notebook or script:
import pandas as pd df = pd.read_csv("clean.csv", parse_dates=["date"]) print("📁 file shape:", df.shape) print("📋 columns :", df.columns.tolist()) print("📅 date range:", df["date"].min(), "→", df["date"].max()) print() print("first 5 rows:") print(df.head()) print() print("summary:") print(df.describe(include="all"))
Sample output
📁 file shape: (5, 6)
📋 columns : ['order_id', 'customer', 'product', 'quantity', 'price', 'date']
📅 date range: 2026-05-01 → 2026-05-03
first 5 rows:
order_id customer product quantity price date
0 1001 Ahmad Roti 3 1.50 2026-05-01
1 1002 Mei Milo 2 3.00 2026-05-01
2 1003 Suresh Nasi 3 8.00 2026-05-02
3 1005 Devi Nasi 2 8.00 2026-05-03
4 1006 Ali Milo 4 3.00 2026-05-03
summary:
order_id customer product quantity price date
count 5 5 5 5.000000 5.000000 5
unique NaN 5 3 NaN NaN NaN
top NaN Ahmad Nasi NaN NaN NaN
freq NaN 1 2 NaN NaN NaN
mean 1003.4 NaN NaN 2.800000 4.700000 2026-05-02 04:48:00
...Read the diff
The parse_dates= kwarg turned date into real datetime64. describe(include="all") shows stats for non-numeric columns too — counts and modes. Five lines tell you everything you need before starting analysis.
Try It Yourself
13 minPick any CSV from earlier in the level. Load it and print shape, columns, dtypes, head(3).
Take a CSV that has IDs starting with zero (e.g., "0123"). Pandas will guess int64 and strip the leading zero. Force the column to str with dtype=.
Hint
df = pd.read_csv("ids.csv", dtype={"id": str}) print(df.dtypes) print(df["id"].head())
read_csv accepts URLs. Load https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv directly. Print top 5 cities by something interesting from the file.
Hint
url = "https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv" df = pd.read_csv(url) print(df.head()) print(df.sort_values("Population", ascending=False).head())
Mini-Challenge · Build a One-Page Profile
8 minWrite profile.py. Given any CSV path on argv, print a one-page profile:
- shape
- column names + dtypes
- missing-value counts per column
- numeric summary (
describe()) - 5 sample rows
Show one possible solution
# profile.py import sys, pandas as pd df = pd.read_csv(sys.argv[1]) print(f"shape : {df.shape}") print("\ncolumns + dtypes:") print(df.dtypes) print("\nmissing per column:") print(df.isna().sum()) print("\nsummary:") print(df.describe(include="all").T) print("\nsample 5 rows:") print(df.sample(min(5, len(df))))
Non-negotiables: .isna().sum() for missing counts, .sample() for a varied peek (better than head for huge files).
Recap
3 minpd.read_csv loads, then 5 methods (shape, columns, head, info, describe) tell you what you've got. parse_dates and dtype kwargs save you from later astype calls. Run this 5-line ritual on every new dataset.
Vocabulary Card
- DataFrame
- Pandas's 2-D table — rows with an index, columns with names.
- Series
- One column of a DataFrame. Iterating
df["col"]yields one. - dtype
- The type pandas chose for a column — int64, float64, object (string), datetime64, bool, category.
- NaN
- "Not a Number" — pandas's missing-value marker.
Homework
4 minFind a real CSV that matters to you — your school's timetable, an Open Data Malaysia file (data.gov.my), a Kaggle public dataset. Run your profile.py on it. Write a 4-line markdown note answering:
- How many rows + columns?
- Which column had the most missing values?
- Which numeric column had the widest spread (max - min)?
- Surprise — anything unexpected?
The deliverable is the markdown note, not code. Aim for short, specific answers that you couldn't have written before looking at the file.