PY-L4-25 · Pandas — Loading & Exploring DataFrames

Learning Goals

3 min

Install pandas and import it as pd.
Load a CSV with pd.read_csv.
Get a fast feel for the data with head, info, describe, shape, columns.
Inspect dtypes — and why pandas guesses sometimes wrongly.

Warm-Up · Install & First Load

5 min

pip install pandas

import pandas as pd

df = pd.read_csv("students.csv")
print(df)

      name  age  score
0   Aisyah   13     88
1  Wei Jie   14     75
2   Suresh   13     92
3      Mei   14     80

Three things to spot: column names came from the header, the index started at 0, every dtype was inferred from the values. That's pandas in one breath.

Today's big idea

A DataFrame is a 2-D labelled table. Rows have an index, columns have names. Every operation either reshapes the table or pulls a summary out of it. The 5-method tour below tells you what you're looking at within seconds.

New Concept · The 5-Method Tour

14 min

1. shape — "how big?"

print(df.shape)   # → (4, 3)  -- 4 rows, 3 columns

2. columns — "what columns?"

print(df.columns.tolist())   # → ['name', 'age', 'score']

3. head / tail — "what does it look like?"

df.head()       # first 5 rows
df.head(3)      # first 3
df.tail(2)      # last 2

4. info — "dtypes and missing?"

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   name    4 non-null      object
 1   age     4 non-null      int64
 2   score   4 non-null      int64
dtypes: int64(2), object(1)

Three columns, no missing values, two int64s and one object (string). If a column has the wrong dtype, you spot it here.

5. describe — "what do the numbers look like?"

df.describe()

             age      score
count   4.000000   4.000000
mean   13.500000  83.750000
std     0.577350   7.804913
min    13.000000  75.000000
25%    13.000000  78.750000
50%    13.500000  84.000000
75%    14.000000  89.000000
max    14.000000  92.000000

Counts, mean, std, min, max, and three quartiles for every numeric column. One method, half a stats course.

Common kwargs to `read_csv`

sep=";"            different delimiter
header=None        no header row
names=[...]        supply your own column names
parse_dates=["d"]  parse a column as datetime on load
na_values=["-"]    treat extra strings as missing
dtype={"id": str}  force a column's dtype

Worked Example · A Friendly First Look

12 min

Save the L4-08 cleaned orders as clean.csv. Run this in a notebook or script:

import pandas as pd

df = pd.read_csv("clean.csv", parse_dates=["date"])

print("📁 file shape:", df.shape)
print("📋 columns   :", df.columns.tolist())
print("📅 date range:", df["date"].min(), "→", df["date"].max())
print()
print("first 5 rows:")
print(df.head())
print()
print("summary:")
print(df.describe(include="all"))

Sample output

📁 file shape: (5, 6)
📋 columns   : ['order_id', 'customer', 'product', 'quantity', 'price', 'date']
📅 date range: 2026-05-01 → 2026-05-03

first 5 rows:
   order_id  customer product  quantity  price       date
0      1001     Ahmad    Roti         3   1.50 2026-05-01
1      1002       Mei    Milo         2   3.00 2026-05-01
2      1003    Suresh    Nasi         3   8.00 2026-05-02
3      1005      Devi    Nasi         2   8.00 2026-05-03
4      1006       Ali    Milo         4   3.00 2026-05-03

summary:
       order_id customer product  quantity     price                 date
count        5        5       5  5.000000  5.000000                    5
unique     NaN        5       3       NaN       NaN                  NaN
top        NaN     Ahmad    Nasi       NaN       NaN                  NaN
freq       NaN        1       2       NaN       NaN                  NaN
mean    1003.4      NaN     NaN  2.800000  4.700000  2026-05-02 04:48:00
...

Read the diff

The parse_dates= kwarg turned date into real datetime64. describe(include="all") shows stats for non-numeric columns too — counts and modes. Five lines tell you everything you need before starting analysis.

Try It Yourself

13 min

01 🟢 Load a small CSV

Pick any CSV from earlier in the level. Load it and print shape, columns, dtypes, head(3).

02 🟡 Fix a guessed dtype

Take a CSV that has IDs starting with zero (e.g., "0123"). Pandas will guess int64 and strip the leading zero. Force the column to str with dtype=.

Hint

df = pd.read_csv("ids.csv", dtype={"id": str})
print(df.dtypes)
print(df["id"].head())

03 🔴 Live from a URL

read_csv accepts URLs. Load https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv directly. Print top 5 cities by something interesting from the file.

Hint

url = "https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv"
df = pd.read_csv(url)
print(df.head())
print(df.sort_values("Population", ascending=False).head())

Mini-Challenge · Build a One-Page Profile

8 min

Write profile.py. Given any CSV path on argv, print a one-page profile:

shape
column names + dtypes
missing-value counts per column
numeric summary (describe())
5 sample rows

Show one possible solution

# profile.py
import sys, pandas as pd

df = pd.read_csv(sys.argv[1])

print(f"shape   : {df.shape}")
print("\ncolumns + dtypes:")
print(df.dtypes)
print("\nmissing per column:")
print(df.isna().sum())
print("\nsummary:")
print(df.describe(include="all").T)
print("\nsample 5 rows:")
print(df.sample(min(5, len(df))))

Non-negotiables: .isna().sum() for missing counts, .sample() for a varied peek (better than head for huge files).

Recap

3 min

pd.read_csv loads, then 5 methods (shape, columns, head, info, describe) tell you what you've got. parse_dates and dtype kwargs save you from later astype calls. Run this 5-line ritual on every new dataset.

Vocabulary Card

DataFrame: Pandas's 2-D table — rows with an index, columns with names.
Series: One column of a DataFrame. Iterating df["col"] yields one.
dtype: The type pandas chose for a column — int64, float64, object (string), datetime64, bool, category.
NaN: "Not a Number" — pandas's missing-value marker.

Homework

4 min

Find a real CSV that matters to you — your school's timetable, an Open Data Malaysia file (data.gov.my), a Kaggle public dataset. Run your profile.py on it. Write a 4-line markdown note answering:

How many rows + columns?
Which column had the most missing values?
Which numeric column had the widest spread (max - min)?
Surprise — anything unexpected?

<class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 4 non-null object 1 age 4 non-null int64 2 score 4 non-null int64 dtypes: int64(2), object(1)

age score count 4.000000 4.000000 mean 13.500000 83.750000 std 0.577350 7.804913 min 13.000000 75.000000 25% 13.000000 78.750000 50% 13.500000 84.000000 75% 14.000000 89.000000 max 14.000000 92.000000

sep=";" different delimiter header=None no header row names=[...] supply your own column names parse_dates=["d"] parse a column as datetime on load na_values=["-"] treat extra strings as missing dtype={"id": str} force a column's dtype

import pandas as pd df = pd.read_csv("clean.csv", parse_dates=["date"]) print("📁 file shape:", df.shape) print("📋 columns :", df.columns.tolist()) print("📅 date range:", df["date"].min(), "→", df["date"].max()) print() print("first 5 rows:") print(df.head()) print() print("summary:") print(df.describe(include="all"))

📁 file shape: (5, 6) 📋 columns : ['order_id', 'customer', 'product', 'quantity', 'price', 'date'] 📅 date range: 2026-05-01 → 2026-05-03 first 5 rows: order_id customer product quantity price date 0 1001 Ahmad Roti 3 1.50 2026-05-01 1 1002 Mei Milo 2 3.00 2026-05-01 2 1003 Suresh Nasi 3 8.00 2026-05-02 3 1005 Devi Nasi 2 8.00 2026-05-03 4 1006 Ali Milo 4 3.00 2026-05-03 summary: order_id customer product quantity price date count 5 5 5 5.000000 5.000000 5 unique NaN 5 3 NaN NaN NaN top NaN Ahmad Nasi NaN NaN NaN freq NaN 1 2 NaN NaN NaN mean 1003.4 NaN NaN 2.800000 4.700000 2026-05-02 04:48:00 ...

# profile.py import sys, pandas as pd df = pd.read_csv(sys.argv[1]) print(f"shape : {df.shape}") print("\ncolumns + dtypes:") print(df.dtypes) print("\nmissing per column:") print(df.isna().sum()) print("\nsummary:") print(df.describe(include="all").T) print("\nsample 5 rows:") print(df.sample(min(5, len(df))))

Learning Goals

Warm-Up · Install & First Load

New Concept · The 5-Method Tour

1. shape — "how big?"

2. columns — "what columns?"

3. head / tail — "what does it look like?"

4. info — "dtypes and missing?"

5. describe — "what do the numbers look like?"

Common kwargs to read_csv

Worked Example · A Friendly First Look

Read the diff

Try It Yourself

Mini-Challenge · Build a One-Page Profile

Recap

Vocabulary Card

Homework

Learning Goals

Warm-Up · Install & First Load

New Concept · The 5-Method Tour

1. shape — "how big?"

2. columns — "what columns?"

3. head / tail — "what does it look like?"

4. info — "dtypes and missing?"

5. describe — "what do the numbers look like?"

Common kwargs to read_csv

Worked Example · A Friendly First Look

Read the diff

Try It Yourself

Mini-Challenge · Build a One-Page Profile

Recap

Vocabulary Card

Homework

Common kwargs to `read_csv`

Common kwargs to `read_csv`