PY-L5-04 · NumPy Basics — Arrays & Vectorisation

Learning Goals

3 min

Create arrays with np.array, np.zeros, np.arange, np.linspace.
Do vectorised maths — operate on the whole array, no loops.
Use broadcasting, slicing, boolean masks, and reshape.
Compute axis-wise stats (sum, mean along rows / columns).

Warm-Up · Loop vs Vector

5 min

# The Python way — a loop
nums = [1, 2, 3, 4]
doubled = [n * 2 for n in nums]      # [2, 4, 6, 8]

# The NumPy way — vectorised, no loop
import numpy as np
arr = np.array([1, 2, 3, 4])
doubled = arr * 2                     # array([2, 4, 6, 8])

Same result. But NumPy does it in optimised C under the hood — for a million numbers it's dramatically faster, and the code is shorter.

Today's big idea

Stop writing loops over numbers. Express the operation on the whole array. This is "vectorisation" — the mental shift that makes ML code fast and readable.

New Concept · The ndarray

14 min

Creating arrays

import numpy as np

np.array([1, 2, 3])              # from a list
np.zeros((2, 3))                 # 2×3 of zeros
np.ones(4)                       # [1. 1. 1. 1.]
np.arange(0, 10, 2)              # [0 2 4 6 8]
np.linspace(0, 1, 5)             # [0.  0.25 0.5 0.75 1. ]
np.random.rand(3)                # 3 random floats in [0,1)

Shape, dtype, ndim

a = np.array([[1, 2, 3], [4, 5, 6]])
print(a.shape)    # (2, 3)
print(a.ndim)     # 2
print(a.dtype)    # int64

Vectorised maths & broadcasting

a = np.array([1, 2, 3, 4])
print(a + 10)        # [11 12 13 14]   scalar broadcasts
print(a ** 2)        # [ 1  4  9 16]
print(a + a)         # [2 4 6 8]        element-wise

b = np.array([[1, 2, 3],
              [4, 5, 6]])
print(b + np.array([10, 20, 30]))   # row broadcasts to every row

Slicing & boolean masks

a = np.array([10, 20, 30, 40, 50])
print(a[1:4])         # [20 30 40]
print(a[a > 25])      # [30 40 50]   boolean mask
a[a > 25] = 0          # set matching elements
print(a)               # [10 20  0  0  0]

Reshape & axes

a = np.arange(6)             # [0 1 2 3 4 5]
m = a.reshape(2, 3)          # [[0 1 2]
                             #  [3 4 5]]
print(m.sum())               # 15   (everything)
print(m.sum(axis=0))         # [3 5 7]   down columns
print(m.sum(axis=1))         # [3 12]    across rows
print(m.mean(axis=0))        # column means

axis=0 collapses rows (gives a per-column result); axis=1 collapses columns (per-row). Memorise this — it confuses everyone at first.

Worked Example · Normalise Features

12 min

A core ML chore: scale each feature column to mean 0, std 1. With NumPy it's two lines.

import numpy as np

# 5 samples, 3 features
X = np.array([
    [180, 80, 25],
    [165, 60, 30],
    [175, 72, 22],
    [190, 95, 40],
    [160, 55, 28],
], dtype=float)

# Per-column mean and std (axis=0 = down the rows)
mu    = X.mean(axis=0)
sigma = X.std(axis=0)

X_scaled = (X - mu) / sigma     # broadcasting does it all

print("means after scaling :", X_scaled.mean(axis=0).round(2))
print("stds  after scaling :", X_scaled.std(axis=0).round(2))
print(X_scaled.round(2))

Sample output

means after scaling : [ 0. -0.  0.]
stds  after scaling : [1. 1. 1.]
[[ 0.5   0.46 -0.34]
 [-0.93 -0.92  0.43]
 [-0.   0.   -0.81]
 [ 1.43  1.6   1.97]
 [-1.   -1.15 -0.04]]

Read the diff

No loops. X - mu subtracts each column's mean from every value in that column (broadcasting); dividing by sigma scales it. This is literally what scikit-learn's StandardScaler does internally — you just wrote it.

Try It Yourself

13 min

01 🟢 Array maths

Make an array of the numbers 1-10. Print: the array squared, the sum, the mean, and only the even numbers.

Hint

a = np.arange(1, 11)
print(a ** 2, a.sum(), a.mean(), a[a % 2 == 0])

02 🟡 Per-row totals

Make a 4×3 array of random integers 0-9. Print each row's sum and each column's max.

Hint

m = np.random.randint(0, 10, size=(4, 3))
print(m)
print("row sums :", m.sum(axis=1))
print("col maxes:", m.max(axis=0))

03 🔴 Min-max scaling

Scale each column of a feature matrix to the range [0, 1] using (X - X.min(0)) / (X.max(0) - X.min(0)). Verify every column's min is 0 and max is 1.

Hint

lo, hi = X.min(axis=0), X.max(axis=0)
Xn = (X - lo) / (hi - lo)
print(Xn.min(axis=0), Xn.max(axis=0))   # [0 0 0] [1 1 1]

Mini-Challenge · Distance Without Loops

8 min

Given one point and an array of other points, compute the Euclidean distance from the point to every other — vectorised, no Python loop. (This is the heart of K-Nearest Neighbours, Lesson 12.)

Show one possible solution

import numpy as np

point  = np.array([3.0, 4.0])
others = np.array([[0, 0], [3, 0], [6, 8], [3, 5]], dtype=float)

# (others - point) broadcasts; square, sum across columns, sqrt
diffs = others - point
dists = np.sqrt((diffs ** 2).sum(axis=1))
print(dists.round(2))     # [5.   4.   5.66 1.  ]

nearest = others[dists.argmin()]
print("nearest point:", nearest)   # [3. 5.]

Non-negotiables: no Python loop, use broadcasting + axis=1 sum, find the closest with argmin. You just wrote the core of KNN.

Recap

3 min

NumPy's ndarray does maths on whole arrays at once — vectorisation — which is fast and concise. Broadcasting lets a scalar or row apply across a whole array. Slicing and boolean masks select elements; reshape changes dimensions; axis=0/axis=1 control whether stats run down columns or across rows. Every ML library speaks NumPy.

Vocabulary Card

ndarray: NumPy's N-dimensional array — the core data type for numeric computing.
vectorisation: Applying an operation to a whole array at once, instead of looping element by element.
broadcasting: NumPy stretching a smaller array to match a larger one's shape during maths.
axis: Which dimension to operate along. axis=0 = down rows (per-column), axis=1 = across columns (per-row).

Homework

4 min

Write numpy_drills.py covering: array creation 3 ways, vectorised maths, a boolean-mask filter, a reshape, and axis-wise stats. Add a timing comparison: square a million numbers with a Python loop vs NumPy, print both times.

Sample · the timing part

import numpy as np, time

n = 1_000_000
pylist = list(range(n))

t0 = time.time()
sq_loop = [x * x for x in pylist]
t1 = time.time()

arr = np.arange(n)
t2 = time.time()
sq_np = arr ** 2
t3 = time.time()

print(f"python loop: {t1 - t0:.3f}s")
print(f"numpy      : {t3 - t2:.4f}s")
print(f"speedup    : {(t1 - t0) / (t3 - t2):.0f}x")

Non-negotiables: all five drills + a real timing comparison showing NumPy's speedup.

# The Python way — a loop nums = [1, 2, 3, 4] doubled = [n * 2 for n in nums] # [2, 4, 6, 8] # The NumPy way — vectorised, no loop import numpy as np arr = np.array([1, 2, 3, 4]) doubled = arr * 2 # array([2, 4, 6, 8])

import numpy as np np.array([1, 2, 3]) # from a list np.zeros((2, 3)) # 2×3 of zeros np.ones(4) # [1. 1. 1. 1.] np.arange(0, 10, 2) # [0 2 4 6 8] np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1. ] np.random.rand(3) # 3 random floats in [0,1)

a = np.array([1, 2, 3, 4]) print(a + 10) # [11 12 13 14] scalar broadcasts print(a ** 2) # [ 1 4 9 16] print(a + a) # [2 4 6 8] element-wise b = np.array([[1, 2, 3], [4, 5, 6]]) print(b + np.array([10, 20, 30])) # row broadcasts to every row

a = np.array([10, 20, 30, 40, 50]) print(a[1:4]) # [20 30 40] print(a[a > 25]) # [30 40 50] boolean mask a[a > 25] = 0 # set matching elements print(a) # [10 20 0 0 0]

a = np.arange(6) # [0 1 2 3 4 5] m = a.reshape(2, 3) # [[0 1 2] # [3 4 5]] print(m.sum()) # 15 (everything) print(m.sum(axis=0)) # [3 5 7] down columns print(m.sum(axis=1)) # [3 12] across rows print(m.mean(axis=0)) # column means

import numpy as np # 5 samples, 3 features X = np.array([ [180, 80, 25], [165, 60, 30], [175, 72, 22], [190, 95, 40], [160, 55, 28], ], dtype=float) # Per-column mean and std (axis=0 = down the rows) mu = X.mean(axis=0) sigma = X.std(axis=0) X_scaled = (X - mu) / sigma # broadcasting does it all print("means after scaling :", X_scaled.mean(axis=0).round(2)) print("stds after scaling :", X_scaled.std(axis=0).round(2)) print(X_scaled.round(2))

import numpy as np point = np.array([3.0, 4.0]) others = np.array([[0, 0], [3, 0], [6, 8], [3, 5]], dtype=float) # (others - point) broadcasts; square, sum across columns, sqrt diffs = others - point dists = np.sqrt((diffs ** 2).sum(axis=1)) print(dists.round(2)) # [5. 4. 5.66 1. ] nearest = others[dists.argmin()] print("nearest point:", nearest) # [3. 5.]