Learn Python Series (#56) - NumPy Beyond the Basics

Repository

https://github.com/realScipio/learn-python-series

What will I learn

You will learn why memory layout (C-order vs Fortran-order) matters for performance;
how views vs copies work in NumPy and when modifications propagate;
advanced broadcasting rules and powerful patterns like distance matrices;
vectorization patterns that replace Python loops entirely;
structured arrays for heterogeneous data without Pandas overhead;
how NumPy connects to the machine learning ecosystem (scikit-learn, PyTorch).

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
Familiarity with NumPy basics from episode #11;
The ambition to learn Python programming.

Difficulty

Intermediate, advanced

Curriculum (of the `Learn Python Series`):

GitHub Account

https://github.com/realScipio

Learn Python Series (#56) - NumPy Beyond the Basics

In episode #11 we covered NumPy fundamentals - creating arrays, basic operations, reshaping, and why NumPy is fast compared to Python lists. That was the "what." This episode is the "why" and "how" at a deeper level.

Understanding memory layout, broadcasting rules, and vectorization patterns is what separates using NumPy from understanding it. And understanding it is what you need before stepping into machine learning, where NumPy is the foundation everything else sits on. Every major ML library - scikit-learn, PyTorch, TensorFlow - operates on array-like structures that follow NumPy's conventions. The mental models you build here transfer directly.

Nota bene: This episode goes deeper than typical "NumPy tutorial" material. We're getting into how arrays work at the memory level, because that's where performance intuition comes from. If you know why certain operations are fast and others aren't, you stop guessing and start writing efficient code on the first try.

Memory Layout - Why C-Order vs Fortran-Order Matters

A 2D NumPy array looks like a grid on screen, but in memory it's a flat sequence of bytes. The question is: which dimension is contiguous?

import numpy as np

# C-order (row-major): rows are contiguous in memory
c_array = np.array([[1, 2, 3], [4, 5, 6]], order='C')
# Memory layout: [1, 2, 3, 4, 5, 6]
# Row 0 is bytes 0-2, row 1 is bytes 3-5 - contiguous

# Fortran-order (column-major): columns are contiguous
f_array = np.array([[1, 2, 3], [4, 5, 6]], order='F')
# Memory layout: [1, 4, 2, 5, 3, 6]
# Column 0 is bytes 0,1, column 1 is bytes 2,3 - contiguous

Why does this matter? Because of how CPU caches work. When you access one byte of memory, the CPU loads an entire cache line (typically 64 bytes - that's 8 float64 values) into L1 cache. If your next access is adjacent in memory, it's already in cache - a "cache hit," which is ~100x faster than fetching from main RAM.

big = np.random.rand(10000, 10000)  # ~800 MB

# Fast: summing along axis=1 (along rows - contiguous in C-order)
row_sums = big.sum(axis=1)   # Accesses memory sequentially

# Slower: summing along axis=0 (along columns - strided access)
col_sums = big.sum(axis=0)   # Jumps 80,000 bytes between elements

The difference can be 2-5x for large arrays. NumPy defaults to C-order, so row-wise operations are naturally fast. If your algorithm is inherently column-oriented (common in statistics and linear algebra), you can create arrays in F-order - or just be aware that column operations will be slower and plan accordingly.

Checking and controlling layout:

print(big.flags)
# C_CONTIGUOUS : True
# F_CONTIGUOUS : False

# Strides tell you the byte-jump per dimension
print(big.strides)
# (80000, 8) - 80000 bytes to next row, 8 bytes to next column

The strides tuple is the key to understanding performance. (80000, 8) means moving to the next row jumps 80,000 bytes (10,000 elements × 8 bytes each), while moving to the next column jumps just 8 bytes. Operations along the small-stride dimension are fast; operations along the large-stride dimension generate cache misses.

Views vs Copies - Know What You Have

NumPy aggressively avoids copying data. Many operations return views - arrays that share memory with the original:

a = np.array([1, 2, 3, 4, 5])

# These create VIEWS (shared memory):
b = a[1:4]            # Slice
c = a.reshape(5, 1)   # Reshape (same total size)
d = a.T               # Transpose

# These create COPIES (independent memory):
e = a[np.array([0, 2, 4])]   # Fancy indexing (arbitrary index array)
f = a[[0, 2, 4]]             # Also fancy indexing
g = a.copy()                  # Explicit copy

Modifying a view modifies the original - and this is by design, not a bug:

b = a[1:4]
b[0] = 99
print(a)   # [1, 99, 3, 4, 5] - a was modified!

This is the opposite of Pandas' copy-on-write behavior (which we covered in episode #54). NumPy gives you shared memory for performance - slicing a 1GB array creates a view in nanoseconds instead of copying gigabytes. But you must be aware of it.

How to check:

print(b.base is a)   # True - b is a view of a
print(e.base is a)   # False - e is a copy, independent

# Or check if two arrays share memory
print(np.shares_memory(a, b))   # True
print(np.shares_memory(a, e))   # False

The rule of thumb: basic indexing (slices, single integers, ...) returns views. Advanced indexing (arrays of indices, boolean arrays) returns copies. When in doubt, check np.shares_memory().

When transposing creates trouble:

big = np.random.rand(1000, 1000)
transposed = big.T   # View - no data copied, just strides reversed

# But transposed is now F-contiguous, not C-contiguous
print(transposed.flags['C_CONTIGUOUS'])  # False

# If you need C-contiguous for performance
contiguous = np.ascontiguousarray(transposed)  # Now a copy, C-order

Advanced Broadcasting - The Full Rules

Episode #11 covered the basics: adding a scalar to an array, combining 1D and 2D arrays. The full broadcasting rules handle arbitrary dimensions:

The three rules (applied from the trailing dimensions backward):

If arrays have different numbers of dimensions, pad the smaller shape with 1s on the left
Dimensions of size 1 are stretched to match the other array's dimension
If dimensions are neither equal nor 1 - it fails with a shape mismatch error

a = np.ones((3, 4))          # Shape: (3, 4)
b = np.array([1, 2, 3, 4])   # Shape: (4,) → padded to (1, 4) → stretched to (3, 4)

result = a + b   # Works: (3, 4) + (4,) → (3, 4)

Outer Operations via Broadcasting

This is where broadcasting gets genuinely powerful. By adding dimensions with np.newaxis, you can compute outer products, distance matrices, and cross-tables without any loops:

x = np.array([1, 2, 3])[:, np.newaxis]   # Shape: (3, 1)
y = np.array([10, 20, 30, 40])            # Shape: (4,)

# Multiplication table via broadcasting
table = x * y   # Shape: (3, 1) × (4,) → (3, 4)
# [[ 10,  20,  30,  40],
#  [ 20,  40,  60,  80],
#  [ 30,  60,  90, 120]]

np.newaxis (which is just None) inserts a dimension of size 1 at the specified position. This is the universal tool for controlling which dimensions get broadcast against each other.

Distance Matrix - A Classic Pattern

Computing all pairwise distances between points is a common operation in data science (nearest neighbors, clustering, spatial analysis). Broadcasting makes it elegant:

# 1000 points in 3D space
points = np.random.rand(1000, 3)

# Pairwise distance matrix WITHOUT loops
# Step 1: compute pairwise differences
diff = points[:, np.newaxis, :] - points[np.newaxis, :, :]
# shapes: (1000, 1, 3) - (1, 1000, 3) → (1000, 1000, 3)

# Step 2: Euclidean distance
distances = np.sqrt((diff ** 2).sum(axis=2))   # Shape: (1000, 1000)

This computes all 1,000,000 pairwise distances in one vectorized operation. A Python loop doing the same thing would take minutes; this takes tens of milliseconds. The memory cost is significant (a 1000×1000×3 intermediate array = ~24 MB), which is the tradeoff - broadcasting trades memory for speed.

For very large point sets where the intermediate array doesn't fit in memory, use scipy.spatial.distance.cdist instead - it computes the same result with a more memory-efficient implementation.

When Broadcasting Fails

a = np.ones((3, 4))
b = np.ones((3, 5))

# a + b → ValueError: operands could not be broadcast together
# Trailing dimensions: 4 and 5 - neither is 1, neither equal → fails

The error message tells you the shapes. When you get a broadcasting error, print both shapes, align them from the right, and check which dimension doesn't match.

Vectorization Patterns - Replacing Loops

The core skill of NumPy programming: every time you write for element in array, ask "can I do this without a loop?"

Pattern 1: Boolean Masking

data = np.random.randn(1_000_000)

# Python loop: ~500ms
result_slow = [x for x in data if x > 0]

# NumPy boolean mask: ~5ms (100x faster)
result_fast = data[data > 0]

# Combining conditions (use & for AND, | for OR, ~ for NOT)
mask = (data > -1) & (data < 1)   # Within 1 standard deviation
filtered = data[mask]

Boolean masks are also useful for conditional assignment:

# Set all negative values to zero
data[data < 0] = 0

# This is an in-place operation on the original array!

Pattern 2: `np.where` for Conditional Selection

prices = np.array([100, -5, 200, -10, 300])

# Clip negative values to zero
clipped = np.where(prices > 0, prices, 0)
# [100, 0, 200, 0, 300]

# Conditional based on another array
signals = np.where(prices > 150, 'buy', 'hold')
# ['hold', 'hold', 'buy', 'hold', 'buy']

Pattern 3: `np.select` for Multi-Branch Logic

When you need if/elif/elif/else:

values = np.random.randn(1000) * 100

conditions = [
    values > 100,
    values > 0,
    values > -50,
]
choices = ['strong_buy', 'buy', 'hold']

labels = np.select(conditions, choices, default='sell')

np.select evaluates conditions in order and returns the first match - exactly like if/elif/else. For two-branch logic, use np.where. For three or more branches, use np.select.

Pattern 4: `np.searchsorted` for Binning

# Define bucket boundaries
boundaries = np.array([0, 100, 500, 1000, 5000, 50000])
prices = np.array([50, 250, 750, 3000, 45000, 80])

bins = np.searchsorted(boundaries, prices)
# [1, 2, 3, 4, 5, 1] - which bucket each price belongs to

np.searchsorted uses binary search, making it O(log n) per element. For assigning millions of data points to hundreds of bins, this is orders of magnitude faster than a loop with if/elif chains.

Pattern 5: `np.einsum` for Complex Tensor Operations

Einstein summation notation handles matrix operations concisely:

A = np.random.rand(3, 4)
B = np.random.rand(4, 5)

# Matrix multiplication (equivalent to A @ B)
C = np.einsum('ij,jk->ik', A, B)

# Batch matrix multiplication - 100 pairs of 3×4 and 4×5 matrices
batch_A = np.random.rand(100, 3, 4)
batch_B = np.random.rand(100, 4, 5)
batch_C = np.einsum('bij,bjk->bik', batch_A, batch_B)

# Diagonal extraction from a batch of matrices
diags = np.einsum('bii->bi', np.random.rand(100, 4, 4))

# Trace (sum of diagonal) for each matrix in a batch
traces = np.einsum('bii->b', np.random.rand(100, 4, 4))

The notation reads as: "for indices i,j in the first operand and j,k in the second, contract over the shared index j and output indices i,k." Repeated indices on the left that don't appear on the right are summed over. It takes a bit to internalize, but once you do, einsum replaces entire pages of loop-based tensor code.

Structured Arrays - Tables Without Pandas

When you need heterogeneous columnar data (mixed types per column) but Pandas is too heavy:

dt = np.dtype([
    ('name', 'U20'),      # Unicode string, max 20 chars
    ('price', 'f8'),       # 64-bit float
    ('volume', 'i8'),      # 64-bit integer
    ('active', '?'),       # Boolean
])

trades = np.array([
    ('BTC/USDT', 50000.0, 1000, True),
    ('ETH/USDT', 3000.0,  5000, True),
    ('SOL/USDT', 100.0,   20000, False),
], dtype=dt)

# Access by field name - returns a view (no copy!)
print(trades['price'])     # [50000.  3000.   100.]
print(trades['name'])      # ['BTC/USDT' 'ETH/USDT' 'SOL/USDT']

# Filter with boolean masks
expensive = trades[trades['price'] > 1000]

# Sort by field
sorted_trades = np.sort(trades, order='volume')

Structured arrays are useful for:

Reading binary data files with mixed types (sensor logs, network packet captures)
Interfacing with C libraries via ctypes or cffi
Lightweight tabular data where Pandas import time or memory overhead isn't justified
Intermediate representations when processing data that will eventually become a DataFrame

They're not a Pandas replacement - no groupby, no merge, no method chaining. But when you need a typed, contiguous, zero-overhead table, structured arrays are the tool.

NumPy's Role in the ML Ecosystem

Understanding NumPy arrays is understanding the lingua franca of Python data science:

# scikit-learn expects NumPy arrays (or array-likes)
from sklearn.linear_model import LinearRegression

X = np.array([[1], [2], [3], [4], [5]])     # Features: shape (5, 1)
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1])   # Targets: shape (5,)

model = LinearRegression().fit(X, y)
print(f"Slope: {model.coef_[0]:.2f}, Intercept: {model.intercept_:.2f}")
# Slope: 2.00, Intercept: 0.04

# PyTorch tensors convert seamlessly to/from NumPy
import torch

numpy_array = np.array([1.0, 2.0, 3.0])
tensor = torch.from_numpy(numpy_array)   # Shared memory!
tensor[0] = 99.0
print(numpy_array)   # [99.  2.  3.] - shared memory, like NumPy views

The array protocol (__array_interface__ and the newer __dlpack__ interface) means any library that follows NumPy conventions can interoperate with zero-copy data sharing. This is why the entire Python data science ecosystem works together seamlessly - Pandas DataFrames, scikit-learn models, PyTorch tensors, TensorFlow operations all speak the same array language.

The mental model: NumPy arrays are the universal data format. Pandas DataFrames contain NumPy arrays (or Arrow arrays) per column. Scikit-learn expects features as 2D arrays and targets as 1D arrays. PyTorch converts to/from NumPy with shared memory. Understanding shapes, strides, broadcasting, and vectorized operations makes every library in the ecosystem more intuitive.

Performance Tips Summary

Situation	Do this	Not this
Element-wise operation	Vectorized ops (`a + b`)	Python loop
Conditional selection	Boolean mask (`a[a > 0]`)	List comprehension
Conditional assignment	`np.where()` / `np.select()`	Loop with if/else
Multi-branch logic	`np.select()`	Nested `np.where()`
Iterate over rows	Rethink - reshape and vectorize	`for row in array`
Need contiguous data	`np.ascontiguousarray()`	Ignore layout, wonder why it's slow
Complex tensor ops	`np.einsum()`	Nested loops
Binning/bucketing	`np.searchsorted()`	Loop with if/elif
Large pairwise computations	`scipy.spatial.distance`	Broadcasting with huge intermediates

What to remember from this one

In this episode, we went deeper into NumPy than the basics:

Memory layout (C-order vs F-order) directly impacts cache performance - row operations are fast by default because NumPy uses C-order, and strides tell you exactly how many bytes each dimension step costs
Views share memory with the original array (slicing, reshaping, transposing); copies are independent (fancy indexing, explicit .copy()) - check with np.shares_memory()
Broadcasting rules: pad shapes with 1s on the left, stretch dimensions of size 1, fail if dimensions are neither equal nor 1
np.newaxis is the key to outer operations - it adds a dimension of size 1 for broadcasting control
Vectorization patterns replace loops: boolean masks for filtering, np.where/np.select for conditionals, np.searchsorted for binning, np.einsum for tensor operations
Structured arrays provide typed, contiguous tabular data without Pandas overhead - useful for binary data, C interop, and lightweight tables
NumPy is the universal array format - Pandas, scikit-learn, PyTorch, and TensorFlow all interoperate through NumPy's array protocol

This is the foundation that everything in data science and machine learning builds upon. Shapes, broadcasting, vectorized operations, memory layout - these concepts reappear in every library, every framework, every model. With Pandas and NumPy solid, you're equipped for whatever comes next.

De groeten! Thanks for reading, and until next time ;-)

Hive account@scipio

Learn Python Series (#56) - NumPy Beyond the Basics

Learn Python Series (#56) - NumPy Beyond the Basics

Repository

What will I learn

Requirements

Difficulty

Curriculum (of the Learn Python Series):

GitHub Account

Learn Python Series (#56) - NumPy Beyond the Basics

Memory Layout - Why C-Order vs Fortran-Order Matters

Views vs Copies - Know What You Have

Advanced Broadcasting - The Full Rules

Outer Operations via Broadcasting

Distance Matrix - A Classic Pattern

When Broadcasting Fails

Vectorization Patterns - Replacing Loops

Pattern 1: Boolean Masking

Pattern 2: np.where for Conditional Selection

Pattern 3: np.select for Multi-Branch Logic

Pattern 4: np.searchsorted for Binning

Pattern 5: np.einsum for Complex Tensor Operations

Structured Arrays - Tables Without Pandas

NumPy's Role in the ML Ecosystem

Performance Tips Summary

What to remember from this one

De groeten! Thanks for reading, and until next time ;-)

Curriculum (of the `Learn Python Series`):

Pattern 2: `np.where` for Conditional Selection

Pattern 3: `np.select` for Multi-Branch Logic

Pattern 4: `np.searchsorted` for Binning

Pattern 5: `np.einsum` for Complex Tensor Operations