NumPy calls highly optimised BLAS/LAPACK libraries (OpenBLAS, Intel MKL) written in Fortran/C and hand-tuned for CPU cache architecture. These libraries achieve near-theoretical peak CPU performance.
Vectorised gradient descent for multiple linear regression:
- Model: Ε· = Xw + b (matrix form, X is mΓn)
- Predictions: Ε· = X Β· w + b (single matrix multiply)
- Errors: e = Ε· β y (element-wise subtraction)
- Gradient for w: βw = (1/m) Xα΅ Β· e (matrix-vector multiply)
- Update: w := w β Ξ± Β· βw (element-wise)
The entire gradient descent step for all n parameters reduces to two matrix operations. Compare this to a nested loop over m examples and n features β the vectorised version is nΓm times faster.
On GPUs, operations like matrix multiply are CUDA kernels β the GPU's thousands of cores each compute a small portion of the result in parallel. A single matrix multiplication that takes seconds on CPU takes milliseconds on GPU.
Deepening Notes
Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.
- The βerror termβ (f(x(i)) β y(i)) stays the same (i) For wβ±Ό, you multiply by the corresponding feature x j 4) Vectorization: putting gradient descent into fast code 4.1 Why vectorize?
- 5) Side note: Normal Equation (alternative to gradient descent) 5.1 What it is A method that can solve for w and b for linear regression without iterative gradient descent.
- In this video you see a technique called feature scaling that will enable gradient descent to run much faster.
- Because the contours are so tall and skinny gradient descent may end up bouncing back and forth for a long time before it can finally find its way to the global minimum.
- And if you run gradient descent on a cost function to find on this, re scaled x1 and x2 using this transformed data, then the contours will look more like this more like circles and less tall and skinny.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- How NumPy, BLAS, and GPU kernels actually execute computations in parallel.
- NumPy calls highly optimised BLAS/LAPACK libraries (OpenBLAS, Intel MKL) written in Fortran/C and hand-tuned for CPU cache architecture.
- On GPUs , operations like matrix multiply are CUDA kernels β the GPU's thousands of cores each compute a small portion of the result in parallel.
- Gradient descent for 1,000 features, 100,000 training examples: Loop version = 10βΈ multiply-add operations executed sequentially β 100 seconds per iteration. Vectorised NumPy β 0.1 seconds. GPU β 0.001 seconds. Same math, 100,000Γ speedup.
- Because w1 tends to be multiplied by a very large number, the size and square feet.
- Because the contours are so tall and skinny gradient descent may end up bouncing back and forth for a long time before it can finally find its way to the global minimum.
- 5) Side note: Normal Equation (alternative to gradient descent) 5.1 What it is A method that can solve for w and b for linear regression without iterative gradient descent.
- Gradient for w: βw = (1/m) Xα΅ Β· e (matrix-vector multiply)
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.
Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.