Gradient Descent is the algorithm that trains virtually every ML model โ from linear regression to GPT-4. Understanding it is non-negotiable for interviews.
The blind hiker analogy: Imagine you're blindfolded on a hilly landscape. You can't see the whole terrain. You can only feel the slope under your feet. Your goal: reach the lowest valley. Your strategy: at every step, feel which direction is downhill and take one step that way. Repeat until you can't go any lower.
In ML: the 'landscape' is the cost function J(w,b). The 'valley floor' is the minimum cost (best model). The 'slope' is the gradient (mathematical derivative). Gradient descent is the algorithm that takes those downhill steps.
The update rule (memorise this):
- w := w โ ฮฑ ร (โJ/โw)
- b := b โ ฮฑ ร (โJ/โb)
Where ฮฑ (alpha) = learning rate (step size). Both updates happen simultaneously using the same current values.
Critical rule: Update ALL parameters simultaneously. Compute all derivatives first using current values, then update them all at once. Updating w first and using the new w to compute b's derivative is a bug โ you'd be computing the wrong gradient.
Three variants you must know:
- Batch GD: use all training data for each step โ very stable but slow for large datasets
- Stochastic GD (SGD): use one random sample per step โ fast but very noisy (zigzags)
- Mini-batch GD: use batches of 32โ512 samples โ industry standard, balances speed + stability + GPU parallelism
Deepening Notes
Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.
- Gradient descent is used all over the place in machine learning, not just for linear regression, but for training for example some of the most advanced neural network models, also called deep learning models.
- Just to make this discussion on gradient descent more general, it turns out that gradient descent applies to more general functions, including other cost functions that work with models that have more than two parameters.
- Gradient descent is used all over the place in machine learning, not just for linear regression, but for training for example some of the most advanced neural network models, also called deep learning models.
- Just to make this discussion on gradient descent more general, it turns out that gradient descent applies to more general functions, including other cost functions that work with models that have more than two parameters.
- Gradient descent is used all over the place in machine learning, not just for linear regression, but for training for example some of the most advanced neural network models, also called deep learning models.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- Gradient Descent is the algorithm that trains virtually every ML model โ from linear regression to GPT-4.
- The core optimisation algorithm that trains virtually every ML model.
- Gradient descent is used all over the place in machine learning, not just for linear regression, but for training for example some of the most advanced neural network models, also called deep learning models.
- Just to make this discussion on gradient descent more general, it turns out that gradient descent applies to more general functions, including other cost functions that work with models that have more than two parameters.
- Gradient descent is the algorithm that takes those downhill steps.
- Here's an overview of what we'll do with gradient descent.
- It turns out, gradient descent has an interesting property.
- Mini-batch GD : use batches of 32โ512 samples โ industry standard, balances speed + stability + GPU parallelism
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.
Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.