The learning rate (alpha) is the multiplier on every gradient step. In one dimension, update size is approximately alpha * |dJ/dw|. This means alpha controls how aggressively parameters move at every iteration.
Three regimes:
- Too large: repeated overshoot across the valley. Cost oscillates or explodes.
- Too small: updates are numerically correct but painfully slow.
- Well tuned: cost decreases quickly at first, then smoothly flattens near convergence.
Operational tuning workflow:
- Run a short learning-rate sweep on log scale (1e-4, 1e-3, 1e-2, 1e-1, 1).
- Plot training loss vs. iteration for each candidate.
- Select the largest value that is stable and mostly monotonic.
- Re-check with validation loss to avoid overfitting-related misreads.
Production nuance: fixed alpha is often suboptimal across the full run. Common schedules include warm-up, step decay, cosine decay, and one-cycle policies. These allow larger early exploration and finer late-stage convergence.
Failure modes frequently confused with bad learning rate:
- Poor feature scaling causing zigzag updates.
- Exploding gradients in deep models (needs clipping/normalisation).
- Noisy mini-batches causing temporary non-monotonic curves.
So when loss behaves badly, diagnose before changing alpha blindly.
Deepening Notes
Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.
- Now let's suppose that after some number of steps of gradient descent, your parameter W is over here, say equal to five.
- So this means that if you're already at a local minimum, gradient descent leaves W unchanged.
- So if your parameters have already brought you to a local minimum, then further gradient descent steps to absolutely nothing.
- So that's the gradient descent algorithm, you can use it to try to minimize any cost function J.
- Not just the mean squared error cost function that we're using for the new regression.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- The most critical hyperparameter โ too large diverges, too small barely moves.
- But because the learning rate is so small, the second step is also just minuscule.
- For the case where the learning rate is too small.
- Run a short learning-rate sweep on log scale (1e-4, 1e-3, 1e-2, 1e-1, 1).
- This also explains why gradient descent can reach a local minimum, even with a fixed learning rate alpha.
- Select the largest value that is stable and mostly monotonic.
- The learning rate (alpha) is the multiplier on every gradient step.
- Failure modes frequently confused with bad learning rate:
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.
Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.