Plot cost J on the y-axis against iteration number on the x-axis. This is the learning curve — the most important diagnostic tool in ML training.
What a healthy learning curve looks like: Cost decreases monotonically every iteration and eventually flattens asymptotically. The curve looks like a ski slope that levels off.
Diagnosing problems from the learning curve:
- Cost goes up → learning rate α is too large (overshooting) or there's a bug in the gradient computation
- Cost decreases but very slowly → α too small, or feature scaling needed
- Cost decreases then oscillates → α slightly too large
- Cost decreases then plateaus → converged (or stuck in local minimum for non-convex problems)
Automatic convergence test: Stop when ΔJ < ε between consecutive iterations, where ε = 10⁻³ is a common threshold. In practice, watching the curve visually is often more informative than a fixed threshold.
Andrew Ng's rule: if gradient descent is working, J should decrease after every single iteration. If it ever increases, something is wrong.
Deepening Notes
Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.
- Concretely, if you plot the cost for a number of iterations and notice that the costs sometimes goes up and sometimes goes down, you should take that as a clear sign that gradient descent is not working properly.
- This is because having the derivative term moves your cost J further from the global minimum instead of closer.
- One debugging tip for a correct implementation of gradient descent is that with a small enough learning rate, the cost function should decrease on every single iteration.
- So if gradient descent isn't working, one thing I often do and I hope you find this tip useful too, one thing I'll often do is just set Alpha to be a very small number and see if that causes the cost to decrease on every iteration.
- In the upcoming optional lab you can also take a look at how feature scaling is done in code and also see how different choices of the learning rate Alpha can lead to either better or worse training of your model.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- The learning curve — how to tell when training is done and when it's broken.
- One debugging tip for a correct implementation of gradient descent is that with a small enough learning rate, the cost function should decrease on every single iteration.
- This is the learning curve — the most important diagnostic tool in ML training.
- Cost goes up → learning rate α is too large (overshooting) or there's a bug in the gradient computation
- What a healthy learning curve looks like: Cost decreases monotonically every iteration and eventually flattens asymptotically.
- Andrew Ng's rule: if gradient descent is working, J should decrease after every single iteration.
- Automatic convergence test: Stop when ΔJ < ε between consecutive iterations, where ε = 10⁻³ is a common threshold.
- But learning rates like this could also be a sign of a possible broken code.
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.
Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.