The live demo is where optimisation becomes tangible. Starting from a clearly bad point (w=-0.1, b=900), each iteration performs the same loop: predict, compute loss, compute gradients, update parameters, and re-evaluate.
What the visuals teach:
- On the data plot, the line rotates/translates toward a realistic trend.
- On the contour plot, (w,b) follows a path from outer rings toward the centre.
- On the loss curve, J drops quickly early, then flattens near convergence.
Why early jumps are larger: far from optimum, gradients are larger, so alpha * gradient gives bigger updates. Near optimum, gradients shrink, so steps become small automatically.
How to read failure from the same visuals:
- Path bouncing across valley with rising loss -> alpha too high.
- Path crawling with almost flat progress -> alpha too low.
- Path drifting to strange regions after initial improvement -> potential data scaling or gradient bug.
This topic should leave you with a debugging mindset: training is observable dynamics, not a black box call to fit().
Deepening Notes
Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.
- Youʼre about to: See the algorithm in action Understand convergence behavior visually Strengthen intuition with experiments Final Intuition Check Why does linear regressionʼs cost function guarantee that gradient descent wonʼt get stuck in a bad local minimum?
- And so that's gradient descent and we're going to use this to fit a model to the holding data.
- To be more precise, this gradient descent process is called batch gradient descent.
- And bash gradient descent is looking at the entire batch of training examples at each update.
- And you'll also see a contour plot, seeing how the cost gets closer to the global minimum as gradient descent finds better and better values for the parameters w and b.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- Starting from a clearly bad point (w=-0.1, b=900), each iteration performs the same loop: predict, compute loss, compute gradients, update parameters, and re-evaluate.
- To be more precise, this gradient descent process is called batch gradient descent.
- The term batch gradient descent refers to the fact that on every step of gradient descent, we're looking at all of the training examples, instead of just a subset of the training data.
- The live demo is where optimisation becomes tangible.
- Often w and b will both be initialized to 0, but for this demonstration, lets initialized w = -0.1 and b = 900.
- But we'll use batch gradient descent for linear regression.
- Why early jumps are larger: far from optimum, gradients are larger, so alpha * gradient gives bigger updates.
- On the data plot, the line rotates/translates toward a realistic trend.
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.
Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.