Gradient Descent — Live Demo

Core Theory

The live demo is where optimisation becomes tangible. Starting from a clearly bad point (w=-0.1, b=900), each iteration performs the same loop: predict, compute loss, compute gradients, update parameters, and re-evaluate.

What the visuals teach:

On the data plot, the line rotates/translates toward a realistic trend.
On the contour plot, (w,b) follows a path from outer rings toward the centre.
On the loss curve, J drops quickly early, then flattens near convergence.

Why early jumps are larger: far from optimum, gradients are larger, so alpha * gradient gives bigger updates. Near optimum, gradients shrink, so steps become small automatically.

How to read failure from the same visuals:

Path bouncing across valley with rising loss -> alpha too high.
Path crawling with almost flat progress -> alpha too low.
Path drifting to strange regions after initial improvement -> potential data scaling or gradient bug.

This topic should leave you with a debugging mindset: training is observable dynamics, not a black box call to fit().

Deepening Notes

Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.

Youʼre about to: See the algorithm in action Understand convergence behavior visually Strengthen intuition with experiments Final Intuition Check Why does linear regressionʼs cost function guarantee that gradient descent wonʼt get stuck in a bad local minimum?
And so that's gradient descent and we're going to use this to fit a model to the holding data.
To be more precise, this gradient descent process is called batch gradient descent.
And bash gradient descent is looking at the entire batch of training examples at each update.
And you'll also see a contour plot, seeing how the cost gets closer to the global minimum as gradient descent finds better and better values for the parameters w and b.

Interview-Ready Deepening

Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.

Starting from a clearly bad point (w=-0.1, b=900), each iteration performs the same loop: predict, compute loss, compute gradients, update parameters, and re-evaluate.
To be more precise, this gradient descent process is called batch gradient descent.
The term batch gradient descent refers to the fact that on every step of gradient descent, we're looking at all of the training examples, instead of just a subset of the training data.
The live demo is where optimisation becomes tangible.
Often w and b will both be initialized to 0, but for this demonstration, lets initialized w = -0.1 and b = 900.
But we'll use batch gradient descent for linear regression.
Why early jumps are larger: far from optimum, gradients are larger, so alpha * gradient gives bigger updates.
On the data plot, the line rotates/translates toward a realistic trend.

Tradeoffs You Should Be Able to Explain

More expressive models improve fit but can reduce interpretability and raise overfitting risk.
Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.

First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.

Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.

🧾 Comprehensive Coverage

Exhaustive coverage points to ensure complete topic understanding without missing core concepts.

Covered: 0 / 11

Starting from a clearly bad point (w=-0.1, b=900), each iteration performs the same loop: predict, compute loss, compute gradients, update parameters, and re-evaluate.To be more precise, this gradient descent process is called batch gradient descent.The live demo is where optimisation becomes tangible.Why early jumps are larger: far from optimum, gradients are larger, so alpha * gradient gives bigger updates.On the data plot, the line rotates/translates toward a realistic trend.On the contour plot, (w,b) follows a path from outer rings toward the centre.Near optimum, gradients shrink, so steps become small automatically.On the loss curve, J drops quickly early, then flattens near convergence.The term batch gradient descent refers to the fact that on every step of gradient descent, we're looking at all of the training examples, instead of just a subset of the training data.Often w and b will both be initialized to 0, but for this demonstration, lets initialized w = -0.1 and b = 900.But we'll use batch gradient descent for linear regression.

Loading interactive module...

💡 Concrete Example

Demo trace: start at w=-0.1, b=900 (nonsensical line). After a few iterations, slope becomes positive and intercept drops, reducing systematic error. Midway, the path still moves noticeably but cost decreases less aggressively than at the start. Near the end, gradient is tiny and parameters barely move, signaling convergence. For x=1250 sq ft, final prediction is close to the observed market trend.

🧠 Beginner-Friendly Examples

Guided Starter Example

Demo trace: start at w=-0.1, b=900 (nonsensical line). After a few iterations, slope becomes positive and intercept drops, reducing systematic error. Midway, the path still moves noticeably but cost decreases less aggressively than at the start. Near the end, gradient is tiny and parameters barely move, signaling convergence. For x=1250 sq ft, final prediction is close to the observed market trend.

Source-grounded Practical Scenario

Starting from a clearly bad point (w=-0.1, b=900), each iteration performs the same loop: predict, compute loss, compute gradients, update parameters, and re-evaluate.

Source-grounded Practical Scenario

To be more precise, this gradient descent process is called batch gradient descent.

🧭 Architecture Flow

Drag to reorder the architecture flow for Gradient Descent — Live Demo. This is designed as an interview rehearsal for explaining end-to-end execution.

1.Define the objective for Gradient Descent — Live Demo

2.Prepare and validate inputs/state

3.Execute core algorithmic step

4.Evaluate outputs and detect failure modes

5.Apply feedback loop and iterate

Flow order matches canonical architecture sequence.

Loading interactive module...

🎬 Interactive Visualization

🛠 Interactive Tool

🧪 Interactive Sessions

Concept Drill: Manipulate key parameters and observe behavior shifts for Gradient Descent — Live Demo.
Failure Mode Lab: Trigger an edge case and explain remediation decisions.
Architecture Reorder Exercise: Reorder 5 flow steps into the correct production sequence.

💻 Code Walkthrough

Concept-to-code walkthrough checklist for this topic.

Define input/output contract before reading implementation details.
Map each conceptual step to one concrete function/class decision.
Call out one tradeoff and one failure mode in interview wording.

🎯 Interview Prep

Questions an interviewer is likely to ask about this topic. Think through your answer before reading the senior angle.

Q1[beginner] Why do gradient descent steps naturally decrease in size as training progresses?
The causal reason is that system behavior is constrained by data, model contracts, and runtime context, not just algorithm choice. The live demo is where optimisation becomes tangible.. A practical check is to validate impact on quality, latency, and failure recovery before scaling. If ignored, teams usually hit label leakage, train-serving skew, and misleading aggregate metrics; prevention requires data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
Q2[beginner] What visual patterns tell you gradient descent is working vs. broken?
It is best defined by the role it plays in the end-to-end system, not in isolation. The live demo is where optimisation becomes tangible.. Operationally, its value appears only when integrated with problem framing, feature/label quality, and bias-variance control and measured against real outcomes. Demo trace: start at w=-0.1, b=900 (nonsensical line). After a few iterations, slope becomes positive and intercept drops, reducing systematic error. Midway, the path still moves noticeably but cost decreases less aggressively than at the start. Near the end, gradient is tiny and parameters barely move, signaling convergence. For x=1250 sq ft, final prediction is close to the observed market trend.. A common pitfall is label leakage, train-serving skew, and misleading aggregate metrics; mitigate with data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
Q3[intermediate] What is the difference between batch GD and mini-batch GD?
The right comparison is based on objective, data flow, and operating constraints rather than terminology. For Gradient Descent — Live Demo, use problem framing, feature/label quality, and bias-variance control as the evaluation lens, then compare latency, quality, and maintenance burden under realistic load. Demo trace: start at w=-0.1, b=900 (nonsensical line). After a few iterations, slope becomes positive and intercept drops, reducing systematic error. Midway, the path still moves noticeably but cost decreases less aggressively than at the start. Near the end, gradient is tiny and parameters barely move, signaling convergence. For x=1250 sq ft, final prediction is close to the observed market trend.. In production, watch for label leakage, train-serving skew, and misleading aggregate metrics, and control risk with data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
Q4[intermediate] If your contour path oscillates across the valley, what is your first intervention?
In a system design interview, when asked 'how does your model train', the right answer traces this pipeline: define model f → define cost J → compute gradients → simultaneous parameter update → repeat. Tie your implementation to problem framing, feature/label quality, and bias-variance control, stress-test it with realistic edge cases, and add production safeguards for label leakage, train-serving skew, and misleading aggregate metrics.
Q5[expert] What logs would you keep in production to debug training stability?
It is best defined by the role it plays in the end-to-end system, not in isolation. The live demo is where optimisation becomes tangible.. Operationally, its value appears only when integrated with problem framing, feature/label quality, and bias-variance control and measured against real outcomes. Demo trace: start at w=-0.1, b=900 (nonsensical line). After a few iterations, slope becomes positive and intercept drops, reducing systematic error. Midway, the path still moves noticeably but cost decreases less aggressively than at the start. Near the end, gradient is tiny and parameters barely move, signaling convergence. For x=1250 sq ft, final prediction is close to the observed market trend.. A common pitfall is label leakage, train-serving skew, and misleading aggregate metrics; mitigate with data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
Q6[expert] How would you explain this in a production interview with tradeoffs?
In a system design interview, when asked 'how does your model train', the right answer traces this pipeline: define model f → define cost J → compute gradients → simultaneous parameter update → repeat. That shows you understand training as optimisation, not just as 'call model.fit()'.

🏆 Senior answer angle — click to reveal

Use the tier progression: beginner correctness -> intermediate tradeoffs -> expert production constraints and incident readiness.

📚 Revision Flash Cards

Test yourself before moving on. Flip each card to check your understanding — great for quick revision before an interview.

Start flipping cards to track your progress

Question

In the live demo, Andrew Ng started gradient descent at w=-0.1, b=900. What does this line look like?

tap to reveal →

Answer

f(x) = -0.1x + 900. A nearly horizontal line with a slight negative slope. It predicts ~$900K for a 0-size house and prices slightly decrease as size increases. Completely wrong. But gradient descent finds the correct model from this terrible starting point.

Loading interactive module...