Skip to content
Concept-Lab
โ† Machine Learning๐Ÿง  44 / 114
Machine Learning

Gradient Descent โ€” Update Rule

The actual update equations โ€” the math behind every gradient step.

Core Theory

This topic gave the actual mathematical update rule. Andrew Ng's exact formulation:

On each step:

  • w := w โˆ’ ฮฑ ร— (โˆ‚J(w,b)/โˆ‚w)
  • b := b โˆ’ ฮฑ ร— (โˆ‚J(w,b)/โˆ‚b)

The := symbol is the assignment operator (not a mathematical equality). It means 'compute the right side, then store it in the variable on the left'. Andrew Ng was careful to distinguish this from mathematical equality.

Breaking down the formula:

  • ฮฑ (alpha): the learning rate โ€” controls how big each step is
  • โˆ‚J/โˆ‚w: the partial derivative of the cost w.r.t. w โ€” tells you the slope in the w direction
  • Subtract: because we want to go downhill (reduce J), we move against the gradient

Critical rule โ€” simultaneous update: You MUST compute both โˆ‚J/โˆ‚w and โˆ‚J/โˆ‚b using the CURRENT values of (w, b) first, then update both. Updating w first and using the new w to compute โˆ‚J/โˆ‚b is a bug โ€” you're computing the derivative at a different point.

Correct implementation:

  • temp_w = w โˆ’ ฮฑ ร— โˆ‚J/โˆ‚w (computed with current w, b)
  • temp_b = b โˆ’ ฮฑ ร— โˆ‚J/โˆ‚b (computed with current w, b)
  • w = temp_w
  • b = temp_b

Deepening Notes

Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.

  • Before we move on, answer this: If you start gradient descent at two different starting points in a non-convex function, why might you end up at two different solutions?
  • Let's take a look at how you can actually implement the gradient descent algorithm.
  • Now, there's one more subtle detail about how to correctly in semantic gradient descent, you're going to update two parameters, w and b.
  • One important detail is that for gradient descent, you want to simultaneously update w and b, meaning you want to update both parameters at the same time.
  • When you hear someone talk about gradient descent, they always mean the gradient descents where you perform a simultaneous update of the parameters.

Interview-Ready Deepening

Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.

  • The actual update equations โ€” the math behind every gradient step.
  • The way that gradient descent is implemented in code, it actually turns out to be more natural to implement it the correct way with simultaneous updates.
  • This topic gave the actual mathematical update rule.
  • Here's the correct way to implement gradient descent which does a simultaneous update.
  • In contrast, here is an incorrect implementation of gradient descent that does not do a simultaneous update.
  • Subtract : because we want to go downhill (reduce J), we move against the gradient
  • On each step, w, the parameter, is updated to the old value of w minus Alpha times this term d/dw of the cos function J of wb.
  • The := symbol is the assignment operator (not a mathematical equality).

Tradeoffs You Should Be Able to Explain

  • More expressive models improve fit but can reduce interpretability and raise overfitting risk.
  • Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
  • Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.

First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.

Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.

๐Ÿงพ Comprehensive Coverage

Exhaustive coverage points to ensure complete topic understanding without missing core concepts.

Loading interactive module...

๐Ÿ’ก Concrete Example

Pseudocode implementing simultaneous update correctly: dJ_dw = (1/m) ร— sum(f_wb(x[i]) - y[i]) ร— x[i] for all i dJ_db = (1/m) ร— sum(f_wb(x[i]) - y[i]) for all i w = w - alpha ร— dJ_dw # Use dJ_dw computed BEFORE updating w b = b - alpha ร— dJ_db # Use dJ_db computed with ORIGINAL (w, b)

๐Ÿง  Beginner-Friendly Examples

Guided Starter Example

Pseudocode implementing simultaneous update correctly: dJ_dw = (1/m) ร— sum(f_wb(x[i]) - y[i]) ร— x[i] for all i dJ_db = (1/m) ร— sum(f_wb(x[i]) - y[i]) for all i w = w - alpha ร— dJ_dw # Use dJ_dw computed BEFORE updating w b = b - alpha ร— dJ_db # Use dJ_db computed with ORIGINAL (w, b)

Source-grounded Practical Scenario

The actual update equations โ€” the math behind every gradient step.

Source-grounded Practical Scenario

The way that gradient descent is implemented in code, it actually turns out to be more natural to implement it the correct way with simultaneous updates.

๐Ÿงญ Architecture Flow

Loading interactive module...

๐ŸŽฌ Interactive Visualization

๐Ÿ›  Interactive Tool

๐Ÿงช Interactive Sessions

  1. Concept Drill: Manipulate key parameters and observe behavior shifts for Gradient Descent โ€” Update Rule.
  2. Failure Mode Lab: Trigger an edge case and explain remediation decisions.
  3. Architecture Reorder Exercise: Reorder 5 flow steps into the correct production sequence.

๐Ÿ’ป Code Walkthrough

Concept-to-code walkthrough checklist for this topic.

  1. Define input/output contract before reading implementation details.
  2. Map each conceptual step to one concrete function/class decision.
  3. Call out one tradeoff and one failure mode in interview wording.

๐ŸŽฏ Interview Prep

Questions an interviewer is likely to ask about this topic. Think through your answer before reading the senior angle.

  • Q1[beginner] Why must w and b be updated simultaneously in gradient descent?
    The causal reason is that system behavior is constrained by data, model contracts, and runtime context, not just algorithm choice. This topic gave the actual mathematical update rule.. A practical check is to validate impact on quality, latency, and failure recovery before scaling. If ignored, teams usually hit label leakage, train-serving skew, and misleading aggregate metrics; prevention requires data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
  • Q2[beginner] What is the difference between := (assignment) and = (equality) in ML pseudocode?
    The right comparison is based on objective, data flow, and operating constraints rather than terminology. For Gradient Descent โ€” Update Rule, use problem framing, feature/label quality, and bias-variance control as the evaluation lens, then compare latency, quality, and maintenance burden under realistic load. Pseudocode implementing simultaneous update correctly:. In production, watch for label leakage, train-serving skew, and misleading aggregate metrics, and control risk with data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
  • Q3[intermediate] What is the partial derivative โˆ‚J/โˆ‚w telling you?
    It is best defined by the role it plays in the end-to-end system, not in isolation. This topic gave the actual mathematical update rule.. Operationally, its value appears only when integrated with problem framing, feature/label quality, and bias-variance control and measured against real outcomes. Pseudocode implementing simultaneous update correctly:. A common pitfall is label leakage, train-serving skew, and misleading aggregate metrics; mitigate with data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
  • Q4[expert] How does vectorised implementation preserve simultaneous-update correctness?
    Implement this in a controlled sequence: frame the target outcome, define measurable success criteria, build the smallest correct baseline, and instrument traces/metrics before optimization. In this node, keep decisions grounded in problem framing, feature/label quality, and bias-variance control and validate each change against real failure cases. Pseudocode implementing simultaneous update correctly:. Production hardening means planning for label leakage, train-serving skew, and misleading aggregate metrics and enforcing data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
  • Q5[expert] How would you explain this in a production interview with tradeoffs?
    The simultaneous update rule is where beginners make implementation bugs. In NumPy/PyTorch, parameter updates are naturally simultaneous because you compute all gradients (via backward()) before any optimizer.step() call. The framework enforces correctness. But in a from-scratch implementation, you must store temp variables. This comes up in coding interviews.
๐Ÿ† Senior answer angle โ€” click to reveal
Use the tier progression: beginner correctness -> intermediate tradeoffs -> expert production constraints and incident readiness.

๐Ÿ“š Revision Flash Cards

Test yourself before moving on. Flip each card to check your understanding โ€” great for quick revision before an interview.

Loading interactive module...