This topic gave the actual mathematical update rule. Andrew Ng's exact formulation:
On each step:
w := w โ ฮฑ ร (โJ(w,b)/โw)b := b โ ฮฑ ร (โJ(w,b)/โb)
The := symbol is the assignment operator (not a mathematical equality). It means 'compute the right side, then store it in the variable on the left'. Andrew Ng was careful to distinguish this from mathematical equality.
Breaking down the formula:
- ฮฑ (alpha): the learning rate โ controls how big each step is
- โJ/โw: the partial derivative of the cost w.r.t. w โ tells you the slope in the w direction
- Subtract: because we want to go downhill (reduce J), we move against the gradient
Critical rule โ simultaneous update: You MUST compute both โJ/โw and โJ/โb using the CURRENT values of (w, b) first, then update both. Updating w first and using the new w to compute โJ/โb is a bug โ you're computing the derivative at a different point.
Correct implementation:
- temp_w = w โ ฮฑ ร โJ/โw (computed with current w, b)
- temp_b = b โ ฮฑ ร โJ/โb (computed with current w, b)
- w = temp_w
- b = temp_b
Deepening Notes
Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.
- Before we move on, answer this: If you start gradient descent at two different starting points in a non-convex function, why might you end up at two different solutions?
- Let's take a look at how you can actually implement the gradient descent algorithm.
- Now, there's one more subtle detail about how to correctly in semantic gradient descent, you're going to update two parameters, w and b.
- One important detail is that for gradient descent, you want to simultaneously update w and b, meaning you want to update both parameters at the same time.
- When you hear someone talk about gradient descent, they always mean the gradient descents where you perform a simultaneous update of the parameters.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- The actual update equations โ the math behind every gradient step.
- The way that gradient descent is implemented in code, it actually turns out to be more natural to implement it the correct way with simultaneous updates.
- This topic gave the actual mathematical update rule.
- Here's the correct way to implement gradient descent which does a simultaneous update.
- In contrast, here is an incorrect implementation of gradient descent that does not do a simultaneous update.
- Subtract : because we want to go downhill (reduce J), we move against the gradient
- On each step, w, the parameter, is updated to the old value of w minus Alpha times this term d/dw of the cos function J of wb.
- The := symbol is the assignment operator (not a mathematical equality).
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.
Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.