Skip to content
Concept-Lab
Machine Learning

Regularisation — Math for Linear Regression

L2 penalty added to MSE; weight decay in the gradient update.

Core Theory

L2-regularized linear regression objective:

J(w,b) = (1/2m) * sum((ŷ_i - y_i)^2) + (lambda/2m) * sum(w_j^2)

Details that matter:

  • bias term b is usually excluded from penalty.
  • lambda term is normalized by m for scale consistency.
  • regularization acts on weights, not labels/features.

Weight update with decay:

w_j := w_j*(1 - alpha*lambda/m) - alpha*(1/m)*sum((ŷ_i-y_i)*x_ij)

The first factor is weight decay. Each step slightly shrinks coefficient magnitude before fitting residual structure.

Practical insight: if features are not standardized, regularization acts unevenly because coefficient scales are not comparable. Standardize first, then tune lambda.

Operational check: monitor coefficient norms as lambda changes; exploding norms indicate weak regularization or unstable optimization settings.

Deepening Notes

Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.

  • You now understand: Overfitting Underfitting Bias vs variance Regularization intuition Mathematical regularization formulation You are now moving from “student level” to “engineer level”.
  • In this video, we'll figure out how to get gradient descent to work with regularized linear regression.
  • The first part is the usual squared error cost function, and now you have this additional regularization term, where Lambda is the regularization parameter, and you'd like to find parameters w and b that minimize the regularized cost function.
  • Let's take these definitions for the derivatives and put them back into the expression on the left to write out the gradient descent algorithm for regularized linear regression.
  • To implement gradient descent for regularized linear regression, this is what you would have your code do.

Interview-Ready Deepening

Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.

  • L2 penalty added to MSE; weight decay in the gradient update.
  • With α=0.01, λ=1, m=100: decay factor = 1 − (0.01·1/100) = 1 − 0.0001 = 0.9999. Each step, w shrinks by 0.01% before the gradient update. Over 10,000 steps, this prevents w from growing unboundedly.
  • This is the update for linear regression before we had regularization, and this is the term we saw in Week 2 of this course.
  • In fact, the updates for a regularized linear regression look exactly the same, except that now the cost, J, is defined a bit differently.
  • This is why this expression is used to compute the gradient in regularized linear regression.
  • That's why the updated B remains the same as before, whereas the updated w changes because the regularization term causes us to try to shrink w_j.
  • Recall that f of x for linear regression is defined as w dot x plus b or w dot product x plus b.
  • Here is the update for w_j, for j equals 1 through n, and here's the update for b.

Tradeoffs You Should Be Able to Explain

  • More expressive models improve fit but can reduce interpretability and raise overfitting risk.
  • Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
  • Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.

First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.

Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.

🧾 Comprehensive Coverage

Exhaustive coverage points to ensure complete topic understanding without missing core concepts.

Loading interactive module...

💡 Concrete Example

With α=0.01, λ=1, m=100: decay factor = 1 − (0.01·1/100) = 1 − 0.0001 = 0.9999. Each step, w shrinks by 0.01% before the gradient update. Over 10,000 steps, this prevents w from growing unboundedly.

🧠 Beginner-Friendly Examples

Guided Starter Example

With α=0.01, λ=1, m=100: decay factor = 1 − (0.01·1/100) = 1 − 0.0001 = 0.9999. Each step, w shrinks by 0.01% before the gradient update. Over 10,000 steps, this prevents w from growing unboundedly.

Source-grounded Practical Scenario

L2 penalty added to MSE; weight decay in the gradient update.

Source-grounded Practical Scenario

With α=0.01, λ=1, m=100: decay factor = 1 − (0.01·1/100) = 1 − 0.0001 = 0.9999. Each step, w shrinks by 0.01% before the gradient update. Over 10,000 steps, this prevents w from growing unboundedly.

🧭 Architecture Flow

Loading interactive module...

🎬 Interactive Visualization

Loading interactive module...

🛠 Interactive Tool

Loading interactive module...

🧪 Interactive Sessions

  1. Concept Drill: Manipulate key parameters and observe behavior shifts for Regularisation — Math for Linear Regression.
  2. Failure Mode Lab: Trigger an edge case and explain remediation decisions.
  3. Architecture Reorder Exercise: Reorder 5 flow steps into the correct production sequence.

💻 Code Walkthrough

Concept-to-code walkthrough checklist for this topic.

  1. Define input/output contract before reading implementation details.
  2. Map each conceptual step to one concrete function/class decision.
  3. Call out one tradeoff and one failure mode in interview wording.

🎯 Interview Prep

Questions an interviewer is likely to ask about this topic. Think through your answer before reading the senior angle.

  • Q1[beginner] Write the regularised cost function for linear regression.
    Weight decay = L2 regularisation, just written differently in the update rule. Tie your implementation to problem framing, feature/label quality, and bias-variance control, stress-test it with realistic edge cases, and add production safeguards for label leakage, train-serving skew, and misleading aggregate metrics.
  • Q2[beginner] What is weight decay and how does it appear in the gradient update?
    It is best defined by the role it plays in the end-to-end system, not in isolation. L2-regularized linear regression objective: J(w,b) = (1/2m) * sum((ŷ_i - y_i)^2) + (lambda/2m) * sum(w_j^2) Details that matter: bias term b is usually excluded from penalty.. Operationally, its value appears only when integrated with problem framing, feature/label quality, and bias-variance control and measured against real outcomes. With α=0.01, λ=1, m=100: decay factor = 1 − (0.01·1/100) = 1 − 0.0001 = 0.9999. Each step, w shrinks by 0.01% before the gradient update. Over 10,000 steps, this prevents w from growing unboundedly.. A common pitfall is label leakage, train-serving skew, and misleading aggregate metrics; mitigate with data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
  • Q3[intermediate] Why is the bias term b typically not regularised?
    The causal reason is that system behavior is constrained by data, model contracts, and runtime context, not just algorithm choice. L2-regularized linear regression objective: J(w,b) = (1/2m) * sum((ŷ_i - y_i)^2) + (lambda/2m) * sum(w_j^2) Details that matter: bias term b is usually excluded from penalty.. A practical check is to validate impact on quality, latency, and failure recovery before scaling. If ignored, teams usually hit label leakage, train-serving skew, and misleading aggregate metrics; prevention requires data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
  • Q4[expert] Why should features be standardized before interpreting regularized coefficients?
    The causal reason is that system behavior is constrained by data, model contracts, and runtime context, not just algorithm choice. L2-regularized linear regression objective: J(w,b) = (1/2m) * sum((ŷ_i - y_i)^2) + (lambda/2m) * sum(w_j^2) Details that matter: bias term b is usually excluded from penalty.. A practical check is to validate impact on quality, latency, and failure recovery before scaling. If ignored, teams usually hit label leakage, train-serving skew, and misleading aggregate metrics; prevention requires data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
  • Q5[expert] How would you explain this in a production interview with tradeoffs?
    Weight decay = L2 regularisation, just written differently in the update rule. PyTorch's Adam optimiser has a weight_decay parameter that implements exactly this. The mathematical equivalence: adding λ/2m Σwⱼ² to the cost function produces the (1 − α·λ/m) factor in the gradient update. Understanding this equivalence shows you can connect the mathematical formulation to the framework API.
🏆 Senior answer angle — click to reveal
Use the tier progression: beginner correctness -> intermediate tradeoffs -> expert production constraints and incident readiness.

📚 Revision Flash Cards

Test yourself before moving on. Flip each card to check your understanding — great for quick revision before an interview.

Loading interactive module...