Regularisation — Math for Linear Regression

Core Theory

L2-regularized linear regression objective:

J(w,b) = (1/2m) * sum((ŷ_i - y_i)^2) + (lambda/2m) * sum(w_j^2)

Details that matter:

bias term b is usually excluded from penalty.
lambda term is normalized by m for scale consistency.
regularization acts on weights, not labels/features.

Weight update with decay:

w_j := w_j*(1 - alpha*lambda/m) - alpha*(1/m)*sum((ŷ_i-y_i)*x_ij)

The first factor is weight decay. Each step slightly shrinks coefficient magnitude before fitting residual structure.

Practical insight: if features are not standardized, regularization acts unevenly because coefficient scales are not comparable. Standardize first, then tune lambda.

Operational check: monitor coefficient norms as lambda changes; exploding norms indicate weak regularization or unstable optimization settings.

Deepening Notes

Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.

You now understand: Overfitting Underfitting Bias vs variance Regularization intuition Mathematical regularization formulation You are now moving from “student level” to “engineer level”.
In this video, we'll figure out how to get gradient descent to work with regularized linear regression.
The first part is the usual squared error cost function, and now you have this additional regularization term, where Lambda is the regularization parameter, and you'd like to find parameters w and b that minimize the regularized cost function.
Let's take these definitions for the derivatives and put them back into the expression on the left to write out the gradient descent algorithm for regularized linear regression.
To implement gradient descent for regularized linear regression, this is what you would have your code do.

Interview-Ready Deepening

Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.

L2 penalty added to MSE; weight decay in the gradient update.
With α=0.01, λ=1, m=100: decay factor = 1 − (0.01·1/100) = 1 − 0.0001 = 0.9999. Each step, w shrinks by 0.01% before the gradient update. Over 10,000 steps, this prevents w from growing unboundedly.
This is the update for linear regression before we had regularization, and this is the term we saw in Week 2 of this course.
In fact, the updates for a regularized linear regression look exactly the same, except that now the cost, J, is defined a bit differently.
This is why this expression is used to compute the gradient in regularized linear regression.
That's why the updated B remains the same as before, whereas the updated w changes because the regularization term causes us to try to shrink w_j.
Recall that f of x for linear regression is defined as w dot x plus b or w dot product x plus b.
Here is the update for w_j, for j equals 1 through n, and here's the update for b.

Tradeoffs You Should Be Able to Explain

More expressive models improve fit but can reduce interpretability and raise overfitting risk.
Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.

First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.

Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.

🧾 Comprehensive Coverage

Exhaustive coverage points to ensure complete topic understanding without missing core concepts.

Covered: 0 / 12

L2 penalty added to MSE; weight decay in the gradient update.With α=0.01, λ=1, m=100: decay factor = 1 − (0.01·1/100) = 1 − 0.0001 = 0.9999. Each step, w shrinks by 0.01% before the gradient update. Over 10,000 steps, this prevents w from growing unboundedly.Practical insight: if features are not standardized, regularization acts unevenly because coefficient scales are not comparable.Each step slightly shrinks coefficient magnitude before fitting residual structure.Operational check: monitor coefficient norms as lambda changes; exploding norms indicate weak regularization or unstable optimization settings.This is the update for linear regression before we had regularization, and this is the term we saw in Week 2 of this course.In fact, the updates for a regularized linear regression look exactly the same, except that now the cost, J, is defined a bit differently.This is why this expression is used to compute the gradient in regularized linear regression.That's why the updated B remains the same as before, whereas the updated w changes because the regularization term causes us to try to shrink w_j.Recall that f of x for linear regression is defined as w dot x plus b or w dot product x plus b.Here is the update for w_j, for j equals 1 through n, and here's the update for b.If we simplify, then we're saying that w_j is updated as w_j times 1 minus Alpha times Lambda over m, minus Alpha times this other term over here.

Loading interactive module...

💡 Concrete Example

With α=0.01, λ=1, m=100: decay factor = 1 − (0.01·1/100) = 1 − 0.0001 = 0.9999. Each step, w shrinks by 0.01% before the gradient update. Over 10,000 steps, this prevents w from growing unboundedly.

🧠 Beginner-Friendly Examples

Guided Starter Example

With α=0.01, λ=1, m=100: decay factor = 1 − (0.01·1/100) = 1 − 0.0001 = 0.9999. Each step, w shrinks by 0.01% before the gradient update. Over 10,000 steps, this prevents w from growing unboundedly.

Source-grounded Practical Scenario

L2 penalty added to MSE; weight decay in the gradient update.

Source-grounded Practical Scenario

With α=0.01, λ=1, m=100: decay factor = 1 − (0.01·1/100) = 1 − 0.0001 = 0.9999. Each step, w shrinks by 0.01% before the gradient update. Over 10,000 steps, this prevents w from growing unboundedly.

🧭 Architecture Flow

Drag to reorder the architecture flow for Regularisation — Math for Linear Regression. This is designed as an interview rehearsal for explaining end-to-end execution.

1.Define the objective for Regularisation — Math for Linear Regression

2.Prepare and validate inputs/state

3.Execute core algorithmic step

4.Evaluate outputs and detect failure modes

5.Apply feedback loop and iterate

Flow order matches canonical architecture sequence.

Loading interactive module...

🎬 Interactive Visualization

Evaluation is not just about measuring one score. You need to separate parameter fitting, model selection, and final reporting so the number you trust has not already been used to make design decisions.

Training split: 60%Cross-validation split: 20%

Dataset governance

Train: fit model parameters only.

CV: choose degree, lambda, architecture, or threshold.

Test: report final unbiased estimate only after design choices are locked.

Model selection example

Degree 1CV error 0.26

Test error after selection: 0.27

Degree 2CV error 0.14

Test error after selection: 0.15

Degree 4CV error 0.19

Test error after selection: 0.21

Choose the model using cross-validation error, then use the test set once for final reporting. If you use the test set to choose the winner, that score becomes optimistic.

Core rule

Training performance tells you how well the model fit known data.
Cross-validation performance guides design decisions.
Test performance should stay untouched until you want one fair estimate of generalization.

Loading interactive module...

🛠 Interactive Tool

Evaluation is not just about measuring one score. You need to separate parameter fitting, model selection, and final reporting so the number you trust has not already been used to make design decisions.

Training split: 60%Cross-validation split: 20%

Dataset governance

Train: fit model parameters only.

CV: choose degree, lambda, architecture, or threshold.

Test: report final unbiased estimate only after design choices are locked.

Model selection example

Degree 1CV error 0.26

Test error after selection: 0.27

Degree 2CV error 0.14

Test error after selection: 0.15

Degree 4CV error 0.19

Test error after selection: 0.21

Choose the model using cross-validation error, then use the test set once for final reporting. If you use the test set to choose the winner, that score becomes optimistic.

Core rule

Training performance tells you how well the model fit known data.
Cross-validation performance guides design decisions.
Test performance should stay untouched until you want one fair estimate of generalization.

Loading interactive module...

🧪 Interactive Sessions

Concept Drill: Manipulate key parameters and observe behavior shifts for Regularisation — Math for Linear Regression.
Failure Mode Lab: Trigger an edge case and explain remediation decisions.
Architecture Reorder Exercise: Reorder 5 flow steps into the correct production sequence.

💻 Code Walkthrough

Concept-to-code walkthrough checklist for this topic.

Define input/output contract before reading implementation details.
Map each conceptual step to one concrete function/class decision.
Call out one tradeoff and one failure mode in interview wording.

🎯 Interview Prep

Questions an interviewer is likely to ask about this topic. Think through your answer before reading the senior angle.

Q1[beginner] Write the regularised cost function for linear regression.
Weight decay = L2 regularisation, just written differently in the update rule. Tie your implementation to problem framing, feature/label quality, and bias-variance control, stress-test it with realistic edge cases, and add production safeguards for label leakage, train-serving skew, and misleading aggregate metrics.
Q2[beginner] What is weight decay and how does it appear in the gradient update?
It is best defined by the role it plays in the end-to-end system, not in isolation. L2-regularized linear regression objective: J(w,b) = (1/2m) * sum((ŷ_i - y_i)^2) + (lambda/2m) * sum(w_j^2) Details that matter: bias term b is usually excluded from penalty.. Operationally, its value appears only when integrated with problem framing, feature/label quality, and bias-variance control and measured against real outcomes. With α=0.01, λ=1, m=100: decay factor = 1 − (0.01·1/100) = 1 − 0.0001 = 0.9999. Each step, w shrinks by 0.01% before the gradient update. Over 10,000 steps, this prevents w from growing unboundedly.. A common pitfall is label leakage, train-serving skew, and misleading aggregate metrics; mitigate with data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
Q3[intermediate] Why is the bias term b typically not regularised?
The causal reason is that system behavior is constrained by data, model contracts, and runtime context, not just algorithm choice. L2-regularized linear regression objective: J(w,b) = (1/2m) * sum((ŷ_i - y_i)^2) + (lambda/2m) * sum(w_j^2) Details that matter: bias term b is usually excluded from penalty.. A practical check is to validate impact on quality, latency, and failure recovery before scaling. If ignored, teams usually hit label leakage, train-serving skew, and misleading aggregate metrics; prevention requires data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
Q4[expert] Why should features be standardized before interpreting regularized coefficients?
The causal reason is that system behavior is constrained by data, model contracts, and runtime context, not just algorithm choice. L2-regularized linear regression objective: J(w,b) = (1/2m) * sum((ŷ_i - y_i)^2) + (lambda/2m) * sum(w_j^2) Details that matter: bias term b is usually excluded from penalty.. A practical check is to validate impact on quality, latency, and failure recovery before scaling. If ignored, teams usually hit label leakage, train-serving skew, and misleading aggregate metrics; prevention requires data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
Q5[expert] How would you explain this in a production interview with tradeoffs?
Weight decay = L2 regularisation, just written differently in the update rule. PyTorch's Adam optimiser has a weight_decay parameter that implements exactly this. The mathematical equivalence: adding λ/2m Σwⱼ² to the cost function produces the (1 − α·λ/m) factor in the gradient update. Understanding this equivalence shows you can connect the mathematical formulation to the framework API.

🏆 Senior answer angle — click to reveal

Use the tier progression: beginner correctness -> intermediate tradeoffs -> expert production constraints and incident readiness.

📚 Revision Flash Cards

Test yourself before moving on. Flip each card to check your understanding — great for quick revision before an interview.

Start flipping cards to track your progress

Question

What is the regularised cost function for linear regression (L2)?

tap to reveal →

Answer

J(w,b) = (1/2m)Σ(ŷᵢ−yᵢ)² + (λ/2m)Σwⱼ². The first term is the MSE; the second is the L2 penalty. The bias b is not regularised.

Loading interactive module...