This topic built intuition by temporarily simplifying the model to one parameter: set b=0 so f(x)=w·x. This is not because real models only have one parameter, but because reducing dimensionality lets you see the full optimisation landscape clearly.
Now sweep w across values:
- w near the best value gives low error and low cost.
- w too small underestimates outputs, so many residuals are negative and large in magnitude.
- w too large overestimates outputs, so many residuals are positive and large in magnitude.
When you plot w on the x-axis and J(w) on the y-axis, you get a U-shaped parabola. The bottom of this U is the parameter value that minimises training error.
Critical geometric insight: left side of the bowl has negative slope, right side has positive slope, and exactly at the minimum the slope is zero. This single picture explains why gradient descent works: the derivative sign always tells you which direction to move.
Why this matters in production: for linear regression with MSE, J is convex (single global minimum), so optimisation is predictable. If your training still struggles, the issue is usually not local minima; it is usually one of: bad learning rate, poor feature scaling, data quality issues, or model mismatch (linear model for nonlinear signal).
Deepening Notes
Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.
- In this video, we'll walk through one example to see how the cost function can be used to find the best parameters for your model.
- What the cost function J does is, it measures the difference between the model's predictions, and the actual true values for y.
- Now, in order for us to better visualize the cost function J, this work of a simplified version of the linear regression model.
- Now, using this simplified model, let's see how the cost function changes as you choose different values for the parameter w.
- Notice that because the cost function is a function of the parameter w, the horizontal axis is now labeled w and not x, and the vertical axis is now J and not y.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- What the cost function looks like — and why the bowl shape matters.
- Notice that because the cost function is a function of the parameter w, the horizontal axis is now labeled w and not x, and the vertical axis is now J and not y.
- To see this visually, what this means is that if b is set to 0, then f defines a line that looks like this.
- In the cost function, the squared error for the second example is also 0 squared.
- This line in combination with the training set corresponds to this point on the cost function graph at w equals 0.5.
- What the cost function J does is, it measures the difference between the model's predictions, and the actual true values for y.
- This topic built intuition by temporarily simplifying the model to one parameter : set b=0 so f(x)=w·x.
- Critical geometric insight: left side of the bowl has negative slope, right side has positive slope, and exactly at the minimum the slope is zero.
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.
Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.