For logistic regression, the standard objective is binary cross-entropy (log loss), not MSE.
Why: MSE composed with sigmoid creates difficult optimization geometry; log loss is derived from likelihood and gives stable, principled probability training.
Per-example loss:
- y=1 ->
-log(ŷ) - y=0 ->
-log(1-ŷ)
Dataset objective: J=(1/m) * sum(-y*log(ŷ) - (1-y)*log(1-ŷ)).
Intuition: confident wrong predictions are penalized extremely hard. This is why cross-entropy pushes models toward better calibration and sharper decision quality.
Implementation caution: direct sigmoid then log can hit numerical issues near 0 or 1. Production code usually uses logits-space losses (for example BCEWithLogitsLoss) for stability.
Evaluation reminder: optimizing log loss improves probabilistic quality, but operational success also depends on threshold-specific metrics (precision/recall/F1) aligned with business cost.
Deepening Notes
Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.
- In the last video you saw the loss function and the cost function for logistic regression.
- In this video you'll see a slightly simpler way to write out the loss and cost functions, so that the implementation can be a bit simpler when we get to gradient descent for fitting the parameters of a logistic regression model.
- Because we're still working on a binary classification problem, y is either zero or one.
- Using this simplified loss function, let's go back and write out the cost function for logistic regression.
- So with the simplified cost function, we're now ready to jump into applying gradient descent to logistic regression.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- Why MSE creates non-convex surfaces for classification; introducing log loss.
- The cost function that pretty much everyone uses to train logistic regression.
- For logistic regression, the standard objective is binary cross-entropy (log loss) , not MSE.
- Because y is either zero or one and cannot take on any value other than zero or one, we'll be able to come up with a simpler way to write this loss function.
- Evaluation reminder: optimizing log loss improves probabilistic quality, but operational success also depends on threshold-specific metrics (precision/recall/F1) aligned with business cost.
- In the case of y equals 0, we also get back the original loss function as defined above.
- This cost function has the nice property that it is convex.
- Because we're still working on a binary classification problem, y is either zero or one.
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.
Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.