Skip to content
Concept-Lab
Machine Learning

Logistic Regression — Cost Function

Why MSE creates non-convex surfaces for classification; introducing log loss.

Core Theory

For logistic regression, the standard objective is binary cross-entropy (log loss), not MSE.

Why: MSE composed with sigmoid creates difficult optimization geometry; log loss is derived from likelihood and gives stable, principled probability training.

Per-example loss:

  • y=1 -> -log(ŷ)
  • y=0 -> -log(1-ŷ)

Dataset objective: J=(1/m) * sum(-y*log(ŷ) - (1-y)*log(1-ŷ)).

Intuition: confident wrong predictions are penalized extremely hard. This is why cross-entropy pushes models toward better calibration and sharper decision quality.

Implementation caution: direct sigmoid then log can hit numerical issues near 0 or 1. Production code usually uses logits-space losses (for example BCEWithLogitsLoss) for stability.

Evaluation reminder: optimizing log loss improves probabilistic quality, but operational success also depends on threshold-specific metrics (precision/recall/F1) aligned with business cost.

Deepening Notes

Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.

  • In the last video you saw the loss function and the cost function for logistic regression.
  • In this video you'll see a slightly simpler way to write out the loss and cost functions, so that the implementation can be a bit simpler when we get to gradient descent for fitting the parameters of a logistic regression model.
  • Because we're still working on a binary classification problem, y is either zero or one.
  • Using this simplified loss function, let's go back and write out the cost function for logistic regression.
  • So with the simplified cost function, we're now ready to jump into applying gradient descent to logistic regression.

Interview-Ready Deepening

Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.

  • Why MSE creates non-convex surfaces for classification; introducing log loss.
  • The cost function that pretty much everyone uses to train logistic regression.
  • For logistic regression, the standard objective is binary cross-entropy (log loss) , not MSE.
  • Because y is either zero or one and cannot take on any value other than zero or one, we'll be able to come up with a simpler way to write this loss function.
  • Evaluation reminder: optimizing log loss improves probabilistic quality, but operational success also depends on threshold-specific metrics (precision/recall/F1) aligned with business cost.
  • In the case of y equals 0, we also get back the original loss function as defined above.
  • This cost function has the nice property that it is convex.
  • Because we're still working on a binary classification problem, y is either zero or one.

Tradeoffs You Should Be Able to Explain

  • More expressive models improve fit but can reduce interpretability and raise overfitting risk.
  • Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
  • Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.

First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.

Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.

🧾 Comprehensive Coverage

Exhaustive coverage points to ensure complete topic understanding without missing core concepts.

Loading interactive module...

💡 Concrete Example

If y=1 (tumour is malignant) and model predicts ŷ=0.01 (99% confident it's benign): loss = −log(0.01) ≈ 4.6 (very high penalty). If model predicts ŷ=0.99: loss = −log(0.99) ≈ 0.01 (tiny penalty). Log loss harshly penalises confident wrong predictions.

🧠 Beginner-Friendly Examples

Guided Starter Example

If y=1 (tumour is malignant) and model predicts ŷ=0.01 (99% confident it's benign): loss = −log(0.01) ≈ 4.6 (very high penalty). If model predicts ŷ=0.99: loss = −log(0.99) ≈ 0.01 (tiny penalty). Log loss harshly penalises confident wrong predictions.

Source-grounded Practical Scenario

Why MSE creates non-convex surfaces for classification; introducing log loss.

Source-grounded Practical Scenario

The cost function that pretty much everyone uses to train logistic regression.

🧭 Architecture Flow

Loading interactive module...

🎬 Interactive Visualization

🛠 Interactive Tool

🧪 Interactive Sessions

  1. Concept Drill: Manipulate key parameters and observe behavior shifts for Logistic Regression — Cost Function.
  2. Failure Mode Lab: Trigger an edge case and explain remediation decisions.
  3. Architecture Reorder Exercise: Reorder 5 flow steps into the correct production sequence.

💻 Code Walkthrough

Concept-to-code walkthrough checklist for this topic.

  1. Define input/output contract before reading implementation details.
  2. Map each conceptual step to one concrete function/class decision.
  3. Call out one tradeoff and one failure mode in interview wording.

🎯 Interview Prep

Questions an interviewer is likely to ask about this topic. Think through your answer before reading the senior angle.

  • Q1[beginner] Why can't we use MSE as the cost function for logistic regression?
    The causal reason is that system behavior is constrained by data, model contracts, and runtime context, not just algorithm choice. For logistic regression, the standard objective is binary cross-entropy (log loss) , not MSE.. A practical check is to validate impact on quality, latency, and failure recovery before scaling. If ignored, teams usually hit label leakage, train-serving skew, and misleading aggregate metrics; prevention requires data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
  • Q2[beginner] What is log loss intuitively? What does it penalise most?
    It is best defined by the role it plays in the end-to-end system, not in isolation. For logistic regression, the standard objective is binary cross-entropy (log loss) , not MSE.. Operationally, its value appears only when integrated with problem framing, feature/label quality, and bias-variance control and measured against real outcomes. If y=1 (tumour is malignant) and model predicts ŷ=0.01 (99% confident it's benign): loss = −log(0.01) ≈ 4.6 (very high penalty). If model predicts ŷ=0.99: loss = −log(0.99) ≈ 0.01 (tiny penalty). Log loss harshly penalises confident wrong predictions.. A common pitfall is label leakage, train-serving skew, and misleading aggregate metrics; mitigate with data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
  • Q3[intermediate] Where does log loss come from mathematically?
    Log loss (cross-entropy) is derived from maximum likelihood estimation — we're maximising the probability that the training labels were generated by our model. Tie your implementation to problem framing, feature/label quality, and bias-variance control, stress-test it with realistic edge cases, and add production safeguards for label leakage, train-serving skew, and misleading aggregate metrics.
  • Q4[expert] Why is BCEWithLogitsLoss preferred over sigmoid + BCELoss in production code?
    The causal reason is that system behavior is constrained by data, model contracts, and runtime context, not just algorithm choice. For logistic regression, the standard objective is binary cross-entropy (log loss) , not MSE.. A practical check is to validate impact on quality, latency, and failure recovery before scaling. If ignored, teams usually hit label leakage, train-serving skew, and misleading aggregate metrics; prevention requires data contracts, sliced evaluation, drift/calibration monitoring, and rollback triggers.
  • Q5[expert] How would you explain this in a production interview with tradeoffs?
    Log loss (cross-entropy) is derived from maximum likelihood estimation — we're maximising the probability that the training labels were generated by our model. This gives it a solid probabilistic grounding. Knowing the MLE derivation separates senior answers from junior ones. Also: log loss is the standard for classification in every framework — PyTorch's BCELoss, sklearn's log_loss, TensorFlow's BinaryCrossentropy are all the same formula.
🏆 Senior answer angle — click to reveal
Use the tier progression: beginner correctness -> intermediate tradeoffs -> expert production constraints and incident readiness.

📚 Revision Flash Cards

Test yourself before moving on. Flip each card to check your understanding — great for quick revision before an interview.

Loading interactive module...