Computers store numbers with limited precision (floating-point). Computing softmax in two steps โ first computing activations a, then computing loss using a โ introduces numerical roundoff errors. These errors are small for logistic regression but can become significant for softmax.
The solution: let TensorFlow combine the activation and loss computation into one step, giving it freedom to rearrange terms for numerical stability. This is triggered by setting from_logits=True in the loss function and changing the output layer to use linear activation:
# Recommended (numerically stable):
model = Sequential([
Dense(25, activation='relu'),
Dense(15, activation='relu'),
Dense(10, activation='linear') # outputs z values (logits)
])
model.compile(
loss=SparseCategoricalCrossentropy(from_logits=True)
)
TensorFlow then computes the full expression internally in a numerically stable way. The downside: the output layer now produces raw z values ("logits"), not probabilities. To get probabilities for inference, apply softmax manually:
logits = model(x)
probs = tf.nn.softmax(logits)
The same pattern applies to binary classification: use linear output + BinaryCrossentropy(from_logits=True) instead of sigmoid output + BinaryCrossentropy.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- Using from_logits=True to avoid floating-point roundoff errors in softmax and logistic loss.
- Computing softmax in two steps โ first computing activations a, then computing loss using a โ introduces numerical roundoff errors .
- These errors are small for logistic regression but can become significant for softmax.
- This is triggered by setting from_logits=True in the loss function and changing the output layer to use linear activation :
- For logistic regression, this works okay, and usually the numerical round-off errors aren't that bad.
- First, let's set x equals 2/10,000 and print the result to a lot of decimal points of accuracy.
- That's what this from logits equals true argument causes TensorFlow to do.
- If y is equal to 10 is this formula, then this gives TensorFlow the ability to rearrange terms and compute this integral numerically accurate way.
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.
Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.
Numerical stability is an engineering detail that changes model quality in practice. The mathematically equivalent implementation is not always the computationally safest implementation because floating-point arithmetic can underflow or overflow when exponentials and near-canceling values appear.
Production rule: prefer logits-based loss APIs when the framework offers them. They give the library room to rearrange computations more safely than forcing it to materialize fragile intermediate probabilities first.