If feature magnitudes differ a lot (for example sqft around 1500 while bedrooms around 3), gradient descent gets badly conditioned: one direction is steep, another is flat, so updates zigzag.
Scaling normalises optimization geometry so one learning rate can work across dimensions.
Common scaling families:
- Min-max: x_scaled=(x-x_min)/(x_max-x_min), typically [0,1]
- Z-score standardisation: x_scaled=(x-mu)/sigma, mean 0 and std 1
- Robust scaling: center by median and scale by IQR for heavy outliers
Choosing method: standardisation is the default for gradient-based models; min-max is useful when bounded ranges are required; robust scaling helps when extreme values dominate.
Leakage rule (non-negotiable): compute scaling statistics on training split only, then reuse those exact stats for validation, test, and inference traffic.
Operational rule: scaling parameters are model artifacts. Version them with model checkpoints so retraining and rollback stay reproducible.
Deepening Notes
Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.
- To calculate the mean normalization of x_1, first find the average, also called the mean of x_1 on your training set, and let's call this mean Mu_1, with this being the Greek alphabets Mu.
- To implement Z-score normalization, you need to calculate something called the standard deviation of each feature.
- In this case, these values are around 100, which is actually pretty large compared to other scale features, and this will actually cause gradient descent to run more slowly.
- With this little technique, you'll often be able to get gradient descent to run much faster.
- With or without feature scaling, when you run gradient descent, how can you know, how can you check if gradient descent is really working?
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- Normalising features so gradient descent converges faster โ a must-do step.
- In this case, these values are around 100, which is actually pretty large compared to other scale features, and this will actually cause gradient descent to run more slowly.
- If feature magnitudes differ a lot (for example sqft around 1500 while bedrooms around 3), gradient descent gets badly conditioned: one direction is steep, another is flat, so updates zigzag.
- If the features range from negative three to plus three or negative 0.3 to plus 0.3, all of these are completely okay.
- Choosing method: standardisation is the default for gradient-based models; min-max is useful when bounded ranges are required; robust scaling helps when extreme values dominate.
- But if another feature, like x_3 here, ranges from negative 100 to plus 100, then this takes on a very different range of values, say something from around negative one to plus one.
- In this case, feature re-scaling will likely help.
- There's almost never any harm to carrying out feature re-scaling.
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.
Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.