Manual z-score implementation:
- Compute μ = mean of each feature column across training examples:
mu = np.mean(X_train, axis=0) - Compute σ = standard deviation:
sigma = np.std(X_train, axis=0) - Transform:
X_scaled = (X - mu) / sigma - Store μ and σ — apply the same values to val/test/production data
sklearn's StandardScaler encapsulates this cleanly:
scaler = StandardScaler()scaler.fit(X_train) — computes and stores μ and σX_train_scaled = scaler.transform(X_train)X_test_scaled = scaler.transform(X_test) — uses training μ/σ
The separation of fit and transform is the key design pattern — it enforces the rule that test data is never used to compute scaling parameters.
In production, save the fitted scaler with joblib.dump(scaler, 'scaler.pkl') alongside your model. Load it at inference time to scale incoming requests with the same parameters.
Deepening Notes
Source-backed reinforcement: these points are extracted from the session source note to strengthen your theory intuition.
- Recall that the job of gradient descent is to find parameters w and b that hopefully minimize the cost function J.
- What I'll often do is plot the cost function J, which is calculated on the training set, and I plot the value of J at each iteration of gradient descent.
- Notice that the horizontal axis is the number of iterations of gradient descent and not a parameter like w or b.
- Concretely, if you look here at this point on the curve, this means that after you've run gradient descent for 100 iterations, meaning 100 simultaneous updates of the parameters, you have some learned values for w and b.
- This point here corresponds to the value of J for the parameters that you got after 200 iterations of gradient descent.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- Coding z-score normalisation from scratch; using sklearn's StandardScaler.
- Without scaling: gradient descent on house price (sqft=1500, bedrooms=3) zigzags for 10,000 iterations. With z-score scaling (sqft_scaled≈0.5, bedrooms_scaled≈0.2), converges in 100 iterations. Same model, 100× fewer iterations.
- Compute μ = mean of each feature column across training examples: mu = np.mean(X_train, axis=0)
- The separation of fit and transform is the key design pattern — it enforces the rule that test data is never used to compute scaling parameters.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
- Compute σ = standard deviation: sigma = np.std(X_train, axis=0)
- Recall that the job of gradient descent is to find parameters w and b that hopefully minimize the cost function J.
- Notice that the horizontal axis is the number of iterations of gradient descent and not a parameter like w or b.
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.
Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.