Forward propagation is the inference algorithm — the sequence of computations that turns input x into prediction ŷ. It proceeds layer by layer, left to right, which is why it's called "forward".
Step-by-step for a 3-layer network:
- Compute a[1]: apply layer 1's 25 units to input x. Each unit computes sigmoid(w·x + b). Output: vector of 25 activations.
- Compute a[2]: apply layer 2's 15 units to a[1]. Output: vector of 15 activations.
- Compute a[3]: apply the single output unit to a[2]. Output: scalar probability.
- Optional threshold: if a[3] ≥ 0.5, predict ŷ = 1 (same as logistic regression).
Output of a neural network is also written as f(x) — consistent with how we wrote logistic regression output in Course 1. The neural network is just a more expressive version of f(x).
Forward vs backward: forward propagation computes predictions. Backward propagation (backprop) — covered in Week 2 — computes gradients for training. If you already have trained parameters w and b (e.g., downloaded from the internet), you only need forward propagation for inference.
Typical architecture pattern: more units in earlier layers, fewer as you get deeper toward the output. This is a common and generally effective architecture choice.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- The algorithm for making predictions: computing activations left to right through all layers.
- Forward propagation is the inference algorithm — the sequence of computations that turns input x into prediction ŷ.
- It proceeds layer by layer, left to right, which is why it's called "forward".
- This will be an algorithm called forward propagation.
- Compute a [1] : apply layer 1's 25 units to input x. Each unit computes sigmoid(w·x + b). Output: vector of 25 activations.
- Compute a [2] : apply layer 2's 15 units to a [1] . Output: vector of 15 activations.
- Backward propagation (backprop) — covered in Week 2 — computes gradients for training.
- Given these 64 input features, we're going to use the neural network with two hidden layers.
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.
Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.
Forward propagation is a deterministic dataflow graph. Once the parameters are fixed, inference is just repeated function application from left to right. This is why inference services can be cached, benchmarked, and profiled like any other computation pipeline.
Production view: inference latency depends on layer count, width, activation cost, and batch size. The math topic here connects directly to serving design later: knowing where activations are computed tells you where memory, latency, and numerical issues show up in real systems.