Backpropagation is a leaky abstraction. Every engineer who skips the math eventually hits a wall they cannot reason past. This course builds the foundation from scratch — so that wall never stops you.
Andrew Ng is excellent. fast.ai is practical. But both make a choice: skip the hard parts to keep you moving. That choice leaves a gap. When something breaks at the gradient level, you are guessing — not reasoning.
PyTorch computes gradients automatically. That is its strength — and your weakness if you never derived them yourself. Autograd is not understanding. It is the absence of the need to understand, until suddenly you need to.
Dead ReLUs. Vanishing gradients. A model that trains fine but fails silently. Engineers who understand backpropagation see these coming. Those who don't spend weeks wondering why their loss curve went flat.
You can explain what attention does. But can you explain why the dot product of queries and keys produces meaningful similarity scores? Real understanding goes one level deeper than most courses are willing to go.
| Elsewhere | Here |
|---|---|
| Derivation in the appendix | Derivation is the session |
| Framework hides the gradient | You compute the gradient yourself |
| "Trust the autograd" | You are the autograd, in NumPy |
| Intuition without the proof | Proof first, intuition follows |
| 15% course completion | Live. 30 people. Accountable |
| Knowledge expires with the API | Math does not have deprecation notices |
Not the Wikipedia answer. We start from a single neuron, understand what it is computing and why, then stack layers until the structure of a network stops feeling arbitrary and starts feeling inevitable.
These are the smallest possible neural networks. We derive the loss function, write gradient descent from scratch, and build the intuition for learning itself before a single hidden layer appears. Everything after this is built on this.
We draw the computational graph. We trace exactly how a number flows forward through the network and how an error signal flows backward through the same graph in reverse. The chain rule stops being a formula and becomes a picture.
The centrepiece of the course. We derive the four fundamental equations of backprop line by line — no steps skipped, no "it can be shown that." Then we implement it in NumPy. Then we watch it train. This is the session most engineers say they wish they had years earlier.
Sequences introduce a new problem: gradients must travel through time. We derive backpropagation through time, understand exactly why gradients vanish over long sequences, and see how LSTMs solve it structurally — not as a magic trick but as a designed solution to a specific mathematical failure.
Attention emerged because RNNs had a fundamental limitation. We arrive here having earned that context. Self-attention, positional encoding, multi-head attention and the full encoder architecture are derived from what you already know — not presented as a new black box but as a logical continuation of everything before it.
We move from understanding transformers to building one. Tokenisation, embedding layers, positional encodings, the full decoder stack — written from scratch in Python. No HuggingFace, no shortcuts. You will have a working language model architecture by the end of this module, and you will understand every parameter in it.
Architecture is only half the story. We implement the training loop, cross-entropy loss over a vocabulary, learning rate scheduling, and gradient clipping. We train a small character-level language model from scratch, watch it learn to generate text, and understand exactly why it does — and what limits it. The course ends with something that works, and a mind that understands why.
Early bird seats are limited. Price returns to €999 once the first 30 seats are filled.
Join the waitlist. Waitlist members get first access to seats and lock in the early bird price before it opens publicly.
No spam. One email when seats open.