Start of training - randomly initialised
Embedding Layer
Attention Layer
Feed Forward Network (FFN) Layer
LayerNorm parameter
The training loop - how learning actually happens
Step 1 - Forward pass
Step 2 - Compute the loss
Step 3 - Backpropogation
Step 4 - Optimiser step
Repeat - millions/trillions of times
What each component learns
Token Embeddings - lexical meaning
Attention weights - relationships
LayerNorm - signal conditinonality and stability
Putting it all together