My implementation of the original transformer in the paper Attention Is All You Need.
I'm drawing heavily on The Annotated Transformer blogpost. I'm also reading Aleksa Gordić's implementation.
For the first few commits, there will be shameless copy and pasting. The intention is to perform the following in this project:
- Understand the original transformer, piece by piece. I'll be copying and pasting, but diving deeply into the intent of each function/class.
- Get the transformer working, most likely through a translation task of some kind.
- Practice coding the entire thing from scratch from memory, using the final translation task as a "unit test" of sorts. a. This will likely be done by coding individual components from scratch first.
- Once I can rewrite the entire program from memory three times over three days, I will consider my comprehension mastered.
This is if you want to follow my logic of development, not just be overwhelmed with the end result. I left detailed notes in each file, recording what I understood as I understood it.
- Layer Normalization (LayerNorm.py)
- Sublayers (SublayerUnit.py)
- Attention (attention.py)
- MultiHeadAttention (MultiHeadAttention.py)
- FeedForwardNetwork (FeedForwardNetwork.py)
- Positional Encoding (PositionalEncoding.py)
- Source Masking/Padding (notes added to attention.py)
conda create -n <env name>
conda activate <env name>
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 torchtext==0.16.2 altair spacy -c pytorch
pip install pandas
pip install pytorch-lightning