I recently spent a weekend building and training the first GPT model from scratch, almost exclusively consulting the original paper and PyTorch documentation. Unsurprisingly, implementing the model was more challenging than the paper implied.

The following are a few tips that helped me complete this project successfully. Several of them are applicable to general model building and training.

Efficiently Batch Data

The authors of GPT trained their model on the BookCorpus dataset. The Hugging Face implementation of this dataset, like many other text corpora, is divided into examples each consisting of one line. Tokenizing each line, adding padding to match the context window length, and then batching these lines initially appears to be a reasonable data processing approach.

Adopting this method, however, results in very long training times. The BookCorpus dataset has over 70 million lines, most of which are shorter than my context window size of 256. Assuming a batch size of 64, your model would need to perform over one million updates per epoch. On a RTX 2070 Super (curse the GPU shortage), training a small model via this data processing approach required over 30 hours per epoch.

I recommend instead tokenizing the lines in parallel while preserving token/line order, joining the tokens into one long array, and then applying the batchify function as suggested by the PyTorch documentation:

def batchify(data: Tensor, bsz: int) -> Tensor:
	"""
    Args:
        data: Tensor, shape [N]
        bsz: int, batch size
    Returns:
        Tensor of shape [N // bsz, bsz]
    """
    seq_len = data.size(0) // bsz
    data = data[: seq_len * bsz]
    return data.view(seq_len, bsz).contiguous()

This alternative approach results in each epoch requiring only ~30,000 updates or less than one hour of training time per epoch.

Ensure Positional Encoding is Additive

Recall that positional encoding in transfomer models is applied to the output of the embedding layer. The result is then added to the embedding layer output rather than replacing it. Forgetting this addition will cause your model to fail to converge.

The correct code (note the +=):

x = self.embedding(x)
x = self.dropout(x)
x += self.pos_encode(x)

Initialize Embedding Layer Weights

PyTorch modules have default weight initialization methods that are generally adequate. For GPT training, however, the default weight initialization method of uniform sampling for the embedding layer can slow training significantly. Like the GPT authors suggest, use normal initialization:

nn.init.normal_(self.embedding.weight, std=0.02)

Furthermore, the PyTorch transformer modules have the known limitation that every block in a transformer decoder module has the same weights due to how the package clones the provided decoder layer. As a result, be sure to initialize the weights of these decoder linear layers manually, again with a normal distribution:

for layer in self.decoder.layers:
	nn.init.normal_(layer.linear1.weight, std=0.02)
	nn.init.normal_(layer.linear2.weight, std=0.02)

Clip Gradients

Gradients can explode during training, though this is much less common for transformer models when compared to recurrent neural networks. You can largely prevent this by clipping gradients in your training script:

torch.nn.utils.clip_grad_norm_(
	self.model.parameters(), 1.0
)

Keep Optimizer Simple

The GPT authors used an Adam optimizer with a learning rate scheduler featuring cosine annealing. While it may be tempting to immediately jump to using a scheduler, I recommend starting with a vanilla Adam or AdamW optimizer with default hyperparameters and a 2.5e-4 learning rate.

If, after extensive debugging and hyperparameter experimentation, your model continues to fail to converge, then turn to learning rate scheduling. Learning rate scheduling comes with additional hyperparameters that can actually be a cause of convergence failure.

Use 16-bit Floating Point Training

My RTX 2070 Super has only 8 GB of VRAM, but even with a better GPU, fitting a full-sized GPT model is no easy task. To make things easier and allow for larger model sizes, use 16-bit floating point training. PyTorch makes this pretty straighforward thanks to its GradScaler helper:

scaler = torch.cuda.amp.GradScaler()

I didn’t notice a significant difference in perplexity when comparing 32-bit vs 16-bit floating point training, ceteris paribus.

Thoroughly Track Experiments

With training times for any reasonably-sized language model taking over a day at the least, it is important to track your training progress thoroughly.

Apart from standard quantitative evaluation metrics such as loss and perplexity, measuring/visualizing gradients as well as tracking system stats like CPU/GPU usage, memory availability, etc. can be very useful.

Luckily, many MLOps platforms these days can handle such tasks automatically. I personally used Weights & Biases and found it to be a joy to use.

Finally, be sure to allow for qualitative model evaluation after every epoch by running your generative model with a small set of prompts. Apply different decoding strategies like greedy, beam search, top-k with temperature, and nucleus sampling.

Useful Resources

I found the following resources helpful when verifying the veracity of my code:

Karpathy’s minGPT
PyTorch implementation of GPT (GitHub)

Published on November 20, 2022.