
Which Loss Function Do LLMs use?
/ 3 min read
Table of Contents
ChatGPT, Claude, and other LLMs predict the next word using mathematical loss functions. Their outputs seem magical but the underlying logic is based on mathematical principles that we will explore in this article.
The Foundation: Next-Token Prediction
Every large language model is trained on a simple task: predicting the next token. Given a sequence of words, the model must guess what comes next. For example, for “The cat sat on the,” the model should assign high probability to “mat” and low probability to unrelated words like “cloud”.
This approach, known as causal language modeling, transforms the complex challenge of understanding language into a prediction problem that can be solved through mathematical optimization.
Cross-Entropy Loss
The standard loss function used throughout the industry is cross-entropy loss, also known as negative log-likelihood. This mathematical function measures how well the model’s predicted probability distribution matches the true distribution of tokens.
The formula for cross-entropy loss is:
- : the true label (one-hot vector — 1 for the correct token, 0 for others).
- : the predicted probability for token
- Summed over all tokens in the training batch.
For next-token prediction, all the wrong tokens will have 0 as the multiplier so this simplifies to:
where is the model’s predicted probability for the correct next token.
Since of number between and is negative the final result will be a positive number. If the model predicted the right token with a probablity of 0.0001 vs 1 the loss function would return the value below;
- Loss is large, this prediction will be penalized.
- Loss is small, this prediction will be rewarded.
Why Cross-Entropy Works So Well
Cross-entropy loss is particularly effective for language modeling because it:
- Penalizes confident wrong predictions heavily: If the model assigns very low probability to the correct token, the loss becomes very large
- Rewards accurate predictions: When the model correctly assigns high probability to the right token, the loss approaches zero
- Maintains differentiability: The smooth mathematical properties allow for effective gradient-based optimization
The Training Process in Action
During pre-training, the learning process follows a continuous cycle:
- Forward Pass: The model processes input text and outputs probability distributions over the entire vocabulary for each position
- Loss Calculation: Cross-entropy loss compares the predicted probabilities with the actual next tokens
- Backpropagation: Gradients of the loss flow backward through all model parameters
- Parameter Updates: An optimizer adjusts the model weights to minimize future loss
- Iteration: This process repeats across billions of tokens from diverse text sources
This cycle continues until the model develops sophisticated patterns for language understanding and generation.
Related Topics to Investigate
- Label Smoothing
- Multi-token prediction (predicting several tokens ahead)
- Token-order objectives
- Masked language modeling components
- Direct Preference Optimization