Which Loss Function Do LLMs use?

ChatGPT, Claude, and other LLMs predict the next word using mathematical loss functions. Their outputs seem magical but the underlying logic is based on mathematical principles that we will explore in this article.

The Foundation: Next-Token Prediction

Every large language model is trained on a simple task: predicting the next token. Given a sequence of words, the model must guess what comes next. For example, for “The cat sat on the,” the model should assign high probability to “mat” and low probability to unrelated words like “cloud”.

This approach, known as causal language modeling, transforms the complex challenge of understanding language into a prediction problem that can be solved through mathematical optimization.

Cross-Entropy Loss

The standard loss function used throughout the industry is cross-entropy loss, also known as negative log-likelihood. This mathematical function measures how well the model’s predicted probability distribution matches the true distribution of tokens.

The formula for cross-entropy loss is:

$\mathcal{L}_{CE} = -\sum_{i} y_i \log p_i$

$y_i$ : the true label (one-hot vector — 1 for the correct token, 0 for others).
$p_i$ : the predicted probability for token $i$
Summed over all tokens in the training batch.

For next-token prediction, all the wrong tokens will have 0 as the multiplier so this simplifies to:

$\mathcal{L} = -\log p_y$

where $p_y$ is the model’s predicted probability for the correct next token.

Since $log$ of number between $0$ and $1$ is negative the final result will be a positive number. If the model predicted the right token with a probablity of 0.0001 vs 1 the loss function would return the value below;

$\mathcal{L} = -\log (0.0001) = 4$ Loss is large, this prediction will be penalized.
$\mathcal{L} = -\log (0.9) = 0.05$ Loss is small, this prediction will be rewarded.

Why Cross-Entropy Works So Well

Cross-entropy loss is particularly effective for language modeling because it:

Penalizes confident wrong predictions heavily: If the model assigns very low probability to the correct token, the loss becomes very large
Rewards accurate predictions: When the model correctly assigns high probability to the right token, the loss approaches zero
Maintains differentiability: The smooth mathematical properties allow for effective gradient-based optimization

The Training Process in Action

During pre-training, the learning process follows a continuous cycle:

Forward Pass: The model processes input text and outputs probability distributions over the entire vocabulary for each position
Loss Calculation: Cross-entropy loss compares the predicted probabilities with the actual next tokens
Backpropagation: Gradients of the loss flow backward through all model parameters
Parameter Updates: An optimizer adjusts the model weights to minimize future loss
Iteration: This process repeats across billions of tokens from diverse text sources

This cycle continues until the model develops sophisticated patterns for language understanding and generation.

Label Smoothing
Multi-token prediction (predicting several tokens ahead)
Token-order objectives
Masked language modeling components
Direct Preference Optimization

← What is Matryoshka Representation Learning (MRL)?

The Foundation: Next-Token Prediction

Cross-Entropy Loss

Why Cross-Entropy Works So Well

The Training Process in Action

Related Topics to Investigate