skip to content
luminary.blog
by Oz Akan
fountain sketch

Which Loss Function Do LLMs use?

Exploring Cross-Entropy Loss in Large Language Models.

/ 3 min read

Table of Contents

ChatGPT, Claude, and other LLMs predict the next word using mathematical loss functions. Their outputs seem magical but the underlying logic is based on mathematical principles that we will explore in this article.

The Foundation: Next-Token Prediction

Every large language model is trained on a simple task: predicting the next token. Given a sequence of words, the model must guess what comes next. For example, for “The cat sat on the,” the model should assign high probability to “mat” and low probability to unrelated words like “cloud”.

This approach, known as causal language modeling, transforms the complex challenge of understanding language into a prediction problem that can be solved through mathematical optimization.

Cross-Entropy Loss

The standard loss function used throughout the industry is cross-entropy loss, also known as negative log-likelihood. This mathematical function measures how well the model’s predicted probability distribution matches the true distribution of tokens.

The formula for cross-entropy loss is:

LCE=iyilogpi\mathcal{L}_{CE} = -\sum_{i} y_i \log p_i

  • yiy_i: the true label (one-hot vector — 1 for the correct token, 0 for others).
  • pip_i: the predicted probability for token ii
  • Summed over all tokens in the training batch.

For next-token prediction, all the wrong tokens will have 0 as the multiplier so this simplifies to:

L=logpy\mathcal{L} = -\log p_y

where pyp_y is the model’s predicted probability for the correct next token.

Since loglog of number between 00 and 11 is negative the final result will be a positive number. If the model predicted the right token with a probablity of 0.0001 vs 1 the loss function would return the value below;

  • L=log(0.0001)=4\mathcal{L} = -\log (0.0001) = 4 Loss is large, this prediction will be penalized.
  • L=log(0.9)=0.05\mathcal{L} = -\log (0.9) = 0.05 Loss is small, this prediction will be rewarded.

Why Cross-Entropy Works So Well

Cross-entropy loss is particularly effective for language modeling because it:

  • Penalizes confident wrong predictions heavily: If the model assigns very low probability to the correct token, the loss becomes very large
  • Rewards accurate predictions: When the model correctly assigns high probability to the right token, the loss approaches zero
  • Maintains differentiability: The smooth mathematical properties allow for effective gradient-based optimization

The Training Process in Action

During pre-training, the learning process follows a continuous cycle:

  1. Forward Pass: The model processes input text and outputs probability distributions over the entire vocabulary for each position
  2. Loss Calculation: Cross-entropy loss compares the predicted probabilities with the actual next tokens
  3. Backpropagation: Gradients of the loss flow backward through all model parameters
  4. Parameter Updates: An optimizer adjusts the model weights to minimize future loss
  5. Iteration: This process repeats across billions of tokens from diverse text sources

This cycle continues until the model develops sophisticated patterns for language understanding and generation.

  • Label Smoothing
  • Multi-token prediction (predicting several tokens ahead)
  • Token-order objectives
  • Masked language modeling components
  • Direct Preference Optimization