What is LLM Inference?

changelog: added details about tokenization

When you type a prompt into ChatGPT, Claude, or any other language model, what happens behind the scenes? Let’s talk about inference - the computational steps that transform your input into a response.

LLM inference is the process by which a trained language model generates text predictions based on input. Think of it as the model’s “thinking” process - taking your prompt and producing a response one token (word or subword) at a time.

Unlike training, which teaches the model patterns from massive datasets, inference is about applying that learned knowledge to generate new text.

The Inference Process

Here’s how it works step by step:

1. Tokenization

Neural networks can’t process raw text. They work with numbers. Tokenization maps a word / partial word / symbol to a number. This is a deterministic process.

LLMs usually use Subword Tokenization like Byte Pair Encoding (BPE), Unigram, or WordPiece. Your input text gets broken down into tokens - smaller units that the model can process. For example:

“Hello world” → [“Hello”, ” world”]
“unhappiness” → [“un”, “happiness”]
“GPT-4” → [“G”, “PT”, ”-”, “4”] (depending on the tokenizer)

During training, the tokenizer scans a massive text dataset to determine which subwords appear most frequently. These are added to the vocabulary. When processing input, the tokenizer greedily matches the longest subwords from the vocabulary. Subwords are mapped to integer IDs; these IDs are then used as input for the model.

Imagine the text starts with characters. The algorithm repeatedly merges the most common pairs of tokens in the data. Over thousands of merges, common subwords like “ing”, “ation”, or “pre” are created. For instance:

Input: “Reading”
Initial tokens: [“R”, “e”, “a”, “d”, “i”, “n”, “g”]
After merges: [“Read”, “ing”]

Rare or new words are split creatively:

“Supercalifragilistic” → [“Super”, “cal”, “i”, “frag”, “il”, “istic”]

Models use reserved tokens for purposes such as:

Start (<s>) or end (</s> or <eos>) of a sentence
Padding (<pad>) to fill out sequences to a uniform length. For instance, if you have sentences of varying lengths, shorter sentences are padded with <pad> so that they can all be arranged into a uniform matrix for efficient computation in parallel.
Unknown (<unk>) for truly unrecognized text. This is rare in modern tokenizers but if the tokenizer encounters a truly foreign or corrupted input, it falls back to this token.

2. Embedding Lookup

Each token is converted into a numerical representation that combines two critical components:

Token Embeddings

Token embeddings convert discrete tokens into dense vector representations that capture semantic meaning:

Lookup Process: Each input token gets mapped to its corresponding embedding vector from a learned embedding matrix
Semantic Representation: These vectors encode the token’s meaning in high-dimensional space (e.g., 768, 1024, or 4096 dimensions)
Context Processing: The model uses these embeddings to understand relationships between words

For example, tokens like “king” and “queen” would have similar embeddings because they share semantic properties.

Positional Embeddings

Positional embeddings solve a critical problem: transformer models (GPT Series, LLAMA, BERT) have no inherent understanding of token order:

Position Encoding: Each token position gets its own embedding vector
Order Information: These embeddings tell the model where each token appears in the sequence
Combined Representation: Token embeddings are added to positional embeddings before entering the transformer layers

There are several types of positional embeddings:

Absolute Positional Embeddings

Fixed embeddings for each position (0, 1, 2, …)
Used in original BERT and GPT models
Simple but limited by training sequence length

Relative Positional Embeddings

Encode relative distances between tokens
More flexible for handling sequences longer than training data
Used in models like T5 and some recent architectures

Rotary Position Embeddings (RoPE)¹

Rotate embedding vectors based on position
Used in models like LLaMA and GPT-NeoX
Better extrapolation to longer sequences

RoPE Deep Dive: The Rotation Magic

RoPE works by applying rotation matrices to embedding vectors based on their position. Think of it like a clock - each position gets a specific rotation angle.

How RoPE Works:

Split the embedding: Take a token’s embedding vector and split it into pairs of dimensions
Apply rotation: Each pair gets rotated by an angle that depends on the token’s position
Different frequencies: Each dimension pair uses a different rotation frequency

Example: Let’s say we have a 4-dimensional embedding for the word “cat” at position 2:

1
Original embedding: [0.5, 0.3, 0.8, 0.2]
2
Split into pairs: [(0.5, 0.3), (0.8, 0.2)]
3

4
Position 2 rotation angles:
5
- Pair 1: θ₁ = 2 × 10000^(-0/4) = 2 radians
6
- Pair 2: θ₂ = 2 × 10000^(-2/4) = 0.2 radians
7

8
After rotation:
9
- Pair 1: (0.5×cos(2) + 0.3×sin(2), -0.5×sin(2) + 0.3×cos(2))
10
- Pair 2: (0.8×cos(0.2) + 0.2×sin(0.2), -0.8×sin(0.2) + 0.2×cos(0.2))

Why This Works:

Relative positioning: The dot product between two RoPE embeddings naturally encodes their relative distance
Length invariance: Works for sequences longer than training data because the rotation pattern continues
Smooth interpolation: Positions between training positions get sensible intermediate rotations

Advantage over absolute embeddings: If you train on sequences up to 2048 tokens, absolute embeddings break at position 2049. RoPE just continues the rotation pattern, allowing models to handle much longer sequences during inference.

Why Both Matter

Token embeddings provide the “what” (semantic content)
Positional embeddings provide the “where” (sequence order)
Together, they give the model complete information about each token’s meaning and position

Without positional embeddings, “The cat sat on the mat” and “mat the on sat cat The” would look identical to the model, making coherent text generation impossible.

During inference, the process looks like:

Input Processing: “The cat sat” → [token_emb(“The”) + pos_emb(0), token_emb(“cat”) + pos_emb(1), token_emb(“sat”) + pos_emb(2)]
Autoregressive Generation: When generating the next token, the model maintains positional information for the entire context
Context Window: Positional embeddings limit sequence length - the model can only handle positions it was trained on

3. Forward Pass

The embeddings flow through the model’s neural network layers:

Attention mechanisms help the model focus on relevant parts of the input. For example, when translating a sentence or answering a question, the model can dynamically decide which words in the input are most important to consider at each step. This results in better context awareness and more accurate outputs.
Feed-forward networks process and transform the information. They further transform the embedded representations, allowing the model to learn more complex features after the attention step.
Layer normalization keeps the values stable. This technique standardizes the values (activations) within a layer to have zero mean and unit variance. It keeps the model’s internal values stable as they move through the network, which helps with training efficiency and prevents the values from growing too large or too small as the data passes through the many layers of a large model.

4. Next Token Prediction

At the output layer, the model produces a probability distribution over its entire vocabulary - essentially ranking how likely each possible next token is given the context.

5. Token Selection

The model selects the next token using strategies like:

Greedy Decoding

Always pick the highest probability token.

Example: Given context “The weather is”

1
Probability distribution:
2
nice: 0.4
3
good: 0.3
4
sunny: 0.15
5
bad: 0.1
6
terrible: 0.05

Greedy decoding always picks “nice” (highest probability).

Result: Deterministic but can be repetitive and boring.

Random Sampling

Randomly select based on probabilities, using temperature to control randomness.

Example: Same context “The weather is”

Temperature = 1.0 (normal): Sample directly from probabilities
Temperature = 0.1 (low): Makes distribution sharper, almost greedy
Temperature = 2.0 (high): Flattens distribution, more random

1
With temperature 0.5:
2
nice: 0.55 (boosted)
3
good: 0.25
4
sunny: 0.12
5
bad: 0.06
6
terrible: 0.02
7

8
Might pick "good" or "sunny" sometimes for variety

Top-k Sampling

Only consider the k most likely tokens, then sample from those.

Example: Top-k=3 with context “The weather is”

1
Original: nice(0.4), good(0.3), sunny(0.15), bad(0.1), terrible(0.05)
2
Top-k=3: nice(0.53), good(0.32), sunny(0.15) [renormalized]

Never picks “bad” or “terrible” - filters out unlikely options.

Top-p (Nucleus) Sampling 2

Include tokens until their cumulative probability reaches p.

Example: Top-p=0.8 with context “The weather is”

1
Cumulative probabilities:
2
nice: 0.4
3
nice + good: 0.7
4
nice + good + sunny: 0.85 ← Exceeds 0.8, so stop at "good"
5

6
Sample from: nice(0.57), good(0.43) [renormalized]

Adaptive vocabulary size - sometimes 2 tokens, sometimes 10+.

Real-World Impact

Chatbots: Often use Top-p for natural conversation
Code generation: Might use lower temperature + Top-k for precision
Creative writing: Higher temperature for more surprising word choices

6. Iteration

The newly generated token is added to the context, and the process repeats until a stopping condition is met (end token, max length, etc.).

The Human Perspective

From your perspective, inference feels like a conversation. The model isn’t truly “understanding” in the human sense. It’s performing sophisticated pattern matching based on statistical relationships learned during training.

Each response is generated fresh - the model doesn’t have memory between separate conversations (unless explicitly provided context).

How Does the GPU Run the Inference?

Efficiently serving Large Language Model (LLM) inference at scale requires smart batching strategies to utilize computational resources and balance throughput with latency. The three main batching techniques—static, dynamic, and continuous—differ in flexibility and suitability for various production scenarios.

1. Static Batching

Static batching waits for a fixed number of requests before processing them together in a single batch. All requests in the batch are run simultaneously through the LLM. The process begins only when the batch is full, regardless of waiting time.

Use Case:

Best suited for offline or scheduled workloads, such as nightly batch jobs, bulk document processing, or analytics tasks where latency isn’t critical.

Pros:

Maximizes GPU or TPU efficiency by filling computational capacity.
Simple to implement; batching logic is straightforward.

Cons:

Increases latency for the first requests in a batch—they must wait for the batch to fill.
If requests in a batch have variable execution times (e.g., some short, some long), the entire batch waits for the longest-running request (the “straggler effect”), wasting resources.

2. Dynamic Batching

Dynamic batching collects requests as they arrive and processes a batch either when it reaches a maximum size or after a maximum wait time (batch window). It doesn’t insist on the batch being full: a partial batch can be processed if the timer expires, minimizing individual request wait time.

Use Case:

Appropriate for production with unpredictable, bursty, or variable traffic, such as APIs serving live user queries.
Useful for models where request completion times are relatively consistent (e.g., image generation models), and latency is a performance metric.

Implementation Parameters:

Max batch size: Upper limit of requests in a batch.
Max batch delay: Maximum time to wait before processing the current batch.

Pros:

Provides balanced trade-off between throughput and latency.
Reduces wait time for early-arriving requests compared to static batching.
Helps enforce Service Level Agreements (SLAs) and optimize for real-time traffic patterns.

Cons:

Batches may be partially full, so it does not always maximize hardware utilization.
Suffers from the same “straggler effect”—short requests wait for the longest in the batch to finish.
Introduces complexity in keeping track of timers and request queues.

3. Continuous Batching (a.k.a. In-Flight or Iteration-Level Batching)

Continuous batching is designed specifically for LLMs, which generate output token-by-token, often with highly variable sequence lengths. After each inference token-generation step, finished sequences are immediately replaced by new requests, so the GPU is always busy. Batch composition changes dynamically at each iteration, not just at the start.

Mechanism:

LLM serving frameworks (like vLLM, Hugging Face TGI, SGLang, TensorRT-LLM) operate at the token level, not the request level.
As soon as a sequence finishes generating its output, a new request can be added to the batch in the next token-generation cycle.
Results in a high and steady utilization of the GPU, regardless of variability in prompt lengths or completion times.

Use Case:

Ideal for production LLM serving and APIs, especially where output sequence length varies (which is typical with user workloads).
Enables low latency and high throughput in customer-facing applications.

Pros:

Maximizes throughput and GPU occupancy, minimizing idle time substantially compared to static or dynamic batching.
New requests aren’t blocked by slow, long-running generations in the batch.
Can achieve substantial throughput gains (up to 23x) and significant latency reduction compared to naive or static batching.

Cons:

More complex to implement: requires fine-grained control of token-level scheduling and dynamic resource management.
Needs careful configuration of max batch size and anticipated input/output shapes.

Comparison Table

Feature	Static Batching	Dynamic Batching	Continuous Batching
Batch Start	Fixed size, when full	Full or when timer elapses	After every token step (iteration)
Latency	High for first requests	Medium, tunable	Low, minimized
Throughput	High (for full batches)	Medium-High	Highest
Straggler Effect	Present	Present	Minimized
Use Case	Offline, batch jobs	Production APIs w/ similar workloads	Live LLM APIs, variable workloads
Implementation	Simple	Moderate	Complex
GPU Utilization	High if batch full	Balanced	Maximum

Static batching is excellent for predictable, non-interactive jobs.
Dynamic batching addresses real-time needs by blending throughput with latency, but suffers when request completion times differ greatly.
Continuous batching is essential for LLMs, especially in conversational, interactive, or open-ended query scenarios, due to its ability to utilize hardware optimally while minimizing per-request latency—even as generations vary widely in time required.

For modern GenAI web APIs and high-throughput inference workloads (such as with vLLM or Hugging Face TGI), continuous batching is now considered best practice due to its unmatched efficiency and responsiveness in real-world LLM deployments.

Key Challenges

Computational Cost: Inference requires significant compute resources, especially for large models. Each token generation involves processing the entire context through billions of parameters.

Context Windows: Models have limited context windows (e.g., 8K, 32K, or 128K tokens). Longer conversations may require truncation or specialized techniques.

Latency vs Quality: There’s often a trade-off between response speed and quality. Faster inference techniques may sacrifice some accuracy.

Optimization Techniques

KV Caching: Stores previously computed key-value pairs to avoid recomputing them for each new token.

Quantization: Reduces model precision (e.g., from 16-bit to 8-bit) to speed up inference while maintaining acceptable quality.

Model Pruning: Removes less important parameters to create smaller, faster models.

Speculative Decoding: Uses a smaller “draft” model to propose tokens that a larger model then validates.

Footnotes

← CUDA Programming: An Introduction

Subword Tokenization Algorithms →