
What is LLM Inference?
/ 11 min read
Updated:Table of Contents
changelog: added details about tokenization
When you type a prompt into ChatGPT, Claude, or any other language model, what happens behind the scenes? Let’s talk about inference - the computational steps that transform your input into a response.
What Is LLM Inference?
LLM inference is the process by which a trained language model generates text predictions based on input. Think of it as the model’s “thinking” process - taking your prompt and producing a response one token (word or subword) at a time.
Unlike training, which teaches the model patterns from massive datasets, inference is about applying that learned knowledge to generate new text.
The Inference Process
Here’s how it works step by step:
1. Tokenization
Neural networks can’t process raw text. They work with numbers. Tokenization maps a word / partial word / symbol to a number. This is a deterministic process.
LLMs usually use Subword Tokenization like Byte Pair Encoding (BPE), Unigram, or WordPiece. Your input text gets broken down into tokens - smaller units that the model can process. For example:
- “Hello world” → [“Hello”, ” world”]
- “unhappiness” → [“un”, “happiness”]
- “GPT-4” → [“G”, “PT”, ”-”, “4”] (depending on the tokenizer)
During training, the tokenizer scans a massive text dataset to determine which subwords appear most frequently. These are added to the vocabulary. When processing input, the tokenizer greedily matches the longest subwords from the vocabulary. Subwords are mapped to integer IDs; these IDs are then used as input for the model.
Imagine the text starts with characters. The algorithm repeatedly merges the most common pairs of tokens in the data. Over thousands of merges, common subwords like “ing”, “ation”, or “pre” are created. For instance:
- Input: “Reading”
- Initial tokens: [“R”, “e”, “a”, “d”, “i”, “n”, “g”]
- After merges: [“Read”, “ing”]
Rare or new words are split creatively:
- “Supercalifragilistic” → [“Super”, “cal”, “i”, “frag”, “il”, “istic”]
Models use reserved tokens for purposes such as:
- Start (<s>) or end (</s> or <eos>) of a sentence
- Padding (<pad>) to fill out sequences to a uniform length. For instance, if you have sentences of varying lengths, shorter sentences are padded with <pad> so that they can all be arranged into a uniform matrix for efficient computation in parallel.
- Unknown (<unk>) for truly unrecognized text. This is rare in modern tokenizers but if the tokenizer encounters a truly foreign or corrupted input, it falls back to this token.
2. Embedding Lookup
Each token is converted into a numerical representation that combines two critical components:
Token Embeddings
Token embeddings convert discrete tokens into dense vector representations that capture semantic meaning:
- Lookup Process: Each input token gets mapped to its corresponding embedding vector from a learned embedding matrix
- Semantic Representation: These vectors encode the token’s meaning in high-dimensional space (e.g., 768, 1024, or 4096 dimensions)
- Context Processing: The model uses these embeddings to understand relationships between words
For example, tokens like “king” and “queen” would have similar embeddings because they share semantic properties.
Positional Embeddings
Positional embeddings solve a critical problem: transformer models (GPT Series, LLAMA, BERT) have no inherent understanding of token order:
- Position Encoding: Each token position gets its own embedding vector
- Order Information: These embeddings tell the model where each token appears in the sequence
- Combined Representation: Token embeddings are added to positional embeddings before entering the transformer layers
There are several types of positional embeddings:
Absolute Positional Embeddings
- Fixed embeddings for each position (0, 1, 2, …)
- Used in original BERT and GPT models
- Simple but limited by training sequence length
Relative Positional Embeddings
- Encode relative distances between tokens
- More flexible for handling sequences longer than training data
- Used in models like T5 and some recent architectures
Rotary Position Embeddings (RoPE)1
- Rotate embedding vectors based on position
- Used in models like LLaMA and GPT-NeoX
- Better extrapolation to longer sequences
RoPE Deep Dive: The Rotation Magic
RoPE works by applying rotation matrices to embedding vectors based on their position. Think of it like a clock - each position gets a specific rotation angle.
How RoPE Works:
- Split the embedding: Take a token’s embedding vector and split it into pairs of dimensions
- Apply rotation: Each pair gets rotated by an angle that depends on the token’s position
- Different frequencies: Each dimension pair uses a different rotation frequency
Example: Let’s say we have a 4-dimensional embedding for the word “cat” at position 2:
Original embedding: [0.5, 0.3, 0.8, 0.2]Split into pairs: [(0.5, 0.3), (0.8, 0.2)]
Position 2 rotation angles:- Pair 1: θ₁ = 2 × 10000^(-0/4) = 2 radians- Pair 2: θ₂ = 2 × 10000^(-2/4) = 0.2 radians
After rotation:- Pair 1: (0.5×cos(2) + 0.3×sin(2), -0.5×sin(2) + 0.3×cos(2))- Pair 2: (0.8×cos(0.2) + 0.2×sin(0.2), -0.8×sin(0.2) + 0.2×cos(0.2))
Why This Works:
- Relative positioning: The dot product between two RoPE embeddings naturally encodes their relative distance
- Length invariance: Works for sequences longer than training data because the rotation pattern continues
- Smooth interpolation: Positions between training positions get sensible intermediate rotations
Advantage over absolute embeddings: If you train on sequences up to 2048 tokens, absolute embeddings break at position 2049. RoPE just continues the rotation pattern, allowing models to handle much longer sequences during inference.
Why Both Matter
- Token embeddings provide the “what” (semantic content)
- Positional embeddings provide the “where” (sequence order)
- Together, they give the model complete information about each token’s meaning and position
Without positional embeddings, “The cat sat on the mat” and “mat the on sat cat The” would look identical to the model, making coherent text generation impossible.
During inference, the process looks like:
- Input Processing: “The cat sat” → [token_emb(“The”) + pos_emb(0), token_emb(“cat”) + pos_emb(1), token_emb(“sat”) + pos_emb(2)]
- Autoregressive Generation: When generating the next token, the model maintains positional information for the entire context
- Context Window: Positional embeddings limit sequence length - the model can only handle positions it was trained on
3. Forward Pass
The embeddings flow through the model’s neural network layers:
- Attention mechanisms help the model focus on relevant parts of the input. For example, when translating a sentence or answering a question, the model can dynamically decide which words in the input are most important to consider at each step. This results in better context awareness and more accurate outputs.
- Feed-forward networks process and transform the information. They further transform the embedded representations, allowing the model to learn more complex features after the attention step.
- Layer normalization keeps the values stable. This technique standardizes the values (activations) within a layer to have zero mean and unit variance. It keeps the model’s internal values stable as they move through the network, which helps with training efficiency and prevents the values from growing too large or too small as the data passes through the many layers of a large model.
4. Next Token Prediction
At the output layer, the model produces a probability distribution over its entire vocabulary - essentially ranking how likely each possible next token is given the context.
5. Token Selection
The model selects the next token using strategies like:
Greedy Decoding
Always pick the highest probability token.
Example: Given context “The weather is”
Probability distribution:nice: 0.4good: 0.3sunny: 0.15bad: 0.1terrible: 0.05
Greedy decoding always picks “nice” (highest probability).
Result: Deterministic but can be repetitive and boring.
Random Sampling
Randomly select based on probabilities, using temperature to control randomness.
Example: Same context “The weather is”
- Temperature = 1.0 (normal): Sample directly from probabilities
- Temperature = 0.1 (low): Makes distribution sharper, almost greedy
- Temperature = 2.0 (high): Flattens distribution, more random
With temperature 0.5:nice: 0.55 (boosted)good: 0.25sunny: 0.12bad: 0.06terrible: 0.02
Might pick "good" or "sunny" sometimes for variety
Top-k Sampling
Only consider the k most likely tokens, then sample from those.
Example: Top-k=3 with context “The weather is”
Original: nice(0.4), good(0.3), sunny(0.15), bad(0.1), terrible(0.05)Top-k=3: nice(0.53), good(0.32), sunny(0.15) [renormalized]
Never picks “bad” or “terrible” - filters out unlikely options.
Top-p (Nucleus) Sampling 2
Include tokens until their cumulative probability reaches p.
Example: Top-p=0.8 with context “The weather is”
Cumulative probabilities:nice: 0.4nice + good: 0.7nice + good + sunny: 0.85 ← Exceeds 0.8, so stop at "good"
Sample from: nice(0.57), good(0.43) [renormalized]
Adaptive vocabulary size - sometimes 2 tokens, sometimes 10+.
Real-World Impact
- Chatbots: Often use Top-p for natural conversation
- Code generation: Might use lower temperature + Top-k for precision
- Creative writing: Higher temperature for more surprising word choices
6. Iteration
The newly generated token is added to the context, and the process repeats until a stopping condition is met (end token, max length, etc.).
The Human Perspective
From your perspective, inference feels like a conversation. The model isn’t truly “understanding” in the human sense. It’s performing sophisticated pattern matching based on statistical relationships learned during training.
Each response is generated fresh - the model doesn’t have memory between separate conversations (unless explicitly provided context).
How Does the GPU Run the Inference?
Efficiently serving Large Language Model (LLM) inference at scale requires smart batching strategies to utilize computational resources and balance throughput with latency. The three main batching techniques—static, dynamic, and continuous—differ in flexibility and suitability for various production scenarios.
1. Static Batching
Static batching waits for a fixed number of requests before processing them together in a single batch. All requests in the batch are run simultaneously through the LLM. The process begins only when the batch is full, regardless of waiting time.
Use Case:
- Best suited for offline or scheduled workloads, such as nightly batch jobs, bulk document processing, or analytics tasks where latency isn’t critical.
Pros:
- Maximizes GPU or TPU efficiency by filling computational capacity.
- Simple to implement; batching logic is straightforward.
Cons:
- Increases latency for the first requests in a batch—they must wait for the batch to fill.
- If requests in a batch have variable execution times (e.g., some short, some long), the entire batch waits for the longest-running request (the “straggler effect”), wasting resources.
2. Dynamic Batching
Dynamic batching collects requests as they arrive and processes a batch either when it reaches a maximum size or after a maximum wait time (batch window). It doesn’t insist on the batch being full: a partial batch can be processed if the timer expires, minimizing individual request wait time.
Use Case:
- Appropriate for production with unpredictable, bursty, or variable traffic, such as APIs serving live user queries.
- Useful for models where request completion times are relatively consistent (e.g., image generation models), and latency is a performance metric.
Implementation Parameters:
- Max batch size: Upper limit of requests in a batch.
- Max batch delay: Maximum time to wait before processing the current batch.
Pros:
- Provides balanced trade-off between throughput and latency.
- Reduces wait time for early-arriving requests compared to static batching.
- Helps enforce Service Level Agreements (SLAs) and optimize for real-time traffic patterns.
Cons:
- Batches may be partially full, so it does not always maximize hardware utilization.
- Suffers from the same “straggler effect”—short requests wait for the longest in the batch to finish.
- Introduces complexity in keeping track of timers and request queues.
3. Continuous Batching (a.k.a. In-Flight or Iteration-Level Batching)
Continuous batching is designed specifically for LLMs, which generate output token-by-token, often with highly variable sequence lengths. After each inference token-generation step, finished sequences are immediately replaced by new requests, so the GPU is always busy. Batch composition changes dynamically at each iteration, not just at the start.
Mechanism:
- LLM serving frameworks (like vLLM, Hugging Face TGI, SGLang, TensorRT-LLM) operate at the token level, not the request level.
- As soon as a sequence finishes generating its output, a new request can be added to the batch in the next token-generation cycle.
- Results in a high and steady utilization of the GPU, regardless of variability in prompt lengths or completion times.
Use Case:
- Ideal for production LLM serving and APIs, especially where output sequence length varies (which is typical with user workloads).
- Enables low latency and high throughput in customer-facing applications.
Pros:
- Maximizes throughput and GPU occupancy, minimizing idle time substantially compared to static or dynamic batching.
- New requests aren’t blocked by slow, long-running generations in the batch.
- Can achieve substantial throughput gains (up to 23x) and significant latency reduction compared to naive or static batching.
Cons:
- More complex to implement: requires fine-grained control of token-level scheduling and dynamic resource management.
- Needs careful configuration of max batch size and anticipated input/output shapes.
Comparison Table
Feature | Static Batching | Dynamic Batching | Continuous Batching |
---|---|---|---|
Batch Start | Fixed size, when full | Full or when timer elapses | After every token step (iteration) |
Latency | High for first requests | Medium, tunable | Low, minimized |
Throughput | High (for full batches) | Medium-High | Highest |
Straggler Effect | Present | Present | Minimized |
Use Case | Offline, batch jobs | Production APIs w/ similar workloads | Live LLM APIs, variable workloads |
Implementation | Simple | Moderate | Complex |
GPU Utilization | High if batch full | Balanced | Maximum |
- Static batching is excellent for predictable, non-interactive jobs.
- Dynamic batching addresses real-time needs by blending throughput with latency, but suffers when request completion times differ greatly.
- Continuous batching is essential for LLMs, especially in conversational, interactive, or open-ended query scenarios, due to its ability to utilize hardware optimally while minimizing per-request latency—even as generations vary widely in time required.
For modern GenAI web APIs and high-throughput inference workloads (such as with vLLM or Hugging Face TGI), continuous batching is now considered best practice due to its unmatched efficiency and responsiveness in real-world LLM deployments.
Key Challenges
Computational Cost: Inference requires significant compute resources, especially for large models. Each token generation involves processing the entire context through billions of parameters.
Context Windows: Models have limited context windows (e.g., 8K, 32K, or 128K tokens). Longer conversations may require truncation or specialized techniques.
Latency vs Quality: There’s often a trade-off between response speed and quality. Faster inference techniques may sacrifice some accuracy.
Optimization Techniques
KV Caching: Stores previously computed key-value pairs to avoid recomputing them for each new token.
Quantization: Reduces model precision (e.g., from 16-bit to 8-bit) to speed up inference while maintaining acceptable quality.
Model Pruning: Removes less important parameters to create smaller, faster models.
Speculative Decoding: Uses a smaller “draft” model to propose tokens that a larger model then validates.