What are Positional Embeddings? • luminary.blog

“the cat chased the mouse” versus “the mouse chased the cat.”

Same words, completely different meanings! This fundamental challenge—understanding word order—is what positional embeddings solve in modern AI language models. But why is this a problem to solve in the first place?

The Transformer’s Unique Challenge

When the groundbreaking “Attention Is All You Need” ¹ paper introduced Transformers in 2017, it revolutionized how AI processes language. Unlike older models that read text word-by-word (like we do), Transformers process all words simultaneously. This parallel processing makes them incredibly fast and powerful, but it creates a problem: without sequential processing, how does the model know which word comes first, second, or last?

Think of it this way: if I showed you a jumbled collection of these words: “am”, “therefor”, “I”, “think”, “I”. You’d need some way to know their proper order to understand the intended message. That’s exactly the challenge Transformers face.

Teaching AI About Word Order

The solution is simple but mathematically sophisticated. Before feeding text into a Transformer, we add special numerical patterns called “positional embeddings” to each word. These patterns encode each word’s position in the sequence, allowing the model to understand word order despite processing everything simultaneously.

Let’s see exactly how this works with a concrete example.

”Hello World” to the Rescue

Let’s trace through how a Transformer would process the simple phrase “Hello World” using positional embeddings. For clarity, I’ll use a small embedding dimension of 4 (real models use much more).

Step 1: Understanding the Components

First, we need two types of embeddings:

Word embeddings: Numerical representations capturing each word’s meaning. (check this article to learn more about word embeddings)
Position embeddings: Numerical patterns encoding each word’s location

Step 2: Creating Position Embeddings

Transformers use special sine and cosine functions to generate unique positional patterns. The formulas look intimidating at first:

For even dimensions: $PE(pos, 2i) = \sin( \frac{pos}{10000^{\frac{2i}{d_{model}}}})$
For odd dimensions: $PE(pos, 2i+1) = \cos( \frac{pos}{10000^{\frac{2i}{d_{model}}}})$

In these formulas:

$PE$ stands for positional embedding
$pos$ is the position of the word in the sequence (e.g., 0 for the first word, 1 for the second, etc.).
$i$ is the embedding dimension index.
$d_{model}$ is the total number of embedding dimensions which is 4 in our example.

Let’s break this down with our example:

For “Hello” (position 0):

Dimension 0: sin(0/1) = 0.0
Dimension 1: cos(0/1) = 1.0
Dimension 2: sin(0/100) = 0.0
Dimension 3: cos(0/100) = 1.0
Position vector: [0.0, 1.0, 0.0, 1.0]

For “World” (position 1):

Dimension 0: sin(1/1) ≈ 0.841
Dimension 1: cos(1/1) ≈ 0.540
Dimension 2: sin(1/100) ≈ 0.010
Dimension 3: cos(1/100) ≈ 1.000
Position vector: [0.841, 0.540, 0.010, 1.000]

Step 3: Combining Word and Position Information

Now comes the magic. We simply add the position embedding to the word embedding element-by-element:

For “Hello”:

Word embedding: [0.1, 0.2, 0.3, 0.4] (hypothetical values)
Position embedding: [0.0, 1.0, 0.0, 1.0]
Combined: [0.1, 1.2, 0.3, 1.4]

For “World”:

Word embedding: [0.5, 0.6, 0.7, 0.8]
Position embedding: [0.841, 0.540, 0.010, 1.000]
Combined: [1.341, 1.140, 0.710, 1.800]

These combined vectors now contain both semantic meaning and positional information!

The Genius Behind the Design

I hope you wonder: Why use sine and cosine functions? Why that mysterious 10,000 in the formula? I did. The design is actually quite brilliant:

1. Bounded Values

Sine and cosine always produce values between -1 and 1, preventing numerical instability as sequences get longer. This keeps the math well-behaved whether processing a word or a book.

2. Multiple Frequencies for Different Scales

Look back at our position calculations. Notice how dimensions 0 and 1 change rapidly between positions (0.0 to 0.841), while dimensions 2 and 3 barely change (0.0 to 0.010). This multi-scale pattern is crucial:

High-frequency dimensions (early ones) help the model recognize adjacent words
Low-frequency dimensions (later ones) capture long-range relationships

This gives the model the ability to “see” both local word patterns and document-wide structure.

3. The Magic of 10,000

The paper¹ says:

We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, $P Epos+k$ can be represented as a linear function of $P Epos$ .

So it didn’t have to be exactly 10,000. The value 10,000 creates an optimal frequency distribution across dimensions. With an embedding size of 512:

Early dimensions complete cycles every ~6 positions (capturing local patterns)
Middle dimensions cycle every few hundred positions
Final dimensions take tens of thousands of positions to cycle (capturing global structure)

This logarithmic spread ensures the model can understand relationships at every scale, from adjacent words to paragraph-level connections.

4. Relative Position Understanding

Perhaps most cleverly, the sinusoidal pattern allows the model to understand relative positions through simple arithmetic. The difference between position embeddings reveals how far apart words are, regardless of their absolute positions. This means a model trained on 512-word sequences can often handle much longer texts!

Hello Parallel Processing

This solution enables Transformers to maintain their parallel processing advantage while understanding sequential structure. It’s a key reason why models like GPT, BERT, and their descendants can generate coherent text, answer questions contextually, and understand the relationships between words in a sentence.

One Last Thing

Now that we understand how LLMs think and generate sentences, I’m really curious about how we humans do the same. Sometimes it happens: you start a sentence but it takes a bit of time to find the words to finish it. Most other times, words appear one after another. For example, pause for a moment and start a sentence with the word “tomorrow”. Please, try. Did you say “Tomorrow, I will…”? I’m curious how we came up with “I” after the word tomorrow. I didn’t even think about it. The word “I” almost appeared by itself. We certainly don’t do sine/cosine operations in our brains, but then how do we store relations between words? Maybe once we solve the “AI” problem, it may help us solve the “I” problem.

Imagine if an LLM which was never given any documents about how LLMs work. It would be as miserable as we are when it comes to understanding itself.

Footnotes

https://arxiv.org/abs/1706.03762 ↩ ↩²

← Words, Tokens and Embeddings

What do GPT-OSS and Gemma 3 really offer? →