Words, Tokens and Embeddings • luminary.blog

We know LLMs don’t understand words, so we need to convert words into tokens first and then tokens to embeddings. Tokens are integers and they don’t map to words directly. Embeddings are N-dimensional vectors. How do one map to another? It seems confusing.

Let’s walk through this step by step. You just learned about tokenization algorithms in the previous article. Now we need to understand what happens to those tokens inside the model.

1. The Big Picture

Here’s the complete pipeline:

Text Input: “Hello world”
Tokenization: [13225, 2375] (token IDs for GPT-4o)¹
Embedding Lookup: Each token ID becomes a dense vector
Ready for Processing: The model can now work with these vectors

You start with English (or any other supported language), convert to a numerical language (tokens), then translate again into a vectoral language (embeddings) that the neural network actually speaks.

2. The Embedding Matrix

Every language model has an embedding matrix or embedding table. This is essentially a giant lookup table where:

Each row corresponds to a token ID
Each column represents a dimension in the embedding space
The intersection gives you the embedding value for that token and dimension

Example embedding matrix (simplified):

1
Token ID  |  Dimension 0  |  Dimension 1  |  Dimension 2
2
------------------------------------------------------
3
0 (pad)   |     0.0       |     0.0       |     0.0
4
1 (hello) |     0.23      |    -0.45      |     0.78
5
2 (world) |     0.12      |    -0.33      |     0.89
6
3 (cat)   |     0.67      |     0.21      |    -0.12
7
...       |     ...       |     ...       |     ...
8
50256     |    -0.34      |     0.56      |     0.23

When you have token ID 1, you simply look up row 1 and get the vector [0.23, -0.45, 0.78]. Simple. It’s a straightforward table lookup operation.

3. The Simple Embedding Lookup

The sample code below demonstrated the process:

1
# Simplified example
2
token_ids = [1, 2]  # "hello world"
3
embedding_matrix = [
4
    [0.0, 0.0, 0.0],      # token 0
5
    [0.23, -0.45, 0.78],  # token 1 (hello)
6
    [0.12, -0.33, 0.89],  # token 2 (world)
7
    # ... more rows
8
]
9

10
# Lookup embeddings
11
hello_embedding = embedding_matrix[1]  # [0.23, -0.45, 0.78]
12
world_embedding = embedding_matrix[2]  # [0.12, -0.33, 0.89]

Each token ID acts as an index into the embedding matrix. No complex computation needed—just array indexing.

4. Real-World Scale

In actual models, these numbers get big fast:

BERT-base: 30,522 tokens × 768 dimensions = ~23 million parameters
Llama-7B: 32,000 tokens × 4,096 dimensions = ~131 million parameters
GPT-3: 50,257 tokens × 12,288 dimensions = ~615 million parameters just for embeddings
GPT-4: 100,256 tokens x 1,536/3,072/16,000 dimensions = up to ~1.6 billion (these are unoffical numbers, also because of the mixture-of-experts architecture dimensions are different for different “experts”)

That’s why embeddings often represent a significant chunk of a model’s total parameters. These are just word (token) embeddings. There is also the position embeddings.

5. Position Embeddings

Token embeddings only tell you what the token is, but not where it appears in the sequence. That’s why most models also use position embeddings.

Two types of position information:

Absolute Position: “This token is at position 5”
Relative Position: “This token is 3 positions after that one”

Example with positions:

1
Input: "The cat sits"
2
Tokens: [283, 3797, 10762]
3
Positions: [0, 1, 2]
4

5
# Final embedding = token embedding + position embedding
6
the_final = token_embedding[283] + position_embedding[0]
7
cat_final = token_embedding[3797] + position_embedding[1]
8
sits_final = token_embedding[10762] + position_embedding[2]

This way, the model knows that “cat” appears before “sits”, which is crucial for understanding meaning.

6. What are “Dense Vectors”?

Dense vector refers to a mathematical representation of a token (word or subword) using a list (or array) of numbers.

Dense means that most or all of the values in the vector are nonzero (as opposed to “sparse vectors,” which are mostly zeros). It looks like: [0, 0, 0, 0, 1, 0,…,0]

Each vector typically has hundreds or thousands of dimensions; for example, a 1,024-dimensional vector might look like: [0.13,−0.45,0.99,…,0.05]

If we had used one-hot encoding (which is a sparse vector) instead of a dense vector:

Vocabulary of 50,000 words = 50,000-dimensional vectors
Mostly zeros with a single 1
No semantic relationships captured

With dense embeddings:

Same vocabulary = 512-dimensional vectors (much smaller!)
Every dimension has a meaningful value
Similar words have similar vectors
Mathematical operations reveal relationships

7. Training the Embeddings

A good question to ask – How does the model learn these embedding values? There are two main approaches:

From Scratch
The embedding matrix starts with random values and gets updated during training. As the model learns to predict the next word, it adjusts embeddings so that tokens used in similar contexts get similar vectors and the overall task performance improves.

Pre-trained Embeddings
Some models start with embeddings from Word2Vec, GloVe, or FastText, then fine-tune them. This gives a head start since these embeddings already capture semantic relationships.

8. Practical Example: GPT-4o Token Processing

Let’s trace through a real example:

Input: “Singularity is near”

Step 1 - Tokenization: ¹

1
"Singularity is near" → [54138, 72869, 382, 5862]

Step 2 - Embedding Lookup:

1
Token 54138 (Sing) → [0.016482863575220108, -0.002266584662720561, -0.0010731428628787398, ..., 0.0030743060633540154]  # 1536 dimensions
2
Token 72869 (ularity) → [0.022945120930671692, -0.006187810562551022, 0.023258427157998085, ..., 0.020899411290884018]
3
Token 382 ( is) → [-0.019708732143044472, 0.005309739615768194, -0.0006801000563427806, '...', 0.029159288853406906]
4
Token 5862 ( near) → [-0.06510740518569946, -0.014064463786780834, 0.018161896616220474, '...', -0.005212091840803623]

Code if you want to replicate the result above:

1
import tiktoken
2
from openai import OpenAI
3

4
input_text = "Singularity is near"
5

6
encoding = tiktoken.get_encoding("cl100k_base")
7
tokens = encoding.encode(input_text)
8
print("Tokens:", tokens)
9
print("Number of tokens:", len(tokens))
10

11
client = OpenAI()
12
embeddings = []
13
for token in tokens:
14
    # Decode the token back to text to get its embedding.
15
    token_text = encoding.decode([token])
16
    response = client.embeddings.create(
17
        input=token_text,
18
        model="text-embedding-3-small"
19
    )
20
    embeddings.append(response.data[0].embedding)
21

22
for i, embedding in enumerate(embeddings):
23
  print(f"Embedding for token {tokens[i]} ('{encoding.decode([tokens[i]])}'):")
24
  first_three_and_last = embedding[:3] + ['...'] + [embedding[-1]]
25
  print(first_three_and_last)
26
  print("-" * 20)

Above we decoded the token back to text. However, we previously mentioned that the flow is from word to token(s) and then from token to embedding. We followed this process because the OpenAI client.embeddings.create method is designed to take a string input (the actual text) to generate embeddings. It does not directly accept token IDs as input for the input parameter.

Step 3 - Add Position Information:

1
Sing_final = Sing_token_emb + position_emb[0]
2
ularity_final = ularity_token_emb + position_emb[1]
3
is_final = is_token_emb + position_emb[2]
4
near_final = near_token_emb + position_emb[3]

Now the model has four vectors it can work with, each containing both semantic and positional information.

9. The Magic of Learned Representations

What’s fascinating is that through training, these embeddings develop meaningful patterns:

Semantic clusters: Words with similar meanings cluster together
Analogical relationships: “king - man + woman ≈ queen”
Syntactic patterns: Verbs, nouns, adjectives form their own neighborhoods

The model discovers these patterns automatically just by trying to predict the next token millions of times.

10. Memory and Efficiency Considerations

Token-to-embedding conversion is fast—it’s just array lookups. But storing the embedding matrix requires significant memory:

Memory calculation:

1
Memory = num_tokens × embedding_dimension × 4 bytes (float32)
2

3
For GPT-3:
4
50,257 × 12,288 × 4 = ~2.4 GB just for token embeddings!

This is why model optimization often focuses on:

Reducing vocabulary size (fewer tokens)
Lowering embedding dimensions
Using different data types (float16 instead of float32)

Summary

Token to embedding conversion is the important bridge between symbols and mathematics:

Tokenization gives us integer IDs that represent subwords
Embedding lookup converts these IDs to dense, meaningful vectors
Position embeddings add crucial sequence information
The result is rich numerical representations the model can process

This seemingly simple lookup operation is what allows language models to work with human language while operating in the mathematical world of neural networks. Without this conversion, there’s no way for the model to understand that “king” and “queen” are related, or that “cat” and “kitten” share meaning.

One Last Thing

Every word you type goes through this exact process, transforming from text into vectors that capture not just the symbols, but their meaning, relationships, and position in the grand mathematical space where AI does its thinking. Actually, AI does math without even understanding how it does it. I find it very similar to how we think. We still don’t understand many of the processes happening in our brains scientifically. Beyond that, we have no way of observing our own intellectual processes. So, when we say AI doesn’t actually think or understand, I am not sure what that really means other than us trying to impose our superiority. Maybe.

Footnotes

https://platform.openai.com/tokenizer ↩ ↩²

← Subword Tokenization Algorithms

What are Positional Embeddings? →