Embedding Selection for RAG Systems • luminary.blog

Introduction

Retrieval-Augmented Generation (RAG) has emerged as a critical approach for extending large language models beyond their training data. At the heart of every effective RAG implementation lies a important decision: which embedding model to use. This choice will impact your system’s performance, costs, and scalability.

Why Embedding Selection Matters

Embeddings are the foundation of semantic search in RAG systems. These dense vector representations translate text into numerical values that capture meaning, allowing systems to find contextually relevant information by measuring vector similarity. However, not all embedding models are created equal.

“The quality of retrieval depends on how well content is chunked, indexed, and matched to user intent,” as industry experts note. Poor embedding selection can lead to retrieval noise or missing context that significantly degrades the quality of your AI’s responses.

Evidence-Based Selection Framework

To help you navigate this complex landscape, below is a framework for embedding selection based on both theoretical foundations and practical benchmarks.

1. Start with Benchmark Leaderboards

Begin your selection process by consulting established benchmarks:

Hugging Face MTEB (Massive Text Embedding Benchmark) provides comparative performance data across models on retrieval tasks.(Blog post on MTEB)
Focus on metrics like NDCG@10 scores, which measure ranking accuracy
For multimodal applications, emerging benchmarks like MME offer specialized insights

These leaderboards provide an objective starting point, but they’re only the beginning of your journey.

2. Evaluate Domain Relevance

The most critical factor in embedding selection is how well the model aligns with your specific domain:

General vs. Domain-Specific: While general-purpose embeddings from providers like OpenAI or Cohere work well for broad topics, specialized domains (legal, medical, technical) often benefit from custom or fine-tuned embeddings
Vocabulary Coverage: Ensure the model adequately represents domain-specific terminology and concepts (related to feature selection)
Conceptual Resolution: Some domains require distinguishing between subtle semantic differences that general models might miss

Custom evaluation is critical: As industry data shows, models performing well on general benchmarks may underperform on niche datasets. Testing embeddings with a representative sample of your actual data is essential to identify mismatches in terminology or structure. (Check out this paper.)

3. Balance Technical Specifications

Your selection must balance several technical factors:

Factor	Considerations
Dimensions	Higher dimensions capture more nuance but increase storage and computational costs. Aim for 384–768 dimensions for most RAG use cases.
Window Size	Models have different context windows (how much text they can process at once)
Multilingual Support	Critical for international applications or multilingual content
Latency	Generation speed affects both indexing time and query response time
Cost	Open-source models can reduce expenses compared to proprietary APIs

Dimensions in Depth

Nuance vs. Dimensionality
- Higher-dimensional embeddings (e.g., 300–1000 dimensions) encode finer semantic relationships, as seen in models like BERT (768 dimensions) and OpenAI’s text-embedding-3-large (3072 dimensions).
- However, excessively high dimensions (e.g., 1536–3072) risk overfitting and computational inefficiency without proportional gains in retrieval quality.
Cost and Performance Trade-offs
- Storage/Memory: Embeddings with 384 dimensions require ~1.5KB per vector, while 768-dimensional vectors need ~3KB. For a billion-scale database, this difference translates to terabytes of additional storage.
- Query Speed: Cosine similarity computations scale linearly with dimensions. For example, 768-dimensional vectors take ~2x longer to process than 384-dimensional ones.
- Indexing Efficiency: High-dimensional vectors force approximate nearest neighbor (ANN) algorithms like HNSW to use more layers, increasing memory usage and latency.
Recommended Range (384–768 Dimensions)
- Empirical Benchmarks: Models like all-MiniLM-L6-v2 (384 dimensions) and UAE-Large-V1 (1024 dimensions) achieve strong retrieval performance while balancing speed and storage.
- Industry Guidance: MongoDB and Weaviate recommend 384–768 dimensions for RAG, as this range captures sufficient semantic detail without excessive overhead.
- Cost-Effectiveness: OpenAI’s text-embedding-3-small (512 dimensions) outperforms larger models in cost/performance ratios for RAG, costing $0.02 per million tokens vs.$ 0.13 for text-embedding-3-large.

Performance Comparison

Model	Dimensions	Retrieval Accuracy (NDCG@10)	Latency per Query	Storage/Million Vectors
`all-MiniLM-L6-v2`	384	58.3	12 ms	0.38 GB
`text-embedding-3-small`	512	62.0	15 ms	0.51 GB
`text-embedding-3-large`	3072	64.6	45 ms	3.07 GB

Practical Recommendations

Start Small: Begin with 384–512 dimensions (e.g., all-MiniLM-L6-v2) to establish a cost-effective baseline.
Scale Strategically: Increase to 768 dimensions only if retrieval quality is insufficient, and monitor latency/storage impacts.
Optimize Post-Processing: Use PCA or quantization to reduce dimensions without significant accuracy loss (e.g., 1024 → 512 dimensions).

4. Consider Dense vs. Sparse Embeddings

The embedding type should align with your content characteristics:

Dense embeddings (like OpenAI’s text-embedding-3-small) excel at general semantic search and understanding conceptual relationships
Sparse embeddings (like SPLADE) often perform better for domains with rare terms (such as medical jargon) by emphasizing keyword relevance

For many applications, a hybrid approach combining both dense and sparse techniques delivers the best results.

5. Optimize Your Chunking Strategy

Embedding selection cannot be separated from your chunking approach:

Chunk size: Smaller chunks (128–512 tokens improve precision but may lose context
Overlap: Strategic overlap between chunks helps preserve contextual continuity
Document structure: Respecting natural document boundaries (paragraphs, sections) improves retrieval quality

Your chunking strategy should be developed in tandem with your embedding selection, as they directly impact each other. Learn more about the token to embedding process.

Implementation Approaches

Based on these considerations, several implementation strategies emerge:

Single Embedding Approach

Using one embedding model for your entire corpus is the simplest approach. This works well when:

Your content is relatively homogeneous
You have limited computational resources
You need a straightforward implementation

Multi-Embedding Strategy

Different embedding models for different content types or query categories can improve relevance but adds complexity:

Domain-specific embeddings for technical content
General embeddings for broad topics
Specialized embeddings for particular languages or formats

Hybrid Retrieval Systems

Combining embedding-based retrieval with other methods:

Keyword search for handling rare terms or acronyms
Classification models for routing to appropriate knowledge bases
BM25 or other traditional IR methods as fallbacks

Evaluation Methods and Tools

Proper embedding selection requires systematic evaluation:

Ragas: An open-source framework for benchmarking retrieval quality
Vectorize: Streamlines experiments with chunking strategies and embedding models
Chunk Attribution Analysis: Measures how often retrieved chunks actually contribute to answers
A/B Testing: Comparing different embeddings on real user queries

Industry leaders recommend measuring not just retrieval accuracy, but also inference speed, storage requirements, and how performance scales with increasing data volume.

Common Embedding Models and Their Strengths

Model	Strengths	Best For
text-embedding-3-large (OpenAI)	High performance for general topics	Applications requiring nuanced understanding
text-embedding-3-small (OpenAI)	Good balance of performance and efficiency	Balanced systems with moderate complexity
e5-large (Microsoft)	Strong on information retrieval tasks	Research-oriented applications
all-MiniLM-L6-v2 (Sentence Transformers)	Lightweight with good performance	Cost-sensitive implementations
BGE-M3 (BAAI)	Strong multilingual performance	International applications

Planning for Scalability

As your RAG system grows, embedding selection becomes even more critical:

Verify API rate limits for proprietary models and monitor costs as usage increases
For open-source models, optimize inference speed with quantization or hardware acceleration
Standardize embedding pipelines, using the same model for embedding queries and documents
Regularly update models to leverage improvements in the rapidly evolving field

Conclusion

Embedding selection is both an art and a science, requiring careful consideration of technical specifications, domain relevance, and operational constraints. You can select embeddings that maximize your RAG system’s accuracy by combining benchmark results with domain-specific testing and systematic evaluation.

Glossary

Retrieval-Augmented Generation (RAG)
A technique that combines large language models with external data retrieval, allowing the model to access and use information beyond its training data to improve responses.

Embedding Model
A machine learning model that converts text or other data into dense vector representations (embeddings) that capture semantic meaning for tasks like search or classification.

Embeddings
Numerical vectors representing text, images, or other data, designed so that similar items are close together in vector space, enabling semantic search and comparison. See Word Embeddings for detailed explanation.

Semantic Search
A search method that uses the meaning (semantics) of queries and documents, rather than just keyword matching, often leveraging embeddings to find contextually relevant results. Learn about semantic vs lexical similarity.

Vector Similarity
A measure of how close two vectors (embeddings) are in space, commonly calculated using metrics like cosine similarity to determine semantic similarity between pieces of text.

Chunking
The process of splitting documents into smaller segments (chunks), such as paragraphs or sentences, to improve retrieval precision and context management in RAG systems.

Indexing
Organizing and storing embeddings or documents in a way that allows for efficient retrieval during search operations.

Benchmark Leaderboards
Publicly available rankings that compare the performance of different models on standardized tasks, providing objective data for model selection.

Hugging Face MTEB (Massive Text Embedding Benchmark)
A large-scale benchmark suite that evaluates and compares text embedding models across various retrieval and semantic tasks.

NDCG@10 (Normalized Discounted Cumulative Gain at 10)
A metric for evaluating ranking quality in information retrieval, measuring how well the top 10 results match the ideal order of relevance.

Multimodal Applications
Systems or tasks that process and integrate multiple data types (e.g., text, images, audio) simultaneously.

Domain-Specific Embeddings
Embeddings trained or fine-tuned on data from a particular field (e.g., medical, legal), capturing specialized vocabulary and concepts.

Vocabulary Coverage
The extent to which a model’s embeddings represent the terms and concepts relevant to a specific domain or dataset.

Conceptual Resolution
A model’s ability to distinguish between subtle differences in meaning or context within a domain.

Context Window
The maximum amount of text (measured in tokens) that a model can process at one time.

Multilingual Support
The capability of a model to handle and generate embeddings for content in multiple languages.

Latency
The time delay between submitting a query and receiving a response, influenced by model complexity and computational resources.

Open-Source Models
Machine learning models whose source code is freely available for use, modification, and distribution.

Dimensionality (Dimensions)
The number of values in an embedding vector; higher dimensions can capture more nuance but require more storage and computation.

Overfitting
A modeling issue where a model learns patterns specific to the training data, reducing its ability to generalize to new data.

Cosine Similarity
A metric that measures the cosine of the angle between two vectors, commonly used to assess the similarity of embeddings.

Approximate Nearest Neighbor (ANN) Algorithms
Algorithms designed to quickly find vectors in a large dataset that are closest to a query vector, often used in high-dimensional spaces.

HNSW (Hierarchical Navigable Small World)
A popular ANN algorithm that builds a graph structure to enable efficient similarity search in large-scale vector databases.

PCA (Principal Component Analysis)
A dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space while preserving as much variance as possible.

Quantization
A process that reduces the precision of model weights or embeddings to decrease storage and computation needs, often with minimal accuracy loss.

Dense Embeddings
Embeddings where most vector values are non-zero, capturing broad semantic relationships and suitable for general semantic search.

Sparse Embeddings
Embeddings where most vector values are zero, emphasizing the presence or absence of specific features or keywords, often improving performance for rare terms.

Hybrid Approach
Combining dense and sparse embeddings (or other methods) to leverage the strengths of both for improved retrieval performance.

Chunk Size
The number of tokens or words in each document segment (chunk) used for embedding and retrieval.

Overlap
The practice of including some repeated content between adjacent chunks to maintain contextual continuity.

Single Embedding Approach
Using one embedding model for all content, suitable for homogeneous data and simpler implementations.

Multi-Embedding Strategy
Using different embedding models for different types of content or queries to improve relevance and performance.

Hybrid Retrieval Systems
Combining embedding-based retrieval with traditional methods (like keyword search or BM25) or classification models for more robust information access.

BM25
A traditional ranking function used in information retrieval that scores documents based on term frequency and inverse document frequency.

Ragas
An open-source framework for benchmarking and evaluating the quality of retrieval in AI systems.

Vectorize
A tool or framework that streamlines experiments with different chunking strategies and embedding models.

Chunk Attribution Analysis
A method for measuring how often retrieved chunks actually contribute to the final answers in a retrieval system.

A/B Testing
An experimental method where two or more variants are compared to determine which performs better on real user queries.

Inference Speed
The time it takes for a model to generate embeddings or responses from input data.

text-embedding-3-large / text-embedding-3-small (OpenAI)
Specific OpenAI embedding models with different dimensionalities and performance/cost trade-offs.

e5-large (Microsoft)
A Microsoft model optimized for information retrieval tasks.

all-MiniLM-L6-v2 (Sentence Transformers)
A lightweight, efficient embedding model from the Sentence Transformers library, balancing performance and resource usage.

BGE-M3 (BAAI)
A multilingual embedding model from BAAI, designed for international applications.

API Rate Limits
Restrictions imposed by service providers on the number of API requests allowed within a certain time frame.

Inference
The process of using a trained model to generate predictions or embeddings from new input data.

Embedding Pipelines
The end-to-end workflow for generating, storing, and retrieving embeddings in an AI system.

Fine-Tuning
The process of further training a pre-trained model on domain-specific data to improve its performance in a particular context.

What’s Next?

To better understand embedding selection, explore these foundational concepts:

Features in ML - Understand the foundation of features and data representation
Word Embeddings - Learn the basics of how embeddings work
Semantic vs Lexical Similarity - Understand different types of text similarity
Tokenization Algorithms - How text is broken down before embedding
Token to Embedding - The process of converting tokens into embeddings

← Diffusion Based Language Models

The Agentic AI Hype →