Context matters: multi-vector embeddings for diverse data sources

In most of the AI powered knowledge management systems we have built, we frequently need to process and retrieve information from diverse document types, ranging from academic papers to reports to webinar transcripts. This heterogeneity presents unique challenges for traditional single-vector embedding approaches, even with hybrid (keyword) retrieval. This article examines how multi-vector embedding strategies can enhance retrieval effectiveness across varied data sources while maintaining semantic relevance and structural context.

Current approaches and discussions

The field of document retrieval and embedding strategies (in the context of RAG) continues to evolve rapidly. Earlier (circa 2022) discussions focused heavily on chunking techniques - the methods used to break down documents into smaller, processable pieces. Approaches range from simple length-based splits (e.g. 1000 characters or 200 words), to more sophisticated methods considering document structure and semantic meaning. These methods include text-structured approaches that respect natural language boundaries, document-structured methods that preserve format-specific elements (like HTML or Markdown), and semantic-based splitting that considers content meaning.

Alongside chunking strategies, contextual retrieval has emerged as a promising direction for improving retrieval accuracy. Traditional RAG systems often struggle with context loss during chunking, potentially missing crucial information during retrieval. New approaches address this by maintaining or reconstructing context during the embedding process. For example, some systems now add contextual prefixes to chunks before embedding, helping preserve relationships between different parts of a document.

While these core challenges around retrieval and chunking continue to be actively researched, cutting-edge work is pushing into new territories. Recent research (pre-print from a few days ago by Jin et al. 2024) has highlighted the need to move beyond just technical retrieval capabilities toward "preference-aligned RAG" - systems that can better align with human preferences and expectations. This advanced research explores challenges like ensuring logical coherence when reasoning across multiple documents, providing precise citations, knowing when to abstain from answering, and handling conflicting information in retrieved documents.

The discussion around what was called ‘advanced RAG’ techniques (but likely now ‘basic’ considering advancements) highlights a fundamental tension in knowledge retrieval systems: the balance between granularity and context. As the field continues to mature, these foundational challenges are being tackled alongside newer questions about how to make RAG systems not just technically capable, but also aligned with how humans actually want to use them.

Approaches to structural context in document retrieval

Back to ‘basics’, when dealing with diverse document collections, we have used two main approaches for handling structural context of document chunks in retrieval systems. The first approach focuses on standardization: using small, efficient models to classify and tag each document chunk with consistent metadata labels across the entire database. This method creates a homogeneous metadata space that enables precise filtering and segmentation, potentially improving retrieval accuracy through well-defined categorical boundaries. For example, a model could identify and tag all sections that represent "lessons learned," regardless of their original headings.

The alternative approach, which we explore in this article, embraces the inherent flexibility of vector embeddings to handle structural diversity. Rather than enforcing a rigid metadata structure, this method uses a second embedding to capture the contextual and structural nuances of each document chunk. This approach is particularly valuable when dealing with varied document types - from academic papers to webinar transcripts - where structural elements might carry different meanings or serve different purposes across documents. For instance, insights that function as "lessons learned" might appear in various sections, from "Results" to "Final Observations" to "Key Takeaways," and a vector-based approach can capture these semantic relationships more naturally.

Introduction to multi-vector embeddings

Multi-vector embeddings represent different aspects of a document chunk through separate vector representations. This approach typically involves maintaining at least two distinct embeddings:

A content embedding that captures the semantic meaning of the actual text
A metadata embedding that encodes structural and contextual information This separation allows for more nuanced retrieval operations that can leverage both content similarity and structural relevance independently or in combination.

For example for this chunk of text from page 14 or a report, traditionally, one would embed just the text or sometimes prepend the metadata to the text and embed the combined text (but see Noise section below). Instead we suggest creating an embedding for text and an embedding for part of metadata

{
    "id": "chunk_123",
    "text": "Our research in Ghana revealed significant improvements in child nutrition when combining local ingredients with education programs.",
    "metadata": {
      "document_title": "Ghana Nutrition Program Evaluation 2023",
      "section": "Results and Discussion",
      "subsection": "Key Findings",
      "authors": ["Smith, J.", "Kumar, R."],
      "page": 14
    },
}

Theoretical foundations

The underlying principle of multi-vector embeddings stems from the observation that different aspects of a document chunk (content and structure) occupy distinct semantic spaces. By separating these representations, we can:

Preserve the semantic purity of content embeddings
Maintain structural context without diluting the primary content signals
Enable flexible weighting schemes during retrieval operations

Implementation architecture

Vector storage and representation

The implementation of a multi-vector embedding system requires careful consideration of several key components:

Primary text embedding
- Represents the semantic content of the document chunk
- Focuses on the actual information content
- Typically uses standard embedding models optimized for semantic similarity
Metadata embedding
- Encodes structural and contextual information
- Includes document hierarchy, section information, and other relevant metadata
- May use the same or different embedding models as the primary text

Query processing

When processing user queries, the system generates two distinct embeddings:

A query text embedding that captures the semantic intent
A query metadata embedding that encodes any structural or contextual hints

Retrieval strategies

Weighted combination approach

The most straightforward implementation combines both vectors using a weighted scoring function:

final_score = w_t * text_similarity + w_m * metadata_similarity

Where:

w_t represents the text similarity weight
w_m represents the metadata similarity weight
text_similarity and metadata_similarity are computed using cosine similarity or dot product In a vector-enabled database like PostgreSQL with pgvector, this can be implemented directly in SQL:

WITH combined_scores AS (
  SELECT
    id,
    text_content,
    0.7 * (text_embedding <=> query_text_vector) +
    0.3 * (metadata_embedding <=> query_metadata_vector) as combined_score
  FROM documents
  WHERE text_embedding IS NOT NULL
    AND metadata_embedding IS NOT NULL
  ORDER BY combined_score DESC
  LIMIT 100
)
SELECT * FROM combined_scores
ORDER BY combined_score DESC;

This approach can be further enhanced using Reciprocal Rank Fusion (RRF), which helps balance the influence of different ranking signals when you include a keyword ts_vector type search:

WITH vector_scores AS (
  SELECT 
    id,
    text_content,
    -- Combined vector score using weighted text and metadata embeddings
    -- using 70%/30% weighting
    0.7 * (text_embedding <=> query_text_vector) +
    0.3 * (metadata_embedding <=> query_metadata_vector) as vector_score
  FROM documents
  WHERE text_embedding IS NOT NULL 
    AND metadata_embedding IS NOT NULL
),
vector_ranked AS (
  SELECT 
    id, 
    vector_score,
    ROW_NUMBER() OVER (ORDER BY vector_score DESC) as rank_ix
  FROM vector_scores
  LIMIT 400  -- Reasonable limit for performance
),
keyword_scores AS (
  SELECT 
    id,
    ts_rank_cd(to_tsvector('english', text_content), query_tsquery, 32) as keyword_score
  FROM documents
  WHERE to_tsvector('english', text_content) @@ query_tsquery
),
keyword_ranked AS (
  SELECT 
    id, 
    keyword_score,
    ROW_NUMBER() OVER (ORDER BY keyword_score DESC) as rank_ix
  FROM keyword_scores
  LIMIT 400
),
combined_scores AS (
  SELECT 
    COALESCE(v.id, k.id) as id,
    (
      -- Using an rrf_k value of 60 (might want to trial 30-60)
      COALESCE(1.0 / (60 + v.rank_ix), 0.0) * 0.7 +  -- Vector weight
      COALESCE(1.0 / (60 + k.rank_ix), 0.0) * 0.3    -- Keyword weight
    ) as combined_score,
    v.vector_score,
    k.keyword_score
  FROM 
    (SELECT DISTINCT id FROM (SELECT id FROM vector_ranked UNION SELECT id FROM keyword_ranked) u) all_ids
    LEFT JOIN vector_ranked v USING (id)
    LEFT JOIN keyword_ranked k USING (id)
)
SELECT 
  d.*,
  c.combined_score,
  c.vector_score,
  c.keyword_score
FROM documents d
INNER JOIN combined_scores c ON d.id = c.id
ORDER BY c.combined_score DESC
LIMIT 20;

Here, we first compute a combined vector score using weighted text and metadata embeddings, then use RRF to combine this with traditional keyword search rankings. The COALESCE handles cases where a document might be found by one method but not the other.

Note: Using other sparse vector/keyword search techniques than ts_vector might be preferred. And the SQL function could be optimized but is here to illustrate the approach.

Tiered retrieval

An alternative approach implements a two-step retrieval process:

Initial filtering or ranking based on one embedding type
Subsequent re-ranking using the other embedding type This method proves particularly effective when queries have strong structural components or when performance optimization is crucial.

Practical applications

Program report retrieval

Consider an evaluation report about nutrition programs in Ghana. A query like "What are the lessons learned about nutrition in Ghana?" might benefit from both vectors:

Content embedding captures: "nutrition," "Ghana," program details
Metadata embedding captures: "Lessons Learned" section heading Even if the text doesn't explicitly mention "lessons learned," the metadata embedding helps surface relevant content from appropriate sections.

Webinar transcript analysis

For time-stamped webinar transcripts:

Content embedding: captures the actual discussion content
Metadata embedding: encodes webinar and segment titles (e.g. Youtube “chapters”) This enables queries like "Find discussions about budget planning in the Q&A session" to leverage both content relevance and structural context.

Implementation considerations

Metadata representation

To prevent signal dilution in metadata embeddings, consider these strategies:

Concise representation
- Use shortened versions of lengthy titles
- Extract key phrases from section headings
- Maintain only essential structural information
Structured formatting

doc_type: study
title: nutritional programs ghana
section: lessons learned
author: smith

Performance optimization

Several factors influence system performance:

Vector dimension trade-offs
- Larger dimensions provide more semantic capacity
- Smaller dimensions reduce computational overhead
- Balance based on specific use case requirements
Storage considerations
- Multiple vectors increase storage requirements
- Consider compression techniques for large-scale deployments
- Evaluate vector database capabilities for multi-vector support

Challenges and solutions

Signal dilution

Challenge: Long metadata fields can dilute important text or structural signals.

Solution approaches:

Don’t prepend the metadata to the text
Implement metadata summarization
Use selective field embedding
Apply field-specific weighting schemes

Query interpretation

Challenge: Determining appropriate weights for different embedding types.

Solutions:

Dynamic weight adjustment based on query analysis
User feedback incorporation
A/B testing of different weighting schemes

Performance evaluation

Metrics

Key performance indicators include:

Retrieval precision
Recall at k
Mean reciprocal rank
Query latency

Experimental results

In experimental implementations, multi-vector should demonstrate:

Improved precision for structure-dependent queries
Better handling of implicit structural references
More flexible query interpretation capabilities

Best practices

Vector generation
- Maintain consistent embedding models across similar field types
- Implement robust error handling for embedding generation
- Validate embedding quality through sampling
Query processing
- Implement query analysis to detect structural hints
- Monitor query performance metrics
System maintenance
- Regular evaluation of embedding effectiveness
- Periodic retraining or updating of embedding models
- Continuous monitoring of retrieval quality

Future directions

Several areas for future development include:

Dynamic weight optimization
- Machine learning approaches for weight adjustment
- Context-aware weighting schemes
- User feedback incorporation
Enhanced metadata representation
- Hierarchical embedding structures
- Field-specific embedding models
- Improved summarization techniques
Performance optimization
- Improved vector compression methods
- More efficient similarity computation
- Better handling of sparse metadata

Conclusion

Multi-vector embeddings could provide a robust framework for handling heterogeneous document structures in knowledge retrieval systems. By separating content and structural representations, these systems can maintain semantic precision while leveraging structural context effectively. The approach offers particular value for organizations dealing with diverse document types and complex retrieval requirements.

Success in implementing such systems requires careful attention to metadata representation, query processing, and performance optimization. As the field continues to evolve, we expect to see further refinements in embedding techniques and retrieval strategies, leading to even more effective knowledge management solutions.