Knowledge Retrieval  |  December 25, 2024

Context matters: multi-vector embeddings for diverse data sources

In most of the AI powered knowledge management systems we have built, we frequently need to process and retrieve information from diverse document types, ranging from academic papers to reports to webinar transcripts. This heterogeneity presents unique challenges for traditional single-vector embedding approaches, even with hybrid (keyword) retrieval. This article examines how multi-vector embedding strategies can enhance retrieval effectiveness across varied data sources while maintaining semantic relevance and structural context.

Current approaches and discussions

The field of document retrieval and embedding strategies (in the context of RAG) continues to evolve rapidly. Earlier (circa 2022) discussions focused heavily on chunking techniques - the methods used to break down documents into smaller, processable pieces. Approaches range from simple length-based splits (e.g. 1000 characters or 200 words), to more sophisticated methods considering document structure and semantic meaning. These methods include text-structured approaches that respect natural language boundaries, document-structured methods that preserve format-specific elements (like HTML or Markdown), and semantic-based splitting that considers content meaning.

Alongside chunking strategies, contextual retrieval has emerged as a promising direction for improving retrieval accuracy. Traditional RAG systems often struggle with context loss during chunking, potentially missing crucial information during retrieval. New approaches address this by maintaining or reconstructing context during the embedding process. For example, some systems now add contextual prefixes to chunks before embedding, helping preserve relationships between different parts of a document.

While these core challenges around retrieval and chunking continue to be actively researched, cutting-edge work is pushing into new territories. Recent research (pre-print from a few days ago by Jin et al. 2024) has highlighted the need to move beyond just technical retrieval capabilities toward "preference-aligned RAG" - systems that can better align with human preferences and expectations. This advanced research explores challenges like ensuring logical coherence when reasoning across multiple documents, providing precise citations, knowing when to abstain from answering, and handling conflicting information in retrieved documents.

The discussion around what was called ‘advanced RAG’ techniques (but likely now ‘basic’ considering advancements) highlights a fundamental tension in knowledge retrieval systems: the balance between granularity and context. As the field continues to mature, these foundational challenges are being tackled alongside newer questions about how to make RAG systems not just technically capable, but also aligned with how humans actually want to use them.

Approaches to structural context in document retrieval

Back to ‘basics’, when dealing with diverse document collections, we have used two main approaches for handling structural context of document chunks in retrieval systems. The first approach focuses on standardization: using small, efficient models to classify and tag each document chunk with consistent metadata labels across the entire database. This method creates a homogeneous metadata space that enables precise filtering and segmentation, potentially improving retrieval accuracy through well-defined categorical boundaries. For example, a model could identify and tag all sections that represent "lessons learned," regardless of their original headings.

The alternative approach, which we explore in this article, embraces the inherent flexibility of vector embeddings to handle structural diversity. Rather than enforcing a rigid metadata structure, this method uses a second embedding to capture the contextual and structural nuances of each document chunk. This approach is particularly valuable when dealing with varied document types - from academic papers to webinar transcripts - where structural elements might carry different meanings or serve different purposes across documents. For instance, insights that function as "lessons learned" might appear in various sections, from "Results" to "Final Observations" to "Key Takeaways," and a vector-based approach can capture these semantic relationships more naturally.

Introduction to multi-vector embeddings

Multi-vector embeddings represent different aspects of a document chunk through separate vector representations. This approach typically involves maintaining at least two distinct embeddings:

  1. A content embedding that captures the semantic meaning of the actual text
  2. A metadata embedding that encodes structural and contextual information

This separation allows for more nuanced retrieval operations that can leverage both content similarity and structural relevance independently or in combination.

For example for this chunk of text from page 14 or a report, traditionally, one would embed just the text or sometimes prepend the metadata to the text and embed the combined text (but see Noise section below). Instead we suggest creating an embedding for text and an embedding for part of metadata

{ "id": "chunk_123", "text": "Our research in Ghana revealed significant improvements in child nutrition when combining local ingredients with education programs.", "metadata": { "document_title": "Ghana Nutrition Program Evaluation 2023", "section": "Results and Discussion", "subsection": "Key Findings", "authors": ["Smith, J.", "Kumar, R."], "page": 14 }, }

Theoretical foundations

The underlying principle of multi-vector embeddings stems from the observation that different aspects of a document chunk (content and structure) occupy distinct semantic spaces. By separating these representations, we can:

  1. Preserve the semantic purity of content embeddings
  2. Maintain structural context without diluting the primary content signals
  3. Enable flexible weighting schemes during retrieval operations

Implementation architecture

Vector storage and representation

The implementation of a multi-vector embedding system requires careful consideration of several key components:

  1. Primary text embedding
    • Represents the semantic content of the document chunk
    • Focuses on the actual information content
    • Typically uses standard embedding models optimized for semantic similarity
  2. Metadata embedding
    • Encodes structural and contextual information
    • Includes document hierarchy, section information, and other relevant metadata
    • May use the same or different embedding models as the primary text

Query processing

When processing user queries, the system generates two distinct embeddings:

  1. A query text embedding that captures the semantic intent
  2. A query metadata embedding that encodes any structural or contextual hints

Retrieval strategies

Weighted combination approach

The most straightforward implementation combines both vectors using a weighted scoring function:

final_score = w_t * text_similarity + w_m * metadata_similarity

Where:

  • w_t represents the text similarity weight
  • w_m represents the metadata similarity weight
  • text_similarity and metadata_similarity are computed using cosine similarity or dot product

In a vector-enabled database like PostgreSQL with pgvector, this can be implemented directly in SQL:

WITH combined_scores AS ( SELECT id, text_content, 0.7 * (text_embedding <=> query_text_vector) + 0.3 * (metadata_embedding <=> query_metadata_vector) as combined_score FROM documents WHERE text_embedding IS NOT NULL AND metadata_embedding IS NOT NULL ORDER BY combined_score DESC LIMIT 100 ) SELECT * FROM combined_scores ORDER BY combined_score DESC;

This approach can be further enhanced using Reciprocal Rank Fusion (RRF), which helps balance the influence of different ranking signals when you include a keyword ts_vector type search:

WITH vector_scores AS ( SELECT id, text_content, -- Combined vector score using weighted text and metadata embeddings -- using 70%/30% weighting 0.7 * (text_embedding <=> query_text_vector) + 0.3 * (metadata_embedding <=> query_metadata_vector) as vector_score FROM documents WHERE text_embedding IS NOT NULL AND metadata_embedding IS NOT NULL ), vector_ranked AS ( SELECT id, vector_score, ROW_NUMBER() OVER (ORDER BY vector_score DESC) as rank_ix FROM vector_scores LIMIT 400 -- Reasonable limit for performance ), keyword_scores AS ( SELECT id, ts_rank_cd(to_tsvector('english', text_content), query_tsquery, 32) as keyword_score FROM documents WHERE to_tsvector('english', text_content) @@ query_tsquery ), keyword_ranked AS ( SELECT id, keyword_score, ROW_NUMBER() OVER (ORDER BY keyword_score DESC) as rank_ix FROM keyword_scores LIMIT 400 ), combined_scores AS ( SELECT COALESCE(v.id, k.id) as id, ( -- Using an rrf_k value of 60 (might want to trial 30-60) COALESCE(1.0 / (60 + v.rank_ix), 0.0) * 0.7 + -- Vector weight COALESCE(1.0 / (60 + k.rank_ix), 0.0) * 0.3 -- Keyword weight ) as combined_score, v.vector_score, k.keyword_score FROM (SELECT DISTINCT id FROM (SELECT id FROM vector_ranked UNION SELECT id FROM keyword_ranked) u) all_ids LEFT JOIN vector_ranked v USING (id) LEFT JOIN keyword_ranked k USING (id) ) SELECT d.*, c.combined_score, c.vector_score, c.keyword_score FROM documents d INNER JOIN combined_scores c ON d.id = c.id ORDER BY c.combined_score DESC LIMIT 20;

Here, we first compute a combined vector score using weighted text and metadata embeddings, then use RRF to combine this with traditional keyword search rankings. The COALESCE handles cases where a document might be found by one method but not the other.

Note: Using other sparse vector/keyword search techniques than ts_vector might be preferred. And the SQL function could be optimized but is here to illustrate the approach.

Tiered retrieval

An alternative approach implements a two-step retrieval process:

  1. Initial filtering or ranking based on one embedding type
  2. Subsequent re-ranking using the other embedding type

This method proves particularly effective when queries have strong structural components or when performance optimization is crucial.

Practical applications

Program report retrieval

Consider an evaluation report about nutrition programs in Ghana. A query like "What are the lessons learned about nutrition in Ghana?" might benefit from both vectors:

  1. Content embedding captures: "nutrition," "Ghana," program details
  2. Metadata embedding captures: "Lessons Learned" section heading

Even if the text doesn't explicitly mention "lessons learned," the metadata embedding helps surface relevant content from appropriate sections.

Webinar transcript analysis

For time-stamped webinar transcripts:

  1. Content embedding: captures the actual discussion content
  2. Metadata embedding: encodes webinar and segment titles (e.g. Youtube “chapters”)

This enables queries like "Find discussions about budget planning in the Q&A session" to leverage both content relevance and structural context.

Implementation considerations

Metadata representation

To prevent signal dilution in metadata embeddings, consider these strategies:

  1. Concise representation
    • Use shortened versions of lengthy titles
    • Extract key phrases from section headings
    • Maintain only essential structural information
  2. Structured formatting
doc_type: study title: nutritional programs ghana section: lessons learned author: smith

Performance optimization

Several factors influence system performance:

  1. Vector dimension trade-offs
    • Larger dimensions provide more semantic capacity
    • Smaller dimensions reduce computational overhead
    • Balance based on specific use case requirements
  2. Storage considerations
    • Multiple vectors increase storage requirements
    • Consider compression techniques for large-scale deployments
    • Evaluate vector database capabilities for multi-vector support

Challenges and solutions

Signal dilution

Challenge: Long metadata fields can dilute important text or structural signals.

Solution approaches:

  1. Don’t prepend the metadata to the text
  2. Implement metadata summarization
  3. Use selective field embedding
  4. Apply field-specific weighting schemes

Query interpretation

Challenge: Determining appropriate weights for different embedding types.

Solutions:

  1. Dynamic weight adjustment based on query analysis
  2. User feedback incorporation
  3. A/B testing of different weighting schemes

Performance evaluation

Metrics

Key performance indicators include:

  1. Retrieval precision
  2. Recall at k
  3. Mean reciprocal rank
  4. Query latency

Experimental results

In experimental implementations, multi-vector should demonstrate:

  1. Improved precision for structure-dependent queries
  2. Better handling of implicit structural references
  3. More flexible query interpretation capabilities

Best practices

  1. Vector generation
    • Maintain consistent embedding models across similar field types
    • Implement robust error handling for embedding generation
    • Validate embedding quality through sampling
  2. Query processing
    • Implement query analysis to detect structural hints
    • Monitor query performance metrics
  3. System maintenance
    • Regular evaluation of embedding effectiveness
    • Periodic retraining or updating of embedding models
    • Continuous monitoring of retrieval quality

Future directions

Several areas for future development include:

  1. Dynamic weight optimization
    • Machine learning approaches for weight adjustment
    • Context-aware weighting schemes
    • User feedback incorporation
  2. Enhanced metadata representation
    • Hierarchical embedding structures
    • Field-specific embedding models
    • Improved summarization techniques
  3. Performance optimization
    • Improved vector compression methods
    • More efficient similarity computation
    • Better handling of sparse metadata

Conclusion

Multi-vector embeddings could provide a robust framework for handling heterogeneous document structures in knowledge retrieval systems. By separating content and structural representations, these systems can maintain semantic precision while leveraging structural context effectively. The approach offers particular value for organizations dealing with diverse document types and complex retrieval requirements.

Success in implementing such systems requires careful attention to metadata representation, query processing, and performance optimization. As the field continues to evolve, we expect to see further refinements in embedding techniques and retrieval strategies, leading to even more effective knowledge management solutions.