In most of the AI powered knowledge management systems we have built, we frequently need to process and retrieve information from diverse document types, ranging from academic papers to reports to webinar transcripts. This heterogeneity presents unique challenges for traditional single-vector embedding approaches, even with hybrid (keyword) retrieval. This article examines how multi-vector embedding strategies can enhance retrieval effectiveness across varied data sources while maintaining semantic relevance and structural context.
The field of document retrieval and embedding strategies (in the context of RAG) continues to evolve rapidly. Earlier (circa 2022) discussions focused heavily on chunking techniques - the methods used to break down documents into smaller, processable pieces. Approaches range from simple length-based splits (e.g. 1000 characters or 200 words), to more sophisticated methods considering document structure and semantic meaning. These methods include text-structured approaches that respect natural language boundaries, document-structured methods that preserve format-specific elements (like HTML or Markdown), and semantic-based splitting that considers content meaning.
Alongside chunking strategies, contextual retrieval has emerged as a promising direction for improving retrieval accuracy. Traditional RAG systems often struggle with context loss during chunking, potentially missing crucial information during retrieval. New approaches address this by maintaining or reconstructing context during the embedding process. For example, some systems now add contextual prefixes to chunks before embedding, helping preserve relationships between different parts of a document.
While these core challenges around retrieval and chunking continue to be actively researched, cutting-edge work is pushing into new territories. Recent research (pre-print from a few days ago by Jin et al. 2024) has highlighted the need to move beyond just technical retrieval capabilities toward "preference-aligned RAG" - systems that can better align with human preferences and expectations. This advanced research explores challenges like ensuring logical coherence when reasoning across multiple documents, providing precise citations, knowing when to abstain from answering, and handling conflicting information in retrieved documents.
The discussion around what was called ‘advanced RAG’ techniques (but likely now ‘basic’ considering advancements) highlights a fundamental tension in knowledge retrieval systems: the balance between granularity and context. As the field continues to mature, these foundational challenges are being tackled alongside newer questions about how to make RAG systems not just technically capable, but also aligned with how humans actually want to use them.
Back to ‘basics’, when dealing with diverse document collections, we have used two main approaches for handling structural context of document chunks in retrieval systems. The first approach focuses on standardization: using small, efficient models to classify and tag each document chunk with consistent metadata labels across the entire database. This method creates a homogeneous metadata space that enables precise filtering and segmentation, potentially improving retrieval accuracy through well-defined categorical boundaries. For example, a model could identify and tag all sections that represent "lessons learned," regardless of their original headings.
The alternative approach, which we explore in this article, embraces the inherent flexibility of vector embeddings to handle structural diversity. Rather than enforcing a rigid metadata structure, this method uses a second embedding to capture the contextual and structural nuances of each document chunk. This approach is particularly valuable when dealing with varied document types - from academic papers to webinar transcripts - where structural elements might carry different meanings or serve different purposes across documents. For instance, insights that function as "lessons learned" might appear in various sections, from "Results" to "Final Observations" to "Key Takeaways," and a vector-based approach can capture these semantic relationships more naturally.
Multi-vector embeddings represent different aspects of a document chunk through separate vector representations. This approach typically involves maintaining at least two distinct embeddings:
This separation allows for more nuanced retrieval operations that can leverage both content similarity and structural relevance independently or in combination.
For example for this chunk of text from page 14 or a report, traditionally, one would embed just the text
or sometimes prepend the metadata
to the text
and embed the combined text (but see Noise section below). Instead we suggest creating an embedding for text
and an embedding for part of metadata
{ "id": "chunk_123", "text": "Our research in Ghana revealed significant improvements in child nutrition when combining local ingredients with education programs.", "metadata": { "document_title": "Ghana Nutrition Program Evaluation 2023", "section": "Results and Discussion", "subsection": "Key Findings", "authors": ["Smith, J.", "Kumar, R."], "page": 14 }, }
The underlying principle of multi-vector embeddings stems from the observation that different aspects of a document chunk (content and structure) occupy distinct semantic spaces. By separating these representations, we can:
The implementation of a multi-vector embedding system requires careful consideration of several key components:
When processing user queries, the system generates two distinct embeddings:
The most straightforward implementation combines both vectors using a weighted scoring function:
final_score = w_t * text_similarity + w_m * metadata_similarity
Where:
In a vector-enabled database like PostgreSQL with pgvector, this can be implemented directly in SQL:
WITH combined_scores AS ( SELECT id, text_content, 0.7 * (text_embedding <=> query_text_vector) + 0.3 * (metadata_embedding <=> query_metadata_vector) as combined_score FROM documents WHERE text_embedding IS NOT NULL AND metadata_embedding IS NOT NULL ORDER BY combined_score DESC LIMIT 100 ) SELECT * FROM combined_scores ORDER BY combined_score DESC;
This approach can be further enhanced using Reciprocal Rank Fusion (RRF), which helps balance the influence of different ranking signals when you include a keyword ts_vector
type search:
WITH vector_scores AS ( SELECT id, text_content, -- Combined vector score using weighted text and metadata embeddings -- using 70%/30% weighting 0.7 * (text_embedding <=> query_text_vector) + 0.3 * (metadata_embedding <=> query_metadata_vector) as vector_score FROM documents WHERE text_embedding IS NOT NULL AND metadata_embedding IS NOT NULL ), vector_ranked AS ( SELECT id, vector_score, ROW_NUMBER() OVER (ORDER BY vector_score DESC) as rank_ix FROM vector_scores LIMIT 400 -- Reasonable limit for performance ), keyword_scores AS ( SELECT id, ts_rank_cd(to_tsvector('english', text_content), query_tsquery, 32) as keyword_score FROM documents WHERE to_tsvector('english', text_content) @@ query_tsquery ), keyword_ranked AS ( SELECT id, keyword_score, ROW_NUMBER() OVER (ORDER BY keyword_score DESC) as rank_ix FROM keyword_scores LIMIT 400 ), combined_scores AS ( SELECT COALESCE(v.id, k.id) as id, ( -- Using an rrf_k value of 60 (might want to trial 30-60) COALESCE(1.0 / (60 + v.rank_ix), 0.0) * 0.7 + -- Vector weight COALESCE(1.0 / (60 + k.rank_ix), 0.0) * 0.3 -- Keyword weight ) as combined_score, v.vector_score, k.keyword_score FROM (SELECT DISTINCT id FROM (SELECT id FROM vector_ranked UNION SELECT id FROM keyword_ranked) u) all_ids LEFT JOIN vector_ranked v USING (id) LEFT JOIN keyword_ranked k USING (id) ) SELECT d.*, c.combined_score, c.vector_score, c.keyword_score FROM documents d INNER JOIN combined_scores c ON d.id = c.id ORDER BY c.combined_score DESC LIMIT 20;
Here, we first compute a combined vector score using weighted text and metadata embeddings, then use RRF to combine this with traditional keyword search rankings. The COALESCE handles cases where a document might be found by one method but not the other.
Note: Using other sparse vector/keyword search techniques than ts_vector
might be preferred. And the SQL function could be optimized but is here to illustrate the approach.
An alternative approach implements a two-step retrieval process:
This method proves particularly effective when queries have strong structural components or when performance optimization is crucial.
Consider an evaluation report about nutrition programs in Ghana. A query like "What are the lessons learned about nutrition in Ghana?" might benefit from both vectors:
Even if the text doesn't explicitly mention "lessons learned," the metadata embedding helps surface relevant content from appropriate sections.
For time-stamped webinar transcripts:
This enables queries like "Find discussions about budget planning in the Q&A session" to leverage both content relevance and structural context.
To prevent signal dilution in metadata embeddings, consider these strategies:
doc_type: study title: nutritional programs ghana section: lessons learned author: smith
Several factors influence system performance:
Challenge: Long metadata fields can dilute important text or structural signals.
Solution approaches:
Challenge: Determining appropriate weights for different embedding types.
Solutions:
Key performance indicators include:
In experimental implementations, multi-vector should demonstrate:
Several areas for future development include:
Multi-vector embeddings could provide a robust framework for handling heterogeneous document structures in knowledge retrieval systems. By separating content and structural representations, these systems can maintain semantic precision while leveraging structural context effectively. The approach offers particular value for organizations dealing with diverse document types and complex retrieval requirements.
Success in implementing such systems requires careful attention to metadata representation, query processing, and performance optimization. As the field continues to evolve, we expect to see further refinements in embedding techniques and retrieval strategies, leading to even more effective knowledge management solutions.