Evaluating knowledge retrieval systems: the essentials

When building applications powered by large language models (LLMs), particularly in knowledge retrieval systems using Retrieval-Augmented Generation (RAG), one of the biggest challenges for developers—especially those new to the field—is evaluation.

It’s easy to get excited when the model generates a well-written response, and this excitement often leads to pushing the product into production too soon. However, without proper evaluation, you risk discovering that the system’s performance falls apart when scrutinized by experts.

The thought of evaluation can be intimidating, filled with complex metrics like F1 scores and recall rates. But evaluations don’t have to be overwhelming. This article provides a simplified, practical approach to evaluating a typical RAG system—an approach that may not always follow the textbook definition and could make some machine learning engineers cringe, but it’s a crucial first step to keep your application on track. If we skip this step, promising applications with great potential can end up being discarded because they weren’t properly evaluated.

In this article, we will break down the two key parts of a knowledge retrieval workflow to evaluate: the retrieval process and the summarization or generation process.

The basic RAG workflow

A basic RAG workflow is straightforward: a user submits a query, the system retrieves relevant chunks of information from a knowledge base using a search mechanism, and the LLM generates a response based on these chunks. While this sounds simple, each part of this process presents its own opportunities for error. A system may produce a polished response that appears correct at first glance, but deeper analysis could reveal that the content is inaccurate or irrelevant.

For evaluation purposes, we’ll break the workflow into two main parts:

Retrieval evaluation: Is the system retrieving the most relevant chunks of data?
Generation evaluation: Is the LLM correctly generating responses, summaries or insights based on the retrieved data?

Part 1: Evaluating the retrieval process

The retrieval process is where the system pulls chunks of text from your knowledge base based on the user’s query. This step is critical because if irrelevant or misleading chunks are retrieved, even the best language model will generate poor responses. The retrieval process can fail for several reasons, such as inadequate search methods, poor chunking strategies, or ineffective re-ranking of results.

Retrieval challenges

Imagine you have a knowledge base related to water sanitation, and the user asks, “What is the best approach to maintaining a ceramic pot filter at the household level?” Ideally, the system should retrieve relevant instructions from a manual about ceramic pot filters. However, if the search mechanism is flawed, the system could return chunks about unrelated topics like water filtration in general or maintenance of other types of filters.

Here’s where things can go wrong:

Search methods: A basic keyword search might not capture the nuanced meaning of the user’s query, leading to irrelevant results.
Chunking: If your chunks are too small, key information could be split across multiple chunks. If they are too large, irrelevant details may be included, reducing the relevance of the retrieved chunks.
Re-ranking: If the system retrieves too many chunks (e.g., the top 50 results), some critical chunks could be buried too low in the list and discarded when the system narrows the results. There are various techniques to re-rank the results Query rewriting and parallel query variations

Another key component in retrieval is query rewriting. This involves taking the user’s original query and generating multiple variations to help broaden the scope of the search. For example, the query “How to maintain a ceramic pot filter?” could be expanded into several variations:

“Best practices for ceramic pot filter maintenance”
“Household water filter upkeep”
“Routine care for ceramic water filters” While query rewriting can improve retrieval by casting a wider net, it can also introduce problems. If the rewritten queries stray too far from the user’s intent or introduce incorrect assumptions (e.g. pot and candle filters work different!), the system might pull irrelevant or even misleading information. Evaluating this component is essential because a poorly executed query rewrite can undermine the entire retrieval process.

Strategies for evaluating retrieval

To evaluate the retrieval process, you don’t need to dive into the most complex metrics right away. Start simple:

Compare search methods: Try different retrieval strategies, such as keyword-based vs. dense vector searches and hybrid (both). Which method pulls more relevant chunks for your use case? Consider also full on filters, which will involve categorizing or tagging your data and using a filter search like technology=“household water filters” to slice through your data. Example: Compare the results of a keyword search that looks for “maintenance” and “ceramic filter” against a dense vector search that focuses on the semantic meaning of the query.
Test chunking techniques: Experiment with different chunk sizes—paragraph-level vs. full-page chunks. What combination gives the most relevant results? Example: If you chunk data at the paragraph level, critical details might be spread out over multiple chunks, making it harder for the model to pull together a coherent answer. Testing full-page chunks could bring more context, but it also risks pulling in irrelevant information. There are dozens of chunking techniques, perplexity it! (that’s the new “Google it” expression)
Vary the number of retrieved results: Instead of pulling only the top 5 or 10 chunks, expand your retrieval window to include the top 20 or even top 50 results. Does a larger set of results improve the model’s final output? Example: If you only use the top 5 results, you might miss relevant chunks that rank lower. Expanding to the top 20 results could give the model more data to work with, potentially improving the response. Common practice is to retrieve say 100 results and rerank them (cheap and fast) to avoid the LLM having to discern.
Evaluate query rewriting: Check how well the query rewriting component works. Are the query variations bringing back more relevant information or causing confusion? Example: Evaluate each query variation to see if it returns useful information or introduces irrelevant results. If query rewriting consistently leads to irrelevant chunks, it may need adjustment.

Evaluation metrics for retrieval

To measure retrieval effectiveness, you can use the following metrics:

Precision: The proportion of retrieved chunks that are relevant.
Recall: The proportion of relevant chunks that were retrieved.
F1 Score: A balance between precision and recall. These metrics are easy to calculate and will give you a good sense of how well your system is retrieving relevant information. A high recall but low precision means the system retrieves a lot of relevant chunks but also pulls in irrelevant ones. High precision but low recall indicates the system retrieves only a few relevant chunks, potentially missing important information.

Part 2: Evaluating the generation process

After retrieval, the LLM must generate a coherent and accurate response based on the chunks it has been given. This is where summarization comes into play. Even with perfect retrieval, the model can still fail to produce a useful response if it misinterprets the information or includes irrelevant data.

Generation challenges

In our example, after retrieving chunks of text on ceramic pot filters, the model must summarize or use the information in a way that directly answers the user’s query. However, several things can go wrong:

Including irrelevant details: The model might include unnecessary information about topics unrelated to filter maintenance, such as filter design.
Missing key information: Important steps in the maintenance process may be omitted if the model fails to prioritize relevant chunks.
Incoherent responses: The generated summary may be logically disjointed if the model doesn’t correctly stitch together information from different chunks.

Strategies for evaluating generation

To evaluate the summarization process, focus on four key criteria:

Information completeness: Does the summary include all the key points from the retrieved chunks? Example: The response should cover all necessary maintenance steps for the ceramic filter. If it leaves out key information, the summary is incomplete.
Factual accuracy: Is the information presented in the summary correct? Example: If the summary suggests using the wrong cleaning method, even though it’s well-written, it’s factually incorrect and could mislead users.
Coherence: Is the summary easy to follow and logically structured? Example: A good summary should flow logically from one point to the next, without jumping between unrelated topics.
Query relevance: Does the summary answer the original question posed by the user? Example: If the query is about ceramic filter maintenance, the summary should stick to that topic rather than discussing general sanitation practices.

Evaluation metrics for generation

To evaluate the quality of response, use similar metrics to those for retrieval:

Precision: The proportion of the summary that is relevant to the query.
Recall: How much of the important information from the retrieved chunks was included.
F1 Score: A balanced measure of both precision and recall. In addition to these metrics, you should introduce human evaluation, where subject matter experts review the summaries for relevance and correctness. Automated evaluation tools that compare generated summaries against a gold standard can also be helpful, if a gold standard or ground truth exists!

Taking a step back: rethinking Your retrieval strategy

Before getting too deep into your evaluation, it’s helpful to take a step back and reconsider whether your current retrieval strategy is truly the most effective approach for handling the complexity and specificity of your knowledge base.

As your application grows more complex, or the domain of knowledge you’re retrieving from becomes more specialized, you might find that your current pipeline, though effective, could benefit from an additional layer of “intelligence”. This is where the concept of an intention qualifier or query router comes into play. Rather than treating every query the same, a query router can assess the intent behind the query and direct it toward a specific retrieval process or knowledge subdomain that’s most relevant. This approach, which involves setting up multiple tailored retrieval pipelines, can dramatically improve precision and quality without necessarily increasing the complexity of the system. By focusing on more specific parts of your knowledge base or employing customized search and re-ranking techniques, you can fine-tune your system for better results. This shift in logic, from a one-size-fits-all RAG approach to a more focused, multi-pipeline strategy, can be the key to unlocking higher quality responses. This starts to get into agents, or “agentic” workflows…something for another post.

Conclusion

Evaluating a knowledge retrieval system powered by LLMs, particularly one that uses RAG workflows, doesn’t have to be a daunting task. While it’s easy to be satisfied with a well-written output, proper evaluation ensures that your system is accurate, relevant, and capable of handling real-world queries.

Start with simple evaluations of the retrieval process by testing different search methods, chunking strategies, and query rewriting and results reranking techniques. Next, move on to evaluating the generation process by assessing how well the model generates complete, accurate, and relevant responses. Remember, this isn’t a perfect or by-the-book approach (the books on LLM application evaluation are still all draft!), but taking these first steps is crucial to making sure your application reaches its full potential and avoids being prematurely discarded due to poor evaluations.