December 5, 2024
|
Gert Jan Spriensma

Pushing a RAG prototype to production

In the past months, we helped one of our customers bring a Retrieval-Augmented Generation (RAG) prototype into production. In this post, I would like to share our key learnings and milestones from this two-month adventure.

Crafting the Benchmark Answer Set

One of the foundational elements of our project was establishing a robust Benchmark Answer Set. We initially believed that having access to a large Q&A database would simplify this task. However, we quickly realized that choosing questions that captured the full range of necessary nuances was anything but straightforward. Furthermore, accurately identifying and correlating the right sources, particularly when different sources discuss similar topics, proved to be complex. To overcome this challenge, we opted for an iterative approach in which improvements were made progressively, together with domain experts and their feedback. This way, we didn’t have to wait until the Benchmark Answer Set was ready; we could already start optimizing the solution with the limited content we had available.

Building the Evaluation Framework

Setting up an evaluation framework itself is relatively straightforward. Yet, fine-tuning the metrics to ensure they accurately reflected our goals required quite a bit of effort and attention. Metrics like recall might appear simple at first glance, but the devil is in the details. It’s easy to slip up. We had to consistently reassess our methods to ensure they remained aligned. Although our focus was more on directional trends rather than absolute figures, having precise measurements undoubtedly made the insights much more impactful. The primary metrics we used to measure performance included Recall, Precision/Mean Reciprocal Rank (MRR), and an LLM-based correctness metric, which together painted a comprehensive picture of our framework’s effectiveness.

System Components: From Prototype to Beta

The transition from a prototype to a beta version involves the integration and refinement of multiple system components. If you have a prototype, you probably have the following components already setup:

• A basic pipeline for cleaning, chunking, and embedding

• A retrieval system utilizing similarity searches

• A mechanism for generating answers

As we progressed to the beta phase, the system evolved to include more sophisticated layers:

• An advanced pipeline for enhanced data processing techniques

• A classification layer for smarter routing of queries

• A query expansion layer to dissect and broaden the scope of questions

• A reranker to synthesize and prioritize results

• A guardrails layer to validate answers and sources

Project Attitude: Detail-Oriented and Hands-On

Throughout the project, our success depended less on the metrics themselves and more on our ability to examine every aspect of the pipeline. This required a deep dive into the details—examining unanswered questions, overlooked sources, and constantly testing and tweaking small parts of the solution. It’s not just about having domain expertise; it’s about actively engaging with the data, recognizing patterns, and making informed adjustments to the many components that can be optimized.

Enhancing Semantic Match

The most crucial task is to improve the semantic match between a question and the corresponding content from which the answer can be extracted. In our project, we plotted the embeddings in 2D and could clearly distinguish the difference between the most important content (blue) and questions (red). Fortunately for us, there was already content in place that acted as a bridge between the questions and the main content. However, if you are not so fortunate and don’t have that bridge, you can consider strategies like HyDE or synthetic content generation to bridge the gap.

The 2D projection of our content (blue) and query (red) embeddings.

With query expansion, you can focus on different elements of the question (for example, a specific element of your business) to touch on multiple types of content in your database. The problem with this approach is that you need to bring the results from different queries together, merge, and deduplicate them. As you are likely getting a lot more results now, a reranker layer becomes important.

A reranker is a slower but more powerful model (compared to an embedding model) that measures the similarity between the question and the content. In our case, reranking proved to be very difficult as commercial offerings such as Cohere and Jina didn’t work for us, leaving us with larger and slower models. The reason for this was the very specific domain we operated in and Dutch as the main language. For more general applications, rerankers would probably have worked really well.

This part of the project shows that you cannot ‘just’ apply a best practice and expect to get great results; you have to really test and figure out what works for your specific case. For example, we expected that implementing reranking would be relatively trivial and in the end most of development time went into this layer, without a perfect result.

And sometimes you need some luck

Our journey was significantly boosted by the timely release of OpenAI’s new embedding models, which enhanced the similarity matching and improved our recall by 10%-15%. We saw a 10%-15% jump in performance by simply changing from text-embedding-ada-002 to text-embedding-003-large. We also fine-tuned an embeddings model specific to our domain, but unfortunately, the results were far from perfect, probably due to the language.

Stay tuned for the next post where I’ll delve into the second part of this 2 months project, setting up guardrails, fine-tuning prompts, and preparing for a successful beta launch.

Come chat with us

Get in touch to find out what your data can do for you. Spoiler alert: it's a lot.

Contact Us