This is the fourth installment in a multi-part series on evaluating various RAG systems using Tonic Validate, a RAG evaluation and benchmarking platform. All the code and data used in this article is available here. We’ll be back in a bit with another comparison of more RAG tools!
I (and likely others) am curious to hear how you’ve been using the tool to optimize your RAG setups! Use Tonic Validate to score your RAG and visualize experiments, and let everyone know what you’re building, which RAG system you used, and which parameters you tweaked to improve your scores on X (@tonicfakedata). Bonus points if you also include your charts from the UI. We’ll promote the best write-up and send you some Tonic swag as well.
Hello again! In this series, we’ve heavily focused on young, nascent companies building RAG tooling, but there is also a host of RAG product suites offered by the big cloud providers. So, for this evaluation, I decided to evaluate Amazon Bedrock to see how some of Bedrock’s offerings perform at RAG. Amazon Bedrock has base models in the modalities of text, embedding, and image that anyone can use to build AI applications. For RAG specifically I’ll be looking at their text and embedding models. Bedrock has text models from Anthropic, Cohere, AI21 Labs, Meta, Stability AI, and Amazon, as well as embedding models from Amazon and Cohere. As you can see, there are a lot of models in Bedrock to choose from when deciding to build a RAG application. For this post, I’ll use Amazon Bedrock to compare head to head a RAG system using Amazon’s Titan models to a RAG system using Cohere’s models.
Setting up the experiment consists of three main steps:
In the following sections you’ll see code for exactly how to implement each step. For now, I’ll summarize what each step consists of and show the code for a base class for implementing a simple RAG system.
To set up a Knowledge Base in Amazon Bedrock, you choose an s3 folder with your data in it, and the text embedding model you’d like to use. Bedrock handles the chunking, embedding, and storing of the data in the s3 folder. Given a user query, you retrieve context from the Knowledge Base relevant to the user query through an AWS API. The Knowledge Base handles the retrieval process. A good tutorial for setting up a Knowledge Base in Amazon Bedrock is found here. As usual, I used the collection of 212 essays from Paul Graham that has been used in the previous RAG evaluation series posts. You can read more about how this dataset is prepared here.
Choosing a text model to serve as the LLM is easy, you just decide which one you want to use from the list of models in the AWS console. You interact with the chosen text model through the AWS API via the name of the model (you’ll see specifics of this below). Not all models are created equal so it may be prudent to try a couple of them and use Tonic Validate to understand the impact.
Writing the RAG system in Python consists of determining the logic for how to take a user question, retrieve the relevant context for the question from the Knowledge Base, and prompt the LLM with the question and the retrieved context to answer the question. For this purpose, we used a simple abstract base class in Python:
The Cohere RAG system is set up the same way as the Titan one. In this case the Cohere Embed English model is used as the embedding model and the Cohere Command model is used as the LLM. The RAG base class with the calls to the Command model is implemented as:
To run a more thorough analysis on these systems, I am going to use Tonic Validate’s Python SDK, which provides an easy way to score RAG systems based on various metrics (you can read more about these in the GitHub repo). In our case, we are going to use the answer similarity score, which scores how similar the LLM’s answer is to the correct answer for a given question. For running Tonic Validate, I created a benchmark of 55 question-answer pairs from a random selection of 30 Paul Graham essays. I can then run both RAG systems through all the questions, collect the RAG-facilitated LLM responses, and pass both the LLM’s answers and the ideal answers from the benchmark set to Tonic Validate. Using this data, Tonic Validate will automatically score the LLM responses, giving me a quantitative idea of how each RAG system is performing.
To get started with this, I ran the following code to load the questions and gather the LLM’s answers using both RAG systems:
After the LLM’s answers are stored, I can pass them to Tonic Validate to score them.
After Tonic Validate is finished processing, I observed the following results:
Across the board, Cohere performed better, although Amazon Titan’s performance was also strong (especially considering the low scores of competitor systems like OpenAI Assistants). With a higher average and minimum answer similarity score, Cohere’s RAG system provided a correct (or close to correct) answer more often than Amazon Titan’s system. The lower standard deviation further means that response quality was more consistent across the 55 tests I ran on both systems. The results are promising, but I encourage you to use Tonic Validate and replicate the experiment using your own models and benchmarks.
While both systems performed well, Cohere is the winner here, as a whole performing better than Amazon Titan. It was super easy to set up these RAG systems using Knowledge Bases and the models provided by Amazon Bedrock. Amazon Bedrock’s Knowledge Bases completely manage chunking, embedding, storing and retrieving your data for the data retrieval portion of RAG. Amazon Bedrock also provides a plethora of models to choose from as the LLM in your RAG system. I can’t wait to try the other ones out!
All the code and data used in this article is available here. I’m curious to hear your take on Amazon Bedrock, building RAG systems and Tonic Validate! Reach out to me at joeferrara@tonic.ai to chat more.