This is the fifth installment in a multi-part series on evaluating various RAG systems using Tonic Validate, a RAG evaluation and benchmarking platform. All the code and data used in this article is available here. We’ll be back in a bit with another comparison of more RAG tools!
I (and likely others) am curious to hear how you’ve been using the tool to optimize your RAG setups! Use Tonic Validate to score your RAG and visualize experiments, and let everyone know what you’re building, which RAG system you used, and which parameters you tweaked to improve your scores on X (@tonicfakedata). Bonus points if you also include your charts from the UI. We’ll promote the best write-up and send you some Tonic swag as well.
Hello again! In this evaluation in the series I decided to evaluate OpenAI’s Rag Assistant to see how it compares to Google’s Vertex Search and Conversation offering. You may recall that we have evaluated OpenAI’s Rag Assistant in the past, both here and here. Well, we were curious if it’s improved over time (read ahead to find out if it has!). We haven’t reviewed Google’s RAG offering up until now so we are excited for the head to head evaluation.
Since we are evaluating two offerings today that both have e2e RAG solutions the setup is simple. For both products we will upload the documents on which we test retrieval, we then generate our answers via the built-in RAG system, and send the generated answers directly to Tonic Validate to generate scores.
Our testing set utilized 212 Paul Graham essays which you can find in our GitHub here.
For Google Vertex you’ll need to create a new Vertex Application, upload the collection of 212 Paul Graham essays to Google Cloud Storage and create a datastore in your Vertex application which references your Google Cloud Storage bucket which contains the essays.
For OpenAI things are somewhat more convoluted. The OpenAI RAG assistant only supports up to 20 files at a time. So just like in your previous blog posts we combine ALL of the 212 essays into one large text file. The text file with all of the combined essays can be found here.
Go check out the notebook we provide to see how both assistants are setup and initialized.
Let’s start off by asking both assistants a simple question, related to the essays. In siliconvalley.txt Paul Graham discusses the necessary ingredients to make a city into a tech hub. So let’s ask our assistants about it.
The question we ask is:
According to Paul Graham, you only need two kinds of people to create a technology hub: rich people and nerds [1]. To create a startup hub, you need to establish a first-rate research university in a place so nice that rich people wanted to live there [4]. The town would be hospitable to both groups you need: both founders and investors [4].
According to Paul Graham, several key components are necessary to create a technology hub:
Both services provide at least an OK answers which ultimately answer the question. It is our contention that Google Vertex performs much better here because it gets to the right answer better and more quickly, i.e.
Tonic Validate provides Answer Similarity scores here of a 5/5 for Vertex (a perfect score) and a score of 2/5 for OpenAI.
Ok, lets now look at how Vertex performs across a large set of questions and reference answers. The Q&A we will use can be found in here. It is a set of ~50 questions that are relevant to Paul Graham’s essays.
Below is how Vertex performed using Validates Answer Similarity Metric.
These are strong results, but lets still analyze a few of Vertex’s misses. There were 3 questions where Vertex scored a 0/5. The 3 questions are:
Vertex misunderstood this question and instead answered what was Airbnb’s short term goals during its tenure in YCombinator.
Here, Vertex failed to find any relevant context in the essays and responded that it could not provide an answer. This is better than hallucinating but still a disappointment because the answer can be found in the essay gap.txt.
The answer can be found in equity.txt. In that essay Paul Graham talks about the perceived outcomes of an investment from the POV of both the entrepreneur and the investor. The question is asking about the entrepreneur whereas the answer provided by Vertex is from the POV of the investor. I’d like to point out that it is AWESOME that Tonic Validate was able to call out this answer as being incorrect given the subtlety involved.
Let’s move on to OpenAI performance. From the chart below we can see a few things. First OpenAI has a lot more perfectly answered questions than Google Vertex (25 vs 17). But, they also have four questions where they scored a 0 vs Vertex’s three questions.
Let’s analyze a few of the questions where OpenAI scored a 0/5.
In all three of the above questions OpenAI responded that it couldn’t find the relevant context to answer the question. Just like Vertex, this is better than hallucinating but the answers can be found in the essays googles.txt ,wtax.txt and lwba.txt respecitively.
Here, the LLM answers by stating things that used to be constraints but are no longer constraints, such as lack of open source software, expensive hardware, improvements in programming languages, etc.
But it never really answers the questions, which according to Paul Graham is actually the US immigration policy (see foundervisa.txt).
Alright, now for the main event. Below, we show the distribution and summary statistics of Tonic Validate scores for OpenAI Rag Assistant and Google Vertex Search and Conversation.
The results show that OpenAI is the winner. It is a relatively close call with OpenAI having a mean score of 3.47 vs Vertex’s of 3.3. However, we should also point out that OpenAI has a bit more variability. They have more scores at both ends of the spectrum.
Something that we don’t include in our evaluation is throughput and practicability of using the technology in real-world situations. I want to briefly mention that using both solutions is a breeze during setup but there were a few things that make Vertex a more viable candidate for a production system. The first is throughput. Vertex answered all 55 questions in a matter of 1 or 2 minutes. OpenAI on the other hand took well over 30 minutes and had to be run multiple times because of intermittent failures. Additionally, the OpenAI RAG Assistant only supports up to 20 files which is limiting.