This is the third installment in a multi-part series I am doing to evaluate various RAG systems using Tonic Validate, a RAG evaluation and benchmarking platform, and the open source tool tvalmetrics. All the code and data used in this article is available here. I’ll be back in a bit with another comparison of more RAG tools!
I (and likely others) am curious to hear how you’ve been using the tool to optimize your RAG setups! Use tvalmetrics to score your RAG and Tonic Validate to visualize experiments, and let everyone know what you’re building, which RAG system you used, and which parameters you tweaked to improve your scores on X (@tonicfakedata). Bonus points if you also include your charts from the UI. We’ll promote the best write-up and send you some Tonic swag as well.
After last week’s post, one reader requested that we take a look at Haystack. Having never come across Haystack before, I was curious. After looking at Haystack’s Github, it seems to be focused more on the whole LLM pipeline as opposed to RAG itself. In particular, it includes features for fine-tuning, semantic search, and decision making alongside normal RAG capabilities. It even includes features for users to give feedback so you can improve your models. At face value, it seems that the barrier to entry to building customized GPTs has been lowered with a product like Haystack. However, these are hard problems to solve and different projects are taking different approaches that have different tradeoffs and results. So, it’s always good to test your setups before putting them into production to ensure optimal performance. In this article, I am going to help you out and compare Haystack to LangChain (another end-to-end RAG library) to see how both of them do and provide my opinion on which is best for production workloads.
To get started with Haystack, I have a collection of 212 essays from Paul Graham. You can read more about how I prepared this test set for the experiments in my earlier blog post, here. To ingest these documents into Haystack, I ran the following code:
This code sets up a document store to hold our embeddings. Then it adds all the essays to the document store and computes embeddings for them. After that, it configures the pipeline to search the embeddings and query GPT-4 Turbo.
Similar to Haystack, I am going to set up a document store to hold the embeddings using the following code:
After setting that up, I can set up our pipeline which searches the document store.
To start the comparison, let’s give both Haystack and LangChain an easy question about one of the essays:
Both systems gave the same answer to the question (which is the correct answer).
To run a more thorough analysis on these systems, I am going to use tvalmetrics, which provides an easy way to score RAG systems based on various metrics (you can read more about these in the GitHub repo). In our case, we are going to use the answer similarity score, which scores how similar the LLM’s answer is to the correct answer for a given question. For running tvalmetrics, I created a benchmark of 55 question-answer pairs from a random selection of 30 Paul Graham essays. I can then run both RAG systems through all the questions, collect the RAG-facilitated LLM responses, and pass both the LLM’s answers and the ideal answers from the benchmark set to tvalmetrics. Using these data, tvalmetrics will automatically score the LLM responses, giving me a quantiative idea of how each RAG system is performing.
To get started with this, I ran the following code to load the questions and gather the LLM’s answers using both RAG systems:
After the LLM’s answers are stored, I can pass them to tvalmetrics to score them.
After tvalmetrics is finished processing, I observed the following results:
Across the board, Haystack performed better, although LangChain’s performance was also strong (especially considering the low scores of competitor systems like OpenAI Assistants). With a higher average and minimum answer similarity score, Haystack’s RAG system provided a correct (or close to correct) answer more often than LangChain’s system. The lower standard deviation further means that response quality was more consistent across the 55 tests I ran on both systems. The results are promising, but I encourage you to use tvalmetrics and replicate the experiment using your own models and benchmarks.
While both systems performed well, Haystack is the winner here, as a whole performing better than LangChain. I also found that Haystack’s system was a lot easier to work with. The documentation quality of Haystack was drastically better than LangChain, and I would recommend using Haystack’s system in production for this reason. An exception to this is if you need to integrate RAG with a complex system like agents. In that case, LangChain’s integration with their agent framework makes it a much more attractive option, and in general LangChain is built for setups like that where you are orchestrating many services across the whole stack. However, if you are using RAG to build or improve a simple chatbot, then you’ll be perfectly fine using Haystack.
All the code and data used in this article is available here. I’m curious to hear your take on OpenAI Assistants, GPTs, and tvalmetrics! Reach out to me at ethanp@tonic.ai.