Blog
Technical deep dive

Building a RAG system on Databricks with your unstructured data using Tonic Textual

Author
Prasad Kona
Author
Ethan P
September 20, 2024
Building a RAG system on Databricks with your unstructured data using Tonic Textual

Written by Prasad Kona, Lead Partner Solutions Architect at Databricks, and Ethan P., Software Engineer at Tonic.ai

Building a RAG system can be challenging. In addition to deployment and infrastructure challenges (eg, scaling up your vector db), there are many tradeoffs and decisions to make for each component of your RAG stack.

One key decision is choosing between LLM RAG vs fine-tuning — RAG allows your system to retrieve external information dynamically, while fine-tuning adjusts a model for specific use cases based on existing knowledge.

However, the biggest challenge facing enterprises seeking to implement RAG is getting quality data to power the system in the first place. In today’s world, your organization’s data is probably spread across different formats (PDFs, Word Docs, etc) and locations. Colocating and converting your unstructured data into a common format ready for ingestion into a vector database for RAG is difficult and cumbersome to do manually and at scale. 

Luckily, Databricks and Tonic.ai have partnered to drastically simplify the process of connecting your enterprise unstructured data to AI systems to reap the benefits of this new and exciting technology. Databricks is well-known for its powerful, unified data and AI platform, which seamlessly integrates your company’s data (whether structured or unstructured) stored in a data lake with your existing infrastructure to accelerate your data-driven initiatives via open data access. By leveraging the Mosaic AI Agent Framework from Databricks, you can efficiently build, deploy, and scale RAG applications. Meanwhile, Tonic Textual handles the heavy lifting of converting your unstructured data—whether it's in PDFs, Word Docs, or other formats—into a standardized format that’s ready for use with all of the exciting features offered on the Databricks platform. By using Tonic Textual, information from your documents will also be enriched with metadata that improves the quality and relevance of the responses generated by your RAG app, reducing hallucinations and delivering a more accurate, trustworthy AI experience.

Together, Databricks and Tonic Textual remove the complexities of data preparation and integration, allowing your teams to focus on building high-quality RAG systems. In this demo, we will show you how to integrate Tonic Textual with the Mosaic AI Agent  Framework to build a simple RAG system on the unstructured data in your data lake.

Overview

Diagram showing the process of parsing PDF files into chunked text and metadata with Tonic Textual, storing it in a Databricks catalog, and syncing with Databricks Vector Search for querying embeddings.

The first step in building our RAG system is to ingest the data. For our system, we will start out by feeding Tonic Textual our data. The data in this example is all PDFs, but it can be any of the other formats supported by Textual (including docx, xlsx, png, etc). After Textual has parsed all these files into a common format (markdown) and has extracted metadata for each file, we can store this information in a Databricks catalog which syncs to Mosaic AI Vector Search for querying. The metadata generated by Textual in this example is a list of entities in each chunk (e.g. company names, individual names, phone numbers, etc):

Flowchart showing how Mosaic AI Agent Framework processes a user's question by querying Tonic Textual for entities and using Vector Search to find relevant context, returning the answer to the user.

For our chatbot itself, the user will ask a question to our RAG system running the Mosaic AI Agent Framework. The Agent Framework will query Textual to find a list of the entities in the user’s question. Using these entities, the RAG system will filter for documents that include at least one of the entities mentioned in the user’s question. This improves the quality and speed of your vector search by only searching over documents that are relevant to the question.

Requirements

To get started, make sure you have the following set up:

  1. A Databricks workspace
  2. A Tonic Textual account
  3. A Tonic Textual API Key
  4. A S3 bucket with your data

Getting started

First, we will create a notebook in our Databricks workspace. Once the notebook is created, we can install our dependencies.

Once the dependencies are installed, we will set up some variables containing our Databricks catalog name, catalog schema name, RAG model name, and vector search endpoint name.

With these variables set up, we can create the catalog and catalog schema.

We will also create the vector search endpoint too which will connect to our catalog (at a later step) to provide our RAG system with a way to query our data.

Setting up Textual

To load our data into Databricks, we first need to set up Textual and  connect it to an S3 bucket. First, go to https://textual.tonic.ai/ and login. Then click “Create a Pipeline” to create a Textual Pipeline. A Textual Pipeline is used to automatically parse files from S3, extract relevant metadata from your documents, and standardize it for ingestion into your RAG system.

Screenshot of the Textual platform with a 'Create a Pipeline' button highlighted, used to process unstructured data from S3 or local files for RAG systems or LLM development.

Once you click “Create a Pipeline”, you’ll see options to configure your pipeline. Fill out the S3 bucket information to create the pipeline.

Screenshot of S3 bucket information form to create the pipeline

Once the pipeline is created, select the S3 input and output locations

Screenshot of settings for S3 input and output locations

Then finally you can click “Run Pipeline” to process your files

Screenshot of the Databricks pipeline setup in Textual, with the 'Run Pipeline' button highlighted, indicating the final step to process files from the S3 bucket into the output location.

Once your pipeline is done running, you can move to the next step of connecting Textual to Databricks. This is a live pipeline that can be re-run to refresh the data that gets loaded into Databricks – Textual will only process new or modified files during each run to keep costs down.

Setting up Textual with Databricks

Now that Textual has processed our data, we can connect it to Databricks. First, we will define the catalogs where we will save our data.

Then we can pull our chunked data from Tonic Textual. Before doing this step, ensure that you have your Textual API key set as a secret called TONIC_TEXTUAL_API_KEY and that your Textual Pipeline ID is set as a secret called TONIC_TEXTUAL_PIPELINE_ID. To get the Pipeline ID, you can copy it from above the “Run Pipeline” button in the previous step.

In the code below, we connect to the pipeline in Textual using our API key. Then we specify the entities that we would like to collect as metadata. To see a complete list of entities Textual detects out-of-the-box, go to the Tonic Textual docs. We can now create a list containing the chunked text, the metadata, and the file location.

Now we can create our dataframe with the data from Textual and then add a unique ID for each chunk.

Finally, we can write our parsed and enriched data to a Delta table and then sync it to Mosaic AI Vector Search, which will also automatically generate the embeddings.

Configuring our RAG chain

Now we will set up our RAG app itself using MLFlow with its LangChain integration. By using ML Flow, we will be able to serve our LangChain model via Databricks while also automatically getting in-depth logging for each step in the chain in our LangChain model. Additionally, MLFlow will also allow you to run evaluations later on to measure performance. In order to set up MLFlow, we need to save our Databricks configuration.

After this configuration is saved, we need to create a new notebook (which will be called Tonic_Chain). In the new notebook, install the required dependencies.

Then in a new cell restart Python to use the updated packages

First, let’s set up some helper functions for LangChain along with loading the configuration for the model.

Then we will set up the integration with Tonic Textual. This function connects to Textual and extracts the organization metadata from a user’s query. We will use this metadata later to improve retrieval for the RAG system.

Now, let’s connect to Mosaic AI Vector Search. One thing to note is that our LangChain Retriever (which finds the relevant data in our vector db) is configured to automatically filter our documents that don’t have metadata relevant to the user’s question.

Next, let’s set up the prompts for our LangChain chain.

Finally, we can set up the LangChain chain itself.

Deploying the application

To deploy our application, go back to the original notebook and run the following code:

After this is done running, we should have a functional app which you can query via the url provided in the notebook output.

Conclusion

By using Tonic Textual and Databricks together, you can easily create a high-quality RAG app with your company’s data. Using Textual, you can ingest all your files in S3 into a common format for RAG. While not a focus of this post, Textual can also automatically detect and redact sensitive information in your unstructured data, allowing you to protect the privacy of the data being used for your AI systems. Then with the Mosaic AI Agent Framework, you can query on the ingested data and deploy a RAG app with the data. Together, with Tonic Textual and Databricks, you can increase your RAG system’s accuracy by using Textual’s generated metadata to filter out irrelevant documents in your vector database. Through filtering, you can query on a smaller subset of data to increase result accuracy.

To learn more about the Databricks Mosaic AI Agent Framework, check out the docs. To learn more about Tonic Textual, go to our website and sign up for our generous free trial.

Prasad Kona
Lead Partner Solutions Architect at Databricks
Prasad Kona is a respected advisor and thought leader with a proven track record in developing and implementing sophisticated data analytics and AI strategies for customers and partners. He currently works as a Senior Staff Solution Architect at Databricks, where he contributes to the growth of the Databricks technology partner ecosystem

Make your sensitive data usable for testing and development.

Unblock data access, turbocharge development, and respect data privacy as a human right.
Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.