Technical deep dive

The challenges of preparing unstructured data for Generative AI

Author

Joe Ferrara, PhD

Author

August 28, 2024

TL;DR Extracting text from unstructured documents in a variety of document types can be a lot of work, requiring many different Python packages. Transforming the extracted text into a uniform format suitable for ingestion by Generative AI systems is even more work. Tonic Textual solves these problems with an easy, streamlined solution: just connect Tonic Textual to your data source, let Tonic Textual do data prep for you, and then access the extracted and formatted text with just a few lines of Python code.

Picture this: you’re all ready to take advantage of this huge Generative AI hype. You’ve been dreaming about creating an LLM Agent to make a specifically annoying business system more efficient, and you’re sure completing this project will help the company hit the big KPIs and OKRs leadership has set. So, you tell your Junior Data Scientist to pull all the unstructured data the company has related to the business process. You’re going to set up an experiment to determine whether it’s better for you to fine-tune an LLM or to use RAG for the LLM Agent to complete its tasks.

Seems easy enough with all of the open-source frameworks, tools, and models available to help developers build AI tools. But there’s one catch: in order for AI models to provide their promised benefits to enterprises, you have to give it your data in the proper format, which is a problem that has yet to be elegantly solved until now.

Dealing with the data

Your Junior Data Scientist lets you know that he’s put all the relevant data into an S3 bucket for you. You finish a couple of meetings, and you’ve got a free hour to dig into the data before more meetings. You open up a Jupyter notebook in one screen and start scrolling through the S3 bucket in another screen. In the S3 bucket you’ve got pdf, jpeg, docx, csv, xlsx, and txt files. You’d like to start understanding what exactly is in these files, perhaps do some light analysis and processing of the text data before putting it into a training pipeline to fine-tune Llama 3.1 or before chunking the data for RAG.

Getting text data out of documents

It’s easy enough to work with txt files in Python, and Pandas is a good solution for the csv and xlsx files. But what about docx, pdf, and jpeg files? A docx file is basically the same as a txt file, so there must be an easy way to use Python packages for extracting the text from docx files. Google says python-docx will work, so great.

Now for pdfs. You know the pdfs don’t have too many images so a super fancy OCR model shouldn’t be needed for them. Google to the rescue again, PyMuPDF looks good.

Lastly, the JPEG files. The jpg files are images with text in random places, much less text heavy than the pdfs, so an image neural network model is probably the best for extracting this text. You decide on using one of the out of the box models in keras-ocr for the jpeg files.

An attempt using Python packages

After pip installing each of these Python packages, you grab an example file from each filetype and start hammering away in the Jupyter notebook (probably starting with code from ChatGPT or Claude 3.5 if you’re cool) to extract the text from each file. After messing around for a while with your Python version and package dependencies to get each package working correctly on your laptop (tensorflow 2.15 is needed for keras-ocr but tensorflow 2.15 doesn’t work with Python 3.12, ugh), you eventually get this meaty Python snippet working:

This Python snippet will extract the raw text from each file type, but it loses all the formatting of the text, which is particularly important for the pdf and docx files. It also drops things like headers, footers, images, and page numbers in the pdf and docx files. The csv and xlsx files are converted to markdown to work with as strings, but what about tables that appear in the pdf files. Are they going to be in Markdown? You have no idea. This code also took a while to write and as currently written won’t scale to a large volume of files, which your AI system will require to perform at a sufficient level.

Before you know it, the Google Calendar alert comes in warning you that the next block of meetings is about to start and you know that the 10x engineer you used to be could have done this a lot faster and would already be analyzing the text in the files instead of fiddling around trying to process the files to get text out of them.

You’re feeling a bit like:

Ben Affleck smoking a cigarette and looking exhausted and disappointed

You were hoping to have the text from these files in a training pipeline or chunked for RAG already, but you haven’t even started doing analysis on the text.

Tonic Textual to the rescue

After you get out of your meetings, your Junior Data Scientist mentions a tool called Tonic Textual. It’s a privacy-focused platform designed to standardize and protect unstructured data for AI development and LLM training that’ll extract the text from your different unstructured documents, and prepare that text for downstream Generative AI tasks like fine-tuning and RAG. It even has its own state-of-the-art NER models that you can use to tag important information and redact private data that appears in your text.

All that work to write Python to extract the text from the documents is replaced with creating a Pipeline in the Tonic Textual UI that points at your existing S3 bucket with all your unstructured documents and then clicking “Run Pipeline.” Running the pipeline extracts the text from all the documents and normalizes it into a uniform markdown format. It captures headers, footers, page numbers, tables, and formatting like italics, bold, and section titles. It even works well with extracting information from charts and tables in pdf files.

After the pipeline is done running, this short Python snippet pulls the text from each of the files into your Jupyter notebook:

The text from the docx, pdf, jpg, xlsx, and csv files are all put into a markdown format. All of the tables—whether in pdfs or coming from a csv file—are markdown tables. Using Tonic Textual has you feeling like:

Ben Affleck carrying an order of Dunkin' Donuts with a smile on his face

You’re rolling now, ready to analyze the text and put it into a fine tuning pipeline or RAG system. You don’t even mind that you’re staying late working after hours to do it.

For your fine-tuning task, you’ll use the code snippet from above to extract each document as a markdown string and fine tune Llama 3.1 70B directly on those markdown strings.

For your RAG system, you’ll use the Tonic Textual SDK to chunk the documents, and then use your favorite embedding model to embed the chunks and load them into a vector database. Here are some additional things you can do using the Textual SDK.

The takeaway

Now you’re ready to evaluate your LLM Agent to see if it can really help you hit those KPIs and OKRs. If this idea doesn’t pan out, at least you don’t have to struggle through the excruciating data preparation work required to get any LLM application off the ground. After all, working creatively with LLMs and conducting experiments is why the company hired you, not for extracting text from unstructured data and building your own data pipelines. Leave that to Tonic Textual. That Junior Data Scientist that told you about Tonic Textual probably deserves a raise.

Connect with our team to learn more about leveraging Tonic Textual for your AI use cases, or sign up for an account today.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Joe Ferrara, PhD

Staff AI Scientist

Joe is a Senior Data Scientist at Tonic. He has a PhD in mathematics from UC Santa Cruz, giving him a background in complex math research. At Tonic, Joe focuses on implementing the newest developments in generating synthetic data.