Blog
Product Updates

Creating unstructured data pipelines for Retrieval Augmented Generation

Author
Janice Manwiller
Author
August 30, 2024
Creating unstructured data pipelines for Retrieval Augmented Generation
In this article
    Share

    Our initial release of Tonic Textual focused on generating redacted versions of unstructured text and image files. This is a great workflow for companies trying to make safe use of their unstructured data in their data workflows, including AI model training.

    Working closely with our early customers, we learned that in addition to data privacy, preparing the data for use with generative AI tools is a major impediment that affects time-to-value for enterprise AI use cases.

    So we expanded Tonic Textual’s functionality to serve that new use case in tandem with the first, so that you can take your unstructured data from raw to AI-ready in just a few minutes, while you ensure that sensitive data is protected.

    Introducing the Textual pipeline workflow, Textual’s newest capability that allows you to use the same types of source files to produce Markdown versions of your document that you can import into a vector database for RAG. Textual’s built-in redaction and synthesis features enable you to ensure that your RAG content does not include sensitive values. At the same time, you can use the entities that Textual detects to enrich your embeddings with additional information to help improve retrieval.

    About the pipeline process

    A Textual pipeline is a collection of files—including plain text files, Word documents, PDFs, Excel spreadsheets, images, and more. A pipeline can process either files that you upload from a local filesystem, or files and folders that you select from cloud storage.

    A diagram illustrating the Tonic Textual unstructured data pipeline

    After you create your pipeline and select your files, to produce the RAG-ready output, Textual:

    1. Extracts the raw text from the files.
    2. Detects the entities in the files. These are the same types of entities that are redacted or synthesized in the Textual redaction workflow.
    3. Converts the extracted text to Markdown.
    4. Generates JSON files that contain the detected entity list and the Markdown content.

    Viewing the file processing results in Textual

    For each processed file, the Textual application provides views of the results.

    Original file content

    The file details include the file content in both Markdown:

    File content shown in Markdown

    and in a rendered format:

    File content shown in a rendered format

    Detected entities in the file

    Textual also displays a version of the file text that highlights the detected entities in the file. For each detected entity, Textual also displays the entity type—names, identifiers, addresses, and so on.

    Display of the entity types identified in Tonic Textual

    JSON output

    Textual then provides the JSON output, which includes both the Markdown content and the list of detected entities.

    JSON output in Tonic Textual

    Tables and key-value pairs

    For PDFs and images, Textual also displays any tables and key-value pairs that are present in the file.

    Display of tables found in PDFs and images, in Tonic Textual
    Display of key-value pairs found in PDFs and images, in Tonic Textual

    Retrieving and using the results

    You can download the JSON files from Textual, or retrieve them directly from your cloud storage.

    You can also use the Textual Python SDK to retrieve pipelines and pipeline results.

    When you use the SDK to retrieve the processed text, you can also specify how to present each type of detected entity. For example, you can redact names and identifiers and synthesize addresses and datetime values. This helps to ensure that your RAG content does not contain sensitive data.

    Here are a couple of examples of how to use pipeline output, including how to create RAG chunks and how to add the content to a vector retrieval system.

    Recap

    The Textual pipeline workflow takes unstructured text and image files and produces Markdown-based content that you can use to populate a vector database for RAG.

    The pipeline scans the pipeline files for entities and generates JSON output that contains the generated Markdown and the list of entities.

    From Textual, you can view and download the results. You can also use the Textual Python SDK to retrieve pipelines and pipeline results. The SDK includes options to redact or synthesize the detected entities in the returned results.

    From here, it’s “choose your own AI adventure”; you decide how best to leverage the data for RAG, and there are many possibilities. Chunk and embed using the strategy that works best for your data and use case, and then use its API to load it into your preferred vector database.

    Connect with our team to learn more, or sign up for an account today.

    Janice Manwiller
    Principal Technical Writer
    Janice Manwiller is the Principal Technical Writer at Tonic.ai. She currently maintains the end-user documentation for all of the Tonic.ai products and has scripted and produced several Tonic video tutorials. She's spent most of her technical communication career designing and developing information for security-related products.

    Fake your world a better place

    Enable your developers, unblock your data scientists, and respect data privacy as a human right.
    Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.