For the detailed instructions on how to setup and install the containers on Snowpark you can go to this GitHub repository. If you encounter any issues, file a GitHub issue and we’ll get back to you quickly.
Tonic Textual provides advanced Named Entity Recognition (NER) and synthetic replacement of sensitive free-text data. It is used to safely train AI models on sensitive private data, preventing data leakage through your AI models. Today, we are excited to announce that Tonic Textual is now available on the Snowflake Data Platform via Snowpark Container Services (SPCS). SPCS enables you to run containerized workloads directly within Snowflake, ensuring that your data doesn’t leave your Snowflake account for processing.
This means that you can take advantage of Textual’s state-of-the-art NER models without ever having to egress your data out of Snowflake’s secure environment. If you are storing unstructured text data in Snowflake and want to use this data for model training, fine-tuning, or RAG applications, Tonic Textual can help you do that safely at scale while maintaining data utility and compliance.
Architecture
Because of SPCS, this all happens on Textual services running on Snowflake’s clusters, so your data never leaves the secure confines of Snowflake.
A basic example
For this example, we’ll set up a single node compute pool in Snowpark. It’ll use a GPU_NV_S instance which is Snowpark’s smallest and most cost-efficient GPU instance. It uses a single A10G Nvidia graphics card which has 24GB of RAM. It additionally has 8 vCPUs and 32GB of non-GPU RAM. We’ll run our service on top of this compute pool and disable auto-scaling by setting MIN_INSTANCE and MAX_INSTANCE counts to 1.
Let’s start with a few simple text examples which we’ll call directly without needing data loaded into a table:
This returns a singular result of:
My name is [NAME_GIVEN_czg72], and [DATE_TIME_joVVM9] I am demo-ing [PRODUCT_uRLPiR3X], a software product created by [ORGANIZATION_QDeGw5]
We can see them in a side-by-side as:
This will now disable DATE_TIME detection and synthesize names while redacting everything else Textual identifies as sensitive in the string. This configuration yields:
A longer example, using a table
Let’s create a toy Snowflake table that holds some conversational data. The following code will create a Snowflake table representing a transcription of a customer support conversation:
In this table, the conversation is broken up into multiple rows. Alternatively, each conversation could exist entirely in 1 row or perhaps the data is stored along with metadata in a JSON blob. No matter what though, Textual will support it!
This will return a single column of redacted snippets. We can take it a step further and create a new TABLE (or perhaps even a materialized view); because the sensitive information is removed, this table can be safely shared to lower environments for downstream use cases such as model training or analytics.
This query will give you an entirely new table of redacted data. Converting this to a view or materialized view is as easy.
The Takeaway
Snowflake is widely known as one of the most secure cloud data stores on the market. Because of this, you trust Snowflake with your organization’s most sensitive data. By deploying Tonic Textual on Snowflake’s clusters using SPCS, your data stays in Snowflake’s secure confines, maintaining data security while still getting the benefits of Textual’s state-of-the-art NER models and synthetic data engine. The combination of SPCS and Tonic Textual makes it safe and easy to redact and synthesize text for training AI models without fear of data leakage.
Have sensitive text data on Snowflake? Reach out to us for access to Textual on Snowpark at: textual@tonic.ai.
Unblock data access, turbocharge development, and respect data privacy as a human right.