Technical deep dive

How to mask data for testing in MongoDB

Author

Chiara Colombi

Author

April 26, 2022

Masking data for safe, compliant use in testing environments is not as straightforward as it seems, especially when using schemaless, unstructured databases like MongoDB. Personally identifiable information (PII) can be scattered throughout your document-based data in ways that are hard to predict—so hard that it simply isn’t a challenge most teams offering data de-identification solutions are willing to take on. (Spoiler alert: Tonic.ai isn’t “most teams.”)

Let’s explore how to mask data for testing in MongoDB, plus what makes this such a challenging nut to crack. Then, we’ll wrap up with a quick demonstration of how to mask MongoDB data using Tonic.

MongoDB’s Masked Data Challenges

MongoDB is a NoSQL database system that stores data as documents. Its document database system operates without structure or schema, and each copy can contain numerous types of data as the data level progresses.

MongoDB’s flexible storage system enables efficiency for scaling apps because it stores large amounts of data within clusters defined in millions of nodes. However, this document-based storage system presents significant challenges when de-identifying and masking data.

Challenge 1: Unstructured data

The first challenge is its unstructured nature. Since MongoDB is schemaless, each field in a data collection can represent any one of the various data types. Furthermore, this type can change with each document level. A field may exist as an integer in one document level and a string in another. This lack of consistency presents an obstacle with masking data and generating production-like data for testing.

Challenge 2: MongoDB’s JSON storage format

MongoDB’s JSON storage format poses another data masking challenge. The JSON format houses various forms of data, from names to license plate numbers and other types that are less easily quantifiable. These high-level nested document fields create complex hierarchies, which complicate the level of granularity needed to generate realistically represented test data.

Challenge 3: Time required to build

Even for a relational database management system (RDBMS), it takes a significant amount of time and resources to create an infrastructure capable of generating test data that perfectly mimics production data. MongoDB’s various formats and versions significantly increase that time. Your de-identification infrastructure requires generators that can track and mask each version and document format. Hardly a walk in the park.

Finding an Effective Data Masking Solution

Generating MongoDB data is challenging because an effective solution needs to:

Detect and locate PII across each document in the entire collection.
Mask the data according to type, even though key types may vary across document levels with the same key.
Provide complete visibility into your document collection to observe and check each document during the generation process.

Assume a sample analytics collection with multiple documents contains documents A and B with a plate_number key. However, the plate_number key in document A is an integer, while the one in document B is a string. An efficient solution should know that it needs to mask the integer plate_number key with an integer and the string key with a string.

How Tonic Masks MongoDB Data

Tonic enables aggregating document collections in MongoDB to de-identify sensitive information and generate realistic, useful document-based data while eliminating the risk of PII slipping through. The Tonic interface provides a comprehensive view of the entire data generation process.

Let’s take a look at how it works:

A schemaless data-capturing method

Tonic curbs the complexity of document storage databases by employing a schemaless data-capturing method. We create a hybrid document model representing the entirety of the documents in your collection, then transfer the model to lower environments — like your staging environment. After connecting, Tonic scans the source database and automatically creates this hybrid document while capturing all edge cases and leaving no room for PII leaks.

Granular NoSQL data

By mixing and matching our 50+ generators, the platform masks NoSQL data with a high degree of granularity to mirror the complexity of your data. Regardless of how unstructured or varied your document database is, Tonic can accommodate, masking several data types in a document — even within the same fields.

Then, Tonic’s user-friendly interface enables you to preview the newly masked data, giving you a comprehensive view of each document so you can refine your data along the way.

Consistent data across databases

Tonic’s cross-database support helps achieve consistent test data throughout your data ecosystem. By connecting natively to other database types like Redshift, PostgreSQL, and Oracle and matching input-to-output generated data types across these databases, Tonic produces realistic test data that preserves relationships across databases.

Additionally, organizations can use Tonic with Mongo as a NoSQL interface, freeing teams to mask data stored in homegrown, document-based DB solutions.

Now that we understand the value Tonic brings to the question of how to mask data for testing in MongoDB, let’s explore how to put it into action.

How to mask PII data in MongoDB with Tonic

Integrating with MongoDB is simple — Tonic can connect natively to MongoDB via an Atlas connection string.

First, create a destination database in MongoDB to store the generated data. We’ll call our database tonic_mongo_integration.

Then, log into the Tonic platform and create a workspace. The workspace stores your connection information, data jobs, and other relevant information.

Within your workspace configuration, enter your MongoDB Connection String for your source and your destination. These connection strings enable Tonic to grab the data from your source Mongo instance and then store masked data in your destination Mongo instance. Next, enter the name of your MongoDB Database that will hold the fake data.

‍

‍

Then, hop into the Privacy Hub in the left side menu. The Privacy Hub performs a scan of all the documents in your collection. It flags data that it identifies as sensitive and makes recommendations for which generators to apply to protect that data.

‍

You can apply generators directly within Privacy Hub or click into the Collection View in the left side menu to see your hybrid document in full.

In the Collection drop-down menu, select customers. This gives a preview of the collection’s fields. You can set the view mode as Single Document, which then shows each document in your collection.

‍

‍

Alternatively, you can view a Hybrid Document that shows all of the fields that appear across the documents in the collection.

‍

‍

Once you’ve applied the generators you need to safely and realistically mask your sensitive data, it’s time to generate. Click Generate Data in the upper right.

‍

‍

To view the status of your data generation, click Jobs in the left-hand side menu. We’re showing a Completed status, so let’s check our database and compare our MongoDB data.

‍

‍

Here’s the original data in our source DB. Note the various address fields.

‍

‍

And here’s the fake data in our destination DB. Notice the new, realistic address fields.

‍

‍

And there you have it! Real fake document-based data in MongoDB for all your testing needs.

How to mask data for testing in MongoDB with Tonic

Tonic’s unique data masking solution for MongoDB enables developers to test their products efficiently with high-quality, realistic document-based data.

But don’t take our word for it. Check out what eBay had to say on the eBayTech blog about Tonic’s role in creating high-quality NoSQL data for their staging environments. If you’re looking for a similar solution for your NoSQL data, let’s chat.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Chiara Colombi

Director of Product Marketing

A bilingual wordsmith dedicated to the art of engineering with words, Chiara has over a decade of experience supporting corporate communications at multi-national companies. She once translated for the Pope; it has more overlap with translating for developers than you might think.