MongoDB is a NoSQL database platform that uses collections of documents to store data rather than tables and rows like most traditional Relational Database Management Systems (RDBMS). It derives its name from the word 'humongous' — 'mongo' for short. It is an open source database with options for free, enterprise, or fully managed Atlas cloud licenses.
Development on MongoDB began as early as 2007 with plans to release a platform as a service (PaaS) product; however, the founding software company 10gen decided instead to pursue an open source model. In 2013, 10gen changed their name to MongoDB to unify the company with their flagship product, and the company went public.
MongoDB was built with the intent to disrupt the database market by creating a platform that would ease the development process, scale faster, and offer greater agility than a standard RDBMS. Before MongoDB's inception, its founders — Dwight Merriman, Kevin P. Ryan, and Eliot Horowitz — were founders and engineers at DoubleClick. They were frustrated with the difficulty of using existing database platforms to develop the applications they needed. MongoDB was born from their desire to create something better.
As of this writing, MongoDB ranks first on db-engines.com for documents stores and fifth for overall RDBMS platforms.
Being document-based, Mongo stores data in JSON-like documents of varying sizes that mimic how developers construct classes and objects. MongoDB's scalability can be attributed to its ability to define clusters with hundreds of nodes and millions of documents. Its agility results from intelligent indexing, sharding across multiple machines, and workload isolation with read-only secondary nodes.
While the ease of creating documents to store data in MongoDB is valuable for development purposes, it entails significant challenges when attempting to create realistic test data for Mongo. Unlike traditional RDBMS platforms with predefined schemas, MongoDB functions through JSON-like documents that are self-contained with their own individual definitions. In other words, it's schema-less. The elements of each document can develop and change without requiring conformity to the original documents, and their overall structure can vary. Where in one document, a field may contain a string, that same field in another document may have an integer.
The JSON file format itself introduces its own level of complexity. JSON documents have great utility because they can be used to store many types of unstructured data from healthcare records to personal profiles to drug test results. Data of this type can come in the form of physician notes, job descriptions, customer ratings, and other formats that aren't easy to quantify and structure. What’s more, it is often in the form of nested arrays that create complex hierarchies. A high level of granularity is required to ensure data privacy when generating test data based on this data, whether through de-identification or synthesis. If that granularity isn’t achieved, the resulting test data will, at best, fail to accurately represent your production data and, at worst, leak PII into your lower environments.
A high degree of privacy paired with a high degree of utility is the gold standard when generating test data based on existing data. Already it can take days or weeks to build useful, safe test data in-house using a standard RDBMS. The variable nature of MongoDB's document-based data extends that in-house process considerably. It's the wild west out there, and you’d need to build a system capable of tracking every version and format of every document in your database to ensure that nothing is missed—a risky proposition.
It’s also worth noting that there aren’t many tools currently available for de-identifying and synthesizing data in MongoDB. This speaks to the challenges involved—challenges we’re gladly taking on.
Safely generating mock data in a document-based database like MongoDB requires best-in-class tools that can detect and locate PII across documents, mask the data according to its type (even when that type varies within the same field across different documents), and give you complete visibility so you can ensure no stone has been left unturned.
At Tonic, we provide an integrated, powerful solution for generating de-identified, realistic data for your test environments in MongoDB. For companies working with data that doesn't fit neatly into rows and columns, Tonic enables aggregating elements across documents to realistically anonymize sensitive information while providing a holistic view of all your data in all of its versions. Here are a few ways we accomplish this goal:
We’re proud to be leading the industry in offering de-identification of semi-structured data in document-based databases. Are you ready to start safely creating mock data that mimics your MongoDB production database? Check out a recording of our June launch webinar, which includes a demo of our Mongo integration. Or better yet, contact our team, and we'll show you the ropes live.