It’s hard to believe it but it’s been a year since we released the original version of Condenser, our open source database subsetting tool. Since then, Condenser has been deployed in a variety of situations, and we’ve learned a bunch along the way, specifically, what you need to make it work best for you. The culmination of this is the release of Condenser 2.0! Rejoice 🎉!
In this post, we’ll explore what we’ve learned about subsetting and the challenges our new release helps you overcome. If you’re new to subsetting and wondering how it could fit into your workflow, here are a few possibilities:
Condenser uses foreign key to traverse your database and grab only what’s necessary to extract the subset you’re looking for. In other words, if you have a database (🎂) that you’d like just a piece of (🍰), Condenser can make it for you.
Sweet, right? Sweet…but not actually a piece of cake.
Oh so many challenges. We explored the core questions in our original post announcing Condenser a year ago. Here, we’ll focus on the issues we encountered that led to the release of 2.0.
When it comes right down to it, this is the classic conundrum of communication. What you say isn’t always what your listener hears. Or, in our case, what we asked for in an algorithm wasn’t always what our program returned.
Condenser was originally released with an Iterative Subset algorithm. In short, Iterative Subset always starts at one end of a topological sort of your database’s tables and works its way across the sort to fill every table. It searches for the optimal subset iteratively by changing the rows in the starting table. Iterative Subset stops when some criteria has been met. This was purposefully vague because you can, in principle, provide almost any criteria. The main advantages of Iterative Subset are that it’s fast and you can express flexible termination criteria.
In practice, however, we have since found Direct Subset to be consistently better at creating subsets that match our expectations. Direct Subset works by starting from a set of initial targets. These are tables with subsetting goals, e.g. give me 5% of the
users table and 5% of the
vendors table. Direct subset begins by producing those subsets in your output database. Then it makes two passes on your database, ordered by a topological sort of the tables. In the first phase, it greedily includes rows from tables that are upstream of the initial targets. For example, the
events table is upstream of
users if there is a foreign key in
users. The second phase grabs all required keys given the greedily derived database.
To contrast the two: Direct Subset always starts with the initial targets, which can be anywhere in the topological sort, and Iterative Subset always starts at the beginning of the topological sort. Direct Subset always ends after two passes, and Iterative Subset continues iterating until meeting a desired subset criteria. Iterative Subset has flexible criteria for ending iteration, Direct Subset merely guarantees that the initial targets will be as specified.
After extensive testing, we found Direct Subset produced more practical subsets than Iterative Subset (usually by bringing in more data), though it was often slower (usually because it brought in more data). So for Condenser 2.0, we replaced Iterative Subset with Direct Subset.
If you want to know even moar about how these two approaches work, feel free to email us at email@example.com. I’m confident my enthusiasm for subsetting will outlast yours. 😇
Given that Condenser relies on foreign keys to create its subsets, how they are handled is fundamental to the subsetting process. Condenser 1.0 was limited in the kinds of foreign keys it supported, thereby limiting the tool’s capacity to subset a variety of databases. It also didn’t allow for much configuration in terms of foreign key constraints. AKA complex subsetting be damned. To lift these restrictions and greatly expand Condenser’s capabilities, we’ve made two major improvements in terms of foreign key handling.
config.jsonlets you specify additional foreign keys, which will be unioned with the foreign key constraints present in the database.
This one was a no brainer. Why offer an open source subsetting tool for only one of the most popular open source databases when, with just a little more effort, you can do it for two? Spread the open source love! In addition to Postgres, we’re happy to announce that Condenser 2.0 also supports MySQL.
Consider these the icing on the cake:
initial_targets. This is very useful when you want to pull out specific parts of a database, e.g. everything associated with
customer_id = 5, rather than simply shrinking a database.
upstream_filters. During subsetting’s greedy phase (read the description of Direct Subset for more info), the subsetter grabs as many rows as possible from upstream tables.
upstream_filtersgives you an opportunity to restrict how greedy it is. For example, suppose you want the subset “all users in Brazil, and the last 30 days of web traffic”. You might specify this with an initial target on
country = 'Brazil', and an upstream filter on
current_date - event_date < 30. You can specify an upstream filter on a table or on a column. When it’s on a column, it applies to all tables with that column.
keep_disconnected_tableshas been added to let you control what happens to tables that aren’t connected to any
initial_targets; this includes transitive connections.
The metaphor’s not dead yet, folks. Here’s your cherry on top. Condenser now ships with
example-config.json that gives examples of what you can do in
config.json, as well as a shell
README.md has been augmented with a bunch more on how to install and configure Condenser. With that said, no documentation is ever perfect. Feel free to contact us if you run into any problems.
If Condenser doesn’t quite cut it for you, Tonic also offers a commercial subsetter. The commercial subsetter has more advanced algorithms for subsetting very large databases, additional subsetting options, a UI, and job scheduling. For more information, contact us at firstname.lastname@example.org.
Over the last year, Condenser has been a great tool for us as well as our customers. This is what inspired these additional investments in Condenser. We hope it’s helpful for you too. So, happy birthday Condenser 🥳! And here’s to many more! 🥂