Data de-identification in the finance industry

Author

Janice Manwiller

August 30, 2024

Keeping customer data secure is vital for any organization. But for the finance industry, the stakes are particularly high.

As we know from news stories and personal experiences, leaked or stolen personal information can spell disaster—lost savings, fraudulent debt, and wrecked credit scores. And untangling that mess can take a very long time.

In this article, we'll talk about data de-identification in the context of the finance industry, and how a test data management platform can be a valuable solution for finance organizations.

What is data de-identification?

Before we discuss finance data de-identification in-depth, let's start with a quick definition of data de-identification.

To quote our earlier guide to data de-identification:

"Data de-identification is any action taken to eliminate or modify personally identifiable information (PII) and sensitive personal data within datasets to safeguard individuals' privacy."

For example, one method to de-identify data is to strip out or obscure names, account numbers, or any other information that could identify a person or provide sensitive personal information about that person.

How the finance industry uses data de-identification

Data de-identification has many use cases in the finance industry. All of these use cases are underscored by the need to not reveal personal data both now and in the future.

Application development and testing

Like many other organizations, finance organizations need to develop and test software systems. In the finance industry, these systems might include wire transfer systems to move customer money between accounts.

In addition to verifying the basic features and functions, the testing needs to ensure that the system does not enable any leakage or theft of personal data.

To do that, they need to have access to realistic, high-quality data that does not contain sensitive information.

Data migration testing

Another testing use case for finance data de-identification involves migration of data between systems.

For example, a finance organization introduces a new cloud offering. They need to migrate their data from the current system to the cloud system.

But before they do the actual migration with real data, they can do dry runs with de-identified data, to ensure that the data makes it intact from the old system to the new system, and to ensure that the migration does not introduce the risk of data theft or leakage.

Analysis and fraud detection

Finance organizations need to understand the latest financial trends based on their customer activity. They also need to constantly be on the lookout for possible data leakage and fraud.

They need to continuously analyze data to identify trends and to detect anomalous activity.

The data for this type of analysis needs to be highly realistic, so that the patterns of activity match what is actually happening. But it also must be protected.

Well-constructed, de-identified data allows for accurate and secure analysis.

RAG and LLM construction

A more recent phenomenon involves the building of large language models (LLM) and retrieval augmented generation (RAG) systems.

These systems can be used for data analysis, to answer questions about trends and activity.

They can also be used for chat-based support systems.

Producing a usable and useful RAG or LLM requires large volumes of realistic data. But organizations cannot feed personal data into these systems. De-identified data is a must.

Why is data de-identification important for financial institutions?

Now that we've discussed how the finance industry might use de-identification, let's look at why finance data de-identification is so important.

Regulatory compliance

Finance organizations are subject to different sets of government regulations that determine how they must handle sensitive data.

If they do not comply with these regulations, they can be subject to severe fines and fees.

Finance data de-identification can help to ensure compliance with these regulations. Compliance can also boost customer confidence.

Consumer confidence

Customers trust finance organizations with their livelihoods—retirement savings, education funds, investment accounts.

To survive and thrive, finance organizations must earn and maintain that trust.

Customers must be confident that their personal data is protected strongly at all times—that an organization's secure processes and procedures make it unlikely that data is leaked or stolen.

And if something does happen, they need to know that the company has their back and will make things right as quickly as possible.

Innovation

Any organization, including a finance organization, needs to always be innovating and improving their systems.

Having de-identified data on hand for testing and analysis helps ensure that new and improved features and tools get out the door more quickly and securely.

Finance data de-identification: Useful methods

To meet their use cases, finance organizations tend to most often rely on these finance data de-identification methods.

Data masking

Data masking means to protect sensitive data by replacing it with a non-sensitive substitute. Masking techniques include pseudonymization, anonymization, and scrambling.

These techniques are intended to provide the most useful data while maintaining privacy.

For finance organizations, with data that contains extremely sensitive values such as Social Security and credit card numbers, effective masking is paramount.

Data synthesis

Data synthesis is another de-identification technique. It is somewhat similar to masking in that its goal is to replace real-world sensitive values with realistic non-sensitive values.

Instead of using transformed records from the original production data, data synthesis creates completely new data that uses the same structure and statistics as the original data.

You can specify rules for the size of the synthesized data and the content of the records. For example, you might ask for a set of 100 transaction records, with equal numbers of deposits and withdrawals of amounts, all of which are less than $500.

Depending on the data type to de-identify, synthesized data can be a more secure and appropriate approach, especially for categorical, continuous, and event series data. That said, it can run into limitations at scale.

Database subsetting

Database subsetting, also referred to as subsampling, means to select a random or representative sample of data instead of using the entire dataset.

Subsetting does not in and of itself de-identify data. But for effective testing, a finance organization that has extremely large sets of data must be able to create functional subsets that capture all of the data scenarios from the larger database.

Finance data de-identification paired with subsetting offers the utmost in data privacy and developer efficiency.

Access management

All data protection involves controlling access to that data.

Producing de-identified data requires someone to have access to production data, to configure the de-identification, and to verify that the de-identified data meets the intended use case.

Finance organizations need to make sure that access is limited to the bare minimum needed to produce the de-identified data, and that no sensitive data is leaked in the process.

Safely de-identify sensitive financial data for testing and development.

Accelerate digital transformation and model training with PCI-compliant test data.

Book a demo

Selecting the appropriate de-identification methods to use

When you start the process of creating a de-identified dataset, you need to consider the following questions:

How large is your source data, and how much de-identified data do you need?

These questions can help to determine whether you can de-identify the entire dataset or you need to carve out a subset.

For example, a de-identified dataset for basic software testing might not need to be quite as large as a de-identified dataset used to train an LLM or to analyze overall trends.

What types of values are in the source data?

Organizations must identify all of the values that need to be protected, to ensure that they are included in the finance data de-identification and are handled appropriately based on their data type.

Who needs to have access to the source and de-identified data?

Organizations must identify who needs to have access to the de-identified data, and ensure that users only have access to the data that they need.

They also must limit access to the source data that is being de-identified.

How realistic does the data need to be?

This helps to determine the type of masking to apply to values, to ensure that the de-identified data meets its intended use case.

Does the de-identified data need to maintain existing relationships? If so, then you need to make sure that primary and foreign key relationships aren't lost during de-identification.

Do you need to keep the same proportions of values? If so, then you probably don't want to replace all of the values with a single constant or a null value.

Do the values need to have a valid format? For example, does a credit card number have to be a realistic value that isn't rejected when you use it to test a credit card transaction?

Using Structural to de-identify financial data

Tonic Structural is a developer-first test data management solution that includes an array of capabilities that finance organizations can use for effective, realistic finance data de-identification, including:

Sensitivity scan: The Structural sensitivity scan identifies a wide range of sensitive value types, including financial information such as bank and credit card numbers. It also allows for custom sensitivity types to identify values that are specific to your organization.

Generators: A Structural generator performs a specific type of transformation on a value. It can produce realistic replacements of known types of values such as an SSN or a customer name. Structural also allows you to create custom variants of its generators, called generator presets, so that you can create custom configurations for specific use cases.

Consistency: Structural supports input-to-output consistency in many of the transformations that it performs. Self-consistency ensures that a value in the source data maps to the same transformed value in the destination data. You can also tie the value of one column to another column. Consistency increases the realism of the generated data.

Subsetting: Structural's subsetting feature allows you to create smaller subsets of de-identified data that maintain referential integrity. Subsets can be centered around specific types of data. For example, one subset might generate a set of transaction records (with related data), while another might focus on customer information. You can adjust the subset configuration to create larger or smaller subsets.

How else can financial institutions protect customer data?

De-identification is just one weapon in the finance industry data protection arsenal. Here are a few other basic processes that also contribute to data security:

Perform regular system audits

Audit the system regularly to verify that personal data has not been leaked or stolen, and to ensure that the data is not vulnerable to leakage or theft.

Encrypt data

Make sure that data that isn't de-identified is encrypted at all times.

Standardize best practices

Make sure that all of your employees know best practices for data protection, and make it easy for employees to follow those best practices.

Establish backup and recovery plans

Make sure to back up data regularly, and have a recovery plan in place to restore lost or destroyed data and to respond to incidents of theft or leakage.

The takeaway

Protecting the privacy and security of personal data is especially vital in the finance industry. Failing to protect that information leads to disastrous consequences for organizations and their customers.

Finance organizations need access to realistic, de-identified data for a variety of use cases, including system testing, analysis, and fraud detection.

Structural's sensitivity scan to identify sensitive data, generators to de-identify that data, and subsetting to produce manageable chunks of realistic data, are vital tools to produce de-identified data for the finance industry.

To learn more about Structural de-identification and how to use it for financial data, connect with our team or start a free trial of Tonic Structural today.