Since 2018, organizations that handle the personal data of individuals living in the European Union (EU) have had to adapt their data handling practices to comply with the EU’s General Data Protection Regulation (GDPR). The regulation has been used as a model for similar data privacy regulations around the world, including Japan, Brazil, South Korea, South Africa, Turkey, the United Kingdom, and the United States. Organizations falling under the controls enacted by this legislation are required to place an emphasis on the privacy of individuals in their data handling practices, requiring many to rethink how they manage the massive amounts of data being collected from their customers, clients, and users.
As a data privacy company, we’ve helped hundreds of customers meet the requirements of GDPR by leveraging synthetic data to de-identify sensitive information to meet regulatory requirements but still retain the utility of that data. By partnering with Tonic.ai, organizations are able to unlock the value of their production data while protecting customer privacy and trust and reducing the risk of costly compliance violations.
In the age of generative AI, data privacy has again come to the forefront of the conversation, no doubt because of the important role GDPR has played in establishing a mandate to protect individuals’ privacy. For organizations working with unstructured text data that contains PII, PHI, and other sensitive information, we built Tonic Textual—which leverages named entity recognition (NER) models to detect sensitive information in your data—to help address the regulatory requirements of work with that data for analytics and AI development.
How NER works: identifying sensitive information
Named Entity Recognition (NER) is a technology that is used to identify categories of data within unstructured text. This can include things like names of people, locations, organizations, times and dates. Name-entity regulation and its models is more advanced than simple string matching and can identify entities based on their context, which catches entities that may be misspelled or extremely uncommon.
The practical applications of NER in compliance
NER has broad applications across many industries, but specifically can be used to meet compliance requirements of the GDPR and other privacy regulations. The table below outlines a number of examples.
These applications demonstrate the extensive applicability of NER in enhancing data privacy across different fields by providing a reliable mechanism for identifying and managing sensitive information.
The scope of GDPR: beyond traditional data
The General Data Protection Regulation (GDPR), is a European regulation that sets restrictions and obligations for how personal data is handled by organizations. Though it was implemented in May 2018, the implications of the GDPR are still being discovered as courts in different EU countries interpret the applicability and scope of the regulation.
The GDPR provides broad protection of the Personal Information of individuals. The GDPR is designed to give people in the EU more control over their personal data. It grants individuals rights such as the ability to access their data, have it corrected, and even erased under certain circumstances. It also requires companies to dispose of data that they no longer need, design systems with privacy in mind, as well as a wide variety of other technical and legal obligations intended to protect Personal Information.
Though the GDPR only protects the Personal Information of European subjects, it is extraterritorial in scope and may apply to organizations outside of the EU in a variety of situations. The GDPR also is currently one of the strictest data protection standards and many other countries and regulatory bodies have modeled data protection regulations after the European model.
Navigating troublesome data types and formats
Complying with the GDPR is never simple, even with datasets that are well structured and straightforward. The material scope of the GDPR extends beyond data processed in databases or structured data stores and is generally interpreted to include documents both digital and hard-copy that are part of a filing system (see here and here). This includes files used in day-to-day operations, for example:
- Emails
- Documents that are kept in shared drives
- Documents kept on an employee’s workstation
- Documents in filing cabinets, binders, medical records, etc
It also includes data that is no longer used, for example:
- Archive data
- Scanned files
- Backups
It also includes data of unknown provenance, like:
- data that was acquired as part of a merger or acquisition
- data that was migrated from an older system that lost metadata, context, structure
- data created by employees as part of a project
- incomplete or poorly documented datasets
All of these pose difficulties when trying to comply with the GDPR due to the lack of structure, indexing and the variety of files and formats.
Utilizing Tonic Textual for GDPR compliance
Tonic Textual is able to recognize, redact and synthesize sensitive entities in unstructured text data, which has multiple practical applications that are typically expensive and time consuming to keep in compliance with the GDPR. Tonic Textual specifically can be used to identify what personal information is being stored and processed and to minimize personal data that is no longer needed, while still retaining the utility of the data for business use.
Data cataloging with NER technology
The first step to determining what needs to be done to comply with the GDPR (or other privacy/data protection regulation) is determining what data is being processed. Name entity recognition can be used to quickly identify different categories of data within a dataset to determine what risks and obligations may be present without the time intensive process of manually reviewing files.
Tonic Textual can be used to identify what categories of data are being processed (emails, names, locations, nationalities, etc) as well as identify the frequency at which these items appear in an unstructured dataset. This can provide a quick assessment to determine what risk exists in a dataset and whether it is worth retaining, de-identifying or destroying.
Data minimization and sensitive information masking
One of the core tenets of the GDPR is “data minimisation,” which applies to both the collection and storage of Personal Information. The GDPR requires that companies get rid of Personal Information when it is no longer required for its original purpose. Though the GDPR requires the destruction of Personal Information, associated non-identifying information may still provide value to organizations (e.g. an ecommerce company can’t justify keeping customer support interactions of people whose account hasn’t been active for years, but the interaction itself is an important data point that can help a company identify trends and improve business practices).
Tonic Textual can not only be used to identify personal information in a variety of documents, it can also be used to mask it from documents to allow retention of other non-Personal Information and any value that may have to an organization. By removing non-essential personal information, organizations can safely leverage the data for business use without risking compliance violations and lost customer trust.
The future of data management with Tonic Textual
Going forward, the landscape of data management is poised to become increasingly complex due to the exponential growth of digital information. In this evolving environment, NER will play a pivotal role in enhancing data privacy and compliance frameworks. By leveraging NER, organizations can automatically identify and classify sensitive personal data across vast and varied datasets, ensuring that this information is handled in accordance with ever-tightening global privacy regulations.
GDPR has changed the way companies must manage and interact with their data. Since its inception, organizations have been seeking scalable ways to manage the vast amounts of data being collected. Tonic Textual provides proprietary NER technology to help companies detect sensitive information in their unstructured text data and masking capabilities that can be used as a way to safely handle unstructured data types that would otherwise require significant amounts of manual review, cost and effort.
Leveraging NER for enhanced data privacy
Advancements in NER technology, integrated with artificial intelligence and machine learning, are significantly improving the accuracy and efficiency of data processing. This integration is enabling real-time data analysis and protection, making it easier for organizations to stay ahead of potential data breaches. Additionally, as privacy laws continue to evolve, NER will become an indispensable tool in developing data governance strategies that are robust, scalable, and capable of protecting sensitive information in a privacy-centric world. Organizations that take a proactive approach to data management will not only comply with legal standards but also build trust with consumers by safeguarding their personal information against emerging threats.
To learn more about how Tonic Textual can help you address challenges with meeting GDPR requirements, connect with our team today.
Unblock data access, turbocharge development, and respect data privacy as a human right.