data masking (original) (raw)

What is data masking?

Data masking is a method of creating a structurally similar but inauthentic version of an organization's data that can be used for purposes such as software testing and user training. The purpose is to protect the actual data while having a functional substitute for occasions when the real data is not required.

Although most organizations have stringent security controls in place to protect production data in storage and in business use, sometimes that same data element has been used for operations that are less secure. The issue is often compounded if these operations are outsourced and the organization has less control over the environment. In the wake of compliance legislation, most organizations are no longer comfortable exposing real data unnecessarily.

Data masking substitutes original values in a data set with randomized data using various data shuffling and manipulation techniques. The obfuscated data maintains the unique characteristics of the original data so that it yields the same results as the original data set.

How does data masking work?

Encryption is the best way to securely store and transfer sensitive data. Unfortunately, encrypted data is difficult to query and analyze. For example, you cannot filter users based on age if their data of birth is encrypted. As a result, organizations need another way to keep data secure and private when it is being used for research, development and testing.

Data masking, which is also called data sanitization, keeps sensitive information private by making it unrecognizable but still usable. This lets developers, researchers and analysts use a data set without exposing the data to any risk.

Data masking is different from encryption. Encrypted data can be decrypted and returned to its original state with the correct encryption key. With masked data, there is no algorithm to recover the original values. Masking generates a characteristically accurate but fictitious version of a data set that has zero value to hackers. It also cannot be reverse engineered, and statistical outputs cannot be used to identify individuals.

Like data encryption, not every data field needs to be masked, although some fields must be completely hidden.

Diagram of how an encryption key works

How a cryptographic key works when encrypting a message.

Why is data masking important?

Various data protection standards and regulations require that businesses and other organizations protect personally identifiable information, or PII, and protected health information and keep it confidential. These standards and regulations include the following:

These regulations and standards play an important role in establishing appropriate levels of data protection and preventing unauthorized users from accessing the data. However, they also make it challenging for companies that want to analyze or share their data with others. Data masking reduces the risks of sensitive data being exposed and lets enterprises comply with various standards and regulations while handling regulated data.

Data masking techniques

A variety of data management techniques can be used to mask or anonymize PII and other private and sensitive data depending on the data type. These masking methods include the following:

Scrambling

Scrambling randomly reorders alphanumeric characters to obscure the original content. For example, a customer complaint ticket number of 3429871 in a production environment could appear as 8840162 in a test environment after being scrambled. Although scrambling is easy to implement, it only works on certain types of data. Data obfuscated this way is not as secure as other techniques.

Substitution

This technique replaces the original data with another value from a supply of credible values. Lookup tables are often used to provide alternative values to the original, sensitive data. The values must pass rule constraints and preserve the original characteristics of the data.

It is harder to apply substitution than scrambling, but it can be applied to several data types and provides good security. For example, credit card numbers can be substituted with numbers that pass card provider validation rules.

Shuffling

Values within a column, such as user surnames, are shuffled to randomly reorder them. For example, if customer surnames are shuffled, the results look accurate but won't reveal any personal information. However, it is essential that the shuffling masking algorithm is kept secure so it cannot be used to reverse-engineer the data masking process.

Date aging

This method increases or decreases a date field by a specific date range. Again, the range value used must be kept secure.

Variance

A variance is applied to a number or date field. This approach is often used for masking financial and transaction value and date information. The variance algorithm modifies each number or date in a column by a random percentage of its real value. For instance, a column of employees' salaries could have a variance of plus or minus 5% applied to it. This would provide a reasonable disguise for the data while maintaining the range and distribution of salaries within existing limits.

Masking out

Masking out only scrambles part of a value and is commonly applied to credit card numbers where only the last four digits remain visible.

Nullifying

Nullifying replaces the real values in a data column with a null value, completely removing the data from view. Although this sort of deletion is simple to implement, the nullified column cannot be used in queries or analysis. As a result, it can degrade the integrity and quality of the data set for development and testing environments.

Examples of data masking techniques

Data masking takes different forms based on the technique used. See some examples of the various types.

Types of data masking

The process of masking data can be initiated in different ways depending on where and when the data is needed. Various types of data masking include the following:

Data masking challenges

There are several challenges involved with data masking:

Complicated. Data masking is not a simple, one-step activity. The data must be transformed to eliminate the risk of exposing sensitive information through inference attacks. At the same time, the system must maintain the complexity and unique characteristics of the original unmasked data, such as frequency distribution. This ensures that queries and analysis yield the same results as they would from the original.

Referential integrity. Masked data must maintain referential integrity across systems and databases. This is called deterministic data masking and ensures consistency of the masked output across databases and among servers, ensuring a value is always replaced with the same masked value. For example, the name "Helen" is always replaced by the value "Denise" wherever it occurs in the data being masked to preserve primary and foreign keys and relationships.

This may sound straightforward, but most databases are normalized for performance, and sensitive data gets stored in a variety of tables across one or more databases. Structured text, such as Extensible Markup Language (XML), can contain values that must be obfuscated, so they have to be parsed and masked.

Governance. For data used in software testing, the masking process must ensure data still matches internal rules governing the data attributes, such as zip codes or bank account numbers. If care is not taken here, the application will fail an update test because the data validation checks fail.

To overcome these challenges, database administrators must do a detailed review of the data being masked. Others involved with the data -- developers, testers, data scientists and security teams -- should contribute to the review. Input from these stakeholders ensures the appropriate masking techniques are used, data sources of valid replacement values are generated, referential integrity is maintained across all systems and the masked data maintains the characteristics of the original data.

Data masking best practices

Data masking is an important part of how organizations comply with privacy regulations and still are able to mine data for valuable insights. However, the ever-growing volume of structured and unstructured data that organizations collect has increased the complexity of data masking at scale. To remain compliant, it is essential to follow best practices:

Data masking use cases

Data masking is used in many industries to drive innovation and improve services. For example, the banking and finance industries use it to develop and test new systems as well as improve fraud detection algorithms.

While data masking is used for different reasons, the main driver is data security and personal data privacy.

List of best practices for preventing data breaches

Various data management best practices help prevent data breaches.

Many database vendors offer features and services that simplify data masking and integrate them with existing systems.

Find out more about keeping data safe beyond data masking in our comprehensive guide to data security.

This was last updated in September 2022

Continue Reading About data masking

Dig Deeper on Data security and privacy