DataMasque | More realistic fake data with DataMasque's Deterministic Masking

More realistic fake data with DataMasque's Deterministic Masking

Ben Shaw, Head of Engineering

Feb 27, 2023

The most important part of masked data is that it must be secure, and irreversible. That is, masked data can't be transformed back into the sensitive data from which it was generated.

The second most important aspect is arguably realism – masked data should be indistinguishable from real data. Development and testing on an application is extremely difficult when everyone has the first name Foo, the last name Test, and lives at 123 Fake Street, Anytown, USA!

How DataMasque creates realistic data

DataMasque provides many built-in seed files and specialty masking functions to preserve and generate many different types of data. And, with a few small tweaks to your ruleset you can make the masked data look even better with deterministic masking.

DataMasque can force consistency by hashing on a known value throughout your databases or files. What this boils down to (and we'll get into the specifics soon), is that wherever you saw John Doe in your sensitive data, you'll see Richard Roe in the masked data.

This means anonymised data is still meaningful. It's easy to tell that Richard Roe across different tables in the masked database is the same Richard Roe in your masked files, even though you no longer know his real personal information.

The safety of DataMasque's deterministic masking

DataMasque has three levels of determinism:

No consistency: this is the default. All values are random and change every run.
Intra-run consistency. For one masking run, when specifying one or more hash sources, values based on the hash source will be consistent. Values still change every run though.
Inter-run consistency. By specifying both a run secret (a sort of "seed" for the randomness generator), and using hash_columns/hash_sources, data can be masked consistently – even between databases and files. Another organisation could even mask data with consistent results, provided you have a way to securely supply them the run secret.

You might be wondering – how can deterministic masking be safe? If it's repeatable then is it really random? The answer is, yes! Hash values are simply used as a random seed, along with the instance or run secret which adds at least 160 bits of randomness.

And repeatable doesn't mean reversible either. An attacker would need to compromise many parts of the system and get access to a run secret (which is not stored) to even begin a replay attack – and that would take hundreds of years to execute.

Enabling consistency

For database or tabular files masking (CSVs, Parquet and other delimited text files), just add a hash_columns option as part of a column masking rule. For file masking on other types of files (like XML or JSON) the option is called hash_sources.

Provided the same hash values are accessible in each, here's an example of how to perform consistent masking across databases and files.

The database has a table with user_id and first_name columns (among others).

The files are JSON with this structure:

{
    "user_id": 3,
    "first_name": "John",
}

First, in the mask_table database-masking task, use user_id as one of the hash_columns:

# some of the ruleset is omitted for brevity
tasks:
  - type: mask_table
    table: customers
    key: user_id
    rules:
      - column: first_name
        hash_columns:
        - user_id
        masks:
        - type: from_file
          seed_file: DataMasque_firstNames_mixed.csv
          seed_column: firstname-mixed

Then, for the mask_file task, specify hash_sources with a json_path.

# some of the ruleset is omitted for brevity
tasks:
  - type: mask_file
    rules:
    - hash_sources:
      - json_path: ['user_id']
      masks:
        - type: json
          transforms:
          - path: ['first_name']
            masks:
            - type: from_file
              seed_file: DataMasque_firstNames_mixed.csv
              seed_column: firstname-mixed

To extract values from an XML document, the xpath option should be set instead of json_path. Extracting values from XML and JSON documents inside a database column is also possible, by specifying xpath/json_path with hash_columns.

Note that since mask_table and mask_file tasks can't both be performed in the same run, the same run secret must be used for the database and file runs to get consistency.

Hashing values on themselves

It should also be noted that a value can be hashed on itself, therefore in lieu of a unique ID you could hash on a composite "identifier" of first_name, last_name and birth_date, and use that as a source to perform consistent masking on those personal identifiers themselves. Once again, this works for both databases and files.

Deterministic masking is just one of the ways DataMasque can help you generate more realistic masked data, and it only requires a couple of additions to a ruleset. For a more in-depth explanation, you can read the DataMasque deterministic masking documentation.

If you're not already protecting your data with DataMasque, you can get started right now with DataMasque on the AWS Marketplace or please contact DataMasque for Cohesity, on-prem or cloud environment support.

More realistic fake data with DataMasque's Deterministic Masking

How DataMasque creates realistic data

The safety of DataMasque's deterministic masking

Enabling consistency

Hashing values on themselves

Free quote

30 day free trial

Request a demo