The most important part of masked data is that it must be secure, and irreversible. That is, masked data can't be transformed back into the sensitive data from which it was generated.
The second most important aspect is arguably realism – masked data should be indistinguishable from real data. Development and testing on an application is extremely difficult when everyone has the first name Foo, the last name Test, and lives at 123 Fake Street, Anytown, USA!
How DataMasque creates realistic data
DataMasque provides many built-in seed files and specialty masking functions to preserve and generate many different types of data. And, with a few small tweaks to your ruleset you can make the masked data look even better with deterministic masking.
DataMasque can force consistency by hashing on a known value throughout your databases or files. What this boils down to (and we'll get into the specifics soon), is that wherever you saw John Doe in your sensitive data, you'll see Richard Roe in the masked data.
This means anonymised data is still meaningful. It's easy to tell that Richard Roe across different tables in the masked database is the same Richard Roe in your masked files, even though you no longer know his real personal information.
The safety of DataMasque's deterministic masking
DataMasque has three levels of determinism:
No consistency: this is the default. All values are random and change every run.
Intra-run consistency. For one masking run, when specifying one or more hash sources, values based on the hash source will be consistent. Values still change every run though.
Inter-run consistency. By specifying both a run secret (a sort of "seed" for the randomness generator), and using
hash_columns
/hash_sources
, data can be masked consistently – even between databases and files. Another organisation could even mask data with consistent results, provided you have a way to securely supply them the run secret.
You might be wondering – how can deterministic masking be safe? If it's repeatable then is it really random? The answer is, yes! Hash values are simply used as a random seed, along with the instance or run secret which adds at least 160 bits of randomness.
And repeatable doesn't mean reversible either. An attacker would need to compromise many parts of the system and get access to a run secret (which is not stored) to even begin a replay attack – and that would take hundreds of years to execute.
Enabling consistency
For database or tabular files masking (CSVs, Parquet and other delimited text files), just add a hash_columns
option as part of a column masking rule. For file masking on other types of files (like XML or JSON) the option is called hash_sources
.
Provided the same hash values are accessible in each, here's an example of how to perform consistent masking across databases and files.
The database has a table with user_id
and first_name
columns (among others).
The files are JSON with this structure:
{
"user_id": 3,
"first_name": "John",
}
First, in the mask_table
database-masking task, use user_id
as one of the hash_columns
:
# some of the ruleset is omitted for brevity
tasks:
- type: mask_table
table: customers
key: user_id
rules:
- column: first_name
hash_columns:
- user_id
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
Then, for the mask_file
task, specify hash_sources
with a json_path
.
# some of the ruleset is omitted for brevity
tasks:
- type: mask_file
rules:
- hash_sources:
- json_path: ['user_id']
masks:
- type: json
transforms:
- path: ['first_name']
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
To extract values from an XML document, the
xpath
option should be set instead ofjson_path
. Extracting values from XML and JSON documents inside a database column is also possible, by specifyingxpath
/json_path
withhash_columns
.
Note that since mask_table
and mask_file
tasks can't both be performed in the same run, the same run secret must be used for the database and file runs to get consistency.
Hashing values on themselves
It should also be noted that a value can be hashed on itself, therefore in lieu of a unique ID you could hash on a composite "identifier" of first_name
, last_name
and birth_date
, and use that as a source to perform consistent masking on those personal identifiers themselves. Once again, this works for both databases and files.
Deterministic masking is just one of the ways DataMasque can help you generate more realistic masked data, and it only requires a couple of additions to a ruleset. For a more in-depth explanation, you can read the DataMasque deterministic masking documentation.
If you're not already protecting your data with DataMasque, you can get started right now with DataMasque on the AWS Marketplace or please contact DataMasque for Cohesity, on-prem or cloud environment support.