Generic Masks
Generic Masks are used for simple generation or transformation of most data types.
- Fixed value (
from_fixed
)
Outputs a fixed value - From column (
from_column
)
Outputs from another column (or table) - From file (
from_file
)
Generates values sourced from a CSV file - From Blob (
from_blob
)
Outputs a value that is the entire contents of a file - From format string (
from_format_string
)
Generates random values according to a format string - From choices (
from_choices
)
Generates a random value picked from a (optionally weighted) set of values. - Secure shuffle (
secure_shuffle
)
Securely shuffles a given column's values - From json path (
from_json_path
)
Masks apath
with the contents of anotherpath
in the same JSON document
Fixed value (from_fixed
)
A simple mask that replaces all column values with the same fixed value.
Parameters
value
(required): The value to replace all column values with. Can be any data type, but should match that of the column being masked. This value can be enclosed in quotation marks, which will convert the value a string, or entered without quotation marks.
Example
This example will replace all values in the name
column of the employees table with 'Alex'.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: ID
rules:
- column: name
masks:
- type: from_fixed
value: 'Alex'
Show result
Before | After |
|
|
---|
DataMasque also supports blank values for the from_fixed
mask type. To specify a blank entry, you can
enter the value as """
, ''
, or leave the field blank if there are no spaces.
version: '1.0'
tasks:
- type: mask_table
table: customers
key: id
rules:
column: name
masks:
- type: from_fixed
value: ''
From column (from_column
)
A mask to replace a column's value by copying the value from another database column (from the same table, or a joined table).
Parameters
source_column
(required): The name of the column to copy values from. The source column can be prefixed with a table name to reference a column in another table, and that table name can be prefixed with a schema name to reference a table in another schema.
Note:
from_column
is not supported formask_file
tasks, but is supported formask_tabular_file
tasks.
Example
This example will replace the values of the name
column in the employees
table with values
from the first_name
column of the users
table, joined on id
.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
joins:
- target_table: users
target_key: id
source_table: employees
source_key: user_id
rules:
- column: name
masks:
- type: from_column
source_column: users.first_name
Show result
Given the users
table:
id | first_name |
---|---|
24 | Kyle |
25 | Reid |
26 | Helen |
27 | Callum |
28 | Raymond |
29 | Vivien |
The following transformation will take place on the employees
table.
Before | After |
|
|
---|
From file (from_file
)
A mask to replace column values by selecting from a column (seed_column
) in a CSV formatted seed file (seed_file
).
The seed file must contain one or more columns with named column headers. See the Files documentation for
more information on uploading seed files.
If there are any blank values in the file, they will not be considered when randomly selecting a replacement value.
If you wish to have blank values randomly selected from your seed file, ensure they are enclosed with quotation marks. For example, ""
.
Potential replacement values from the seed file rows may be pre-filtered using the table_filter_column
and
seed_filter_column
parameters. This allows you to subset the available values for each row by matching a column value
from the table to a column value in the seed file. If filtering does not match any rows in the seed file,
fallback_masks
can be provided to generate the replacement value instead.
After filtering has been applied (optional), replacement values from the seed file are selected randomly by default.
To enable deterministic selection of replacement values, you may specify one or more hash_columns
for the masking rule.
See Deterministic masking for more details.
Parameters
seed_file
(required): The name of a user-provided CSV seed file to select values from (see Files guide).seed_column
(optional): The name of the column in the seed file that will provide replacement values. This property only needs to be specified for CSV files with multiple columns.table_filter_column
andseed_filter_column
(optional): If both attributes are provided, the replacement value will only be selected from CSV rows where the value for theseed_filter_column
in the CSV row is equal to the value for thetable_filter_column
column in the database row being masked. These options are only supported for use with database masking, or for tabular file masking. These will not be applicable for other file masking circumstances.fallback_masks
(optional): If there are no rows in the CSV that match thetable_filter_column
/seed_filter_column
condition, then this nested list of masks will be applied to the column value instead. If nofallback_masks
are provided and no match is found, then the column value will be replaced with aNULL
value.null_string
(optional): If set, any values in the seed file column that match this value will be treated asNULL
(instead of a string) when written to the database.value_on_missing
(optional): If the mask were to generate aNULL
value (due to filters not matching any rows or fromNULL
values in the seed file) then use this value (if set) instead.
Example
This example masks the street_name
column with a random street name from a seed file
(DataMasque_address_example.csv) while ensuring that the randomly selected street
is from the same city as the original. A fallback value of “Main Street” is used in the event that the database
row contains a city that does not exist in the CSV.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: street_name
masks:
- type: from_file
seed_file: DataMasque_address_example.csv
seed_column: street_name
table_filter_column: city
seed_filter_column: city
fallback_masks:
- type: from_fixed
value: 'Main Street'
Show result
Given the following database table and using the DataMasque_address_example.csv seed file:
street_number | street_name | city |
---|---|---|
44 | Queen Street | Auckland |
57 | Oxford Street | London |
12 | Houston Street | New York |
5 | Fake Street | Fake City |
This mask could produce the following masked output:
street_number | street_same | city |
---|---|---|
44 | High Street | Auckland |
57 | Brick Lane | London |
12 | Maiden Lane | New York |
5 | Main Street | Fake City |
From blob (from_blob
)
Note: When used in a
mask_table
task, this mask type will be skipped in a dry run as it requires writing to the database.
A mask to replace column values or overwrite the entire file content with the entire contents of another file.
Important! The performance of
from_blob
is comparatively slower than other masking functions due to its need to write a significantly larger amount of data into the database. Consider runningfrom_blob
as its ownparallel
task.
from_blob
is not available for use inmask_tabular_file
tasks. This means you cannot use it to, for example, mask a Parquet column ofbinary
type.
from_blob
is not supported for Amazon Redshift databases.
from_blob
is not supported for IBM Db2 LUW databases.
The source file can be any file type. DataMasque includes some example binary files, as detailed in Built-in blob files). See the Files documentation for more information on uploading your own files.
Parameters
file_name
(required): The name of a user-provided file to retrieve blob data from (see Files guide).
Example
This database masking example masks the avatar
column which stores blob data and replaces all values of records that
have role
equal to admin
with the file admin_user.png
in the ruleset. Note that this example assumes the file
admin_user.png
has been uploaded to DataMasque.
version: "1.0"
name: from_blob
tasks:
- type: mask_table
table: users
key: id
rules:
- if:
- column: role
equals: "admin"
rules:
- column: avatar
masks:
- type: from_blob
file_name: "admin_user.png"
This file masking example masks .jpg
files by overwriting the content of those files with blob data from
replacement.jpg
, (assuming this file has been uploaded through the DataMasque web UI). All matched masked files will
have the same content after masking.
version: "1.0"
name: from_file
tasks:
- type: mask_file
include:
- glob: *.jpg
rules:
- masks:
- type: from_blob
file_name: "replacement.jpg"
From format string (from_format_string
)
A mask to generate values of a defined format. This is useful when all values of a column have the same format. DataMasque will generate values according to a format specified according to the format string syntax.
Parameters
value_format
(required): The format of the data to be generated, which must meet the requirements of the format string syntax.
version: '1.0'
tasks:
- type: mask_table
table: drivers
key: id
rules:
- column: licence_plate
masks:
- type: from_format_string
value_format: "{[A-Z],3}{[0-9],3}"
Show result
Before | After |
|
|
---|
Note
- In addition to what is possible with the formats for a
mask_unique_key
task there is more you can do with format strings with themask_table
task. For example, you can easily select one value from a small number of alternatives using the syntax for alternatives:{(EN|FR)}
- If you need to guarantee that generated values are unique within a column, consider using the
from_unique
mask type.
From choices (from_choices
)
A mask to replace column values by selecting from a list of choices or a dictionary of choices with weights specified.
Parameters
choices
(required): Can either be a list ofstrings
, or a list ofdictionaries
. If a list of dictionaries, each dictionary has a single key, the "choice" and value is the weight.Weights must be positive, and can be a mix of
float
andint
. Weights do not need to sum to a particular value, they are relative.If there are no weights assigned to any of the choices then all choices will have equal weights.
Examples
This example masks the department
column with a randomly selected department from the list of choices.
version: "1.0"
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: department
masks:
- type: from_choices
choices:
- "accounting"
- "sales"
- "research and development"
Show result
Before | After |
|
|
---|
This example masks the department
column with a randomly selected department from the list of choices based on the weights provided.
version: "1.0"
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: department
masks:
- type: from_choices
choices:
- "accounting": 10
- "sales": 15
- "research and development": 25
Show result
Before | After |
|
|
---|
Secure Shuffle (secure_shuffle
)
Note: When used in a
mask_table
task, this mask type will be skipped in a dry run as it requires writing to the database.
This mask is designed to "shuffle" a number of distinct values in a particular column.
It does not strictly rearrange the data, maintaining distribution and cardinality.
Instead, the distinct values in a particular column are used as random replacements
throughout the column.
This means that after masking a column with secure_shuffle
,
the distribution of values inside will be roughly uniform.
By default, null values are included in the column choices.
If include_nulls_in_choices
is set to false
, null values will not be included in the choices,
and thus, the masked column result will not include null values.
Parameters
include_nulls_in_choices
(optional): When set tofalse
,null
values will not be included as output choices. Defaults totrue
. Please see Working withnull
s below for more information.
Support and Restrictions
Currently, secure_shuffle
only supports Oracle databases.
Because secure_shuffle
works by shuffling data directly in the database,
it is unable to be combined with other masking functions.
For example, the output of secure_shuffle
can not be used in a chain
mask to apply
further masking functions.
secure_shuffle
may not be used as part of an if
expression to conditionally apply
secure_shuffle
or a different mask type to the same column.
If further masking operations are to be performed on shuffled data,
a second mask_table
task should be used to perform these operations after the shuffle
is completed.
Working with null
s
Since secure_shuffle
does not maintain distribution, after masking there may be many
more null
values in the column than there were prior to masking.
Using include_nulls_in_choices
,
along with the skip
option at
the column level can control how null
s are handled.
The following examples shows the effect of include_nulls_in_choices
and skip: null
in the final column values. The original values in the column are A
, B
, C
, and
null
.
Use all replacement values, including nulls
The default options for secure_shuffle
masks will include null
s in the replacement
value, as well as perform replace of nulls null
s.
Use all replacement values, including nulls
The default options for secure_shuffle
masks will include null
s in the replacement
value, as well as perform a replacement of null
s.
version: "1.0"
tasks:
- type: mask_table
table: table_name
key: id
rules:
- column: column_to_shuffle
masks:
- type: secure_shuffle
The following table shows source values and possible replacement values for this ruleset.
Source Value | Possible Replacement Values |
---|---|
A | A, B, C null |
B | A, B, C null |
C | A, B, C null |
null |
A, B, C null |
This ruleset means all values, including null
s, may be used as replacements.
Exclude null
s from replacement values
This ruleset uses include_nulls_in_choices: false
to exclude null
from being used as
a replacement value.
version: "1.0"
tasks:
- type: mask_table
table: table_name
key: id
rules:
- column: column_to_shuffle
masks:
- type: secure_shuffle
include_nulls_in_choices: false
The following table shows source values and possible replacement values for this ruleset.
Source Value | Possible Replacement Values |
---|---|
A | A, B, C |
B | A, B, C |
C | A, B, C |
null |
A, B, C |
This ruleset means that there will be no null
s in the output column.
Typically, this type of data cleansing is unwanted, so it is suggested to only use
include_nulls_in_choices: false
with skipping of null
at a column level,
as the next section demonstrates.
Retain null
values in the output
Combining include_nulls_in_choices: false
with skipping of null
values will mean that
null
values remain as null
in the output, while non-null values are shuffled.
The following ruleset achieves this:
version: "1.0"
tasks:
- type: mask_table
table: table_name
key: id
rules:
- column: column_to_shuffle
skip:
- null
masks:
- type: secure_shuffle
include_nulls_in_choices: false
The following table shows source values and possible replacement values for this ruleset.
Source Value | Possible Replacement Values |
---|---|
A | A, B, C |
B | A, B, C |
C | A, B, C |
null |
null |
Skip null
and include null
in replacement values
This last example is shown for completeness, but doesn't reflect a normal use case.
It skips null
values while also including null
in available outputs.
The means nulls
will be retained, as well as being used as a replacement value.
version: "1.0"
tasks:
- type: mask_table
table: table_name
key: id
rules:
- column: column_to_shuffle
skip:
- null
masks:
- type: secure_shuffle
include_nulls_in_choices: true
The following table shows source values and possible replacement values for this ruleset.
Source Value | Possible Replacement Values |
---|---|
A | A, B, C, null |
B | A, B, C, null |
C | A, B, C, null |
null |
null |
This ruleset is the least production-realistic as it will introduce many more nulls
than there were originally.
Example (nullable varchar)
This example securely 'shuffles' the department_name
column:
version: "1.0"
tasks:
- type: mask_table
table: departments
key: id
rules:
- column: department_name
masks:
- type: secure_shuffle
Show result
Before | After |
|
|
---|
From JSON path (from_json_path
)
A mask to replace data from one JSON path with data from another JSON path within the current masking JSON document. This mask allows masking of values, arrays, or objects.
Parameters
relative_source_path
(required): The path name from which to copy the data. It will be related to thepath
specified in the parenttransform
mask and must only reference a single JSON item, either a value, a list, or an object. An error will be raised if therelative_source_path
references multiple items. Example:relative_source_path: ['users', '*', 'name']
.
Note:
from_json_path
is only supported insidejson
mask. For each JSONtransform
, iffrom_json_path
is specified in the list ofmasks
,force_consistency
must be set tofalse
or omitted. Otherwise, an error will be raised.
Example
This example will replace the departure
data in the json_data
column of the flights
table with values from the destination
data.
version: '1.0'
tasks:
- type: mask_table
table: flights
key: id
rules:
- column: json_data
masks:
- type: json
transforms:
- path: ['departure']
masks:
- type: from_json_path
relative_source_path: ['..', 'destination']
Show result
Before | After |
|
|
---|