Ruleset YAML specification
- Ruleset YAML specification
- Database task types
- Serial tasks (
serial
) - File task types
- Masking Functions
- Definitions
- Advanced ruleset features
- Common regular expression patterns
- Date Parsing in DataMasque
- Determining what unique key masking to use
- Schema versioning
- Writing comments
Ruleset YAML specification
The ruleset YAML configuration provides instructions that DataMasque will follow when performing a masking run against a target database. Rulesets are comprised of one or more tasks, which can contain many different types of instructions. The most common use case is the application of tasks for masking sensitive data in database tables with masking rules.
Ruleset Properties
The following properties are specified at the top-level of a ruleset YAML file:
version
(required): The schema version used by this ruleset. The default value present in the editor when creating a new ruleset is generally the value that you should be using. See Schema Versioning for more information.name
(deprecated; will be removed in release 3.0.0): A unique name that will be used to refer to your ruleset. This field may only contain alphanumeric characters and underscores, and its length is limited to 64 characters.tasks
(required): A list of tasks to be performed in order on the target database. See Database Task Types for the comprehensive list of the available task types and their associated parameters.task_definitions
(optional): A list of task definitions to be referenced from the ruleset's list oftasks
through YAML anchors and aliases.rule_definitions
(optional): A list of rule definitions to be referenced from a list of rules through YAML anchors and aliases.mask_definitions
(optional): A list of mask definitions to be referenced from a list of masks through YAML anchors and aliases.skip_defaults
(optional): See Default values to skip.disable_warning_on_duplicate_rules
(optional) If multiple tasks in the same ruleset attempt to mask the same table or column DataMasque will show a warning. Set this totrue
to suppress the warning. Defaults tofalse
.disable_warning_on_glacier_storage
(optional): Only valid for file masking rulesets If a file is archived with Amazon S3 Glacier storage, DataMasque will show a warning that it is skipping the file. Set this totrue
to suppress the warning and skip the file silently. Defaults tofalse
.random_seed
(optional): Deprecated in favour of the Run secret option. See Freezing random values.
Example
The following ruleset provides an example for replacing the last_name
column of every
row in the users
table with the fixed value "redacted last name"
:
# My ruleset.yml
version: '1.0'
tasks:
- type: mask_table
table: users
key: user_id
rules:
- column: last_name
masks:
- type: from_fixed
value: 'redacted last name'
Note: The tasks must be indented using two spaces. For example:
tasks: - type: mask_table ^^
- type: mask_table
is indented two spaces from the parenttasks:
Important note on case-sensitivity: For all tasks except
run_sql
, database identifiers, such as table and column names, should be referenced as you would otherwise reference them in an SQL query. When masking a case-sensitive database, identifiers must be referenced in the ruleset using the correct case.To refer to a case-sensitive table or column in a database, the identifier must be enclosed in double quotation marks. However, when writing YAML, quotations are used to denote a string value, so any enclosing quotations are not considered as part of the value. As such, it is necessary to enclose the entire name - including double quotation marks - in an outer set of single quotation marks. For example:
# Case-sensitive table name; enclosed in both single and double quotations. table: '"CaseSensitiveTableName"'
To refer to a case-sensitive table in a schema, the schema name must also be enclosed in quotation marks if the schema name is case-sensitive. The entire combination of schema and table name must be enclosed in single quotation marks. For example:
# Case-sensitive schema and table name; enclosed in both single and double quotations. table: '"CaseSensitiveSchemaName"."CaseSensitiveTableName"'
If referencing a combination of table and column, you will need have quotation marks around both the table and column names within the surrounding quotation marks. For example:
# Case-sensitive table and column name; enclosed in both single and double quotations. column: '"CaseSensitiveTable"."CaseSensitiveColumn"'
Identifier names containing double quotation marks, backslashes, periods, and whitespace should always be enclosed in double quotation marks. Also, literal double quotation marks and backslashes must be preceded by a backslash:
# Case-sensitive table and column name containing special characters. column: '"Case\"Sensitive\\Table"."Case.Sensitive Table"'
Notes:
- Backslashes and single quotation marks are not supported in identifier names for Microsoft SQL Server (Linked Server) databases.
- For Redshift databases, DataMasque follows PostgreSQL's rules for case-sensitivity of quoted/unquoted identifiers if
enable_case_sensitive_identifier
istrue
, and treats all identifiers as case-insensitive ifenable_case_sensitive_identifier
isfalse
. See: https://docs.aws.amazon.com/redshift/latest/dg/renablecasesensitiveidentifier.html
Database task types
A ruleset consists of a list of tasks to be performed in sequence on
the target database. Tasks are performed serially from top to bottom,
but special serial
and parallel
tasks can be used to nest other tasks
within them for performance (parallelism) or dependency management.
Sensitive data discovery
The run_data_discovery
task type inspects the metadata of your database, searching for columns
which are likely to contain sensitive data. On completion, a report is generated containing a
summary of all identified columns, and their current masking coverage. It is recommended to include
a single run_data_discovery
task in your rulesets to help ensure complete masking coverage and
provide ongoing protection as new sensitive data is added to your database.
See the Sensitive Data Discovery guide for more information on this feature.
Note: The
run_data_discovery
task type is not currently supported for Amazon DynamoDB.
Parameters
This task type does not have any parameters.
Example
The following shows an example ruleset that will execute only the run_data_discovery
task
and no masking tasks. This example usage may be useful when starting a new ruleset from
scratch to determine a starting point for developing your masking rules.
The run_data_discovery
may also be included in a ruleset alongside other masking tasks to
provide continuous feedback on
the masking coverage provided by the ruleset.
version: '1.0'
tasks:
- type: run_data_discovery
Schema Discovery
The run_schema_discovery
task type inspects the metadata of your database, searching for schemas, tables, and columns
and can flag certain columns which are likely to contain sensitive data. On completion, a report is generated containing
a summary of all identified schemas, tables, columns, and relevant metadata of the data within the columns.
See the Schema Discovery guide for more information on this feature.
Note: The
run_schema_discovery
task type is not currently supported for Amazon DynamoDB.
Parameters
This task type does not have any parameters.
Example
The following shows an example ruleset that will execute only the run_schema_discovery
task
and no masking tasks. This example usage may be useful when starting a new ruleset from
scratch to determine a starting point for developing your masking rules.
version: '1.0'
tasks:
- type: run_schema_discovery
Table masks
Each mask_table
task specifies the masking rules to apply to a
database table, as well as any required joins and any conditionals needed to define which rows
should be masked. Masking rules and masks are applied sequentially in the order
they are listed. When multiple masks are combined in sequence, the output value
from each mask is passed as the input to the next mask in the sequence.
More detail about mask functions and rules is provided under Masking Overview section.
Note: While
mask_table
is suitable for the majority of generic masking requirements, many masking functions are not capable of masking unique keys or primary keys. To mask these columns, use thefrom_unique_imitate
mask function, or the special-purposemask_unique_key
task.
Parameters
Each task with type mask_table
is defined by the following parameters:
table
(required): The name of the table in the database. The table name can be prefixed with a schema name to reference a table in another schema. If the table or schema name are case-sensitive, you must enclose the name in double and single quotation marks in order to specify the casing of the name. For example,table: '"CaseSensitiveSchema"."CaseSensitiveTable"'
- For rulesets run against Amazon DynamoDB connections, the table name can be prefixed with the AWS region that
contains the table e.g.
<region>/<table-name>
(without angle brackets). Note that the region prefix is required if you have not created a.aws/credentials
file or environment variable with the default region or are running DataMasque on EC2. Amazon DynamoDB connections currently only support masking a single table in each ruleset.
- For rulesets run against Amazon DynamoDB connections, the table name can be prefixed with the AWS region that
contains the table e.g.
key
(required): One or more columns that identify each table row. Composite keys may be specified for thekey
parameter. For more details on using composite keys, see Composite keys- For Oracle databases it should always be
ROWID
(key: ROWID
). For more details, refer to Query optimisation. - For Microsoft SQL Server and PostgreSQL databases it is recommended to use the primary key, or any other unique key that is not modified during masking, for better performance.
- The key columns must not contain any NULL value.
- If the key column names are case-sensitive, you may enclose each key value in double
and single quotation marks in order to specify the casing of the key. For example,
key: '"Customer_ID"'
- If the columns specified for the key parameter cannot be used to uniquely identify rows, then the masked values will be the same for rows that have the same key value. Refer to the key and hash columns example in the Notes section for how to avoid producing duplicate masked values.
- Columns specified in the key cannot be masked in the task's
rules
. Amazon DynamoDB is an exception to this rule, allowing key columns to be masked with e.g. thefrom_unique
mask type.
- For Oracle databases it should always be
on_missing_column
(optional): The action to take if acolumn
in the ruleset is not found on the table. This option is only available for Amazon DynamoDB. If specified, must be one oferror
orskip
. Defaults toerror
, which means a masking run will fail if a column in the ruleset does not exist. If set toskip
then missing columns do not cause an error and other columns are masked as normal.use_calculated_bounds
(optional): When set totrue
(default) then the batch bounding keys are calculated using values in the database. If set tofalse
, then DataMasque generates these keys without querying the database, which may improve performance. However, this option is only supported for integer, non-composite keys. For more details, see Calculated or generated bounds.rules
(required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:column
(required): The name of the column to mask. For rulesets run against Amazon DynamoDB connections, a wildcard*
can be used to apply masks to columns not matched by other column's rules.masks
(required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the column. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask functions.hash_columns
(optional): A list of columns which will be used as
input to the Deterministic masking algorithm
for this rule. Ifhash_columns
is provided, all mask types that rely on randomisation become deterministic based on thehash_columns
column values. Values in the provided columns can be null.The hash column can be prefixed with a table name to reference a column in another table, and that table name can be prefixed with a schema name to reference a table in another schema.
Hash columns can also be specified by a list of dictionaries with the following keys:
column_name
(required): The name of the column which will be used as input as above.case_transform
(optional):upper
/lower
- Allows for case transforms on the values of the input, ensuring consistent hashed values irrespective of case.trim
(optional): If the value extracted from the column,json_path
, orxpath
contains leading or trailing white space, it can be trimmed by setting this option totrue
. Defaults tofalse
.json_path
(optional): If the column contains JSON data, the path to the value in the JSON data you wish to perform the hash on can be specified here, otherwise the hash will be performed on the entire column.xpath
(optional): If the column contains XML data, the Xpath to the value stored in the XML data you wish to perform the hash on can be specified here, otherwise the hash will be performed on the entire column.coerce_whole_numbers_to_int
(optional): When set totrue
, this option will coerce whole number float values in the specified column to integers before being used for hashing.
For more information on the
json_path
, please refer to the JSON documentation.
For more information on thexpath
, please refer to the XML documentation.
For more information on deterministic masking, please refer to the Deterministic masking.About
coerce_whole_numbers_to_int
When
coerce_whole_numbers_to_int
is enabled, whole number float values will be coerced to integers (e.g.1.0
to1
,4.0
to4
, and so on). This ensures consistent hashing between integer columns and columns where whole numbers are stored as floats/decimals.Numbers that aren't whole are not affected by this option. For example,
1.5
will still be treated as1.5
even when this option is enabled. Strings are also not affected.Example Behavior
Input Input Type Output with coerce_whole_numbers_to_int
set tofalse
Output with coerce_whole_numbers_to_int
set totrue
2.4 float/decimal 2.4 2.4 "3.0" string "3.0" "3.0" "3.1" string "3.1" "3.1" 2.0 float/decimal 2.0 2 5 integer 5 5 workers
(optional): The number of parallel processes to use for masking this table (defaults to 1). Each process will operate on a separate batch of rows from the table (batch size is a configurable run option). Increasingworkers
may decrease masking run times, dependent on database performance and the complexity (e.g. number of columns to mask) of the masking task. It is recommended to increase the number of workers if DataMasque connections to your target database spend more time processing queries than waiting for DataMasque (i.e. the "waiting for clients" time approximates DataMasque CPU time), which suggests DataMasque could efficiently use multiple workers to mask other batches while waiting for database responses. Additionally, it is recommended that the number of parallel processes multiplied by the number of workers assigned to each process does not exceed twice the number of CPUs available on your deployed instance.index_key_columns
(optional): Whenindex_key_columns
istrue
(the default setting), DataMasque will create an additional index on all key columns if there is no existing index contains all key columns. You may wish to disable the creation of a new index if you have an existing index on some of the key columns that will provide adequate performance.- Redshift databases do not support indexes, so this option has no effect for them.
- This option currently has no effect for Microsoft SQL Server (Linked Server) databases.
allow_masking_foreign_keys
(optional): To preserve referential integrity, an error will be raised if any of therules
masks a column that is a foreign key, as the parent of a foreign key should usually be masked withmask_unique_key
and cascaded to its foreign key(s) when possible. This error is disabled when this option is set totrue
. Defaults tofalse
.disable_warning_on_use_of_conditionals
(optional): By default, a warning is shown if conditional optionsskip
,if
/else
orwhere
are used, as this may mean not all rows are being masked (see Conditional Masking). To disable this warning, setdisable_warning_on_use_of_conditionals
totrue
. Defaults tofalse
.
version: '1.0'
tasks:
- type: mask_table
table: users
key:
- user_id
- first_name
rules:
- column: date_of_birth
hash_columns:
- first_name
masks:
- type: from_random_datetime
min: '1980-01-01'
max: '2000-01-01'
If case_transform
is specified under hash_columns
, then the value of the first_name
column will be transformed before
being used for deterministic masking, ensuring that case will not affect the consistency of the mask. In the example
below, where case_transform
is specified on the hash column first_name
, values such as "john" and "John" will both be
converted into "JOHN" before hashing, ensuring that the final masked value is consistent regardless of case.
version: '1.0'
tasks:
- type: mask_table
table: users
key:
- user_id
- first_name
rules:
- column: date_of_birth
hash_columns:
- column_name: first_name
case_transform: upper
masks:
- type: from_random_datetime
min: '1980-01-01'
max: '2000-01-01'
If trim
is specified as true
, then any whitespace in the first_name
column will be removed before the value is used
for deterministic masking. In this case, values such as " John", " John " and "John " will all be converted into "John"
before hashing, ensuring that the final masked value is consistent regardless of surrounding whitespace.
version: '1.0'
tasks:
- type: mask_table
table: users
key:
- user_id
- first_name
rules:
- column: date_of_birth
hash_columns:
- column_name: first_name
trim: true
masks:
- type: from_random_datetime
min: '1980-01-01'
max: '2000-01-01'
Composite keys
A composite key can be specified in following formats:
- A list of columns; for example,
key: ['invoice_id', 'product_id']
would be used to indicate a composite key consisting of a combination of two columns,invoice_id
andproduct_id
. - For Microsoft SQL Server and PostgreSQL, when specifying a composite key, the order of the keys listed must respect the original order as defined in the database.
- A multiline composite key. An example of a multiline composite key is shown below.
key:
- 'invoice_id'
- 'product_id'
Calculated or generated bounds
DataMasque's mask_table
processes a table's data in batches.
It fetches a specified number of rows, masks them, and writes them back to the database.
The number of rows is determined by the batch size.
By default, DataMasque calculates bounding keys for each batch using the values in the
key
column(s) specified in the mask_table
task.
For example, if a table's id
column has values ranging from 100,000 to 200,000 and the batch size is 50,000,
the batches would be as follows:
Batch Number | Start id |
End id |
No. Rows In Batch |
---|---|---|---|
1 | 100,000 | 149,999 | 50,000 |
2 | 150,000 | 199,999 | 50,000 |
3 | 200,000 | 200,000 | 1 |
This approach is called calculated bounds
and is the default method if use_calculated_bounds
is not specified or set to true
.
Calculated bounds work with non-integer columns and composite keys.
As long as the key is unique,
all batches (except the last one if the total number of rows is not divisible by the batch size),
will have the same number of rows as the batch size.
Generating bounding keys requires querying the database,
which can be time-consuming for large tables (e.g. 100,000,000 rows or more).
In such cases,
setting use_calculated_bounds
to false
allows DataMasque to generate bounding keys without querying the table.
This means batch sizes may vary if key values are non-contiguous or do not start at 0,
but no batch will exceed the specified batch size.
With use_calculated_bounds: false
,
DataMasque generates bounds by incrementally adding the batch size until the largest value in the key column is reached.
Here's how the batches would look for the same example:
Batch Number | Start id |
End id |
No. Rows In Batch |
---|---|---|---|
1 | 0 | 49,999 | 0 |
2 | 50,000 | 99,999 | 0 |
3 | 100,000 | 149,999 | 50,000 |
4 | 150,000 | 199,999 | 50,000 |
5 | 200,000 | 200,000 | 1 |
In this scenario, batch 1 is empty. DataMasque skips empty batches, but querying still occurs, which can be inefficient if there are many non-contiguous IDs.
The performance impact of enabling or disabling use_calculated_bounds
depends on your dataset.
Experimenting with use_calculated_bounds: false
is recommended to see if it improves masking speed.
use_calculated_bounds
may only be disabled if all the following conditions are true:
- A single
key
column is specified. - The column is of an integer* type.
- The column must contain no negative values.
0
is allowed. - The column must be a primary key or have a unique index/constraint.
- The maximum value in the column must be less than or equal to 2^32 (4,294,967,296).
- The column must not be nullable (e.g. should be created with
NOT NULL
constraint).
* Supported column types for all databases: INTEGER
, BIGINT
and SMALLINT
.
Additionally, Oracle also supports NUMBER
columns with a scale of 0
.
When to disable calculated bounds
The following table compares the effects of enabling or disabling use_calculated_bounds
:
use_calculated_bounds: true |
use_calculated_bounds: false |
|
---|---|---|
Supports non-integer keys | Yes | No |
Supports composite keys | Yes | No |
Time spent up front querying bounds keys | May have long delays for bigger tables | No time spent up front |
May waste time querying empty batches | No | Yes |
Batch sizes consistent | Yes, except for last batch | Not consistent for non-contiguous keys |
Requires an index* | No | Yes |
*If use_calculated_bounds
is true
and no index is present,
an index will be temporarily added for the task.
Setting use_calculated_bounds
to false
is intended to boost performance on large tables.
However, adding an index to such large tables may be time-consuming, thus making the overall task take longer.
Therefore, if no index exists, the task will fail instead of adding an index.
An index should be added prior to executing the mask_table
task,
either manually or with the use of a run_sql
task.
Supported Database Engines
Disabling calculated bounds is supported on:
- IBM Db2 LUW
- MariaDB
- Microsoft SQL Server
- MySQL
- Oracle
- PostgreSQL
Example mask_table
ruleset
version: '1.0'
tasks:
- type: mask_table
table: users
key: user_id
rules:
- column: last_name
masks:
- type: from_fixed
value: "redacted last name"
Notes
- Index operations will be performed online (
ONLINE=ON
) on SQL Server editions that support this feature.- Certain data types cannot be used as
key
columns for specific database connections. See Unsupported Key Columns by Database for details.- While
rules
andmasks
should typically be provided as lists, they can also be specified as dictionaries that map arbitrary keys to rules/masks. For example:
... rules: last_name_rule: column: last_name masks: fixed_mask: type: from_fixed value: "redacted last name"
- Specifying
rules
ormasks
as a dictionary can allow you to override the rule/mask for a specific key when inheriting from a definition.- When masking a table, if a non-unique key is specified for the
mask_table
task alongsidehash_columns
, if the value of thehash_columns
is different for more than one row with same key value, the final masked values will arbitrarily depend on the order that update statements are executed. This can be avoided by including the targetedhash_columns
as part of a composite key for themask_table
task.
Mask a primary key or unique key
The mask_unique_key
task type can be used to mask the values in a primary key or unique key.
Masking of a primary key or unique key has the requirement that all masked values are unique, which
requires the use of this special-purpose task type.
The mask_unique_key
task type replaces all non-null rows of the target key with new, unique
values, generated in accordance with a user-specified format. The target primary or unique key
columns and associated foreign key columns are updated with these unique replacement values in a
single operation to maintain referential integrity.
Any foreign keys that reference a column to be updated will be disabled prior to masking in order to prevent referential integrity errors, and will be re-enabled after the completion of the task.
Notes:
- Each
mask_unique_key
task will mask the members of a single primary key or unique key constraint. Multiplemask_unique_key
tasks are required to mask multiple independent unique keys on a single table.- The
mask_unique_key
task can be used on tables with up to 200,000,000 non-null rows.- When applied to a composite key, replacement values are only generated for rows that contain a complete, non-null key. For any null or partially null rows, all columns of the target key will be set to
NULL
.- The
mask_unique_key
task must only be applied to columns which are in-fact unique (i.e. the target key columns have aPRIMARY KEY
orUNIQUE
constraint enforced). Unique keys that have multipleNULL
rows (e.g. using a filtered unique index in SQL Server) are allowed; such rows will not be modified by this task.- If unique key masking is attempted on a column that does not contain unique values, an error will be raised that a unique index could not be created during the maskuniquekey task and the masking run will fail.
- When masking a clustered index on SQL Server, the performance of
mask_unique_key
can be significantly improved by disabling all other indexes and constraints on the target table for the duration of the task. It is recommended to implement this in your ruleset usingrun_sql
tasks before and after themask_unique_key
task to disable and then re-enable these constraints.- The
mask_unique_key
task does not support SQL Server columns created with theIDENTITY
property, or Oracle / PostgreSQL columns created withGENERATED ALWAYS AS IDENTITY
.- Use of
mask_unique_key
for Amazon DynamoDB, Amazon Redshift or Microsoft SQL Server (Linked Server) databases is not currently supported in DataMasque, consider usingmask_table
with afrom_unique
mask instead.- Due to the random assignment of replacement values, it is possible (though generally rare) that a row may be assigned a masked value that is identical to its pre-masking value. In these cases, the masking is still effective, as an attacker will not be able to identify which rows' values were replaced with an identical value. However, if you need to guarantee that all masked values are different from their pre-masking values, you should use
min
andmax
parameters to ensure the range of possible output values from your format string does not overlap with the range of pre-masking values in your database.- Notes for Amazon DynamoDB: The
mask_unique_key
task type is not supported for Amazon DynamoDB. Thefrom_unique
mask should be used instead.- Notes for masking files: The
mask_unique_key
task type is not supported for file masking. Thefrom_unique
mask should be used instead.
Warning: The
mask_unique_key
task type must not be run in parallel with tasks that operate on any of the following:
- The target
table
of themask_unique_key
task.- Tables containing foreign keys that reference the
target_key
columns (either directly or through a chain of foreign keys).- Any tables specified in
additional_cascades
.
Parameters
Each task with type mask_unique_key
is defined by the following parameters:
table
(required): The name of the database table that contains the primary key or unique key to be masked.target_key
(required): A list of items defining each column that makes up the primary or unique key, and the format in which replacement values will be generated for that column. Composite keys can be masked by including multiple columns and formats in this list. Each item has the following attributes:column
(required): The name of the column to be masked.format
(optional): The format which will be used to generate replacement values for the column. See Format string syntax for details. Defaults to'{!int}'
.
additional_cascades
(optional): Use this parameter to propagate masked values to implied foreign keys of thetarget_key
. Implied foreign keys are dependencies that exist between tables but are not enforced by foreign key constraints, and hence are not defined in the database. Masked values will be cascaded to these columns; see Cascading of masked values for more details on how this works. Eachadditional_cascades
item has the following attributes:table
(required): The name of the table containing the cascade target columns, which have an implicit reference to thetarget_key
of this task. The table name can be prefixed with a schema name to cascade to a table in another schema.columns
(required): A list of column dictionaries - each describing the relationship between a column of the target key and a column on the cascade target table. Each column mapping item has the following attributes:source
(required): The name of a column in the target key from which masked values will be cascaded to the corresponding target column.target
(required): The name of a column on the cascade target table to which masked values from the source column will be cascaded.
update_foreign_keys
(optional): If set totrue
, foreign keys referencing the target columns will be updated in the same way as foreign keys referencing thetarget_key
. Defaults totrue
.
batch_size
(optional): To avoid excessive memory consumption when masking large tables, DataMasque generates replacement values in batches. This value controls the maximum number of unique values that are generated in a single batch. In general, the default of 50,000 will be acceptable for most use cases. The batch size used by DataMasque will be limited by the number of rows in the target table.allow_masking_foreign_keys
(optional): To preserve referential integrity, an error will be raised if any of thetarget_key
columns is a foreign key, as the parent of a foreign key should usually be masked withmask_unique_key
and cascaded to its foreign key(s) when possible. This error is disabled when this option is set totrue
. Defaults tofalse
.action_on_batch_size_exceeded
(optional): Specifies what DataMasque will do if the number of rows returned from the database exceeds the specified batch size, which can lead to excessive memory usage and performance degradation. This could happen if the key columns are non-unique or when querying corrupt tables. If specified, must be eitherwarning
orerror
. Defaults toerror
. Thewarning
option will cause DataMasque to log a warning in the run log and continue masking. Theerror
option will cause DataMasque to fail the masking run and log the error.disable_warning_on_missing_unique_index
(optional): A boolean which, when set totrue
, prevents the warning when a key doesn't have an adequate unique index and DataMasque creates one for the duration of a masking run being logged. Defaults tofalse
.
Note:
- When using
additional cascades
to propagate composite primary or unique key columns to implied foreign key columns, you need to specify all the columns in theadditional cascades
parameter that are corresponding to the referenced primary or unique key columns.
- Warning! Using
additional_cascades
to propagate partial columns on composite primary or unique key columns is not recommended. Not specifying all the columns that are corresponding to the referenced primary or unique key columns will cause data propagation to fail from the referenced composite primary or unique key columns to the composite implied foreign keys. In other words, partial cascades that reference a subset of the target key columns will result in data for omitted columns not being propagated to the target table, potentially resulting in inconsistent data between the two tables.- For Oracle databases, all columns of the
target_key
must be referenced assource
columns inadditional_cascades
. Partial cascades that reference a subset of the target key columns are disallowed. For example: A composite unique key in aPhoneNumbers
table which consists ofPrefix
andLineNumber
columns which are referenced byPhPrefix
andPhLineNumber
columns in theCalllogs
table but without foreign key constraint. Therefore, it is an implied foreign key that requires using theadditional_cascades
parameter to propagate the masked unique key values to ensure data integrity across the tables.A ruleset needs to be written to specify all corresponding implicit foreign key columns in the
additional_cascades
parameter as such that the masked unique keys will be propagated collectively to the foreign keys:
version: "1.0" tasks: - type: mask_unique_key table: PhoneNumbers target_key: - column: Prefix. #part of the composite unique key constraint format: "{!int, 1:150, pad}" - column: LineNumber #part of the composite unique key constraint format: "{!int, 50001:100000, pad}" additional_cascades: - table: CallLogs columns: # Need to include both Prefix/PhPrefix and LineNumber/PhLineNumber for data to propagate properly. - source: Prefix #UK a target: PhPrefix - source: LineNumber #UK a target: PhLineNumber
- However, if they are not composite foreign keys but individual foreign keys a ruleset needs to be written to propagate the masked unique keys individually:
version: "1.0" tasks: - type: mask_unique_key table: PhoneNumbers target_key: - column: Prefix. #has its own unique key constraint format: "{!int, 1:150, pad}" - column: LineNumber #has its own unique key constraint format: "{!int, 50001:100000, pad}" additional_cascades: # Need to include both Prefix/PhPrefix and LineNumber/PhLineNumber for data to propagate properly. - table: CallLogs columns: - source: Prefix #UK a target: PhPrefix - table: CallLogs columns: - source: LineNumber #UK b target: PhLineNumber
Example 1
The following example will mask the primary key column AccountNumber
of the Users
table with unique
replacement values. Another table, Transactions
, also has a column named AccountNumber
which
has a foreign key relationship to the AccountNumber
column of the Users
table.
Account numbers will be generated with at least 6 digits; the minimum value being 100,000
and the
maximum value growing as required depending on the number of rows in the table.
In this ruleset below, you only need to specify masking rules for the primary key column, AccountNumber
,
to be masked. You do not need to explicitly define the foreign key columns to propagate the replacement values
to in the ruleset. DataMasque will automatically detect primary key and foreign key relationships in the database
and propagate the replacement values to any related foreign key columns; in this case, the new values for
AccountNumber
in the Users
table are implicitly propagated to the AccountNumber
column in the Transactions
table.
version: '1.0'
tasks:
- type: mask_unique_key
table: Users
target_key:
- column: '"AccountNumber"'
format: '{!int,100000:}' # Account numbers will be generated with at least 6 digits
Show result
Users table |
|
---|---|
Before | After |
|
|
Transactions table |
|
---|---|
Before | After |
|
|
Example 2
The following example will mask a composite unique key of a PhoneNumbers
table. The composite key
consists of the following columns:
Prefix VARCHAR(3)
: containing a zero-padded integer in the range (1-150). e.g. 001, 002, etc.LineNumber VARCHAR(6)
: containing a zero-padded integer in the range (50,001-100,000).
Values matching these specific formats can be generated using the following format strings:
'{!int, 1:150, pad}'
: Generates integers between 1-150 (inclusive), zero-padded to a fixed width of 3 characters.'{!int, 50001:100000, pad}'
: Generates integers between 50,001-100,000 (inclusive), zero-padded to a fixed width of 6 characters.
Because both of these formats have an upper value bound, we must consider the maximum number of
unique composite values that are available in this space. Multiplying the number of values in the
two ranges (150 * 50,000
), we can determine that these two format strings will supply us with
7,500,000 rows of unique composite values. As a result, this task would fail if applied to a table
containing more than 7,500,000 (non-null) rows.
In addition to the PhoneNumbers
table, this schema also includes a CallLogs
table containing
the columns PhPrefix
and PhLineNumber
, which are references to the values in the Prefix
and
LineNumber
columns of the PhoneNumbers
table. However, due to specific requirements of this
schema, these references are not defined using a foreign key constraint. Without a foreign key,
DataMasque will not automatically propagate the replacement values generated for the PhoneNumbers
table to the CallLogs
table.
In order to ensure the referential integrity of these implicit
references is maintained during masking, this example uses additional_cascades
to instruct
DataMasque on how to cascade updated values to these columns. DataMasque will take the values of Prefix
and LineNumber
columns of the PhoneNumbers
table and propagate these values to the PhPrefix
and PhLineNumber
columns of the CallLogs
table.
version: '1.0'
tasks:
- type: mask_unique_key
table: PhoneNumbers
target_key:
- column: Prefix
format: "{!int, 1:150, pad}"
- column: LineNumber
format: "{!int, 50001:100000, pad}"
additional_cascades:
- table: CallLogs
columns:
- source: Prefix
target: PhPrefix
- source: LineNumber
target: PhLineNumber
Show result
PhoneNumbers table |
||
---|---|---|
Before | After |
|
|
CallLogs table |
||
Before | After |
|
|
Example 3
Consider a database with two tables and two schemas: Customers
in the Accounts
schema; and Transactions
in the
Sales
schema. The following example will mask the primary key column CustomerId
of the Accounts.Customers
table.
However, any changes made to this CustomerId
column must also be reflected on the Customer
column of the
Sales.Transactions
table. Due to specific requirements of this database, these cross-schema references are not defined
using a foreign key constraint. In order to maintain referential integrity, this example uses additional_cascades
to
instruct DataMasque how to cascade updated values to the relevant column present in the other schema.
The Customer ID consists of 3 letters, followed by a hyphen, then a 4-digit number. This will be constructed using a format string to ensure any values generated conform to the required standards.
version: "1.0"
tasks:
- type: mask_unique_key
table: Accounts.Customers
target_key:
- column: CustomerId
format: "{[a-z],3}-{[0-9],4}"
additional_cascades:
- table: Sales.Transactions
columns:
- source: CustomerId
target: Customer
Show result
Customers table |
|
---|---|
Before | After |
|
|
Transactions table |
|
---|---|
Before | After |
|
|
Format string syntax
The format string syntax used by DataMasque to generate unique replacement values currently supports format strings that are combination of alphanumeric characters and symbols. This can be used to generate key values that combined randomly generated portions combined with fixed formatting to generate a key that matches any format required in your key columns.
Format strings consist of variable components that are declared inside
braces {}
, as well as fixed components outside of braces.
Note: In order to guarantee the uniqueness of generated values, format strings used with
mask_unique_key
andfrom_unique
may contain at most one variable component whose length is not constant (e.g. a variable component for an unpadded random integer whose length will depend on the number of digits:{!int, 1:100}
).
Format strings can be constructed using character sets. Character are wrapped in braces and followed by a comma and a number to indicate how many characters in the set are to be generated.
Character Set | Description |
---|---|
[a-z] | Lower case alphabetical characters. |
[A-Z] | Upper case alphabetical characters. |
[0-9] | Numerical characters. |
[aeiou] | Any vowel. Characters can be individually specified without using a range. |
An example format strings is shown below.
format: "{[a-z],2}_{[A-Z],2}-{[a-zA-Z],3}#{[0-9],5}"
In this example, there are four sets of curly braces within the format string, each providing a set of characters followed by a number indicating how many of these characters to generate.
- The first braces specify
{[a-z],2}
, which will generate a string of 2 lower case alphabetical characters. - After this, there is an underscore outside the brackets, which means all generated values will have an underscore after the 2 alphabetical characters.
- The second braces specify
{[A-Z],2}
, which will generate string of 2 upper case alphabetical characters. - After this second variable, there is a hyphen
-
, meaning a hyphen will always be present after the 2 upper case values. - The third braces specify
{[a-zA-Z],3}
. which will generate a string of length 3 consisting of both lower case and upper case alphabetical characters. - After this, there is a hash
#
, meaning a hash character will always be placed after the third generated string. - The final braces specify
{[0-9],5}
, meaning a 5 digit numerical number is placed at the end of the string.
The following values are some example outputs that may be generated using the format string above:
ab_TJ-RaK#10496
pt_oq-TRu#49511
iu_QE-unT#67312
nd_UL-bES#97638
Example
In this example, we wish to mask a series of number plates.
A number plate consists of 3 alphabet characters, followed by 3 numerical digits. We wish to mask the first letter of each number plate with X, followed by 2 random alphabetical characters, followed by a hyphen and a 3-digit number.
A snippet of the table is shown below, where number_plate
is a unique key of the table.
car_registration
Table
number_plate | car_owner |
---|---|
AAA-111 | Anastasia |
BBB-222 | Bill |
CCC-333 | Chris |
DDD-444 | Judith |
EEE-444 | Gordon |
In this case, we will use the following string format.
format: "X{[A-Z],2}-{[0-9],3}"
We are generating a fixed value of X
, followed by 2 alphabetical characters, as defined by {[A-Z],2}
. After this,
there is a hyphen outside of braces, so a static value of a hyphen will always be generated. After this, 3 random
numerical digits are generated.
A ruleset utilising this string format in a mask_unique_key
task can be shown below.
version: "1.0"
tasks:
- type: mask_unique_key
table: car_registration
target_key:
- column: number_plate
format: "X{[A-Z],2}-{[0-9],3}"
Show result
Before | After |
|
|
---|
Integer string format syntax
The !int
operator can be used to generate integers. In its most basic use with no arguments, the format string
{!int}
will generate integers from one to infinity. Extra parameters can be added to set the output range or pad the
output.
Range (min:max
):
The range defines all possible integer values that may be generated by the integer generator. This
is an optional parameter, defaulting to 1:
(min=1, max=unbounded).
min
(optional): The minimum value which will be generated (inclusive). Defaults to1
.max
(optional): The maximum value which will be generated (inclusive). When this value is not specified the maximum value will be unbounded, meaning it will grow depending on the number of values required.
For example:
{!int, 5:}
: generate integers from5
(inclusive) to infinity.{!int, :100}
: generate integers from1
to100
(inclusive).{!int, 20:80}
: generate integers from20
to80
(inclusive).
Zero-pad (pad
):
pad
(optional): When specified, zero-padding will be applied to generated integers - resulting in fixed character width replacement values (e.g. 001, 002, …, 999). The zero-pad width is determined by the width of themax
value, and therefore is only a valid option when amax
value is specified.
For example:
{!int, :1000, pad}
: generate strings in the format0001
,0002
, etc., up to1000
.
Note: Always wrap format strings in either single or double quotes. Leaving format strings unquoted in the ruleset will result in invalid YAML due to the opening
{
character, which has reserved usage in YAML. See below for examples of a correctly quoted format string:
format: '{!int}' # Single quotation marks format: "{!int, pad}" # or double quotation marks
Hex string format syntax
Integers can be generated and output in a hexadecimal format by using the !hex
operator. The range
and pad
options
apply in the same was as for standard integer generation, however range values are interpreted as hexadecimal rather
than decimal.
For example:
{!hex}
: generate hex strings from1
to infinity, i.e.1
,2
, …,a
,b
, …,ff
,100
, etc.{!hex, 10:100, pad}
: generate hex strings from0x10
to0x100
(inclusive) with padding, i.e.010
,011
, …,0fe
,0ff
,100
.
UUID string format syntax
You may choose to generate values in the Universal Unique Identifier (UUID) format by declaring a !uuid
format string.
A UUID is a string of 32 hexadecimal digits (0 to 9, a to f), separated by hyphens. The number of characters per hyphen
are 8-4-4-4-12. An example UUID would be 12345678-90ab-cdef-1234-567890abcdef
.
In order to generate a UUID as the unique key, simply specify uuid in the format as shown below.
format: "{!uuid}"
You may also specify a prefix within the format string of up to 8 characters. This will ensure that the first characters
in the UUID are always static. For example, specifying format: "{!uuid,aaaa}"
will cause the first 4 characters of
every UUID generated by the ruleset to be a
.
Alternatives string format syntax
You can have DataMasque select one value from a set of alternatives for each generated value.
Such a segment can be specified by wrapping your set of pipe/|
-separated alternatives in parentheses ()
.
For example, {(EN|FR)}-{!int}
can be used to generate an integer prefixed by either EN
or FR
.
At least two alternatives must be specified.
Note: This should only be used with the
from_format_string
mask in amask_table
task as it will not satisfy the unique requirement for themask_unique_key
task.
Cascading of masked values
Referential integrity of data references to the target_key
of a mask_unique_key
task is
maintained by "cascading" the masked replacement values to each reference. DataMasque will perform
this cascade automatically for relationships defined by a database foreign key constraint. Masked
values can also be cascaded to columns that are not members of such a foreign key constraint by
using the additional_cascades
feature.
Specifically, foreign keys that meet the following criteria will be updated:
- Foreign keys of the
target_key
or anadditional_cascade
withupdate_foreign_keys: true
will be updated. - Only foreign keys in the same schema as the target table will be updated.
- Only foreign keys that are enabled will be updated. Foreign keys
that are present, but disabled at the time of masking will be
excluded. Specifically:
- In Oracle, foreign keys disabled by
ALTER TABLE some_table DISABLE CONSTRAINT some_constraint
are not updated. - In Microsoft SQL Server, foreign keys disabled by
ALTER TABLE some_table NOCHECK CONSTRAINT some_constraint
- Note that MySQL, MariaDB, PostgreSQL and Redshift do not have equivalent support for disabling foreign key constraints, so all discovered foreign keys are updated.
- In Oracle, foreign keys disabled by
- Foreign keys will be searched for recursively - i.e. Foreign keys that reference an already updated foreign key will also be updated.
- A foreign key that references a superset of the columns in its
referenced key or the original
target_key
will be updated, but the columns that do not reference thetarget_key
will be left unchanged.- For example, consider using
mask_unique_key
to mask a key columnuser_id
. If a composite foreign key (ref_user_id
,ref_group_id
) referencesuser_id
and another columngroup_id
, then theref_user_id
in that foreign key will be updated but theref_group_id
will not.
- For example, consider using
- A foreign key that references a subset of the columns in the
target_key
will NOT be updated, as row references may be ambiguous.- For example, consider using
mask_unique_key
to mask a composite key (user_id
,group_id
). If a foreign key referencesuser_id
orgroup_id
alone, then that foreign key will not be updated. - Note that if an
additional_cascade
withupdate_foreign_keys: true
uses the sametarget_key
column as asource_column
multiple times, then foreign keys of thatadditional_cascade
will still be updated if they reference each column in the originaltarget_key
at least once. - Additionally, if an
additional_cascade
withupdate_foreign_keys: true
references only a subset of thetarget_key
columns, then foreign keys of thatadditional_cascade
will be updated as long as they reference all the target columns in theadditional_cascade
.
- For example, consider using
If any cascade would update the same set or a subset of the columns of
another cascade, it will be marked as redundant and only the other
cascade will be performed. For example, consider using
mask_unique_key
to mask a key column user_id
where a foreign key
column ref_user_id
references user_id
and an additional_cascade
is specified to update both ref_user_id
and ref_user_id_2
from the
single source_column
of user_id
. In this situation, the foreign
key is redundant as it updates a subset of the columns in the
additional cascade, so only the update specified by the additional
cascade needs to be performed.
The following situations would result in an unresolvable set of cascades, and will cause DataMasque to report an error in the run log:
- Any cascade whose columns partially overlap those of another
cascade without either column set being a superset of, subset of,
or equal set to the other column set.
- For example, consider using
mask_unique_key
to mask a key columnuser_id
that is referenced by a composite foreign key (ref_user_id
,ref_group_id
) while anadditional_cascade
is specified to update columnsref_user_id
andref_role_id
. Because the foreign key andadditional_cascade
would both updateref_user_id
but would also update columns that each other cascade would not update (ref_group_id
andref_role_id
), they cannot be resolved and an error will be reported.
- For example, consider using
- Any cascade that would update a column from a different column in
the
target_key
than another cascade.- For example, consider using
mask_unique_key
to mask a composite key (user_id
,group_id
) that is referenced by a composite foreign key (ref_user_id
,ref_group_id
) while anadditional_cascade
is specified to updateref_user_id=group_id
andref_group_id=user_id
. Because the foreign key andadditional_cascade
would updateref_user_id
andref_group_id
from different source columns, they cannot be resolved and an error will be reported.
- For example, consider using
Any rows of a cascade target (child table) which contain values that are not present in the
target_key
(on the parent table) will have their cascade target
columns set to NULL
. This
situation may occur in one of the following cases:
The cascade target is a foreign key that has at some point been disabled, had values updated, then been re-enabled without being checked / validated. i.e.
- For Oracle the constraint was re-enabled using
NOVALIDATE
. - For Microsoft SQL Server, the constraint was re-enabled without using
WITH CHECK
. - For PostgreSQL, the constraint was dropped and recreated instead of being disabled and re-enabled.
- For Oracle the constraint was re-enabled using
The cascade target is an implicit reference without database constraints.
The behaviour is designed to ensure that no rows are left unmasked on the cascade target.
Important!
- The reported number of rows updated by
mask_unique_key
will include updates made by all cascades, and count each time a row is affected by a different update (i.e. multiple cascade updates on the same table).
Troubleshooting failed updates
Queries run by mask_unique_key
to update key columns may fail in the following circumstances:
- As mentioned above, when any column in a composite key contains a
null
value, all columns will be set tonull
. If any columns in the composite key have a not-null
constraint, then such an update query will fail. To update composite keys with a mix of nullable and non-nullable columns, consider updating either the nullable or non-nullable columns withmask_unique_key
and updating the remaining columns withmask_table
and thefrom_unique
mask type.
Build a temporary table
If you need to repeatedly join multiple tables when masking tables,
or you need to perform some custom SQL
transformations to column values, then you may wish to use the
build_temp_table
task type to create a temporary table which can then be
accessed via a join during a mask_table
task:
Parameters
table_name
(required): The name of the temporary table to create in the database. You will need to use this name when referencing this temporary table later (e.g. injoins
and masking rules). The table name can be prefixed with a schema name if the temporary table should be created in a schema other than the user's default schema.sql_select_statement
(this ORsql_select_file
required): A string containing aSELECT
statement to define the contents of the temporary table. To break the statement across multiple lines, you may use multi-line YAML syntax (|-
or>-
).sql_select_file
(this ORsql_select_statement
required): The name of a user uploaded SQL script file containing aSELECT
query to define the contents of the temporary table. See the Files guide for more information on uploading SQL script files. Use this parameter if you have a complex / long query, or you wish to share the same query between many rulesets.
Example (sql_select_statement
)
version: '1.0'
tasks:
- type: build_temp_table
table_name: my_temporary_table
sql_select_statement: >-
SELECT accounts.account_id, address.city
FROM accounts
INNER JOIN address
ON accounts.address_id = address.id
WHERE accounts.country = 'New Zealand';
- type: mask_table
table: my_temporary_table
key: id
rules:
- column: city
masks:
- type: from_file
seed_file: DataMasque_NZ_addresses.csv
seed_column: city
Example (sql_select_file
)
version: '1.0'
tasks:
- type: build_temp_table
table_name: my_temporary_table
sql_select_file: create_temp_table.sql
- type: mask_table
table: my_temporary_table
key: id
rules:
- column: city
masks:
- type: from_file
seed_file: DataMasque_NZ_addresses.csv
seed_column: city
Notes:
DataMasque will create the temporary tables before applying any masks, and delete them after all tables have been masked. DataMasque will also ensure the temporary tables do not already exist in the database (removing existing temporary tables with the same name if needed). The temporary tables you define will only be available for use in joins, and cannot be masked themselves.
For Microsoft SQL Server databases, temporary table names must begin with the
##
characters, as they will be created as "Global Temporary Tables" so that they are visible to all parallel masking connections. However, in YAML the#
character begins an inline comment, so the temporary table name must be wrapped in double or single quotes (e.g.table: '##my_temporary_table'
).- If using the temporary table name in any other tasks,
it must also be quoted to avoid the
#
character acting as a YAML comment. For example, in arun_sql
task, quote the entire SQL query:
sql: "CREATE INDEX IDX_name ON ##my_temporary_table(name);"
- If using the temporary table name in any other tasks,
it must also be quoted to avoid the
For Oracle and PostgreSQL databases, "temporary tables" are created as regular tables so that temporary tables are visible to all parallel masking connections.
The
build_temp_table
task type is not currently supported for Amazon DynamoDB or Microsoft SQL Server (Linked Server) databases.
Run SQL
Use the run_sql
task type when you need to run any native SQL DDL or DML statements or scripts at any point within a
ruleset.
You can supply SQL for DataMasque to execute either as a script file (see Files guide), or inline in the ruleset:
Parameters
sql
(this ORsql_file
required): An SQL script to be executed. For multi-line scripts, you may use the YAML block style syntax (|-
).sql_file
(this ORsql
required): The name of a user-provided file containing an SQL script to be executed (see Files guide). Use this parameter if you have large blocks of SQL to run, or scripts that you wish to share between many rulesets.
Example Microsoft SQL Server (sql
)
Note: This example uses Microsoft SQL Server specific syntax, as master has been specified as the target database.
version: '1.0'
tasks:
- type: run_sql
sql: |-
USE [master];
ALTER DATABASE eCommerce SET RECOVERY SIMPLE WITH NO_WAIT;
USE [eCommerce];
ALTER TABLE [SalesRecords].[Customer] DROP CONSTRAINT [FK_SALESRECORDS_CUSTOMER];
ALTER TABLE [Invoices].[Customer] DROP CONSTRAINT [FK_INVOICES_CUSTOMER];
Example MySQL (sql
)
version: '1.0'
tasks:
- type: run_sql
sql: |-
CREATE PROCEDURE dorepeat(p1 INT)
BEGIN
SET @x = 0;
REPEAT SET @x = @x + 1; UNTIL @x > p1 END REPEAT;
END;
CALL dorepeat(1000);
Example (sql_file
)
tasks:
- type: run_sql
sql_file: pre_script_1.sql
Notes:
- The
run_sql
task type executes in autocommit mode, and will exit on the first error encountered.- The
run_sql
task type does not run in dry run mode.- The
run_sql
task type is not currently supported for Microsoft SQL Server (Linked Server) databases.- The
run_sql
task type currently only works with one SQL query per task for IBM DB2 LUW databases.- For PostgreSQL, MySQL and MariaDB connections, the SQL interpreter considers a colon followed by letters or numbers as bound parameters. The colon can be escaped with a
\
before it. e.g. Attempting to insert JSON data:INSERT INTO table_name (column_name) VALUES('{"is_real" :true, "key":"value"}'::json)
run_sql
will interprettrue
as a bound parameter, to fix this you can escape the bound parameter by adding a backslash(\
) before the colon(:
) as follows:INSERT INTO table_name (column_name) VALUES('{"is_real" \:true, "key":"value"}'::json)
Notes for Oracle:
- It is recommended to test the execution of your SQL script with Oracle SQLPlus before use in a
run_sql
task.- DataMasque appends "WHENEVER SQLERROR EXIT SQL.SQLCODE" to the beginning of the SQL script, so the
run_sql
task will exit on the first error encountered.- Even if the database raises errors while executing the SQL script, some statements may still have finished executing successfully. It is recommended to check the database on any failed
run_sql
task.run_sql
does not use the schema specified in the connection configuration, instead it will default to the schema of the user. If you wish to change schema, specify it in the script withSET SCHEMA
.- When running multiple statements in a
run_sql
block, a/
character is required between each statement ifEND
is used in the transaction.
Notes for Microsoft SQL Server:
- It is recommended to test the execution of your SQL script with Microsoft
sqlcmd
or SQL Server Management Studio before use in arun_sql
task.- Even if the database raises errors while executing the SQL script, some statements may still have finished executing successfully. It is recommended to check the database on any failed
run_sql
task.
Notes for PostgreSQL:
- It is recommended to test the execution of your SQL script with
psql
before use in arun_sql
task.- A
run_sql
task will be executed with a simple query cycle, where statements are executed in a single transaction (unless explicit transaction control commands are included to force a different behaviour).
Notes for Redshift:
- Using multiple SQL statements in a single
run_sql
task type is not supported currently.
Notes for MySQL and MariaDB:
- It is recommended to test the execution of your SQL script with MySQL/MariaDB shell before use in a
run_sql
task.- Using
DELIMITER
to redefine the MySQL/MariaDB delimiter is not required.DELIMITER
is amysql
/mariadb
command-line client command and is not supported when defining stored programs inrun_sql
tasks.- Warning: Executing
run_sql
tasks with more than one statement may not raise errors upon failure. If the first statement executes correctly but subsequent statements fail, errors may not be raised.
Notes for Amazon DynamoDB:
- The
run_sql
task type is not supported for Amazon DynamoDB.
Notes for IBM DB2 LUW:
- On IBM DB2 LUW,
run_sql
tasks can only execute one SQL query per task.- You may have to split up SQL into multiple
run_sql
tasks to achieve desired results.
Truncate a table
Use the truncate_table
task type to specify tables to be truncated by
DataMasque. All rows will be deleted, but the table structure will be left in
place.
Parameters
table
(required): The name of the table to truncate. The table name can be prefixed with a schema name to reference a table in another schema.
Example
tasks:
- type: truncate_table
table: history_table
...
Notes:
- The
truncate_table
task type does not run in dry run mode.- The
truncate_table
task type is not currently supported for Amazon DynamoDB or Microsoft SQL Server (Linked Server) databases.- The
truncate_table
task type does not properly truncate on IBM DB2 LUW databases, instead it deletes all rows in the table. See IBM Db2 LUW Limitations.
Parallel Tasks
Using the parallel
task type, you can specify a block of tasks to be executed
in parallel, spread across as many workers as are available.
Each parallel task distributes to a maximum of 10 sub-tasks. It is recommended to begin testing parallelisation with at most 4 tasks in parallel, then increase parallelisation if the database has more capacity.
Parallel tasks can be nested inside other serial/parallel tasks.
Parameters
tasks
(required): A set of tasks to perform in parallel.
Example
tasks:
- type: parallel
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: 'name'
masks:
- type: from_fixed
value: 'REDACTED'
- type: mask_table
table: customers
key: id
rules:
- column: 'address'
masks:
- type: from_fixed
value: 'REDACTED'
...
Warning: You should not mask the same table in multiple tasks (including
mask_table
,run_sql
tasks) in parallel, as this could result in data being incorrectly masked.Note: Currently, the
parallel
task type is not supported for either Amazon DynamoDB or file masking.
Serial Tasks
Although tasks are performed serially in the order they are listed in the ruleset by default, you can specify a block of tasks to be performed in serial within a parallel block. This is useful when a subset of parallelisable tasks have dependencies that mean they must be executed in sequence.
Serial tasks can be nested inside other serial/parallel tasks.
Parameters
tasks
(required): A set of tasks to perform in series.
Example
tasks:
- type: parallel
tasks:
- type: serial
tasks:
- type: run_sql
sql_file: pre_employees_script.sql
- type: mask_table
table: 'employees'
key: id
rules:
- column: 'name'
masks:
- type: from_fixed
value: 'REDACTED'
- type: mask_table
table: 'customers'
key: id
rules:
- column: 'address'
masks:
- type: from_fixed
value: 'REDACTED'
...
Note: Currently, the
serial
task type is not supported for file masking.
File task types
Each mask_file
or mask_tabular_file
task specifies the masking rules to apply to each file in the base directory and/or any subdirectories.
As well as any files/directories intended to be skipped or included, and any conditionals required to define which data to mask in the masking process.
Masking rules and masks are applied sequentially in the order they are listed. When multiple masks are combined in sequence, the output value
from each mask is passed as the input to the next mask in the sequence.
Note If the source connection and destination connection are of the same type and have the same base directory, the files will be overwritten. The list of files is read at the start of the masking run, so new files added during the masking run will not be masked and will not be present in the destination.
A ruleset consists of a list of tasks to be performed in sequence on the target data source after the file is masked, it is then written to the selected data destination.
Sensitive data discovery
The run_file_data_discovery
task type inspects the metadata and optionally the contents of your files, searching for columns
which are likely to contain sensitive data. On completion, a report is generated containing a
summary of all identified columns, and their current masking coverage.
See the Sensitive Data Discovery guide for more information on this feature.
Parameters
in_data_discovery
(required): Enable In-Data discovery to examine the contents of discovered files for sensitive data matches. See In-Data Discovery.custom_keywords
(optional): Additional custom keywords and to facilitate sensitive data discovery.disable_built_in_keywords
(optional): Disable DataMasque's built-in discovery keywords. See Built-in data discovery keywords.disable_global_custom_keywords
(optional): Disable any global custom keywords configured on the settings page.disable_global_ignored_keywords
(optional): Disable any global ignored keywords configured in the settings page.ignored_keywords
(optional): Additional ignored keywords to facilitate sensitive data discovery.encoding
(optional): The encoding to use when reading files. Defaults toUTF-8
. Refer to Python Standard Encodings for a list of supported encodings.include
(optional): Specifies files to include in the discovery run. If not set, all files will be discovered (except those skipped by askip
rule).regex
(optional): Files that match this regular expression will be discovered.glob
(optional): Files that match this glob will be discovered.match_against
(optional): Choose if theinclude
rule is applied to the whole path or just the filename. Valid choices arepath
orfilename
. Defaults topath
.
recurse
(optional): A boolean value, when set to true any folders in the Data Source will be recursed into and the files contained will also be discovered. Defaults tofalse
.skip
(optional): Specifies files to not include in the discovery run.regex
(optional): Files that match this regular expression will not be included for discovery.glob
(optional): Files that match this glob will not be included for discovery.match_against
(optional): Choose if theskip
rule is applied to the whole path or just the filename. Valid choices arepath
orfilename
. Defaults topath
.
Examples
The following shows an example ruleset that will execute only the run_file_data_discovery
task and no masking tasks.
This example usage may be useful to determine a starting point for developing your masking rules.
version: '1.0'
tasks:
- type: run_file_data_discovery
include: [{"glob": "*.csv"}]
in_data_discovery: {"enabled": False}
This example enables In-Data discovery:
version: '1.0'
tasks:
- type: run_file_data_discovery
include: [{"glob": "*.csv"}]
in_data_discovery: {"enabled": True}
Object file masks
Each mask_file
task defines masking operations to be performed on a file or set of files.
mask_file
tasks support the following file types:
- JSON
- NDJSON
- XML
- Apache Avro
In addition, any file type can be used with from_blob
,
which will replace the entire file with another file.
Note The file type will be determined solely by the file extension, not by its content.
Parameters
Each task with type mask_file
is defined by the following parameters:
rules
(required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:masks
(required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the file. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask types.
recurse
(optional): A boolean value, when set to true any folders in the Data Source will be recursed into and the files contained will also be masked. Defaults tofalse
.workers
(optional): The number of parallel workers to use for this masking task. Defaults to 1.skip
(optional): Specifies files to not include in the masking run.regex
(optional): Files that match this regular expression will not be included for masking.glob
(optional): Files that match this glob will not be included for masking.match_against
(optional): Choose if theskip
rule is applied to the whole path or just the filename. Valid choices arepath
orfilename
. Defaults topath
.
include
(optional): Specifies files to include in the masking run. If not set, all files will be masked (except those skipped by askip
rule).regex
(optional): Files that match this regular expression will be masked.glob
(optional): Files that match this glob will be masked.match_against
(optional): Choose if theinclude
rule is applied to the whole path or just the filename. Valid choices arepath
orfilename
. Defaults topath
.
encoding
(optional): The encoding to use when reading and writing files. Defaults toUTF-8
. Refer to Python Standard Encodings for a list of supported encodings.
For more information about the ordering of
skip
andinclude
please refer to Include/Skip.
Note: Increasing the number of workers can reduce masking time by masking files in parallel, but extra workers will not mask different parts of the same file. This is useful if there are a large number of files to be masked, but will not reduce the amount of time required to mask a single large file. For files that can be split into multiple parts (for example, tabular files or multi-record files), manually chunking and recombining the files can improve masking speed.
Note:
regex
/glob
will match to the path from the base directory specified in the source connection, consider adding.*
(regex) or*
(glob) to the beginning of the expression for matching. For example: If the structure is /path1/path2/target_file.json, if the base directory is path1/ andrecurse: true
in the ruleset, theregex
/glob
will try to match path2/target_file.json. When including a path by specifying aglob
such as target_path/* therecurse
option needs to be set totrue
otherwise the included path won't be entered and the files it contains will not be masked.
Supported file types
In general, mask_file
has been designed to mask XML, JSON, NDJSON, and Apache Avro files. Each file is loaded as a string and passed to the
masks
. Therefore, to mask a JSON file, a json
mask would be implemented, like the following example:
version: "1.0"
tasks:
- type: mask_file
recurse: true
skip:
- regex: '.*\.json'
- glob: "input/*"
include:
- glob: "other_inputs/*.json"
rules:
- masks:
- type: json
transforms:
- path: ['name']
masks:
- type: from_fixed
value: REDACTED
This would replace the root name
attribute in the JSON with the text REDACTED.
Multirecord Files (NDJSON/Avro)
NDJSON (Newline Delimited JSON) (.ndjson
) files can be masked by specifying a json
mask in the ruleset which will be
applied to each line of JSON in the NDJSON file.
For NDJSON files, a similar ruleset can be specified which will be applied to each record of the file, rather than being applied to the entire file.
version: "1.0"
tasks:
- type: mask_file
rules:
- masks:
- type: json
transforms:
- path: ['name']
masks:
- type: from_fixed
value: REDACTED
This would replace the root name
attribute for each record in the NDJSON with the text REDACTED.
Apache Avro (.avro
) files can be masked in the same way as NDJSON files. When masking Avro files there also needs to be an
Avro Schema file (.avsc
) in the same directory and with the same name as the .avro
file, as the schema file is required to
write the masked .avro
file to ensure data type consistency.
For example, if a file is called data.avro
DataMasque will expect to find a schema file called data.avsc
in the same directory.
Note: The Avro Schema file (
.avsc
) will be copied unmasked into the output directory. Deterministic Masking is also supported for multirecord files, for more information please refer to Deterministic Masking with Files.
Similarly, for XML files, use an xml
mask:
version: "1.0"
tasks:
- type: mask_file
recurse: true
skip:
- regex: '^(.*).xml'
- glob: "input/*"
include:
- glob: "other_inputs/*.xml"
rules:
- masks:
- type: xml
transforms:
- path: 'User/Name'
node_transforms:
- type: text
masks:
- type: from_fixed
value: REDACTED
This would replace the content of the node(s) at User/Name
with the text REDACTED.
To mask other types of files, basic redaction is possible. For example, to replace the contents of every txt
file with
the text REDACTED:
version: "1.0"
tasks:
- type: mask_file
recurse: true
include:
- glob: "*.txt"
rules:
- masks:
- type: from_fixed
value: REDACTED
It is possible to use any mask that accepts text input (or no input), although their effectiveness will depend on the size and content of the input file.
Note also, that files that have not been processed will not be copied from the source to the destination. That is,
DataMasque will either load a file (based on skip
/include
rules), mask it, then copy it to the destination, or
it will ignore the file. Unmasked files will not be copied to the destination.
Tabular file masks
Each mask_tabular_file
task defines masking operations to be performed on a file or set of files.
mask_tabular_file
tasks support the following file types:
- Comma separated (CSV), Tab delimited (TBV), and other character-delimited files
- Parquet
- Fixed-width columns
Parameters
Each task with type mask_tabular_file
is defined by the following parameters:
rules
(required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:column
(required): A column within the tabular file intended for masking. This is the header row of the column (for CSVs) or name of column (for parquet files).masks
(required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the column. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask functions.
recurse
(optional): A boolean value, when set to true any folders in the Data Source will be recursed into and the files contained will also be masked. Defaults tofalse
.workers
(optional): The number of parallel workers to use for this masking task. Defaults to 1.skip
(optional): Specifies files to not include in the masking run.regex
(optional): Files that match this regular expression will not be included for masking.glob
(optional): Files that match this glob will not be included for masking.match_against
(optional): Choose if theskip
rule is applied to the whole path or just the filename. Valid choices arepath
orfilename
. Defaults topath
.
include
(optional): Specifies files to include in the masking run. If not set, all files will be masked (except those skipped by askip
rule).regex
(optional): Files that match this regular expression will be masked.glob
(optional): Files that match this glob will be masked.match_against
(optional): Choose if theinclude
rule is applied to the whole path or just the filename. Valid choices arepath
orfilename
. Defaults topath
.
encoding
(optional): The encoding to use when reading and writing files. Defaults toUTF-8
. Refer to Python Standard Encodings for a list of supported encodings.delimiter
(optional): The character that is used to delimit columns in character-delimited files (e.g.,
for CSVs or\t
for TDVs [Tab Delimited Values]). Defaults to,
.quoting
(optional): Controls how the masked file will be quoted. Valid values areall
,minimal
andnone
. Defaults tominimal
. See Quoting options.column_names
(optional): An array of string defining the names of the fixed-width columns or non-header character-delimited file and used to refer to them in masking rules. For masking fixed-width files, it is required iffixed_width_extension
is specified, and must match the length offixed_width_columns_indexes
.fixed_width_extension
(optional): The file extension that fixed-width files have. Not required if no fixed-width files are to be masked. Should not include a leading.
(e.g. specifytxt
not.txt
).fixed_width_columns_indexes
(optional): An array of two-element arrays of start and end indexes of the fixed width columns. Required iffixed_width_extension
is specified.fixed_width_column_names
(optional, deprecated): This option was renamed tocolumn_names
option. However, it is still supported for backward compatibility.fixed_width_too_wide_action
(optional): The action to take if masked data exceeds the width of the column, can be eithertruncate
to truncate the value to fit in the column orerror
to raise an error and stop the masking run. Defaults totruncate
.fixed_width_line_ending
(optional): The line ending to use when writing out the fixed width data. Will attempt to be detected from the input file, otherwise defaults to\n
.
For more information about:
- How tabular file types are detected, see Tabular File Type Detection.
- Parameters for fixed-width file masking, see Fixed Width File Masking Parameters.
- The ordering of
skip
andinclude
, please refer to Include/Skip. - Table
join
s are not supported in tabular file masking.
Note:
regex
/glob
will match to the path from the base directory specified in the source connection, consider adding.*
(regex) or*
(glob) to the beginning of the expression for matching. For example: If the structure is /path1/path2/target_file.json, if the base directory is path1/ andrecurse: true
in the ruleset, theregex
/glob
will try to match path2/target_file.json. When including a path by specifying aglob
such as target_path/* therecurse
option needs to be set totrue
otherwise the included path won't be entered and the files it contains will not be masked.
version: "1.0"
tasks:
- type: mask_tabular_file
recurse: true
skip:
- regex: '.*\.json'
- glob: "input/*"
include:
- glob: "other_inputs/*"
rules:
- column: name
masks:
- type: from_fixed
value: REDACTED
Tabular File Type Detection
DataMasque uses file extensions to determine how tabular files are loaded for masking. The extension comparison is not
case-sensitive. Files with extension parquet
are treated as Apache Parquet files. If
a fixed_width_extension
is specified (see Fixed Width File Masking Parameters
below) then the file is loaded as fixed-width. All other file types are loaded as character-delimited. If the
delimiter
parameter is unspecified then it defaults to ,
(i.e. CSV).
Note Delimiters must be a single character, and should be quoted with double quotes. For example:
delimiter: " "
for a space ordelimiter: ";"
for a semicolon. Tabs or other invisible delimiters should use the correct escape sequence with a backslash. For example,delimiter: "\t"
for a tab.
The skip
, include
and delimiter
options should be used together to specify the correct delimiter for each file
type that is to be masked.
Note CSV files require header columns for tabular masking as the header columns are used as column names during masking. CSV and fixed-width files are all string based files, therefore values should be cast to other types if being used with masks that require specific types (e.g.
numeric_bucket
). To do this use atypecast
mask, for more information please refer to Typecast
DataMasque will only attempt to load fixed width files if fixed_width_extension
is specified, and will treat any files
with this extension as fixed-width. See also Fixed Width File Masking Parameters.
Once files are loaded they are all masked in the same way, that is, rules are executed and applied on a per row/column basis regardless of the original source type. Data will be written back out in the same format as it was read.
Quoting options
Character delimited files (such as CSVs) handle special characters by quoting the field when written to a file. Special characters include the delimiter (e.g. ,
for CSV), double quotes ("
) or new line characters.
The field content is wrapped in double quotes, and double quotes are replaced by two double quotes.
For example, the text:
She said, suddenly, "Yes, that's right!"
would be written to a CSV file as:
"She said, suddenly, ""Yes, that's right!"""
This quoting will happen automatically if the field requires quoting, but the behavior can be modified based on the quoting
option.
all
: quote all fields.minimal
(default): only quote those fields which contain special characters such as delimiter, double quote or newlines.none
: never quote fields. If the delimiter occurs in the output data then an error will be raised.
Fixed Width File Masking Parameters
Masking of fixed-width files is only attempted if fixed_width_extension
is specified. If fixed_width_extension
is
present in the ruleset without fixed_width_columns_indexes
and column_names
, then an error will be
raised. However, it is valid to have fixed_width_columns_indexes
and column_names
missing if
fixed_width_extension
is also absent.
If fixed_width_extension
is set, then DataMasque will treat any files with that extension as fixed-width and load them
based on the other fixed-width options. To assist in explaining the rules, consider an example file called users.txt
.
This is the content:
Adam 2010-01-01 AAA-1111
Brenda 2010-01-01 EEE-5555
Charlie 2010-02-02 GGG-7777
It has 3 columns, the first containing a name, which is from index 0
to 8
. The second column contains a date and
spans from 8
to 19
. The final column contains a transaction ID and spans from index 19
to 27
.
Note these indexes are specified to be contiguous as some fixed-width formats require contiguous columns, therefore a
trailing space is included in the first and second columns. DataMasque automatically strips leading and trailing spaces
when the data is read. Contiguous columns are not required though, so the same result could be achieved with indexes
(0, 7)
, (8, 18)
and (19, 27)
. When non-contiguous columns are specified DataMasque inserts spaces in between
columns.
Since fixed-width files do not have column headers, the ruleset must also specify these. They can be any arbitrary
valid column identifier (i.e. alphanumeric string without special characters) and are used to identify the columns in
the masking rules. In this case they will be named name
, date
and transaction_id
.
Considering these rules will yield a ruleset like this:
version: "1.0"
tasks:
- type: mask_tabular_file
recurse: true
fixed_width_extension: txt
fixed_width_columns_indexes:
- [0, 8]
- [8, 19]
- [19, 27]
column_names:
- name
- date
- transaction_id
rules:
- column: name
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
- column: date
masks:
- type: from_random_date
min: '1950-01-01'
max: '2000-12-31'
- type: typecast
typecast_as: string
date_format: '%Y-%m-%d'
- column: transaction_id
masks:
- type: imitate
Note that when this ruleset is executed, DataMasque will still load any CSVs or Parquet files it encounters, however it
will use the standard loaders instead of applying the fixed width rules. In this case fixed-width rules will only be
used for txt
files.
This can be useful if CSV or Parquet files exist in the source with the same columns and need to be masked in the
same manner. If these files exist, and they shouldn't be masked, then skip
rules should be added to skip them.
Supported Parquet column data types
Scalar types
The following scalar Parquet column data types are supported for masking,
and likewise for use with seed_filter_column
in a from_file
mask, from_column
, if
, or hash_columns
.
- Null type:
null
(sometimes calledna
) - Integer types:
int8
,int16
,int32
,int64
- Unsigned integer types:
uint8
,uint16
,uint32
,uint64
- Floating-point types:
float
,double
- Boolean type:
bool
- String types:
string
,large_string
- Binary types:
binary
,large_binary
- Date and time types:
date32
,date64
,timestamp
,time32
,time64
,duration
Complex types
DataMasque supports the complex Parquet data types struct
, map
, and list
,
which are sometimes referred to as nested columns.
Note: Map keys must be strings. DataMasque does not support maps with keys of other data types. Map values, and the contents of lists and structs, can be of any supported type: either another struct, map, or list, or any of the supported scalar types listed above.
For masking struct
, map
, and list
columns, use the json
mask.
In the ruleset, specify the name of the column, and then the JSON path to the field to be masked,
treating list
s as JSON arrays, and struct
s and map
s as JSON objects.
For example, suppose you have the following Parquet column of map
type:
column name: staff_by_office
type: map<string, list<struct<first_name: string, surname: string>>>
i.e. each key in the map
is an office name,
and each value is a list of staff members' names, stored as a struct
with fields first_name
and surname
.
Then the following ruleset will mask the surname of all staff in the north
office:
version: "1.0"
tasks:
- type: mask_tabular_file
rules:
- column: staff_by_office
masks:
- type: json
transforms:
- path:
- north
- '*'
- surname
masks:
- type: from_fixed
value: REDACTED
You can also copy one complex column to another of the same type using the from_column
mask,
and use a value within a complex column as a hash_columns
source by specifying a json_path
as detailed above.
Complex columns cannot be used as the seed_filter_column
in from_file
masks.
Conditional masking for Parquet
Complex columns can be referenced in if
conditions. Specify the json_path
to the field to be compared.
Other data types
Attempts to mask or reference a column of type not mentioned above will cause the task to fail. However, you can still mask other columns in a file containing one or more columns with an unsupported data type.
Choosing files to mask with include
/skip
When specifying which files to include
or skip
for a mask_file
/mask_tabular_file
task, the order of checking
needs to be considered. The skip
rules are checked first, followed by the include
rules, which means skip
takes
precedence. Warning! If a file matches both the skip
and include
rules, that item will not be included in
the masking task.
Rules are applied with "or" logic, that is, as long as the file matches any rule in skip
or include
, it will
be skipped/included, respectively.
In comparisons, the file's path does not include a leading /
.
By default, include
and skip
rules are applied to the file's path, relative to the base_directory
. By specifying
match_against: filename
on the include
/skip
, the rule will be applied to just the file's name.
Filtering is performed only on files, not on directories.
If recurse
is true, a directory will always be entered into, even if it would be skipped based on filters.
If a directory has no files to be masked (they are all skipped, or none are included)
then an empty directory will not be created on the target.
glob
matching does not support the **
syntax for multi-level directory matching. If the depth of directories is
unknown, or more advanced path matching is required regex
filtering should be used.
These rules are explained in more detail below, with some examples and common file filtering scenarios.
include
/skip
examples and common scenarios
In general, glob
is used for simple matching, and regex
should be considered when more advanced rules are
required. Note that some of these examples use a mask_file
task and others use a mask_tabular_file
task, however
there is no difference in the way the file matching rules are executed between these different task types.
Mask all files with a certain extension
The ruleset will include all files that end with .xml
inside the base directory and any subdirectories.
version: "1.0"
tasks:
- type: mask_file
recurse: true
include:
- glob: "*.xml"
The recurse
option will enter every subdirectory. Since the match_against
option has not been specified the default
path
option is used, so the glob is applied against the file's path. However since a path like dir1/dir2/file.xml
and a filename like file.xml
both end with .xml
they both match the specified glob; in this case specifying
match_against: filename
would have no effect.
Mask all files with a specific start and end
This ruleset will mask any file that starts with user
and has a .json
extension. For example, user_1.json
,
users_file.json
, etc. These files can be anywhere inside the source. The match_against
option is set to filename
.
version: "1.0"
tasks:
- type: mask_file
recurse: true
include:
- regex: "^user.*\.json$"
match_against: filename
Explaining the entire regular expression syntax is beyond the scope of these examples, but since this pattern is quite common we will briefly explain how it works:
^
means the expression must match start of the path, it can't just be found somewhere in the middle.user
has no special meaning, so it matches the literaluser
part of the path..*
has two parts,.
means match any character and*
means match the preceding expression (the.
) any number of times. So together they mean match any number of any characters.\.
matches a literal.
. The\
escapes the.
so it doesn't behave as matching any character.json
has no special meaning, so it matches the literaljson
extension.$
means the expression must also match at the end of the path,.json
can't just appear somewhere in the middle.
match_against: filename
is specified so that the regex pattern does not match user
in a directory name.
For example, we do not want it to match the path user_data/file_1.json
, but that full path does satisfy the regex.
By specifying match_against: filename
, the regex is applied only to the filename part (file_1.json
) which does not match.
Notes:
- Regex patterns only match at the beginning of the path (or filename, if using
match_against: filename
). So the pattern above will not matchsome_user_data.json
as the worduser
is not at the start of the filename, even if the leading^
start anchor is omitted from the regex.- When using
match_against: filename
, aglob
orregex
with a/
in it will cause the pattern to never match anything, since filenames never contain a/
. DataMasque does not prevent you from using a/
in a pattern when matching against filename, so if there are files you expect to be included, but they are not, this could be the reason why.match_against: filename
has no effect when using onlyglob
patterns. Globs always match at the end of the path, and wildcards do not match across directory separators (/
). It follows that the above example can be rewritten more succinctly asglob: user*.json
with nomatch_against
parameter. However, were the regular expression more complex, it may not be expressible as a glob pattern.
Mask files of a certain type only in the base directory
To mask CSV files only in the base directory, disable the recurse
option.
version: "1.0"
tasks:
- type: mask_tabular_file
include:
- glob: "*.csv"
Without recurse: true
, subdirectories will not be entered into, so only .csv
files in the base directory will be
masked.
Mask files in nested directories of known depth
In this example, there is a directory layout like this:
users1.csv
users2.csv
staff/users1.csv
staff/users2.csv
staff/management/users1.csv
staff/management/users2.csv
staff/contractors/users1.csv
staff/contractors/users2.csv
(Assume there are also some files of other types that are not to be masked).
All users*.csv
files should be masked, except those in the base directory. Since we know the depth of the directories
(staff/
is one level; staff/management/
and staff/contractors/
are two levels) two globs can be added to match
these.
version: "1.0"
tasks:
- type: mask_tabular_file
include:
- glob: "staff/*.csv"
- glob: "staff/*/*.csv"
Since the matching is performed on the path, none of the files in the base directory match. Files in the staff/
directory match the first glob and files in staff/management/
or staff/contractors/
match the second. Since
include
use OR logic then as long as at least one include
rule matches a file will be included.
Note: Globs also match at the end of a more deeply-nested path. For example, with the above globs,
folder1/folder2/staff/file.csv
also matches. If you want to avoid this behaviour, use a regex starting with^staff/
to ensure that the pattern only matches a folder namedstaff
when it is in the base directory - see the following section.
Mask files in all nested directories
If the depth of the directories is unknown, or could be very deep, so that a lot of glob
rules need to be created,
then using regex
to apply regular expressions to the path can be a better option.
Following on from the previous example, CSV files might be nested inside a deep directory tree. The following ruleset
uses a regex to match any .csv
file inside staff/
or any of its child directories.
version: "1.0"
tasks:
- type: mask_tabular_file
include:
- regex: "^staff/.*\.csv$"
Any path that starts with staff/
and ends with .csv
will be matched by this regex.
Often you will be able to apply a slightly modified version of this regex, by just changing the start and end to match your particular paths and file extensions.
For more information on the regular expression syntax used by DataMasque, check the Python Regex Howto.
Mask all files with a given extension, except those with a certain prefix
In this example all .avro
files in the base directory will be masked, except the ones starting with users*
. This
means files like sales_1.avro
, data.avro
, orders-for-2020.avro
will be masked, but users-list.avro
and
users-exported.avro
would not be. It uses an include
rules to include all .avro
files and a skip
rule to skip
any starting with users
.
version: "1.0"
tasks:
- type: mask_file
include:
- glob: "*.avro"
skip:
- glob: "users*.avro"
match_against: filename
The file users-exported.avro
matches both *.avro
and users*.avro
, but since skip
takes precedence the file will
not be included. The file data.avro
only matches *.avro
so it will be included.
Mask files in all nested directories, excluding those with a given prefix
This example will include all .csv
files in the staff/
directory, except those that start with data
. For example,
it matches staff/sales.csv
and staff/management/users.csv
but not staff/data01.csv
or
staff/management/data-list.csv
.
version: "1.0"
tasks:
- type: mask_tabular_file
include:
- regex: "^staff/.*\.csv$"
skip:
- glob: "data*.csv"
match_against: filename
This ruleset show that glob
and regex
can be combined to produce flexible and simple matching rules.
Mask files that match a particular pattern.
This example includes XML files that start with users_
, then have at least one digit (the \d+
parameter). But since
it uses match_against: filename
it will not match against the path, therefore files inside a directory starting with
users_
won't match.
version: "1.0"
tasks:
- type: mask_tabular_file
include:
- regex: "^users_\d+.xml$"
match_against: filename
Some example results are:
users_5.xml
: matchusers_10.xml
: matchusers_list.xml
: no match (the filename doesn't match the regex)users_5/file.xml
: no match (the path matches the regex but comparison is performed against the filename)users_5/users_10.xml
: match (the filenameusers_10.xml
matches the regex)
Masking Functions
Masked data is generated by mask functions. Each mask_table
/mask_file
/mask_tabular_file
task specifies a set of rules that use these mask functions to
mask your data based on your configuration.
Full masking function details are available on the Masking Functions page.
Here is an outline of the masking functions available:
- Generic masks
- Fixed value (
from_fixed
)
Outputs a fixed value - From column (
from_column
)
Outputs from another column (or table) - From file (
from_file
)
Generates values sourced from a file - From blob (
from_blob
)
Outputs a value that is the entire contents of a file - From format string (
from_format_string
)
Generates random values according to a format string - From choices (
from_choices
)
Generates a random value picked from a (optionally weighted) set of values - Secure shuffle (
secure_shuffle
)
Securely shuffles a given column's values - From json path (
from_json_path
)
Masks apath
with the contents of anotherpath
in the same JSON document
- Fixed value (
- String masks
- Imitate (
imitate
)
Replaces characters with other characters from the same set (letters for letters, numbers for numbers) - Random text (
from_random_text
)
Generates random strings of letters - Transform case (
transform_case
)
Transforms the case of characters in a string - Substring (
take_substring
)
Extracts a substring from a column's value - Replace substring (
replace_substring
)
Applies masks to a specific portion of string values - Replace regular expression (
replace_regex
)
Applies masks to parts of string values that match a given regular expression - Substitute (
substitute
) (deprecated)
Deprecated - Useimitate
instead
- Imitate (
- Data Pattern Masks
- Credit Card (
credit_card
)
Replaces credit card values with random ones - Brazilian CPF (
brazilian_cpf
)
Replaces Brazilian CPF numbers with random ones - Social Security Number (
social_security_number
)
Replaces social security numbers with random ones
- Credit Card (
- Numeric masks
- Random Number (
from_random_number
)
Generates a random integer/decimal between two numbers - supports triangular or uniform distribution - Random Boolean (
from_random_boolean
)
Generates a random true/false or 1/0 value - Numeric Bucket (
numeric_bucket
)
Generates replacement numbers whilst retaining specified ranges
- Random Number (
- Date/time masks
- Random date/time (
from_random_datetime
)
Generates random dates and times - Random date (
from_random_date
)
Generates random dates (without time components) - Retain age (
retain_age
)
Transforms values into random date times that preserve the age (in years) - Retain date component (
retain_date_component
)
Transforms specific parts of a date to be random - Retain year (
retain_year
)
Transforms a date time randomly whilst keeping the same year
- Random date/time (
- Transformation masks
- Typecast (
typecast
)
Converts data to a different type - Do nothing (
do_nothing
)
Prevents specific data from being masked
- Typecast (
- Combination masks
- Concatenate (
concat
)
Combines the output of multiple mask operations together - Chain (
chain
)
Chains multiple mask operations, passing the output of one to the next
- Concatenate (
- Unique Masks
- From Unique (
from_unique
)
Generates random but unique strings or numbers, from a format string (databases only) - From Unique Imitate (
from_unique_imitate
)
Transforms strings or numbers to be random, retaining format and uniqueness
- From Unique (
- Document masks
- JSON (
json
)
Masks data inside JSON documents - XML (
xml
)
Masks data inside XML documents
- JSON (
Definitions
You can make use of YAML anchors and aliases to assign a label (anchor) to a block of configuration, and then reference that name (via aliases) to re-use the configuration throughout your ruleset, extending or overriding specific properties/keys in the configuration as needed.
You can define an anchor with &
, then refer to it using an alias, denoted with *
.
You can use overrides with the characters <<:
to add more values, or override existing ones on an alias.
Shown below is an example that sets 4 values with an anchor, uses 3 of those values with an alias, and overrides the 2nd value.
Both tasks share the values for type
, key
and rules
using the anchor &mask_table_opts
and the alias *mask_table_opts
. The value for table
is merged/overridden by <<
or merge operator.
version: '1.0'
tasks:
- &mask_table_opts
type: mask_table
table: details
key: id
rules:
- column: customer_id
masks:
- type: from_random_text
max: 10
- <<: *mask_table_opts
table: customers
As you may commonly want to re-use the definition of a task, rule, or mask in multiple locations (often with
minor variations), DataMasque allows you to group such definitions under the task_definitions
,
rule_definitions
, and mask_definitions
configuration properties of a ruleset. The following subsections demonstrate
how to use anchors and aliases with such definitions.
Task definitions (task_definitions
)
The task_definitions
attribute of the ruleset may contain a list of task definitions to be referenced
from the ruleset's tasks
attribute through YAML anchors and aliases. Each task definition must follow
the same syntax as a task in the tasks
attribute (see Database Task Types for the full list of
available task types).
For example, the following ruleset has an anchor named &mask_table_customer_info
under task_definitions
that is inherited three times
in the tasks
list via the alias *mask_table_customer_info
. Each time the definition is re-used, the table
attribute is overridden:
version: '1.0'
task_definitions:
- &mask_table_customer_info
type: mask_table
# PLACEHOLDER is a placeholder table name to be merged/overridden
# when anchor `&mask_table_customer_info` is referred in alias
table: PLACEHOLDER
key: id
rules:
- column: customer_id
masks:
- type: from_random_text
max: 10
- column: postcode
masks:
- type: from_random_number
max: 9999
tasks:
- <<: *mask_table_customer_info
table: details
- <<: *mask_table_customer_info
table: customers
- <<: *mask_table_customer_info
table: orders
This is equivalent to specifying the three mask_table
tasks separately with the same key
, type
, and rules
but
applied to different table
s, as below:
Show equivalent ruleset without anchors and aliases
version: '1.0'
tasks:
- type: mask_table
table: details
key: id
rules:
- column: customer_id
masks:
- type: from_random_text
max: 10
- column: postcode
masks:
- type: from_random_number
max: 9999
- type: mask_table
table: customers
key: id
rules:
- column: customer_id
masks:
- type: from_random_text
max: 10
- column: postcode
masks:
- type: from_random_number
max: 9999
- type: mask_table
table: orders
key: id
rules:
- column: customer_id
masks:
- type: from_random_text
max: 10
- column: postcode
masks:
- type: from_random_number
max: 9999
Rule definitions (rule_definitions
)
The rule_definitions
attribute of the ruleset may contain a list of rule definitions to be referenced
through YAML anchors and aliases. Each rule definition must follow the same syntax as a rule in a
mask_table
task's rules
.
For example, the following ruleset has an anchor named &postcode_rule
under rule_definitions
that is inherited twice
in a list of rules
via the alias *postcode_rule
. Each time the definition is re-used, the column
attribute is overridden:
version: '1.0'
rule_definitions:
- &postcode_rule
column: postcode
masks:
- type: from_random_number
max: 9999
tasks:
- type: mask_table
table: orders
key: id
rules:
- <<: *postcode_rule
column: target_postcode
- <<: *postcode_rule
column: destination_postcode
This is equivalent to specifying the two rules separately with the same masks
but
applied to different column
s, as below:
Show equivalent ruleset without anchors and aliases
version: '1.0'
tasks:
- type: mask_table
table: orders
key: id
rules:
- column: target_postcode
masks:
- type: from_random_number
max: 9999
- column: destination_postcode
masks:
- type: from_random_number
max: 9999
Mask definitions (mask_definitions
)
The mask_definitions
attribute of the ruleset may contain a list of mask definitions to be referenced
from a list of masks through YAML anchors and aliases. Each mask definition must follow
the same syntax as a mask in the masks
attribute of a rule (see Mask Functions for the full list of
available mask types).
For example, the following ruleset has an anchor named &street_mask
under mask_definitions
that is inherited twice
in a masks
list via the alias *street_mask
. Each time the definition is re-used, the table_filter_column
attribute is overridden:
version: '1.0'
mask_definitions:
- &street_mask
type: from_file
seed_file: DataMasque_address-example.csv
seed_column: street_name
table_filter_column: city
seed_filter_column: city
fallback_masks:
- type: from_fixed
value: 'Main Street'
tasks:
- type: mask_table
table: orders
key: id
rules:
- column: target_street
masks:
- <<: *street_mask
table_filter_column: target_city
- column: destination_street
masks:
- <<: *street_mask
table_filter_column: destination_city
This is equivalent to specifying the two masks separately with identical configuration except for
different table_filter_column
values, as below:
Show equivalent ruleset without anchors and aliases
version: '1.0'
tasks:
- type: mask_table
# PLACEHOLDER is a placeholder table name to be merged/overridden
# when anchor `&mask_table_customer_info` is referred in alias
table: orders
key: id
rules:
- column: target_street
masks:
- type: from_file
seed_file: DataMasque_address-example.csv
seed_column: street_name
table_filter_column: target_city
seed_filter_column: city
fallback_masks:
- type: from_fixed
value: 'Main Street'
- column: destination_street
masks:
- type: from_file
seed_file: DataMasque_address-example.csv
seed_column: street_name
table_filter_column: destination_city
seed_filter_column: city
fallback_masks:
- type: from_fixed
value: 'Main Street'
Nested definitions
It is possible to have nested definitions under a definition.
For example, a nested rule definition &customer_info_street_address
can be anchored under a task definition &mask_table_customer_info
.
version: '1.0'
task_definitions:
- &mask_table_customer_info
type: mask_table
# PLACEHOLDER is a placeholder table name to be merged/overridden
# when anchor `&mask_table_customer_info` is referred in alias
table: PLACEHOLDER
key: id
rules:
- &customer_info_street_address
column: destination_street
masks:
- type: from_file
seed_file: DataMasque_address-example.csv
seed_column: street_name
tasks:
- <<: *mask_table_customer_info
table: details
rules:
- <<: *customer_info_street_address
column: pickup_street
Inherit definitions with dictionaries
As it is not possible to use anchors and aliases to override individual items in a YAML list, DataMasque allows you to
alternatively specify lists of rules
, joins
, and masks
as dictionaries of arbitrary keys to their respective values.
A dictionary is represented in a simple key: value
form (the colon must be followed by a space):
# A mask rules example in dictionary with arbitrary keys
rules:
customer_info_postcode: *postcode_rule
customer_info_customer_id: *customer_id_rule
customer_info_destination_street: *destination_street_rule
The above dictionary is identical to following list while the rules are executed:
# A mask rules example in list
rules:
- *postcode_rule
- *customer_id_rule
- *destination_street_rule
The following example demonstrates how specifying rules
as a dictionary allows you to easily override
individual rules by their key using anchors and aliases.
By specifying rules
as a dictionary in task definition &mask_table_customer_info
and by using anchors and aliases to
reference the rules dictionary and individual rules, we were able to override the entire customer_info_postcode
rule in
the first task, and just the column
of the customer_info_customer_id
rule in the second task.
version: '1.0'
rule_definitions:
- &postcode_rule
column: postcode
masks:
- type: from_random_number
max: 9999
- &customer_id_rule
column: customer_id
masks:
- type: from_random_text
max: 10
task_definitions:
- &mask_table_customer_info
type: mask_table
# PLACEHOLDER is a placeholder table name to be merged/overridden
# when anchor `&mask_table_customer_info` is referred in alias
table: PLACEHOLDER
key: id
rules: &customer_info_rules
customer_info_postcode: *postcode_rule
customer_info_customer_id: *customer_id_rule
customer_info_destination_street:
column: destination_street
masks:
- type: from_file
seed_file: DataMasque_address-example.csv
seed_column: street_name
table_filter_column: destination_city
seed_filter_column: city
fallback_masks:
- type: from_fixed
value: 'Main Street'
tasks:
- <<: *mask_table_customer_info
table: details
rules:
<<: *customer_info_rules
customer_info_postcode:
column: post
masks:
- type: from_fixed
value: 1234
- <<: *mask_table_customer_info
table: orders
rules:
<<: *customer_info_rules
customer_info_customer_id:
<<: *customer_id_rule
column: ext_customer_id
These changes can be seen in the equivalent expanded ruleset below:
Show equivalent ruleset without anchors and aliases
version: '1.0'
tasks:
- type: mask_table
table: details
key: id
rules:
rule0:
column: post
masks:
- type: from_random_number
max: 9999
rule1:
column: customer_id
masks:
- type: from_random_text
max: 10
rule2:
column: destination_street
masks:
- type: from_file
seed_file: DataMasque_address-example.csv
seed_column: street_name
table_filter_column: destination_city
seed_filter_column: city
fallback_masks:
- type: from_fixed
value: 'Main Street'
- type: mask_table
table: orders
key: id
rules:
customer_info_postcode:
column: postcode
masks:
- type: from_random_number
max: 9999
customer_info_customer_id:
column: ext_customer_id
masks:
- type: from_random_text
max: 10
customer_info_destination_street:
column: destination_street
masks:
- type: from_file
seed_file: DataMasque_address-example.csv
seed_column: street_name
table_filter_column: destination_city
seed_filter_column: city
fallback_masks:
- type: from_fixed
value: 'Main Street'
Note: Dictionaries can be used instead of lists in the following ruleset attributes:
- For
rules
, as an attribute of:
- Task type
mask_table
, including inside anif
condition (bothrules
andelse_rules
).- For
joins
, as an attribute of:
- Task type
mask_table
- For
masks
, as an attribute of:
- A rule in a set of
rules
- Mask type
replace_regex
- Mask type
replace_substring
- Mask type
concat
- Mask type
chain
Note: The order of items in above dictionaries in DataMasque ruleset will be respected, so the operations will be performed in the order you specify them.
Advanced ruleset features
Default values to skip
You can specify a list of skip_defaults
values at the ruleset level:
version: '1.0'
skip_defaults:
- null
- ''
- matches: 'administrator_.*'
tasks:
- type: mask_table
table: users
key: id
rules:
- column: role
masks:
- type: from_fixed
value: 'customer'
These will be used as the skip
configuration for any masking rule that does not have skip
explicitly set.
Disabling skip_defaults
To disable the skip_defaults
for a particular column,
set the skip
option for that column to an empty array: skip: []
.
Example
role
column:- The
role
column does not have a specificskip
rule defined, so it will use theskip_defaults
specified at the top of the ruleset. This means that anynull
or empty string (''
) values in the role column will not be masked; they will be skipped over.
- The
details
column:- The
details
column, however, explicitly setsskip: []
, which overrides and disables theskip_defaults
for this column. As a result, even if the value isnull
or an empty string (''
), it will still be masked according to the rule provided (in this case, replaced with the stringREDACTED
).
- The
version: '1.0'
skip_defaults:
- null
- ''
tasks:
- type: mask_table
table: users
key: id
rules:
- column: role
masks:
- type: from_fixed
value: 'customer'
- column: details
skip: []
masks:
- type: from_fixed
value: 'REDACTED'
Summary
skip_defaults
: Defines which values (likenull
or''
) should not be masked by default.skip: []
: Overrides theskip_defaults
for a particular column, ensuring that every value in that column (even those normally skipped) will be masked.
Freezing random values
A run_secret
string can be provided as an option when starting a run. If you provide a run_secret
, then all
randomly chosen and generated values will be exactly the same for multiple runs with the same secret. This can be useful
for testing or retaining consistency across multiple masking runs. The run_secret
can consist of any sequence of at least 20
characters, however it is recommended to use a randomly generated string.
Notes:
- Output values will be frozen only across identical repeated masking runs. Variations in database content (such as inserted or deleted rows) or Run Options (i.e. Batch Size, Max rows) can still result in different outputs even when using the same
run_secret
. Additionally, using a non-unique key may result in non-deterministic results. To retain consistent masking results across variable data sources, use the Deterministic masking features.- As the parallelised unloading of Amazon Redshift data into S3 files cannot guarantee consistent ordering between runs, a fixed
run_secret
may still result in different outputs between masking runs on Amazon Redshift databases. To retain consistent masking results with Amazon Redshift, use the Deterministic masking features.
Deterministic masking
If you wish to make the masked values deterministic based on one or more input values, you can provide hash columns in a ruleset. DataMasque will use the hashed value(s) from the specified sources as one of the random generator inputs (along with the instance seed and the run secret). This allows for consistent output values to be generated even across different databases or inside documents.
For database or tabular file masking, this is done with the use of a hash_columns
argument for each column
being
masked. For file masking, the parameter is called hash_sources
and the hash used for the entire file.
DataMasque will use the hashed value(s) as one of the random generator inputs (along with the instance seed and the run secret). This allows for consistent output values to be generated even across different databases, inside documents (XML/JSON) or when masking files.
Changes to hashing values in 2.8.0: Prior to DataMasque 2.8.0, values to hash were converted to strings then hashed. This may have resulted in duplicate output values when hashing on
null
. From version 2.8.0, hashing occurs on binary representations of data. This means that hashed outputs from DataMasque 2.8.0 onwards will differ to those generated by prior versions.
Deterministic masking with databases or tabular files
Each hash_columns
item can either be just the name of the column, or can specify a json_path
or xpath
to fetch a
hash value from inside JSON or XML, respectively. For more information on json_path
or xpath
please
refer to the JSON documentation/XML documentation.
For JSON/XML hash_columns
the following logic is applied:
- If the value extracted from the
json_path
/xpath
isnull
the hash will be performed on thenull
value. - If the value does not exist at the
json_path
/xpath
the hash will be performed onnull
. - If the JSON/XML data cannot be decoded (is invalid) the hash will be performed on the entire column, even if
null
.
hash_columns
can also trim (strip) leading/trailing white space by setting the trim
parameter to true
.
Note Since
null
is the default fallback for invalid data, this may result in unexpected repeated masked values. Warnings will be added to the run logs when this fallback occurs. Bothjson_path
andxpath
optional parameters cannot be set for the same hash column. It is possible to hash a column on to itself (i.e.hash_columns
contains the currentcolumn
). If you hash a column on itself, and that is used as the hash for other columns, the mask on the hash column should come last in the rules. If not, that column will use the pre-masked value as the hash and other columns use the post-masked value, meaning different values will be generated in the same row for the same mask definition.
Example
This example will mask the date_of_birth
column with a date value that has been deterministically generated based
on the hash of date_of_birth
and first_name
column values combined with a one-off randomly generated secret for
this run.
For example, in every row where date_of_birth = '2000-01-01'
and first_name = 'Carl'
, the date_of_birth
will be
replaced with a deterministically generated value (e.g. 1999-03-03
). This same replacement value will be generated
for repeated rows that match the same date_of_birth
and first_name
values within this run. However, this value
will be different for each run. To retain consistency of deterministically generated values between multiple runs,
see Consistent masking.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: date_of_birth
hash_columns:
- date_of_birth
- name
masks:
- type: from_random_date
min: '1980-01-01'
max: '2000-01-01'
An equivalent alternative version of YAML syntax is shown below:
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: date_of_birth
hash_columns:
- column_name: date_of_birth
- column_name: name
masks:
- type: from_random_date
min: '1980-01-01'
max: '2000-01-01'
Show result
Before | After |
|
|
---|
Example with xpath
This example will mask the date_of_birth
column with a date value that has been deterministically generated based
on the hash of the date_of_birth
attribute from the following XML document column.
<Root>
<Info>
<Employee date_of_birth="2022-10-08">Steve</Employee>
<Employee date_of_birth="2022-10-08">Fred</Employee>
</Info>
</Root>
To get the date_of_birth
attribute from the XML document we need to specify the path with an xpath
. The following xpath
will get the date attribute from the first Employee
element of the XML document, Info/Employee[1]/@date_of_birth
.
If you wanted to perform the hash on a list of date_of_birth
attribute's values you can specify Info/Employee/@date_of_birth
.
version: "1.0"
tasks:
- type: mask_table
table: xml_test
key: id
rules:
- column: date_of_birth
hash_columns:
- column_name: xml_data
xpath: "Info/Employee/@date_of_birth"
masks:
- type: from_random_text
min: 5
max: 10
As shown in the results below, when the date_of_birth
attribute has the same values for both Employee
elements the
resulting masked date_of_birth
column contain the same values.
Show result
Before | After |
|
|
---|
Deterministic masking with files
File masking tasks (mask_file
) support deterministic generation of masked values which are based on input values. You
can provide a hash_sources
list in a rule, similar to the hash_columns
for database masking.
However, with hash_sources
you must specify additional parameters to fetch the hash.
Parameters
json_path
(optional): A path to the value in the JSON data to use for the hash.xpath
(optional): An Xpath to the value stored in the XML data to use for the hash.file_path
(optional): A boolean value, when set totrue
the hash will be performed on the file's path (relative to the base directory). Defaults tofalse
.case_transform
(optional):upper
/lower
- Apply a case transform on the input value, for consistent hash values irrespective of case.trim
(optional): If the value extracted from thejson_path
, orxpath
contains leading or trailing white space, it can be trimmed by setting this option totrue
. Defaults tofalse
.
Notes Each of the
json_path
,xpath
andfile_path
are shown as optional but exactly one of these needs to be specified. For more information onjson_path
orxpath
please refer to the JSON documentation/XML documentation.hash_sources
for multirecord files works the same as for object file, except thehash_sources
are applied to each record in the file instead of the entire file.When
hash_sources
are specified as part of ajson
transform
orxml
node_transform
:
- The
json_path
/xpath
should be a relative path from the current node.- If
hash_sources
are specified as part of the task and within thejson
/xml
mask:- The
hash_sources
will be overwritten with thehash_sources
specified in thejson
/xml
mask.file_path
is no longer a valid parameter.
Example with json_path
This example will mask a JSON file which contains the following JSON data:
{
"users": {
"name": "Richard",
"addresses": [
{
"type": "postal",
"city": "Nelson"
},
{
"type": "physical",
"city": "Napier"
}
]
}
}
We want to mask the city
of the users
, hashed with the name
of the users
. To do this we need
to specify a json_path
to the value we want to use for the hash ([users
,name
]) in the hash_sources
, as shown in
the ruleset below. The masked value we would want to replace the city
with, would be best if it was from one of the seed
files DataMasque provides.
version: "1.0"
tasks:
- type: mask_file
rules:
- hash_sources:
- json_path: ['users', 'name']
masks:
- type: json
transforms:
- path: ['users','addresses', '*','city']
masks:
- type: from_file
seed_file: DataMasque_mixed_countries_addresses.csv
seed_column: city
As shown in the results below, when the name
has the same values for both users
the resulting masked values for the city
is also the same.
Note Each row of the results table below represents a separate file.
Show result
Before | After |
|
|
---|
If case_transform
is specified, then the casing of the hash input values will not affect the consistency of the final
masked result.
version: "1.0"
tasks:
- type: mask_file
rules:
- hash_sources:
- json_path: ['users', 'name']
case_transform: lower
masks:
- type: json
transforms:
- path: ['users','addresses', "*",'city']
masks:
- type: from_file
seed_file: DataMasque_mixed_countries_addresses.csv
seed_column: city
Note Each row of the results table below represents a separate file.
Show result
Before | After |
|
|
---|
If trim
is specified as true
, then any whitespace around the hash input values will not affect the consistency of the final masked
result. In the example below, trim
has been applied to the hash_source
, and values in the hash_source
containing
whitespace have not affected the final masked result.
version: "1.0"
tasks:
- type: mask_file
rules:
- hash_sources:
- json_path: ['users', 'name']
trim: true
masks:
- type: json
transforms:
- path: ['users','addresses', "*",'city']
masks:
- type: from_file
seed_file: DataMasque_mixed_countries_addresses.csv
seed_column: city
Show result
Before | After |
|
|
---|
Example with xpath
This example will mask two different XML files but generate consistent masked data by using user id as a hash. The first contains a user's info.
<User id="8472">
<Name>Evelyn</Name>
<Address>
<City>Nelson</City>
</Address>
</User>
The second contains a list of user's log messages:
<Logs userId="8472">
<Message to="Evelyn" date="2022-08-09" type="welcome"/>
<Message to="Evelyn" date="2022-08-10" type="forgot_password"/>
</Logs>
In both cases, we want the user's name to be consistent, and this can be done by specifying the user's ID as the hash
source for each file. We'll create a ruleset with two mask_file
tasks, and use includes
to map the right task to the
correct file.
The first ruleset will mask the User/Name
text value, and the second will mask the to
attribute of each
Log/Message
element. The first ruleset will use the id
attribute on User
as the hash source, while the second will
use the userId
attribute on Logs
.
Here's the ruleset to accomplish this:
version: "1.0"
tasks:
- type: mask_file
include:
- glob: user_info_*.xml
rules:
- hash_sources:
- xpath: 'User/@id'
masks:
- type: xml
fallback_masks:
- type: from_fixed
value: '<User />'
transforms:
- path: 'User/Name'
node_transforms:
- type: text
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
- type: mask_file
include:
- glob: user_logs_*.xml
rules:
- hash_sources:
- xpath: 'Logs/@userId'
masks:
- type: xml
fallback_masks:
- type: from_fixed
value: '<Logs />'
transforms:
- path: 'User/Name'
node_transforms:
- type: attribute
attributes: 'to'
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
The two output files would use consistent names. The first file:
<User id="8472">
<Name>Sarah</Name>
<Address>
<City>Nelson</City>
</Address>
</User>
And the second:
<Logs userId="8472">
<Message to="Sarah" date="2022-08-09" type="welcome"/>
<Message to="Sarah" date="2022-08-10" type="forgot_password"/>
</Logs>
Example with file_path
This example will just show an example ruleset to mask files based on the file path:
This applies to any mask_file
task.
version: "1.0"
tasks:
- type: mask_file
rules:
- hash_sources:
- file_path: true
masks:
- type: json
transforms:
- path: ['users','addresses', '*','city']
masks:
- type: from_file
seed_file: DataMasque_mixed_countries_addresses.csv
seed_column: city
With this ruleset the hashing will be performed on the file path, so each file with a different path with have differing masked values. This is useful if you want repeatable masking based on the name of a file.
Example with relative json_path
Referring to the document from the JSON path example above, DataMasque can generate a hash from a value that is relative to the current node being masked.
In the following example, the node being masked will be one of the ['users', 'addresses', '*','city']
text nodes.
The string ..
is used to navigate up through the document hierarchy (analogous to using cd ..
to move up a
directory when browsing a files in a terminal).
First we show the ruleset, then explain the hash source's json_path
.
version: "1.0"
tasks:
- type: mask_file
rules:
masks:
- type: json
transforms:
- path: ['users','addresses', '*','city']
hash_sources:
- json_path: ['..', '..', '..', 'name']
masks:
- type: from_file
seed_file: DataMasque_mixed_countries_addresses.csv
seed_column: city
Each transform in transforms
is applied to each city
found at path
. There is only one transform
in this example.
To fetch the name
of the user for that particular city
, to use in hashing, we must ascend up three nodes by using
the ..
operator three times:
- First ascending from the
city
text node to theaddress
object. - Then to the
addresses
array containing the list of addresses. - Finally, to the
user
object that contains theaddresses
array.
The current node is now the user
object. The final element in the json_path
(name
) can be used to fetch the name
of that user
.
In this way the city used in the from_file
mask is generated based on the hash of the name
of that user.
Example with relative xpath
Since the xml
mask uses standard XPaths for node traversal, normal relative XPath syntax may be applied,
as shown in this next example.
version: "1.0"
tasks:
- type: mask_file
include:
- glob: user_info_*.xml
rules:
masks:
- type: xml
fallback_masks:
- type: from_fixed
value: '<User />'
transforms:
- path: '/User/Name'
node_transforms:
- type: text
hash_sources:
- xpath: '../@id'
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
- type: mask_file
include:
- glob: user_logs_*.xml
rules:
masks:
- type: xml
fallback_masks:
- type: from_fixed
value: '<Logs />'
transforms:
- path: '/User/Name'
node_transforms:
- type: attribute
attributes: 'to'
hash_sources:
- xpath: '../@userId'
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
The above example uses two hash_sources
with xpath
attributes. The first uses ../@id
to fetch the id
attribute
of the parent XML node (<User>
), and the second similarly uses ../@userId
to fetch the userId
attribute.
Common regular expression patterns
A regular expression (or "regex") is a sequence of characters that acts as a search pattern
to filter or select substrings of text strings. They can be used
in replace_regex
masks and matches
conditions
of if
and skip
blocks.
Note that it is best practice to wrap the regular expression in quotes to avoid
special characters being misinterpreted as YAML syntax:
version: '1.0'
tasks:
- type: mask_table
table: driversLicence
key: ID
rules:
- column: driversLicence
masks:
- type: replace_regex
regex: "[0-9]"
masks:
- type: from_fixed
value: "#"
Notes: DataMasque uses the Python regular expression syntax, which you can learn more about in the official tutorial.
Regular expressions allow you to match various characters by specifying a character or character set. Below are a list of characters sets that can be used in regular expressions.
Characters can be specified as character classes.
"[0-9]"
can be used to match any numerical character."[a-z]"
and"[A-Z]"
will match lower case and upper case alphabetical characters respectively.- Classes can also be combined;
"[a-zA-Z]"
will match any upper or lower case character. - You may also specify a list of characters to match within square brackets. If you wish to only match vowels,
you can use
"[aeiouAEIOU]"
. - If a caret character is added before any character within the character class, it will instead match something other
than the specified character.
"^[0-9]"
will match any character other than a numerical character.
The table below detail other ways to denote certain character type without the use of character classes.
Character | Description |
---|---|
\d | One digit character, ranging from 0 to 9. |
\w | One 'word' character; an ascii letter, a digit or an underscore. |
\s | One whitespace character. This can be a space, a line break, or a tab. |
\D | One character that is NOT a digit. |
\W | One character that is NOT a word character. |
\S | One character that is NOT a whitespace character. |
Note that when used in double-quoted strings in YAML, literal backslashes must be escaped by backslashes, for example,
"\\d"
will be interpreted as\d
. We recommend quoting YAML strings with single quotes to avoid the need for this escaping, as'\d'
will be interpreted as\d
.
You can also choose to add a quantifier to any character or character class, which will specify how many characters will be matched.
Quantifier | Description |
---|---|
+ | One or more characters will be matched. |
{n} | Exactly n characters will be matched. For example, \d{3} will match strings of exactly 3 numerical digits. |
{a,b} | Any string of characters between x and y length will be matched. For example, \w{1,3} will match a string of word characters between 1 and 3 length. |
{c,} | Any string of characters with c or more characters. For example, \d{2,} will match a string of numerals of length 2 or more. |
* | Any instance of the characters appearing zero or more times. |
? | Matches if the character or character class appears once, or no times. |
Below are some common regular expression patterns that can be used to match typical database strings.
Example Patterns
Phone Number
The pattern below will match a phone number consisting of an international code in parentheses followed by a 7-digit phone number that may be hyphenated after the 3rd digit.
'\(\+\d{1,4}\)\s?\d{3}-?\d{4}'
Some example of matching patterns are shown below:
- (+64)123-4567
- (+1234) 5678910
- (+61) 987-5432
Social Security Number
The pattern below will match a Social Security Number consisting of hyphen-separated digits:
'\d{3}-\d{2}-\d{4}'
Some example of matching patterns are shown below:
- 111-22-3333
- 987-65-4321
- 112-35-8132
Time
The pattern below will match a time consisting of 1 or 2 hour digits and 2 minute digits separated by a colon:
'[0-2]?\d:\d{2}'
Some example of matching patterns are shown below:
- 7:30
- 23:50
- 05:45
Email Address
The pattern below will match an email address consisting of an alphanumeric username and
an alphanumeric domain name separated by an @
symbol. For a more comprehensive email
regular expression, see: emailregex.com.
'[a-zA-Z0-9]+@[a-zA-Z0-9\.]+'
Some example of matching patterns are shown below:
- janedoe@gmail.com
- bobsmith@hotmail.com
- thomas@yahoo.com
DISCLAIMER: The above examples are simplistic and general examples to demonstrate how regular expressions can be constructed - you should ensure these regular expressions are appropriate for your particular data before using them in your masking rulesets.
Date Parsing in DataMasque
Often, rulesets need to specify a date format to convert text into a date. For example, using Date/Time masks, executing type casts or performing Conditional Masking.
Date parsing in DataMasque follows standard %
-based directives, as documented in
strftime() and strptime() Format Codes.
An important point to note is how years with and without centuries are handled.
Parsing years with/without centuries (%Y
vs. %y
)
Two directives are available to parse years:
%Y
to parse years with the century (e.g., 2020, 1999, 1867, etc.).%y
to parse years without the century (e.g., 20, 99, 67, etc.).
When parsing years without a century using %y
, the following interpretation applies:
- Years from 00 to 68 are interpreted as the years 2000 to 2068.
- Years from 69 to 99 are interpreted as the years 1969 to 1999.
While %y
provides a convenient way to parse two-digit years,
it introduces ambiguity in interpretation,
especially near the transition between centuries.
It is recommended to use %Y
whenever possible to avoid ambiguity,
and ensure consistent interpretation across platforms and environments.
Determining what unique key masking to use
Below are the suggestions for when to use the three different mask functions: mask_unique_key
, from_unique
, and from_unique_imitate
.
mask_unique_key
Use Case:
- Best for generating unique values and propagating these to both physical and logical foreign key (FK) columns.
Key Features:
- Allows specifying the data format.
- Generates unique values.
- Propagates to physical and logical FK columns.
- Disallows
hash_columns
on itself or other columns. - Ensures primary key (PK) columns are unique before masking.
- No guaranteed consistency across masking runs or databases.
Ideal Scenario:
- Suited for scenarios where maintaining FK relationships is crucial, and data format specification is required.
from_unique
Use Case:
- Ideal for generating unique values without the need for FK column propagation.
Key Features:
- Allows specifying the data format.
- Generates unique values.
- Disallows
hash_columns
on itself or other columns. - Does not propagate masked PK values to physical and logical FK columns. 5.No guaranteed consistency across masking runs or databases.
Ideal Scenario:
- Useful when propagation to FK columns is not needed, and there's a requirement for data format specification.
from_unique_imitate
Use Case:
- Used when consistent values across masking runs or databases are required, and FK relationships are needed to be maintained without specifying data format.
Key Features:
- Does not allow data format specification.
- Hashes on itself and generates unique values.
- Propagates masked PK values to physical and logical FK columns.
- Guarantees consistent values across runs or databases with the same
run_secret
.
Ideal Scenario:
- Suitable for cases where consistent masking is necessary across multiple runs or databases, maintaining FK relationships is crucial and data format specification is not a priority.
Schema versioning
Schema changes to the DataMasque ruleset specification are tracked using the version
field of
the ruleset. The version number consists of two fields in the format major.minor
. Minor version
increments reflect backwards-compatible changes to the schema, whereas major version increments
represent breaking changes which will require some form of migration from previous versions. Wherever
possible, DataMasque will handle such migrations for you automatically when you upgrade.
Each release of DataMasque only supports the most recent major ruleset version at the time of release. As such, the major schema version of your rulesets must equal the major version supported by your DataMasque release. The minor schema version of your rulesets must be equal to or less than the minor version supported by your DataMasque release.
The ruleset schema version supported by this release of DataMasque is "1.0".
Writing comments
While creating a ruleset, it is possible to write comments in the ruleset. A commented block is skipped during execution, and it helps to add description for specified ruleset block.
If you begin a line with #
(hash symbol), all text on that line will become a comment.
version: '1.0'
# This line will become a comment.
tasks:
- type: mask_table
If you place #
on a line, all text after that #
on that line will become a comment. Any text before it
will still be part of the ruleset.
version: '1.0'
tasks:
- type: mask_table # The name of this task type will not be affected by this comment.
In the DataMasque ruleset editor, the shortcut key combination for commenting ruleset blocks is CTRL + /
on Linux and Windows and ⌘ + / for Mac operating systems. If your cursor is on a line and this shortcut
is used, the entire line will be commented out. Highlighting multiple lines at once will cause all
highlighted lines to be commented out.