Ruleset YAML specification
- Ruleset YAML specification
- Database task types
- Masking files
- File task types
- Masking tables
- Definitions
- Advanced ruleset features
- Common regular expression patterns
Ruleset YAML specification
The ruleset YAML configuration provides instructions that DataMasque will follow when performing a masking run against a target database. Rulesets are comprised of one or more tasks, which can contain many different types of instructions. The most common use case is the application of tasks for masking sensitive data in database tables with masking rules.
Ruleset Properties
The following properties are specified at the top-level of a ruleset YAML file:
version
(required): The schema version used by this ruleset. The default value present in the editor when creating a new ruleset is generally the value that you should be using. See Schema Versioning for more information.name
(deprecated; will be removed in release 3.0.0): A unique name that will be used to refer to your ruleset. This field may only contain alphanumeric characters and underscores, and its length is limited to 64 characters.tasks
(required): A list of tasks to be performed in order on the target database. See Database Task Types for the comprehensive list of the available task types and their associated parameters.task_definitions
(optional): A list of task definitions to be referenced from the ruleset's list oftasks
through YAML anchors and aliases.rule_definitions
(optional): A list of rule definitions to be referenced from a list of rules through YAML anchors and aliases.mask_definitions
(optional): A list of maks definitions to be referenced from a list of masks through YAML anchors and aliases.skip_defaults
(optional): See Default values to skip.random_seed
(optional): Deprecated in favour of the Run secret option. See Freezing random values.
Example
The following ruleset provides an example for replacing the last_name
column of every
row in the users
table with the fixed value "redacted last name"
:
# My ruleset.yml
version: '1.0'
tasks:
- type: mask_table
table: users
key: user_id
rules:
- column: last_name
masks:
- type: from_fixed
value: 'redacted last name'
Note: The tasks must be indented using two spaces. For example:
tasks: - type: mask_table ^^
- type: mask_table
is indented two spaces from the parenttasks:
Important note on case-sensitivity: For all tasks except
run_sql
, database identifiers, such as table and column names, should be referenced as you would otherwise reference them in an SQL query. When masking a case-sensitive database, identifiers must be referenced in the ruleset using the correct case.To refer to a case-sensitive table or column in a database, the identifier must be enclosed in double quotation marks. However, when writing YAML, quotations are used to denote a string value, so any enclosing quotations are not considered as part of the value. As such, it is necessary to enclose the entire name - including double quotation marks - in an outer set of single quotation marks. For example:
# Case-sensitive table name; enclosed in both single and double quotations. table: '"CaseSensitiveTableName"'
To refer to a case-sensitive table in a schema, the schema name must also be enclosed in quotation marks if the schema name is case-sensitive. The entire combination of schema and table name must be enclosed in single quotation marks. For example:
# Case-sensitive schema and table name; enclosed in both single and double quotations. table: '"CaseSensitiveSchemaName"."CaseSensitiveTableName"'
If referencing a combination of table and column, you will need have quotation marks around both the table and column names within the surrounding quotation marks. For example:
# Case-sensitive table and column name; enclosed in both single and double quotations. column: '"CaseSensitiveTable"."CaseSensitiveColumn"'
Identifier names containing double quotation marks, backslashes, periods, and whitespace should always be enclosed in double quotation marks. Also, literal double quotation marks and backslashes must be preceded by a backslash:
# Case-sensitive table and column name containing special characters. column: '"Case\"Sensitive\\Table"."Case.Sensitive Table"'
Notes:
- Backslashes and single quotation marks are not supported in identifier names for Microsoft SQL Server (Linked Server) databases.
- For Redshift databases, DataMasque follows PostgreSQL's rules for case-sensitivity of quoted/unquoted identifiers if
enable_case_sensitive_identifier
istrue
, and treats all identifiers as case-insensitive ifenable_case_sensitive_identifier
isfalse
. See:https://docs.aws.amazon.com/redshift/latest/dg/renablecasesensitiveidentifier.html
Schema versioning
Schema changes to the DataMasque ruleset specification are tracked using the version
field of
the ruleset. The version number consists of two fields in the format major.minor
. Minor version
increments reflect backwards-compatible changes to the schema, whereas major version increments
represent breaking changes which will require some form of migration from previous versions. Wherever
possible, DataMasque will handle such migrations for you automatically when you upgrade.
Each release of DataMasque only supports the most recent major ruleset version at the time of release. As such, the major schema version of your rulesets must equal the major version supported by your DataMasque release. The minor schema version of your rulesets must be equal to or less than the minor version supported by your DataMasque release.
The ruleset schema version supported by this release of DataMasque is "1.0".
Writing comments
While creating a ruleset, it is possible to write comments in the ruleset. A commented block is skipped during execution, and it helps to add description for specified ruleset block.
If you begin a line with #
(hash symbol), all text on that line will become a comment.
version: '1.0'
# This line will become a comment.
tasks:
- type: mask_table
If you place #
on a line, all text after that #
on that line will become a comment. Any text before it
will still be part of the ruleset.
version: '1.0'
tasks:
- type: mask_table # The name of this task type will not be affected by this comment.
In the DataMasque ruleset editor, the shortcut key combination for commenting ruleset blocks is CTRL + /
on Linux and Windows and ⌘ + / for Mac operating systems. If your cursor is on a line and this shortcut
is used, the entire line will be commented out. Highlighting multiple lines at once will cause all
highlighted lines to be commented out.
Database task types
A ruleset is comprised of a list of tasks to be performed in sequence on
the target database. Tasks are performed serially from top to bottom,
but special serial
and parallel
tasks can be used to nest other tasks
within them for performance (parallelism) or dependency management.
Sensitive data discovery
The run_data_discovery
task type inspects the metadata of your database, searching for columns
which are likely to contain sensitive data. On completion, a report is generated containing a
summary of all identified columns, and their current masking coverage. It is recommended to include
a single run_data_discovery
task in your rulesets to help ensure complete masking coverage and
provide ongoing protection as new sensitive data is added to your database.
See the Sensitive Data Discovery guide for more information on this feature.
Note: The
run_data_discovery
task type is not currently supported for Amazon DynamoDB.
Parameters
This task type does not have any parameters.
Example
The following shows an example ruleset that will execute only the run_data_discovery
task
and no masking tasks. This example usage may be useful when starting a new ruleset from
scratch to determine a starting point for developing your masking rules.
The run_data_discovery
may also be included in a ruleset alongside other masking tasks to
provide continuous feedback on
the masking coverage provided by the ruleset.
version: '1.0'
tasks:
- type: run_data_discovery
Schema Discovery
The run_schema_discovery
task type inspects the metadata of your database, searching for schemas, tables, and columns
and can flag certain columns which are likely to contain sensitive data. On completion, a report is generated containing
a summary of all identified schemas, tables, columns, and relevant metadata of the data within the columns.
See the Schema Discovery guide for more information on this feature.
Note: The
run_schema_discovery
task type is not currently supported for Amazon DynamoDB.
Parameters
This task type does not have any parameters.
Example
The following shows an example ruleset that will execute only the run_schema_discovery
task
and no masking tasks. This example usage may be useful when starting a new ruleset from
scratch to determine a starting point for developing your masking rules.
version: '1.0'
tasks:
- type: run_schema_discovery
Table masks
Each mask_table
task defines masking operations to be performed on a
database table. More detail is provided about these tasks under the Masking Tables section.
Parameters
Each task with type mask_table
is defined by the following parameters:
table
(required): The name of the table in the database. The table name can be prefixed with a schema name to reference a table in another schema. If the table or schema name are case-sensitive, you must enclose the name in double and single quotation marks in order to specify the casing of the name. For example,table: '"CaseSensitiveSchema"."CaseSensitiveTable"'
- For rulesets run against Amazon DynamoDB connections, the table name can be prefixed with the AWS region that
contains the table e.g.
<region>/<table-name>
(without angle brackets). Note that the region prefix is required if you have not created a.aws/credentials
file or environment variable with the default region or are running DataMasque on EC2. Amazon DynamoDB connections currently only support masking a single table in each ruleset.
- For rulesets run against Amazon DynamoDB connections, the table name can be prefixed with the AWS region that
contains the table e.g.
key
(required): One or more columns that identify each table row. Composite keys may be specified for thekey
parameter. For more details on using composite keys, see Composite keys- For Oracle databases it should always be
ROWID
(key: ROWID
). DataMasque will implicitly use ROWID when ROWID is not specified. For more details, refer to Query optimisation. - For Microsoft SQL Server and PostgreSQL databases it is recommended to use the primary key, or any other unique key that is not modified during masking, for better performance.
- The key columns must not contain any NULL value.
- If the key column names are case-sensitive, you may enclose each key value in double
and single quotation marks in order to specify the casing of the key. For example,
key: '"Customer_ID"'
- If the columns specified for the key parameter cannot be used to uniquely identify rows, then the masked values will be the same for rows that have the same key value. Refer to the key and hash columns example in the Notes section for how to avoid producing duplicate masked values.
- Columns specified in the key cannot be masked in the task's
rules
. DynamoDB is an exception to this rule, allowing key columns to be masked with e.g. thefrom_unique
mask type.
- For Oracle databases it should always be
on_missing_column
(optional): The action to take if acolumn
in the ruleset is not found on the table. This option is only available for DynamoDB. If specified, must be one oferror
orskip
. Defaults toerror
, which means a masking run will fail if a column in the ruleset does not exist. If set toskip
then missing columns do not cause an error and other columns are masked as normal.rules
(required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:column
(required): The name of the column to mask. For rulesets run against Amazon DynamoDB connections, a wildcard*
can be used to apply masks to columns not matched by other column's rules.masks
(required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the column. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask types.hash_columns
(optional): A list of columns which will be used as input to the Deterministic masking algorithm for this rule. Ifhash_columns
is provided, all mask types that rely on randomisation become deterministic based on thehash_columns
column values. Values in the provided columns can be null. The hash column can be prefixed with a table name to reference a column in another table, and that table name can be prefixed with a schema name to reference a table in another schema. Hash columns can also be specified by a list of dictionaries with the following keys:column_name
(required): The name of the column which will be used as input as above.case_transform
(optional):upper
/lower
- Allow for case transforms on the values of the input, for consistent hashed values irrespective of case. This is useful if values are stored with different cases in different tables, allowing for consistent hashing on those values, For example, email addresses could be stored as all lowercase in one table but mixed case in another.json_path
(optional): If the column contains JSON data, the path to the value in the JSON data you wish to perform the hash on can be specified here, otherwise the hash will be performed on the entire column.xpath
(optional): If the column contains XML data, the Xpath to the value stored in the XML data you wish to perform the hash on can be specified here, otherwise the hash will be performed on the entire column. For more information on thejson_path
please refer to the JSON documentation. For more information on thexpath
please refer to the XML documentation. For more information on deterministic masking please refer to the Deterministic masking.
workers
(optional): The number of parallel processes to use for masking this table (defaults to 1). Each process will operate on a separate batch of rows from the table (batch size is a configurable run option). Increasingworkers
may decrease masking run times, dependent on database performance and the complexity (e.g. number of columns to mask) of the masking task. It is recommended to increase the number of workers if DataMasque connections to your target database spend more time processing queries than waiting for DataMasque (i.e. the "waiting for clients" time approximates DataMasque CPU time), which suggests DataMasque could efficiently use multiple workers to mask other batches while waiting for database responses. Additionally, it is recommended that the number of parallel processes multiplied by the number of workers assigned to each process does not exceed twice the number of CPUs available on your deployed instance.index_key_columns
(optional): Whenindex_key_columns
istrue
(the default setting), DataMasque will create an additional index on all key columns if there is no existing index contains all key columns. You may wish to disable the creation of a new index if you have an existing index on some of the key columns that will provide adequate performance.- For Oracle databases, this option has no effect because
ROWID
is always used as the key. - Redshift databases do not support indexes, so this option has no effect for them.
- This option currently has no effect for Microsoft SQL Server (Linked Server) databases.
- For Oracle databases, this option has no effect because
version: '1.0'
tasks:
- type: mask_table
table: users
key:
- user_id
- first_name
rules:
- column: date_of_birth
hash_columns:
- first_name
masks:
- type: from_random_datetime
min: '1980-01-01'
max: '2000-01-01'
Composite keys
A composite key can be specified in following formats:
- A list of columns; for example,
key: ['invoice_id', 'product_id']
would be used to indicate a composite key consisting of a combination of two columns,invoice_id
andproduct_id
. - For Microsoft SQL Server and PostgreSQL, when specifying a composite key, the order of the keys listed must respect the original order as defined in the database.
- A multiline composite key. An example of a multiline composite key is shown below.
key:
- 'invoice_id'
- 'product_id'
Example mask_table ruleset
version: '1.0'
tasks:
- type: mask_table
table: users
key: user_id
rules:
- column: last_name
masks:
- type: from_fixed
value: "redacted last name"
Notes
- Index operations will be performed online (
ONLINE=ON
) on SQL Server editions that support this feature.- The following types cannot be used as
key
columns:
- Microsoft SQL Server
datetime
time(7)
datetime2(7)
datetimeoffset(7)
- PostgreSQL
real
double precision
- While
rules
andmasks
should typically be provided as lists, they can also be specified as dictionaries that map arbitrary keys to rules/masks. For example:
... rules: last_name_rule: column: last_name masks: fixed_mask: type: from_fixed value: "redacted last name"
- Specifying
rules
ormasks
as a dictionary can allow you to override the rule/mask for a specific key when inheriting from a definition.- When masking a table, if a non-unique key is specified for the
mask_table
task alongsidehash_columns
, if the value of thehash_columns
is different for more than one row with same key value, the final masked values will arbitrarily depend on the order that update statements are executed. This can be avoided by including the targetedhash_columns
as part of a composite key for themask_table
task.
Mask a primary key or unique key
The mask_unique_key
task type can be used to mask the values in a primary key or unique key.
Masking of a primary key or unique key has the requirement that all masked values are unique, which
requires the use of this special-purpose task type.
The mask_unique_key
task type replaces all non-null rows of the target key with new, unique
values, generated in accordance with a user-specified format. The target primary or unique key
columns and associated foreign key columns are updated with these unique replacement values in a
single operation to maintain referential integrity.
Any foreign keys that reference a column to be updated will be disabled prior to masking in order to prevent referential integrity errors, and will be re-enabled after the completion of the task.
Notes:
- Each
mask_unique_key
task will mask the members of a single primary key or unique key constraint. Multiplemask_unique_key
tasks are required to mask multiple independent unique keys on a single table.- The
mask_unique_key
task can be used on tables with up to 200,000,000 non-null rows.- When applied to a composite key, replacement values are only generated for rows that contain a complete, non-null key. For any null or partially null rows, all columns of the target key will be set to
NULL
.- The
mask_unique_key
task must only be applied to columns which are in-fact unique (i.e. the target key columns have aPRIMARY KEY
orUNIQUE
constraint enforced). Unique keys that have multipleNULL
rows (e.g. using a filtered unique index in SQL Server) are allowed; such rows will not be modified by this task.- When masking a clustered index on SQL Server, the performance of
mask_unique_key
can be significantly improved by disabling all other indexes and constraints on the target table for the duration of the task. It is recommended to implement this in your ruleset usingrun_sql
tasks before and after themask_unique_key
task to disable and then re-enable these constraints.- The
mask_unique_key
task does not support SQL Server columns created with theIDENTITY
property, or Oracle / PostgreSQL columns created withGENERATED ALWAYS AS IDENTITY
.- Use of
mask_unique_key
for Amazon DynamoDB, Amazon Redshift or Microsoft SQL Server (Linked Server) databases is not currently supported in DataMasque, consider usingmask_table
with afrom_unique
mask instead.- Due to the random assignment of replacement values, it is possible (though generally rare) that a row may be assigned a masked value that is identical to its pre-masking value. In these cases, the masking is still effective, as an attacker will not be able to identify which rows' values were replaced with an identical value. However, if you need to guarantee that all masked values are different from their pre-masking values, you should use
min
andmax
parameters to ensure the range of possible output values from your format string does not overlap with the range of pre-masking values in your database.- Notes for Amazon DynamoDB: The
mask_unique_key
task type is not supported for Amazon DynamoDB. Thefrom_unique
mask should be used instead.- Notes for masking files: The
mask_unique_key
task type is not supported for file masking. Thefrom_unique
mask should be used instead.
Warning: The
mask_unique_key
task type must not be run in parallel with tasks that operate on any of the following:
- The target
table
of themask_unique_key
task.- Tables containing foreign keys that reference the
target_key
columns (either directly or through a chain of foreign keys).- Any tables specified in
additional_cascades
.
Parameters
Each task with type mask_unique_key
is defined by the following parameters:
table
(required): The name of the database table that contains the primary key or unique key to be masked.target_key
(required): A list of items defining each column that makes up the primary or unique key, and the format in which replacement values will be generated for that column. Composite keys can be masked by including multiple columns and formats in this list. Each item has the following attributes:column
(required): The name of the column to be masked.format
(optional): The format which will be used to generate replacement values for the column. See Format string syntax for details. Defaults to'{!int}'
.
additional_cascades
(optional): Use this parameter to propagate masked values to implied foreign keys of thetarget_key
. Implied foreign keys are dependencies that exist between tables but are not enforced by foreign key constraints, and hence are not defined in the database. Masked values will be cascaded to these columns; see Cascading of masked values for more details on how this works. Eachadditional_cascades
item has the following attributes:table
(required): The name of the table containing the cascade target columns, which have an implicit reference to thetarget_key
of this task. The table name can be prefixed with a schema name to cascade to a table in another schema.columns
(required): A list of column dictionaries - each describing the relationship between a column of the target key and a column on the cascade target table. Each column mapping item has the following attributes:source
(required): The name of a column in the target key from which masked values will be cascaded to the corresponding target column.target
(required): The name of a column on the cascade target table to which masked values from the source column will be cascaded.
update_foreign_keys
(optional): If set totrue
, foreign keys referencing the target columns will be updated in the same way as foreign keys referencing thetarget_key
. Defaults totrue
.
batch_size
(optional): To avoid excessive memory consumption when masking large tables, DataMasque generates replacement values in batches. This value controls the maximum number of unique values that are generated in a single batch. In general, the default of 50,000 will be acceptable for most use cases. The batch size used by DataMasque will be limited by the number of rows in the target table.
Note:
- When using
additional cascades
to propagate composite primary or unique key columns to implied foreign key columns, you need to specify all the columns in theadditional cascades
parameter that are corresponding to the referenced primary or unique key columns.
- Warning! Using
additional_cascades
to propagate partial columns on composite primary or unique key columns is not recommended. Not specifying all the columns that are corresponding to the referenced primary or unique key columns will cause data propagation to fail from the referenced composite primary or unique key columns to the composite implied foreign keys. In other words, partial cascades that reference a subset of the target key columns will result in data for omitted columns not being propagated to the target table, potentially resulting in inconsistent data between the two tables.- For Oracle databases, all columns of the
target_key
must be referenced assource
columns inadditional_cascades
. Partial cascades that reference a subset of the target key columns are disallowed. For example: A composite unique key in aPhoneNumbers
table which consists ofPrefix
andLineNumber
columns which are referenced byPhPrefix
andPhLineNumber
columns in theCalllogs
table but without foreign key constraint. Therefore it is an implied foreign key that requires using theadditional_cascades
parameter to propagate the masked unique key values to ensure data integrity across the tables.A ruleset needs to be written to specify all corresponding implicit foreign key columns in the
additional_cascades
parameter as such that the masked unique keys will be propagated collectively to the foreign keys:
version: "1.0" tasks: - type: mask_unique_key table: PhoneNumbers target_key: - column: Prefix. #part of the composite unique key constraint format: "{!int, 1:150, pad}" - column: LineNumber #part of the composite unique key constraint format: "{!int, 50001:100000, pad}" additional_cascades: - table: CallLogs columns: # Need to include both Prefix/PhPrefix and LineNumber/PhLineNumber for data to propagate properly. - source: Prefix #UK a target: PhPrefix - source: LineNumber #UK a target: PhLineNumber
- However, if they are not composite foreign keys but individual foreign keys a ruleset needs to be written to propagate the masked unique keys individually:
version: "1.0" tasks: - type: mask_unique_key table: PhoneNumbers target_key: - column: Prefix. #has its own unique key constraint format: "{!int, 1:150, pad}" - column: LineNumber #has its own unique key constraint format: "{!int, 50001:100000, pad}" additional_cascades: # Need to include both Prefix/PhPrefix and LineNumber/PhLineNumber for data to propagate properly. - table: CallLogs columns: - source: Prefix #UK a target: PhPrefix - table: CallLogs columns: - source: LineNumber #UK b target: PhLineNumber
Example 1
The following example will mask the primary key column AccountNumber
of the Users
table with unique
replacement values. Another table, Transactions
, also has a column named AccountNumber
which
has a foreign key relationship to the AccountNumber
column of the Users
table.
Account numbers will be generated with at least 6 digits; the minimum value being 100,000
and the
maximum value growing as required depending on the number of rows in the table.
In this ruleset below, you only need to specify masking rules for the primary key column, AccountNumber
,
to be masked. You do not need to explicitly define the foreign key columns to propagate the replacement values
to in the ruleset. DataMasque will automatically detect primary key and foreign key relationships in the database
and propagate the replacement values to any related foreign key columns; in this case, the new values for
AccountNumber
in the Users
table are implicitly propagated to the AccountNumber
column in the Transactions
table.
version: '1.0'
tasks:
- type: mask_unique_key
table: Users
target_key:
- column: '"AccountNumber"'
format: '{!int,100000:}' # Account numbers will be generated with at least 6 digits
Show result
Users table |
|
---|---|
Before | After |
|
|
Transactions table |
|
---|---|
Before | After |
|
|
Example 2
The following example will mask a composite unique key of a PhoneNumbers
table. The composite key
consists of the following columns:
Prefix VARCHAR(3)
: containing a zero-padded integer in the range (1-150). e.g. 001, 002, etc.LineNumber VARCHAR(6)
: containing a zero-padded integer in the range (50,001-100,000).
Values matching these specific formats can be generated using the following format strings:
'{!int, 1:150, pad}'
: Generates integers between 1-150 (inclusive), zero-padded to a fixed width of 3 characters.'{!int, 50001:100000, pad}'
: Generates integers between 50,001-100,000 (inclusive), zero-padded to a fixed width of 6 characters.
Because both of these formats have an upper value bound, we must consider the maximum number of
unique composite values that are available in this space. Multiplying the number of values in the
two ranges (150 * 50,000
), we can determine that these two format strings will supply us with
7,500,000 rows of unique composite values. As a result, this task would fail if applied to a table
containing more than 7,500,000 (non-null) rows.
In addition to the PhoneNumbers
table, this schema also includes a CallLogs
table containing
the columns PhPrefix
and PhLineNumber
, which are references to the values in the Prefix
and
LineNumber
columns of the PhoneNumbers
table. However, due to specific requirements of this
schema, these references are not defined using a foreign key constraint. Without a foreign key,
DataMasque will not automatically propagate the replacement values generated for the PhoneNumbers
table to the CallLogs
table.
In order to ensure the referential integrity of these implicit
references is maintained during masking, this example uses additional_cascades
to instruct
DataMasque on how to cascade updated values to these columns. DataMasque will take the values of Prefix
and LineNumber
columns of the PhoneNumbers
table and propagate these values to the PhPrefix
and PhLineNumber
columns of the CallLogs
table.
version: '1.0'
tasks:
- type: mask_unique_key
table: PhoneNumbers
target_key:
- column: Prefix
format: "{!int, 1:150, pad}"
- column: LineNumber
format: "{!int, 50001:100000, pad}"
additional_cascades:
- table: CallLogs
columns:
- source: Prefix
target: PhPrefix
- source: LineNumber
target: PhLineNumber
Show result
PhoneNumbers table |
||
---|---|---|
Before | After |
|
|
CallLogs table |
||
Before | After |
|
|
Example 3
Consider a database with two tables and two schemas: Customers
in the Accounts
schema; and Transactions
in the
Sales
schema. The following example will mask the primary key column CustomerId
of the Accounts.Customers
table.
However, any changes made to this CustomerId
column must also be reflected on the Customer
column of the
Sales.Transactions
table. Due to specific requirements of this database, these cross-schema references are not defined
using a foreign key constraint. In order to maintain referential integrity, this example uses additional_cascades
to
instruct DataMasque how to cascade updated values to the relevant column present in the other schema.
The Customer ID consists of 3 letters, followed by a hyphen, then a 4-digit number. This will be constructed using a format string to ensure any values generated conform to the required standards.
version: "1.0"
tasks:
- type: mask_unique_key
table: Accounts.Customers
target_key:
- column: CustomerId
format: "{[a-z],3}-{[0-9],4}"
additional_cascades:
- table: Sales.Transactions
columns:
- source: CustomerId
target: Customer
Show result
Customers table |
|
---|---|
Before | After |
|
|
Transactions table |
|
---|---|
Before | After |
|
|
Format string syntax
The format string syntax used by DataMasque to generate unique replacement values currently supports format strings that are combination of alphanumeric characters and symbols. This can be used to generate key values that combined randomly generated portions combined with fixed formatting to generate a key that matches any format required in your key columns.
Format strings consist of variable components that are declared inside
braces {}
, as well as fixed components outside of braces.
Note: In order to guarantee the uniqueness of generated values, format strings used with
mask_unique_key
andfrom_unique
may contain at most one variable component whose length is not constant (e.g. a variable component for an unpadded random integer whose length will depend on the number of digits:{!int, 1:100}
).
Format strings can be constructed using character sets. Character are wrapped in braces and followed by a comma and a number to indicate how many characters in the set are to be generated.
Character Set | Description |
---|---|
[a-z] | Lower case alphabetical characters. |
[A-Z] | Upper case alphabetical characters. |
[0-9] | Numerical characters. |
[aeiou] | Any vowel. Characters can be individually specified without using a range. |
An example format strings is shown below.
format: "{[a-z],2}_{[A-Z],2}-{[a-zA-Z],3}#{[0-9],5}"
In this example, there are four sets of curly braces within the format string, each providing a set of characters followed by a number indicating how many of these characters to generate.
- The first braces specify
{[a-z],2}
, which will generate a string of 2 lower case alphabetical characters. - After this, there is an underscore outside the brackets, which means all generated values will have an underscore after the 2 alphabetical characters.
- The second braces specify
{[A-Z],2}
, which will generate string of 2 upper case alphabetical characters. - After this second variable, there is a hyphen
-
, meaning a hyphen will always be present after the 2 upper case values. - The third braces specify
{[a-zA-Z],3}
. which will generate a string of length 3 consisting of both lower case and upper case alphabetical characters. - After this, there is a hash
#
, meaning a hash character will always be placed after the third generated string. - The final braces specify
{[0-9],5}
, meaning a 5 digit numerical number is placed at the end of the string.
The following values are some example outputs that may be generated using the format string above:
ab_TJ-RaK#10496
pt_oq-TRu#49511
iu_QE-unT#67312
nd_UL-bES#97638
Example
In this example, we wish to mask a series of number plates.
A number plate consists of 3 alphabet characters, followed by 3 numerical digits. We wish to mask the first letter of each number plate with X, followed by 2 random alphabetical characters, followed by a hyphen and a 3 digit number.
A snippet of the table is shown below, where number_plate
is a unique key of the table.
car_registration
Table
number_plate | car_owner |
---|---|
AAA-111 | Anastasia |
BBB-222 | Bill |
CCC-333 | Chris |
DDD-444 | Judith |
EEE-444 | Gordon |
In this case, we will use the following string format.
format: "X{[A-Z],2}-{[0-9],3}"
We are generating a fixed value of X
, followed by 2 alphabetical characters, as defined by {[A-Z],2}
. After this,
there is a hyphen outside of braces, so a static value of a hyphen will always be generated. After this, 3 random
numerical digits are generated.
A ruleset utilising this string format in a mask_unique_key
task can be shown below.
version: "1.0"
tasks:
- type: mask_unique_key
table: car_registration
target_key:
- column: number_plate
format: "X{[A-Z],2}-{[0-9],3}"
Show result
Before | After |
|
|
---|
Integer string format syntax
The !int
operator can be used to generate integers. In its most basic use with no arguments, the format string
{!int}
will generate integers from one to infinity. Extra parameters can be added to set the output range or pad the
output.
Range (min:max
):
The range defines all possible integer values that may be generated by the integer generator. This
is an optional parameter, defaulting to 1:
(min=1, max=unbounded).
min
(optional): The minimum value which will be generated (inclusive). Defaults to1
.max
(optional): The maximum value which will be generated (inclusive). When this value is not specified the maximum value will be unbounded, meaning it will grow depending on the number of values required.
For example:
{!int, 5:}
: generate integers from5
(inclusive) to infinity.{!int, :100}
: generate integers from1
to100
(inclusive).{!int, 20:80}
: generate integers from20
to80
(inclusive).
Zero-pad (pad
):
pad
(optional): When specified, zero-padding will be applied to generated integers - resulting in fixed character width replacement values (e.g. 001, 002, …, 999). The zero-pad width is determined by the width of themax
value, and therefore is only a valid option when amax
value is specified.
For example:
{!int, :1000, pad}
: generate strings in the format0001
,0002
, etc, up to1000
.
Note: Always wrap format strings in either single or double quotes. Leaving format strings unquoted in the ruleset will result in invalid YAML due to the opening
{
character, which has reserved usage in YAML. See below for examples of a correctly quoted format string:
format: '{!int}' # Single quotation marks format: "{!int, pad}" # or double quotation marks
Hex string format syntax
Integers can be generated and output in a hexadecimal format by using the !hex
operator. The range
and pad
options
apply in the same was as for standard integer generation, however range values are interpreted as hexadecimal rather
than decimal.
For example:
{!hex}
: generate hex strings from1
to infinity, i.e.1
,2
, …,a
,b
, …,ff
,100
, etc.{!hex, 10:100, pad}
: generate hex strings from0x10
to0x100
(inclusive) with padding, i.e.010
,011
, …,0fe
,0ff
,100
.
UUID string format syntax
You may choose to generate values in the Universal Unique Identifier (UUID) format by declaring a !uuid
format string.
A UUID is a string of 32 hexadecimal digits (0 to 9, a to f), separated by hyphens. The number of characters per hyphen
are 8-4-4-4-12. An example UUID would be 12345678-90ab-cdef-1234-567890abcdef
.
In order to generate a UUID as the unique key, simply specify uuid in the format as shown below.
format: "{!uuid}"
You may also specify a prefix within the format string of up to 8 characters. This will ensure that the first characters
in the UUID are always static. For example, specifying format: "{!uuid,aaaa}"
will cause the first 4 characters of
every UUID generated by the ruleset to be a
.
Alternatives string format syntax
You can have DataMasque select one value from a set of alternatives for each generated value.
Such a segment can be specified by wrapping your set of pipe/|
-separated alternatives in parentheses ()
.
For example, {(EN|FR)}-{!int}
can be used to generate an integer prefixed by either EN
or FR
.
At least two alternatives must be specified.
Note: This should only be used with the
from_format_string
mask in amask_table
task as it will not satisfy the unique requirement for themask_unique_key
task.
Cascading of masked values
Referential integrity of data references to the target_key
of a mask_unique_key
task is
maintained by "cascading" the masked replacement values to each reference. DataMasque will perform
this cascade automatically for relationships defined by a database foreign key constraint. Masked
values can also be cascaded to columns that are not members of such a foreign key constraint by
using the additional_cascades
feature.
Specifically, foreign keys that meet the following criteria will be updated:
- Foreign keys of the
target_key
or anadditional_cascade
withupdate_foreign_keys: true
will be updated. - Only foreign keys in the same schema as the target table will be updated.
- Only foreign keys that are enabled will be updated. Foreign keys
that are present, but disabled at the time of masking will be
excluded. Specifically:
- In Oracle, foreign keys disabled by
ALTER TABLE some_table DISABLE CONSTRAINT some_constraint
are not updated. - In Microsoft SQL Server, foreign keys disabled by
ALTER TABLE some_table NOCHECK CONSTRAINT some_constraint
- Note that MySQL, PostgreSQL and Redshift do not have equivalent support for disabling foreign key constraints, so all discovered foreign keys are updated.
- In Oracle, foreign keys disabled by
- Foreign keys will be searched for recursively - i.e. Foreign keys that reference an already updated foreign key will also be updated.
- A foreign key that references a superset of the columns in its
referenced key or the original
target_key
will be updated, but the columns that do not reference thetarget_key
will be left unchanged.- For example, consider using
mask_unique_key
to mask a key columnuser_id
. If a composite foreign key (ref_user_id
,ref_group_id
) referencesuser_id
and another columngroup_id
, then theref_user_id
in that foreign key will be updated but theref_group_id
will not.
- For example, consider using
- A foreign key that references a subset of the columns in the
target_key
will NOT be updated, as row references may be ambiguous.- For example, consider using
mask_unique_key
to mask a composite key (user_id
,group_id
). If a foreign key referencesuser_id
orgroup_id
alone, then that foreign key will not be updated. - Note that if an
additional_cascade
withupdate_foreign_keys: true
uses the sametarget_key
column as asource_column
multiple times, then foreign keys of thatadditional_cascade
will still be updated if they reference each column in the originaltarget_key
at least once. - Additionally, if an
additional_cascade
withupdate_foreign_keys: true
references only a subset of thetarget_key
columns, then foreign keys of thatadditional_cascade
will be updated as long as they reference all of the target columns in theadditional_cascade
.
- For example, consider using
If any cascade would update the same set or a subset of the columns of
another cascade, it will be marked as redundant and only the other
cascade will be performed. For example, consider using
mask_unique_key
to mask a key column user_id
where a foreign key
column ref_user_id
references user_id
and an additional_cascade
is specified to update both ref_user_id
and ref_user_id_2
from the
single source_column
of user_id
. In this situation, the foreign
key is redundant as it updates a subset of the columns in the
additional cascade, so only the update specified by the additional
cascade needs to be performed.
The following situations would result in an unresolvable set of cascades, and will cause DataMasque to report an error in the run log:
- Any cascade whose columns partially overlap those of another
cascade without either column set being a superset of, subset of,
or equal set to the other column set.
- For example, consider using
mask_unique_key
to mask a key columnuser_id
that is referenced by a composite foreign key (ref_user_id
,ref_group_id
) while anadditional_cascade
is specified to update columnsref_user_id
andref_role_id
. Because the foreign key andadditional_cascade
would both updateref_user_id
but would also update columns that each other cascade would not update (ref_group_id
andref_role_id
), they cannot be resolved and an error will be reported.
- For example, consider using
- Any cascade that would update a column from a different column in
the
target_key
than another cascade.- For example, consider using
mask_unique_key
to mask a composite key (user_id
,group_id
) that is referenced by a composite foreign key (ref_user_id
,ref_group_id
) while anadditional_cascade
is specified to updateref_user_id=group_id
andref_group_id=user_id
. Because the foreign key andadditional_cascade
would updateref_user_id
andref_group_id
from different source columns, they cannot be resolved and an error will be reported.
- For example, consider using
Any rows of a cascade target (child table) which contain values that are not present in the
target_key
(on the parent table) will have their cascade target
columns set to NULL
. This
situation may occur in one of the following cases:
The cascade target is a foreign key that has at some point been disabled, had values updated, then been re-enabled without being checked / validated. i.e.
- For Oracle the constraint was re-enabled using
NOVALIDATE
. - For Microsoft SQL Server, the constraint was re-enabled without using
WITH CHECK
. - For PostgreSQL, the constraint was dropped and recreated instead of being disabled and re-enabled.
- For Oracle the constraint was re-enabled using
The cascade target is an implicit reference without database constraints.
The behaviour is designed to ensure that no rows are left unmasked on the cascade target.
Important!
- The reported number of rows updated by
mask_unique_key
will include updates made by all cascades, and count each time a row is affected by a different update (i.e. multiple cascade updates on the same table).
Build a temporary table
If you need to repeatedly join multiple tables when masking tables,
or you need to perform some custom SQL
transformations to column values, then you may wish to use the
build_temp_table
task type to create a temporary table which can then be
accessed via a join during a mask_table
task:
Parameters
table_name
(required): The name of the temporary table to create in the database. You will need to use this name when referencing this temporary table later (e.g. injoins
and masking rules). The table name can be prefixed with a schema name if the temporary table should be created in a schema other than the user's default schema.sql_select_statement
(this ORsql_select_file
required): A string containing aSELECT
statement to define the contents of the temporary table. To break the statement across multiple lines, you may use multi-line YAML syntax (|-
or>-
).sql_select_file
(this ORsql_select_statement
required): The name of a user uploaded SQL script file containing aSELECT
query to define the contents of the temporary table. See the Files guide for more information on uploading SQL script files. Use this parameter if you have a complex / long query, or you wish to share the same query between many rulesets.
Example (sql_select_statement
)
version: '1.0'
tasks:
- type: build_temp_table
table_name: my_temporary_table
sql_select_statement: >-
SELECT accounts.account_id, address.city
FROM accounts
INNER JOIN address
ON accounts.address_id = address.id
WHERE accounts.country = 'New Zealand';
- type: mask_table
table: my_temporary_table
key: id
rules:
- column: city
masks:
- type: from_file
seed_file: DataMasque_NZ_addresses.csv
seed_column: city
Example (sql_select_file
)
version: '1.0'
tasks:
- type: build_temp_table
table_name: my_temporary_table
sql_select_file: create_temp_table.sql
- type: mask_table
table: my_temporary_table
key: id
rules:
- column: city
masks:
- type: from_file
seed_file: DataMasque_NZ_addresses.csv
seed_column: city
Notes:
- DataMasque will create the temporary tables before applying any masks, and delete them after all tables have been masked. DataMasque will also ensure the temporary tables do not already exist in the database (removing existing temporary tables with the same name if needed). The temporary tables you define will only be available for use in joins, and cannot be masked themselves.
- For Microsoft SQL Server databases, temporary table names must begin with the
##
characters, as they will be created as 'Global Temporary Tables' so that they are visible to all parallel masking connections. However, in YAML the#
character begins an inline comment, so the temporary table name must be wrapped in double or single quotes (e.g.table: '##my_temporary_table'
).- For Oracle and PostgreSQL databases, 'temporary tables' are created as regular tables so that temporary tables are visible to all parallel masking connections.
- The
build_temp_table
task type is not currently supported for Amazon DynamoDB or Microsoft SQL Server (Linked Server) databases.
Run SQL
Use the run_sql
task type when you need to run any native SQL DDL or DML statements or scripts at any point within a
ruleset.
You can supply SQL for DataMasque to execute either as a script file (see Files guide), or inline in the ruleset:
Parameters
sql
(this ORsql_file
required): An SQL script to be executed. For multi-line scripts, you may use the YAML block style syntax (|-
).sql_file
(this ORsql
required): The name of a user-provided file containing an SQL script to be executed (see Files guide). Use this parameter if you have large blocks of SQL to run, or scripts that you wish to share between many rulesets.
Example Microsoft SQL Server (sql
)
Note: This example uses Microsoft SQL Server specific syntax, as master has been specified as the target database.
version: '1.0'
tasks:
- type: run_sql
sql: |-
USE [master];
ALTER DATABASE eCommerce SET RECOVERY SIMPLE WITH NO_WAIT;
USE [eCommerce];
ALTER TABLE [SalesRecords].[Customer] DROP CONSTRAINT [FK_SALESRECORDS_CUSTOMER];
ALTER TABLE [Invoices].[Customer] DROP CONSTRAINT [FK_INVOICES_CUSTOMER];
Example (sql_file
)
tasks:
- type: run_sql
sql_file: pre_script_1.sql
Notes:
- The
run_sql
task type executes in autocommit mode, and will exit on the first error encountered.- The
run_sql
task type does not run in dry run mode.- The
run_sql
task type is not currently supported for Microsoft SQL Server (Linked Server) databases.- For PostgreSQL and MySQL connections, the SQL interpreter considers a colon followed by letters or numbers as bound parameters. The colon can be escaped with a
\
before it. e.g. Attempting to insert JSON data:INSERT INTO table_name (column_name) VALUES('{"is_real" :true, "key":"value"}'::json)
run_sql
will interprettrue
as a bound parameter, to fix this you can escape the bound parameter by adding a backslash(\
) before the colon(:
) as follows:INSERT INTO table_name (column_name) VALUES('{"is_real" \:true, "key":"value"}'::json)
Notes for Oracle:
- It is recommended to test the execution of your SQL script with Oracle SQLPlus before use in a
run_sql
task.- DataMasque appends "WHENEVER SQLERROR EXIT SQL.SQLCODE" to the beginning of the SQL script, so the
run_sql
task will exit on the first error encountered.- Even if the database raises errors while executing the SQL script, some statements may still have finished executing successfully. It is recommended to check the database on any failed
run_sql
task.run_sql
does not use the schema specified in the connection configuration, instead it will default to the schema of- the user. If you wish to change schema, specify it in the script with
SET SCHEMA
.
Notes for Microsoft SQL Server:
- It is recommended to test the execution of your SQL script with Microsoft
sqlcmd
or SQL Server Management Studio before use in arun_sql
task.- Even if the database raises errors while executing the SQL script, some statements may still have finished executing successfully. It is recommended to check the database on any failed
run_sql
task.
Notes for PostgreSQL:
- It is recommended to test the execution of your SQL script with
psql
before use in arun_sql
task.- A
run_sql
task will be executed with a simple query cycle, where statements are executed in a single transaction (unless explicit transaction control commands are included to force a different behaviour).
Notes for Redshift:
- Using multiple SQL statements in a single
run_sql
task type is not supported currently.
Notes for MySQL:
- It is recommended to test the execution of your SQL script with MySQL shell before use in a
run_sql
task.- Warning: Executing
run_sql
tasks with more than one statement may not raise errors upon failure. If the first statement executes correctly but subsequent statements fail, errors may not be raised.
Notes for Amazon DynamoDB:
- The
run_sql
task type is not supported for Amazon DynamoDB.
Truncate a table
Use the truncate_table
task type to specify tables to be truncated by
DataMasque. All rows will be deleted, but the table structure will be left in
place.
Parameters
table
(required): The name of the table to truncate. The table name can be prefixed with a schema name to reference a table in another schema.
Example
tasks:
- type: truncate_table
table: history_table
...
Notes:
- The
truncate_table
task type does not run in dry run mode.- The
truncate_table
task type is not currently supported for Amazon DynamoDB or Microsoft SQL Server (Linked Server) databases.
Parallel Tasks
Using the parallel
task type, you can specify a block of tasks to be executed
in parallel, spread across as many workers as are available.
Each parallel task distributes to a maximum of 10 sub-tasks. It is recommended to begin testing parallelisation with at most 4 tasks in parallel, then increase parallelisation if the database has more capacity.
Parallel tasks can be nested inside other serial/parallel tasks.
Parameters
tasks
(required): A set of tasks to perform in parallel.
Example
tasks:
- type: parallel
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: 'name'
masks:
- type: from_fixed
value: 'REDACTED'
- type: mask_table
table: customers
key: id
rules:
- column: 'address'
masks:
- type: from_fixed
value: 'REDACTED'
...
Warning: You should not mask the same table in multiple tasks (including
mask_table
,run_sql
tasks) in parallel, as this could result in data being incorrectly masked.Note: The
parallel
task type is not currently supported for Amazon DynamoDB.
Serial Tasks
Although tasks are performed serially in the order they are listed in the ruleset by default, you can specify a block of tasks to be performed in serial within a parallel block. This is useful when a subset of parallelisable tasks have dependencies that mean they must be executed in sequence.
Serial tasks can be nested inside other serial/parallel tasks.
Parameters
tasks
(required): A set of tasks to perform in series.
Example
tasks:
- type: parallel
tasks:
- type: serial
tasks:
- type: run_sql
sql_file: pre_employees_script.sql
- type: mask_table
table: 'employees'
key: id
rules:
- column: 'name'
masks:
- type: from_fixed
value: 'REDACTED'
- type: mask_table
table: 'customers'
key: id
rules:
- column: 'address'
masks:
- type: from_fixed
value: 'REDACTED'
...
Masking files
Each mask_file
or mask_tabular_file
task specifies the masking rules to apply to each file in the base directory and/or any subdirectories.
As well as any files/directories intended to be skipped or included, and any conditionals required to define which data to mask in the masking process.
Masking rules and masks are applied sequentially in the order they are listed. When multiple masks are combined in sequence, the output value
from each mask is passed as the input to the next mask in the sequence.
Note If the source connection and destination connection are of the same type and have the same base directory, the files will be overwritten. The list of files is read at the start of the masking run, so new files added during the masking run will not be masked and will not be present in the destination.
File task types
A ruleset consists of a list of tasks to be performed in sequence on the target data source after the file is masked, it is then written to the selected data destination.
Object file masks
Each mask_file
task defines masking operations to be performed on a file or set of files. More detail is provided
about these tasks under the Masking files section.
Parameters
Each task with type mask_file
is defined by the following parameters:
rules
(required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:masks
(required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the file. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask types.
recurse
(optional): A boolean value, when set to true any folders in the Data Source will be recursed into and the files contained will also be masked. Defaults tofalse
.workers
(optional): The number of parallel workers to use for this masking task. Defaults to 1.skip
(optional): Specifies files to not include in the masking run.regex
(optional): Files that match this regular expression will not be included for masking.glob
(optional): Files that match this glob will not be included for masking.match_against
(optional): Choose if theskip
rule is applied to the whole path or just the filename. Valid choices arepath
orfilename
. Defaults topath
.
include
(optional): Specifies files to include in the masking run. If not set, all files will be masked (except those skipped by askip
rule).regex
(optional): Files that match this regular expression will be masked.glob
(optional): Files that match this glob will be masked.match_against
(optional): Choose if theinclude
rule is applied to the whole path or just the filename. Valid choices arepath
orfilename
. Defaults topath
.
encoding
(optional): The encoding to use when reading and writing files. Defaults toUTF-8
. Refer to Python Standard Encodings for a list of supported encodings.
For more information about the ordering of
skip
andinclude
please refer to Include/Skip.
Note:
regex
/glob
will match to the path from the base directory specified in the source connection, consider adding.*
(regex) or*
(glob) to the beginning of the expression for matching. For example: If the structure is /path1/path2/target_file.json, if the base directory is path1/ andrecurse: true
in the ruleset, theregex
/glob
will try to match path2/target_file.json. When including a path by specifying aglob
such as target_path/* therecurse
option needs to be set totrue
otherwise the included path won't be entered and the files it contains will not be masked.
Supported file types
In general, mask_file
has been designed to mask XML, JSON, NDJSON, and Apache Avro files. Each file is loaded as a string and passed to the
masks
. Therefore, to mask a JSON file, a json
mask would be implemented, like the following example:
version: "1.0"
tasks:
- type: mask_file
recurse: true
skip:
- regex: '.*\.json'
- glob: "input/*"
include:
- glob: "other_inputs/*.json"
rules:
- masks:
- type: json
transforms:
- path: ['name']
masks:
- type: from_fixed
value: REDACTED
This would replace the root name
attribute in the JSON with the text REDACTED.
Multirecord Files (NDJSON/Avro)
NDJSON (Newline Delimited JSON) (.ndjson
) files can be masked by specifying a json
mask in the ruleset which will be
applied to each line of JSON in the NDJSON file.
For NDJSON files, a similar ruleset can be specified which will be applied to each record of the file, rather than being applied to the entire file.
version: "1.0"
tasks:
- type: mask_file
rules:
- masks:
- type: json
transforms:
- path: ['name']
masks:
- type: from_fixed
value: REDACTED
This would replace the root name
attribute for each record in the NDJSON with the text REDACTED.
Apache Avro (.avro
) files can be masked in the same way as NDJSON files. When masking Avro files there also needs to be an
Avro Schema file (.avsc
) in the same directory and with the same name as the .avro
file, as the schema file is required to
write the masked .avro
file to ensure data type consistency.
For example, if a file is called data.avro
DataMasque will expect to find a schema file called data.avsc
in the same directory.
Note: The Avro Schema file (
.avsc
) will be copied unmasked into the output directory. Deterministic Masking is also supported for multirecord files, for more information please refer to Deterministic Masking with Files.
Similarly, for XML files, use an xml
mask:
version: "1.0"
tasks:
- type: mask_file
recurse: true
skip:
- regex: '^(.*).xml'
- glob: "input/*"
include:
- glob: "other_inputs/*.xml"
rules:
- masks:
- type: xml
transforms:
- path: 'User/Name'
node_transforms:
- type: text
masks:
- type: from_fixed
value: REDACTED
This would replace the content of the node(s) at User/Name
with the text REDACTED.
To mask other types of files, basic redaction is possible. For example, to replace the contents of every txt
file with
the text REDACTED:
version: "1.0"
tasks:
- type: mask_file
recurse: true
include:
- glob: "*.txt"
rules:
- masks:
- type: from_fixed
value: REDACTED
It is possible to use any mask that accepts text input (or no input), although their effectiveness will depend on the size and content of the input file.
Note also, that files that have not been processed will not be copied from the source to the destination. That is,
DataMasque will either load a file (based on skip
/include
rules), mask it, then copy it to the destination, or
it will ignore the file. Unmasked files will not be copied to the destination.
Tabular file masks
Each mask_tabular_file
task defines masking operations to be performed on a file or set of files (CSV, Parquet or
fixed-width columns). More detail is provided about these tasks under the Masking files section.
Parameters
Each task with type mask_tabular_file
is defined by the following parameters:
rules
(required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:column
(required): A column within the tabular file intended for masking. This is the header row of the column (for CSVs) or name of column (for parquet files).masks
(required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the column. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask types.
recurse
(optional): A boolean value, when set to true any folders in the Data Source will be recursed into and the files contained will also be masked. Defaults tofalse
.workers
(optional): The number of parallel workers to use for this masking task. Defaults to 1.skip
(optional): Specifies files to not include in the masking run.regex
(optional): Files that match this regular expression will not be included for masking.glob
(optional): Files that match this glob will not be included for masking.match_against
(optional): Choose if theskip
rule is applied to the whole path or just the filename. Valid choices arepath
orfilename
. Defaults topath
.
include
(optional): Specifies files to include in the masking run. If not set, all files will be masked (except those skipped by askip
rule).regex
(optional): Files that match this regular expression will be masked.glob
(optional): Files that match this glob will be masked.match_against
(optional): Choose if theinclude
rule is applied to the whole path or just the filename. Valid choices arepath
orfilename
. Defaults topath
.
encoding
(optional): The encoding to use when reading and writing files. Defaults toUTF-8
. Refer to Python Standard Encodings for a list of supported encodings.delimiter
(optional): The character that is used to delimit columns in character-delimited files (e.g.,
for CSVs or\t
for TDVs). Defaults to,
.fixed_width_extension
(optional): The file extension that fixed-width files have. Not required if no fixed-width files are to be masked. Should not include a leading.
(e.g. specifytxt
not.txt
).fixed_width_columns_indexes
(optional): An array of two-element arrays of start and end indexes of the fixed width columns. Required iffixed_width_extension
is specified.fixed_width_column_names
(optional): An array of string defining the names of the fixed-width columns and used to refer to them in masking rules. Required iffixed_width_extension
is specified, and must match the length offixed_width_columns_indexes
.fixed_width_too_wide_action
(optional): The action to take if masked data exceeds the width of the column, can be eithertruncate
to truncate the value to fit in the column orerror
to raise an error and stop the masking run. Defaults totruncate
.fixed_width_line_ending
(optional): The line ending to use when writing out the fixed width data. Will attempt to be detected from the input file, otherwise defaults to\n
.
For more information about:
- How tabular file types are detected, see Tabular File Type Detection.
- Parameters for fixed-width file masking, see Fixed Width File Masking Parameters.
- The ordering of
skip
andinclude
, please refer to Include/Skip. - Table
join
s are not supported in tabular file masking.
Note:
regex
/glob
will match to the path from the base directory specified in the source connection, consider adding.*
(regex) or*
(glob) to the beginning of the expression for matching. For example: If the structure is /path1/path2/target_file.json, if the base directory is path1/ andrecurse: true
in the ruleset, theregex
/glob
will try to match path2/target_file.json. When including a path by specifying aglob
such as target_path/* therecurse
option needs to be set totrue
otherwise the included path won't be entered and the files it contains will not be masked.
version: "1.0"
tasks:
- type: mask_tabular_file
recurse: true
skip:
- regex: '.*\.json'
- glob: "input/*"
include:
- glob: "other_inputs/*"
rules:
- column: name
masks:
- type: from_fixed
value: REDACTED
Tabular File Type Detection
DataMasque uses file extensions to determine how tabular files are loaded for masking. The extension comparison is not
case-sensitive. Files with extension parquet
are treated as Apache Parquet files. If
a fixed_width_extension
is specified (see Fixed Width File Masking Parameters
below) then the file is loaded as fixed-width. All other file types are loaded as character-delimited. If the
delimiter
parameter is unspecified then it defaults to ,
(i.e. CSV).
Note Delimiters must be a single character, and should be quoted with double quotes. For example:
delimiter: " "
for a space ordelimiter: ";"
for a semicolon. Tabs or other invisible delimiters should use the correct escape sequence with a backslash. For example,delimiter: "\t"
for a tab.
The skip
, include
and delimiter
options should be used together to specify the correct delimiter for each file
type that is to be masked.
Note CSV files require header columns for tabular masking as the header columns are used as column names during masking. CSV and fixed-width files are all string based files, therefore values should be cast to other types if being used with masks that require specific types (e.g.
numeric_bucket
). To do this use atypecast
mask, for more information please refer to Typecast
DataMasque will only attempt to load fixed width files if fixed_width_extension
is specified, and will treat any files
with this extension as fixed-width. See also Fixed Width File Masking Parameters.
Once files are loaded they are all masked in the same way, that is, rules are executed and applied on a per row/column basis regardless of the original source type. Data will be written back out in the same format as it was read.
Fixed Width File Masking Parameters
Masking of fixed-width files is only attempted if fixed_width_extension
is specified. If fixed_width_extension
is
present in the ruleset without fixed_width_columns_indexes
and fixed_width_column_names
, then an error will be
raised. However, it is valid to have fixed_width_columns_indexes
and fixed_width_column_names
missing if
fixed_width_extension
is also absent.
If fixed_width_extension
is set, then DataMasque will treat any files with that extension as fixed-width and load them
based on the other fixed-width options. To assist in explaining the rules, consider an example file called users.txt
.
This is the content:
Adam 2010-01-01 AAA-1111
Brenda 2010-01-01 EEE-5555
Charlie 2010-02-02 GGG-7777
It has 3 columns, the first containing a name, which is from index 0
to 8
. The second column contains a date and
spans from 8
to 19
. The final column contains a transaction ID and spans from index 19
to 27
.
Note these indexes are specified to be contiguous as some fixed-width formats require contiguous columns, therefore a
trailing space is included in the first and second columns. DataMasque automatically strips leading and trailing spaces
when the data is read. Contiguous columns are not required though, so the same result could be achieved with indexes
(0, 7)
, (8, 18)
and (19, 27)
. When non-contiguous columns are specified DataMasque inserts spaces in between
columns.
Since fixed-width files do not have column headers, the ruleset must also specify these. They can be any arbitrary
valid column identifier (i.e. alphanumeric string without special characters) and are used to identify the columns in
the masking rules. In this case they will be named name
, date
and transaction_id
.
Considering these rules will yield a ruleset like this:
version: "1.0"
tasks:
- type: mask_tabular_file
recurse: true
fixed_width_extension: txt
fixed_width_columns_indexes:
- [0, 8]
- [8, 19]
- [19, 27]
fixed_width_column_names:
- name
- date
- transaction_id
rules:
- column: name
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
- column: date
masks:
- type: from_random_date
min: '1950-01-01'
max: '2000-12-31'
- type: typecast
typecast_as: string
date_format: '%Y-%m-%d'
- column: transaction_id
masks:
- type: substitute
Note that when this ruleset is executed, DataMasque will still load any CSVs or Parquet files it encounters, however it
will use the standard loaders instead of applying the fixed width rules. In this case fixed-width rules will only be
used for txt
files.
This can be useful if CSV or Parquet files exist in the source with the same columns and need to be masked in the
same manner. If these files exist, and they shouldn't be masked, then skip
rules should be added to skip them.
Choosing files to mask with include
/skip
When specifying which files to include
or skip
for a mask_file
/mask_tabular_file
task, the order of checking
needs to be considered. The skip
rules are checked first, followed by the include
rules, which means skip
takes
precedence. Warning! If a file matches both the skip
and include
rules, that item will not be included in
the masking task.
Rules are applied with "or" logic, that is, as long as the file matches any rule in skip
or include
, it will
be skipped/included, respectively.
In comparisons, the file's path does not include a leading /
.
By default, include
and skip
rules are applied to the file's path, relative to the base_directory
. By specifying
match_against: filename
on the include
/skip
, the rule will be applied to just the file's name.
Filtering is performed only on files, not on directories. A directory will always be entered into, even if it would be skipped based on filters. If a directory has no files to be masked (they are all skipped, or none are included) then an empty directory will not be created on the target.
glob
matching does not support the **
syntax for multi-level directory matching. If the depth of directories is
unknown, or more advanced path matching is required regex
filtering should be used.
These rules are explained in more detail below, with some examples and common file filtering scenarios.
include
/skip
examples and common scenarios
In general, glob
is used for simple matching, and regex
should be considered when more advanced rules are
required. Note that some of these examples use a mask_file
task and others use a mask_tabular_file
task, however
there is no difference in the way the file matching rules are executed between these different task types.
Mask all files with a certain extension
The ruleset will include all files that end with .xml
inside the base directory and any subdirectories.
version: "1.0"
tasks:
- type: mask_file
recurse: true
include:
- glob: "*.xml"
The recurse
option will enter every subdirectory. Since the match_against
option has not been specified the default
path
option is used, so the glob is applied against the file's path. However since a path like dir1/dir2/file.xml
and a filename like file.xml
both end with .xml
they both match the specified glob; in this case specifying
match_against: filename
would have no effect.
Mask all files with a specific start and end
This ruleset will mask any file that starts with user
and has a .json
extension. For example, user_1.json
,
users_file.json
, etc. These files can be anywhere inside the source. The match_against
option is set to filename
.
version: "1.0"
tasks:
- type: mask_file
recurse: true
include:
- glob: "user*.json"
match_against: filename
match_against: filename
is specified because the glob will not match paths in subdirectories. For example, the path
dir1/dir2/user_1.json
does not match the glob user*.json
, but the filename on its own (user_1.json
) does.
Note: When matching against filename, a
glob
with a/
in it is not valid since a filename should never contain a/
. DataMasque does not prevent you from using a/
in a glob when matching against filename so if there are files you expect to be included, but they are not, this could be the reason why.
Mask files of a certain type only in the base directory
To mask CSV files only in the base directory, disable the recurse
option.
version: "1.0"
tasks:
- type: mask_tabular_file
include:
- glob: "*.csv"
Without recurse: true
, subdirectories will not be entered into, so only .csv
files in the base directory will be
masked.
Mask files in nested directories of known depth
In this example, there is a directory layout like this:
users1.csv
users2.csv
staff/users1.csv
staff/users2.csv
staff/management/users1.csv
staff/management/users2.csv
staff/contractors/users1.csv
staff/contractors/users2.csv
(Assume there are also some files of other types that are not to be masked).
All users*.csv
files should be masked, except those in the base directory. Since we know the depth of the directories
(staff/
is one level; staff/management/
and staff/contractors/
are two levels) two globs can be added to match
these.
version: "1.0"
tasks:
- type: mask_tabular_file
include:
- glob: "staff/*.csv"
- glob: "staff/*/*.csv"
Since the matching is performed on the path, none of the files in the base directory match. Files in the staff/
directory match the first glob and files in staff/management/
or staff/contractors/
match the second. Since
include
use OR logic then as long as at least one include
rule matches a file will be included.
Mask files in all nested directories
If the depth of the directories is unknown, or could be very deep, so that a lot of glob
rules need to be created,
then using regex
to apply regular expressions to the path can be a better option.
Following on from the previous example, CSV files might be nested inside a deep directory tree. The following ruleset
uses a regex to match any .csv
file inside staff/
or any of its child directories.
version: "1.0"
tasks:
- type: mask_tabular_file
include:
- regex: "^staff/.*\.csv$"
Explaining the entire regular expression syntax is beyond the scope of these examples, but since this pattern is quite common we will briefly explain how it works:
^
means the expression must match start of the path, it can't just be found somewhere in the middle.staff/
has no special meaning, so it matches the literalstaff/
part of the path..*
has two parts,.
means match any character and*
means match the preceding expression (the.
) any number of times. So together they mean match any number of any characters.\.
matches a literal.
. The\
escapes the.
so it doesn't behave as matching any character.csv
has no special meaning, so it matches the literalcsv
extension.$
the end expression must also match at the end of the path,.csv
can't just appear somewhere in the middle.
Hence, any path that starts with staff/
and ends with .csv
will be matched by this regex.
Often you will be able to apply a slightly modified version of this regex, by just changing the start and end to match your particular paths and file extensions.
For more information on the regular expression syntax used by DataMasque, check the Python Regex Howto.
Mask all files with a given extension, except those with a certain prefix
In this example all .avro
files in the base directory will be masked, except the ones starting with users*
. This
means files like sales_1.avro
, data.avro
, orders-for-2020.avro
will be masked, but users-list.avro
and
users-exported.avro
would not be. It uses an include
rules to include all .avro
files and a skip
rule to skip
any starting with users
.
version: "1.0"
tasks:
- type: mask_file
include:
- glob: "*.avro"
skip:
- glob: "users*.avro"
match_against: filename
The file users-exported.avro
matches both *.avro
and users*.avro
, but since skip
takes precedence the file will
not be included. The file data.avro
only matches *.avro
so it will be included.
Mask files in all nested directories, excluding those with a given prefix
This example will include all .csv
files in the staff/
directory, except those that start with data
. For example,
it matches staff/sales.csv
and staff/management/users.csv
but not staff/data01.csv
or
staff/management/data-list.csv
.
version: "1.0"
tasks:
- type: mask_tabular_file
include:
- regex: "^staff/.*\.csv$"
skip:
- glob: "data*.csv"
match_against: filename
This ruleset show that glob
and regex
can be combined to produce flexible and simple matching rules.
Mask files that match a particular pattern.
This example includes XML files that start with users_
, then have at least one digit (the \d+
parameter). But since
it uses match_against: filename
it will not match against the path, therefore files inside a directory starting with
users_
won't match.
version: "1.0"
tasks:
- type: mask_tabular_file
include:
- regex: "^users_\d+.xml$"
match_against: filename
Some example results are:
users_5.xml
: matchusers_10.xml
: matchusers_list.xml
: no match (the filename doesn't match the regex)users_5/file.xml
: no match (the path matches the regex but comparison is performed against the filename)users_5/users_10.xml
: match (the filenameusers_10.xml
matches the regex)
Masking Tables
Each mask_table
task specifies the masking rules to apply to a
database table, as well as any required joins and any conditionals needed to define which rows
should be masked. Masking rules and masks are applied sequentially in the order
they are listed. When multiple masks are combined in sequence, the output value
from each mask is passed as the input to the next mask in the sequence.
Note: While
mask_table
is suitable for most generic masking requirements, it is not capable of masking unique keys or primary keys. Masking of such values requires the use of the special-purposemask_unique_key
task.
Selecting data to mask
DataMasque provides some advanced features for selecting additional data from the database for use in your masking rules.
Joining tables
When masking a table, you can specify a list of joins (or dictionary mapping keys to joins) that will join the rows of a target table to rows from one or more additional tables, providing you with the additional joined values to use in your masking rules.
Parameters
target_table
(required): The name of the new table you wish to join into the masking data. The target table can be prefixed with a schema name to reference a table in another schema.target_key
(required): The key ontarget_table
to use when performing the join. This can be specified as a single column name or a list of column names.source_table
(required): The name of the table you wish join thetarget_table
with. This could be the table being masked, or another table earlier in the list ofjoins
(allowing you to perform multi-step joins). The source table can be prefixed with a schema name to reference a table in another schema.source_key
(required): The key onsource_table
to use when performing the join. This can be specified as a single column name or a list of column names.
Example
In this example, we have two tables: Customers
and Membership
. We would like the mask the first_name
and
last_name
columns of the Customers
table, but only if the user's membership status is Active.
The Customers
contains data on customers, including their name and membership ID. The Membership
table
contains the status of the membership of each Customer: either Active or Inactive. The membership_ip
column
of the Users
has a foreign key relation with the id
column of the Membership
table.
Customers
Table
customer_id | first_name | last_name | membership_id |
---|---|---|---|
1 | Anastasia | Rose | 10001 |
2 | Bill | Jones | 10002 |
3 | Chris | Yang | 10003 |
4 | Judith | Taylor | 10004 |
5 | Gordon | Smith | 10005 |
Membership
Table
id | membership_status |
---|---|
10000 | Active |
10001 | Active |
10002 | Inactive |
10003 | Active |
10004 | Inactive |
In order to access the membership_status
column of the Membership
table, we need to define a join in our
ruleset from the Customers
table to the Membership
table.
version: "1.0"
tasks:
- type: mask_table
table: Customers
key: customer_id
joins:
- target_table: Membership
target_key: id
source_table: Customers
source_key: membership_id
rules:
- if:
- column: '"Membership".membership_status'
equals: Active
rules:
- column: first_name
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
- column: last_name
masks:
- type: from_file
seed_file: DataMasque_lastNames.csv
seed_column: lastnames
After performing the join, this will allow us to reference the membership_status
column of the Membership
table in our ruleset. In this example, we can reference the column with Membership.membership_status
. Using
this column, we can use Conditional Masking to only mask the rows of Customers
where the
status of the membership is 'Active'.
Note: To reference a column in a joined table, the table name of joined table must be added as a prefix to the column name**
The example below utilises the from_file
mask type detailed here to select a random
first name from the DataMasque_firstNames-mixed.csv
and a random last name from the
DataMasque_lastNames.csv
files that can be found on our Supplementary Files user guide. It will
first check if the membership_status
for the customer is 'Active', and if so, masks the two name columns: otherwise,
these columns are left unmasked.
version: "1.0"
tasks:
- type: mask_table
table: Customers
key: customer_id
joins:
- target_table: Membership
target_key: id
source_table: Customers
source_key: membership_id
rules:
- if:
- column: '"Membership".membership_status'
equals: Active
rules:
- column: first_name
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
- column: last_name
masks:
- type: from_file
seed_file: DataMasque_lastNames.csv
seed_column: lastnames
This example will produce the following results in the Customers
table. The customers with customer_id
3 and 5
are not masked, as the status of their membership is 'Inactive' in the joined Membership
table.
customer_id | first_name | last_name | membership_id |
---|---|---|---|
1 | Tia | Pallin | 10001 |
2 | Nikau | Koller | 10002 |
3 | Chris | Yang | 10003 |
4 | Anika | Thom | 10004 |
5 | Gordon | Smith | 10005 |
Note:
For Microsoft SQL Server (Linked Server),
joins
are not currently supported.For Microsoft SQL Server, when using temporary table, the name of the temporary table must be wrapped in quotation marks, as the
#
symbol in the YAML editor denotes the beginning of a comment (e.g.target_table: '##my_temporary_table'
or'##my_temporary_table.column'
).To reference a temporary table column (e.g. the
table_filter_column
parameter of thefrom_file
mask type or as a part ofhash_columns
) you must prefix the column name with its table name (e.g.table.column
).Any column name specified without a table prefix is assumed to belong to the table being masked (as specified by the
table
parameter for the task). You cannot specify tables that belong to other schemas.
Conditional masking
You may wish to only apply masks to rows or values that meet some conditions. DataMasque has three different methods for conditionally applying masks to meet different use cases:
Use case | Mechanism |
---|---|
I want to restrict which rows are fetched for masking from the database table. | Where |
I want to apply certain masking rules to only a subset of rows. | If |
I want to skip applying masks to certain column values. | Skip |
Warning: Use of the conditional masking features 'where', 'skip', or 'if/else', may mean your masking rules are not applied to some database rows or values. It is recommended to verify the resulting output satisfies your masking requirements.
Where - restricting database fetches
To restrict which rows are fetched for masking from a database table,
you can specify a where
clause for a masked_table
:
version: "1.0"
tasks:
- type: mask_table
table: users
key: id
where: >-
"users"."role" <> 'administrator'
rules:
...
The where
clause can refer to any columns in the masked table or joined
tables. All columns must be referenced using their table-qualified name (e.g.
Users.FirstName
). Ensure to use appropriate quoting as required. For example,
if the identifier uses a reserved word, starts with an illegal character, or is
a case-sensitive identifier.
Important!
- Any rows excluded by the
where
clause will not be masked.- The SQL you provide for the
where
clause will not be validated before execution, please take care when constructing your SQL.- The SQL you provide for the
where
clause should not end in a semicolon, as this will cause a masking error.- Any string in the where clause variables must be quoted in single quotation marks.
- Joined tables cannot be referenced in the
where
clause currently.
Note for Amazon Redshift:
- Use of
where
clause for Amazon Redshift is not yet supported in DataMasque. This is in our roadmap and will be included in future releases.
If - conditional rules
You can choose to apply certain masking rules to only a subset of rows
within a table, while still allowing other masks to be applied to those
rows. This can be achieved through the use of if
-conditions in rules
lists.
DynamoDB note: Masking Amazon DynamoDB uses a
mask_table
task, so standardmask_table
condition parameters apply.
Example
In the following example, the last_name
of all users will be replaced with
'Smith'
, but the user's gender
will determine the mask applied to their
first_name
:
version: "1.0"
tasks:
- type: mask_table
table: users
key: id
rules:
- column: last_name
masks:
- type: from_fixed
value: 'Smith'
- if:
- column: gender
equals: 'female'
rules:
- column: first_name
masks:
- type: from_fixed
value: 'Alice'
else_rules:
- if:
- column: gender
equals: 'male'
rules:
- column: first_name
masks:
- type: from_fixed
value: 'Bob'
else_rules:
- column: first_name
masks:
- type: from_fixed
value: 'Chris'
This example applies the same conditions in a mask_tabular_file
task.
version: "1.0"
name: basic_file_mask
tasks:
- type: mask_tabular_file
include:
- glob: "*.csv"
rules:
- if:
- column: gender
equals: 'female'
rules:
- column: first_name
masks:
- type: from_fixed
value: 'Alice'
else_rules:
- if:
- column: gender
equals: 'male'
rules:
- column: first_name
masks:
- type: from_fixed
value: 'Bob'
else_rules:
- column: first_name
masks:
- type: from_fixed
value: 'Chris'
Condition Parameters
if
(required): A list of conditions (see below) that must all evaluate astrue
for the nested list of rules to be applied to a row.rules
(required): A nested list of masking rules/nested-if
-conditions (or dictionary mapping labels to rules) that will only be applied to rows that meet the conditions defined underif
.else_rules
(optional): A nested list of masking rules/nested-if
-conditions (or dictionary mapping labels to rules) that will only be applied to rows that do NOT meet the conditions defined underif
.
A condition under if
can contain the following attributes.
There are 3 different parameters that can be used to select which data values the condition is applied to:
column
(required): The database column to check this condition against. The column name can be prefixed with a table name to reference a column in another table, and that table name can be prefixed with a schema name to reference a table in another schema.json_path
(required withinmask_file
tasks): For JSON records, ajson_path
must be provided to extract the value to which the condition will be applied. Orxpath
(required withinmask_file
tasks): For XML records, anxpath
must be provided to extract the value to which the condition will be applied. The following conditions can be applied to the selected data values:equals
(optional): If specified, the condition will only evaluate astrue
if the value exactly equals the specified value. Data types are also checked (i.e.100
is not equal to"100"
). Data extracted from XML is treated as strings and the comparison value in the ruleset must be quoted.matches
(optional): If specified, the condition will only evaluate astrue
if the string of the value matches the specified regular expression. For more details on how to use regular expressions, see Common regular expression patterns.less_than
(optional): If specified, the condition will only evaluate astrue
if the value is a number or date/datetime and is less than the given value. If the specified value is numeric, the value from the data will be treated as a number.less_than_or_equal
(optional): If specified, the condition will only evaluate astrue
if the value is a number or date/datetime and is less than or equal to the given value. If the specified value is numeric, the value from the data will be treated as a number.greater_than
(optional): If specified, the condition will only evaluate astrue
if the value is a number or date/datetime and is greater than the given value. If the specified value is numeric, the value from the data will be treated as a number.greater_than_or_equal
(optional): If specified, the condition will only evaluate astrue
if the value is a number or date/datetime and is greater than or equal to the given value. If the specified value is numeric, the value from the data will be treated as a number.age_greater_than
(optional): If specified, the condition will only evaluate astrue
if the value is a date/datetime and the difference in years is greater than this value. If the value is not a date/datetime usedate_format
to try to parse it as a date/datetime. Ifdate_format
is not provided the date format%Y-%m-%d
will be used as a fallback. Can be combined withage_less_than
to enforce an age within a range.age_less_than
(optional): If specified, the condition will only evaluate astrue
if the value is a date/datetime and the difference in years is less than this value. If the value is not a date/datetime usedate_format
to try to parse it as a date/datetime. Ifdate_format
is not provided the date format%Y-%m-%d
will be used as a fallback. Can be combined withage_greater_than
to enforce an age within a range.disable_auto_date_parsing
(optional): By default, strings in rulesets that appear to be dates are attempted to be parsed into dates/datetimes. This can cause comparisons to unexpectedly fail due to type mismatches. Setdisable_auto_date_parsing
totrue
to disable this automatic parsing. This setting is overridden when specifying thedate_format
argument.
When performing conditions on values containing dates/datetimes the following parameter can be specified:
date_format
(optional): The format of the date/datetime stored. If specified, the condition will attempt to load the column value or the value at thejson_path
/xpath
as a datetime with the specified format. If the value cannot be parsed with thedate_format
an error will be raised.
Notes:
- If the comparison is any of
less_than
,less_than_or_equal
,greater_than
, orgreater_than_or_equal
:
now
orcurrent_date_time
can be specified as the value to use the current datetime at the time of masking.- Use a similar format to the date/datetime stored e.g. if the datetime contains hours:mins:seconds+time_zone, specify those in the value to compare against ("2012-12-12 10:30:00+00:00").
- If the date/datetime contains timezone information it will be compared to the specified date with UTC timezone. This also applies when using
now
orcurrent_date_time
.
An additional option can be specified for conditions within mask_file
tasks:
on_missing
(optional): Determines how to handle records where no value exists at a condition'sjson_path
orxpath
, or where the record is not valid JSON/XML and a condition specifies ajson_path
/xpath
:error
(default): Raise an error in the run log.skip
: Skip this rule for the current record/file.apply_if_rules
: Apply the rules defined inrules
that are normally applied when the condition is True.apply_else_rules
: Apply the rules defined inelse_rules
that are normally applied when the condition is False. Only permitted whenelse_rules
are provided.
Notes:
- When comparing values in an XML document,
equals
conditions comparing to non-string values will always be false since they are stored as a string in the XML document. Please make sureequals
conditions match exactly to what is contained in the document.- When a database column specified in an
if
does not exist, the masking run will always raise an error.on_missing
behaviour only applies when a no values exists at a specifiedjson_path
orxpath
or where the column/record contains invalid JSON/XML when ajson_path
/xpath
is specified.
Conditions can also be grouped with the logical operators or
, not
,
and and
:
version: "1.0"
tasks:
- type: mask_table
table: users
key: user_id
rules:
- if:
- and:
- not:
- column: username
matches: 'customer_\w+'
- or:
- column: admin
equals: true
- column: role
equals: "admin"
rules:
- column: username
masks:
- type: from_fixed
value: "Bob"
Note: When using an
if
conditional in rulesets, final row counts will reflect the number of rows processed rather than the number of rows masked. This is due to the rows being filtered on the application side and so all rows fetched will be processed and added to the row count. Alternativelywhere
conditionals can be used in the ruleset which will provide an accurate row count of masked rows.
Skip - not masking specific values
A common use-case is to not apply masks to certain values, e.g. to leave
NULL
values or empty strings unchanged. You can choose to not mask
certain values in a column by specifying a number of values to skip
:
version: "1.0"
tasks:
- type: mask_table
table: users
key: user_id
rules:
- column: username
skip:
- null
- ""
- matches: 'admin_\w+'
masks:
- type: from_fixed
value: "Bob"
Any column values that are exactly equal to any of the
string/numeric/null values in the skip
list will not be masked (data
types are also checked, i.e. 100
is not equal to "100"
).
Additionally, string column values matching a regular expression can be
skipped by specifying the skip value as matches: "my_regex"
.
For more details on how to use regular expressions,
see Common regular expression patterns.
Conditional masking for mask_file
tasks
Conditional masking for object files (as part of a mask_table
) differs to database masking. For
databases, conditions are evaluated per row, and affect the values (columns) in only that row.
For object files, the condition is applied per file and affects values for that file.
Multi-record object files (NDJSON or Avro) are more like databases in that the conditions are applied per record in the file and affect values for that record.
Masking tabular files (with mask_tabular_file
) applies rules per row in the file, again,
similar to masking a database table.
For the full list of parameters please refer to Condition Parameters
Notes:
where
is not supported for conditional masking for files aswhere
is used to restrict what is returned in database queries, and therefore it does not apply to files.The implementation of
skip
for files is different from databases. It is used in conjunction withinclude
to choose which files to mask or not. Please refer tomask_file
tasks andmask_tabular_file
tasks.For tabular file masking, refer to the general If section.
Here are the use cases for each type of condition:
Use case | Mechanism |
---|---|
I want to apply certain masking rules to the values contained in the files. | If |
I want to skip certain files based on the file name. | Please refer to the skip option for mask_file tasks |
Below are some example rulesets for conditional file masking.
If - conditional rules (files)
Example
In the following example, the last_name
of all users will be replaced with
'Smith'
, but the user's gender
will determine the mask applied to their first_name
:
version: "1.0"
name: json_with_conditional
tasks:
- type: mask_file
rules:
- masks:
- type: json
transforms:
- path: ["last_name"]
on_missing: error
masks:
- type: from_fixed
value: "Stevens"
- if:
- json_path: ["gender"]
equals: "female"
on_missing: apply_if_rules
rules:
- masks:
- type: json
transforms:
- path: ["first_name"]
on_missing: error
masks:
- type: from_fixed
value: "Alice"
else_rules:
- if:
- json_path: ["gender"]
equals: "male"
on_missing: skip
rules:
- masks:
- type: json
transforms:
- path: ["first_name"]
on_missing: error
masks:
- type: from_fixed
value: "Bob"
else_rules:
- masks:
- type: json
transforms:
- path: ["first_name"]
on_missing: error
masks:
- type: from_fixed
value: "Jamie"
Similarly, with XML files
version: "1.0"
name: xml_with_conditional
tasks:
- type: mask_file
rules:
- masks:
- type: xml
transforms:
- path: "//Root/Employee/LastName"
on_missing: error
node_transforms:
- type: text
masks:
- type: from_fixed
value: Stevens
- if:
- xpath: "//Root/Employee/Gender/text()"
equals: "female"
on_missing: apply_if_rules
rules:
- masks:
- type: xml
transforms:
- path: "//Root/Employee/FirstName"
on_missing: error
node_transforms:
- type: text
masks:
- type: from_fixed
value: Alice
else_rules:
- if:
- xpath: "//Root/Employee/Gender/text()"
equals: "male"
on_missing: skip
rules:
- masks:
- type: xml
transforms:
- path: "//Root/Employee/FirstName"
on_missing: error
node_transforms:
- type: text
masks:
- type: from_fixed
value: Bob
else_rules:
- masks:
- type: xml
transforms:
- path: "//Root/Employee/FirstName"
on_missing: error
node_transforms:
- type: text
masks:
- type: from_fixed
value: Jamie
An example for tabular files can be found in the If section.
Mask types
Masks are the basic 'building-block' algorithms provided by DataMasque for generating and manipulating column values. Multiple masks can be combined in a list to create a pipeline of transformations on the data, or combined using combinator masks to build up more complex output values.
Parameters
Mask algorithms are defined by their type
parameter - this parameter is common
to (and required by) all masks:
type
(required) determines the type of mask, and therefore what other parameters can be specified.
Note: Masks operate by either manipulating the original column value, or by generating an entirely new value that replaces the original value. The latter can be referred to as a 'source' mask, as the mask is a source of new values. Such 'source' masks are indicated by the
from_
prefix on the masktype
.
Available mask types
Generic masks
Fixed value (from_fixed
)
A simple mask that replaces all column values with the same fixed value.
Parameters
value
(required): The value to replace all column values with. Can be any data type, but should match that of the column being masked. This value can be enclosed in quotation marks, which will convert the value a string, or entered without quotation marks.
Example
This example will replace all values in the name
column of the employees table with 'Alex'.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: ID
rules:
- column: name
masks:
- type: from_fixed
value: 'Alex'
Show result
Before | After |
|
|
---|
DataMasque also supports blank values for the from_fixed
mask type. To specify a blank entry, you can
enter the value as """
, ''
, or leave the field blank if there are no spaces.
version: '1.0'
tasks:
- type: mask_table
table: customers
key: id
rules:
column: name
masks:
- type: from_fixed
value: ''
From column (from_column
)
A simple mask to replace a column's value by copying the value from another database column (from the same table, or a joined table).
Parameters
source_column
(required): The name of the column to copy values from. The source column can be prefixed with a table name to reference a column in another table, and that table name can be prefixed with a schema name to reference a table in another schema.
Note:
from_column
is not supported formask_file
tasks, but is supported formask_tabular_file
tasks.
Example
This example will replace the values of the name
column in the employees
table with values
from the first_name
column of the users
table, joined on id
.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
joins:
- target_table: users
target_key: id
source_table: employees
source_key: user_id
rules:
- column: name
masks:
- type: from_column
source_column: users.first_name
Show result
Given the users
table:
id | first_name |
---|---|
24 | Kyle |
25 | Reid |
26 | Helen |
27 | Callum |
28 | Raymond |
29 | Vivien |
The following transformation will take place on the employees
table.
Before | After |
|
|
---|
From file (from_file
)
A mask to replace column values by selecting from a column (seed_column
) in a CSV formatted seed file (seed_file
).
The seed file must contain one or more columns with named column headers. See the Files documentation for
more information on uploading seed files.
If there are any blank values in the file, they will not be considered when randomly selecting a replacement value.
If you wish to have blank values randomly selected from your seed file, ensure they are enclosed with quotation marks. For example, ""
.
Potential replacement values from the seed file rows may be pre-filtered using the table_filter_column
and
seed_filter_column
parameters. This allows you to subset the available values for each row by matching a column value
from the table to a column value in the seed file. If filtering does not match any rows in the seed file,
fallback_masks
can be provided to generate the replacement value instead.
After filtering has been applied (optional), replacement values from the seed file are selected randomly by default.
To enable deterministic selection of replacement values, you may specify one or more hash_columns
for the masking rule.
See Deterministic masking for more details.
Parameters
seed_file
(required): The name of a user-provided CSV seed file to select values from (see Files guide).seed_column
(optional): The name of the column in the seed file that will provide replacement values. This property only needs to be specified for CSV files with multiple columns.table_filter_column
andseed_filter_column
(optional): If both attributes are provided, the replacement value will only be selected from CSV rows where the value for theseed_filter_column
in the CSV row is equal to the value for thetable_filter_column
column in the database row being masked.fallback_masks
(optional): If there are no rows in the CSV that match thetable_filter_column
/seed_filter_column
condition, then this nested list of masks will be applied to the column value instead. If nofallback_masks
are provided and no match is found, then the column value will be replaced with aNULL
value.null_string
(optional): If set, any values in the seed file column that match this value will be treated asNULL
(instead of a string) when written to the database.
Example
This example masks the street_name
column with a random street name from a seed file
(DataMasque_address_example.csv) while ensuring that the randomly selected street
is from the same city as the original. A fallback value of “Main Street” is used in the event that the database
row contains a city that does not exist in the CSV.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: street_name
masks:
- type: from_file
seed_file: DataMasque_address_example.csv
seed_column: street_name
table_filter_column: city
seed_filter_column: city
fallback_masks:
- type: from_fixed
value: 'Main Street'
Show result
Given the following database table and using the DataMasque_address_example.csv seed file:
street_number | street_name | city |
---|---|---|
44 | Queen Street | Auckland |
57 | Oxford Street | London |
12 | Houston Street | New York |
5 | Fake Street | Fake City |
This mask could produce the following masked output:
street_number | street_same | city |
---|---|---|
44 | High Street | Auckland |
57 | Brick Lane | London |
12 | Maiden Lane | New York |
5 | Main Street | Fake City |
From format string (from_format_string
)
A simple mask to generate values of a defined format. This is useful when all values of a column have the same format. DataMasque will generate values according to a format specified according to the format string syntax.
Parameters
value_format
(required): The format of the data to be generated, which must meet the requirements of the format string syntax.
version: '1.0'
tasks:
- type: mask_table
table: drivers
key: id
rules:
- column: licence_plate
masks:
- type: from_format_string
value_format: "{[A-Z],3}{[0-9],3}"
Show result
Before | After |
|
|
---|
Note
- In addition to what is possible with the formats for a
mask_unique_key
task there is more you can do with format strings with themask_table
task. For example, you can easily select one value from a small number of alternatives using the syntax for alternatives:{(EN|FR)}
- If you need to guarantee that generated values are unique within a column, consider using the
from_unique
mask type.
From unique (from_unique
)
A simple mask that generates string values that are guaranteed to be unique within the target column.
Note: To automatically cascade primary and unique key values to
foreign keys or mask composite keys, consider using
mask_unique_key
instead.
Parameters
format
(required): The format that will be used to generate values. See format string syntax for details on how to construct a format string.
version: '1.0'
tasks:
- type: mask_table
table: drivers
key: id
rules:
- column: licence_plate
masks:
- type: from_unique
format: "{[A-Z],3}{[0-9],3}"
Show result
Before | After |
|
|
---|
Note
- All values produced by
from_unique
will be strings. To convert values to other data types, you canchain
yourfrom_unique
mask with atypecast
mask. - Unlike
from_format_string
,from_unique
ensures generated values are unique. This is achieved using the same underlying unique value generation procedure asmask_unique_key
. - The specified
format
must allow for a sufficient number of unique values to cover the full number of rows in the target table or file (rounded up to the nearest multiple of thebatch_size
formask_table
tasks), otherwise an error will be reported when executing the run. - Using the
from_unique
mask type differs from themask_unique_key
task type in the following ways:from_unique
cannot perform any cascading of values to related columns, such as foreign keys.from_unique
cannot be used to guarantee joint uniqueness across the columns in a composite key, unless guaranteeing uniqueness within a single column within the composite key is sufficient for the target use case.from_unique
can be used to update unique columns at the same time as other columns in amask_table
task, without the need for a separate database update performed by amask_unique_key
task.from_unique
can be used to generate unique values inmask_tabular_file
tasks or inmask_table
tasks for databases that do not supportmask_unique_key
(such as DynamoDB).
- Using
from_unique
in amask_table
task whereworker_count
> 1 is only supported for Oracle and Amazon DynamoDB connections. from_unique
cannot be used in the rules of amask_file
task.- Like other mask types,
from_unique
cannot be used to update thekey
columns of amask_table
task (except for databases that allowmask_table
to update key columns, such as DynamoDB). - All
from_unique
masks within a given task will be produced from the same sequence of unique values, such that twofrom_unique
masks with identicalformat
will produce the same value for the same row. This can be useful to ensure values produced by identicalfrom_unique
masks in differentif
branches are jointly unique.
Typecast (typecast
)
A simple mask to convert a value from one data type to another. This is particularly useful when a value is stored as one data type in a database, but needs to be transformed to a different data type to apply certain mask types within DataMasque.
If the typecast cannot be performed (e.g. because the value of the column is incompatible with the target data type), an error will be raised.
Parameters
typecast_as
(required): The data type you wish to convert the column value to. Must be one of:date
,datetime
,string
,integer
,float
, orboolean
.date_format
(optional): When converting a string to a datetime, this format string can be provided to specify the expected format of the string. Conversely, when converting a datetime to a string, this format string can be provided to determine the format of the resulting string. Format strings must be valid Python date/time format strings.
Example
This example will replace every start_date
in the employees
table with a fixed date value of 2010-01-01
. The
fixed date value is specified as a string and typecast as a datetime before being written to the database.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: start_date
masks:
- type: from_fixed
value: '2010-01-01'
- type: typecast
typecast_as: datetime
date_format: '%Y-%m-%d'
Do nothing (do_nothing
)
A simple mask which does not mask the data instead, sets the original value. This is useful when there are certain items or elements of columns that are not intended to be masked.
For example, you can use it to retain certain known XML attributes and mask all the rest. For more information please refer to XML Documentation.
Manipulating date values as strings
If you wish to manipulate non-string values using their string representations, you can use the typecast
mask.
Using the pattern demonstrated below, you may convert the value to a string, apply some manipulations, and then convert
back to the original data type. For example, to truncate the day and month from a date while retaining the
year, you may do the following:
_example
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: date_of_birth
masks:
# Convert to string
- type: typecast
typecast_as: string
date_format: '%Y-%m-%d'
# Take the 'YYYY-' component and append a fixed value of
# '01-01' to create the value 'YYYY-01-01'
- type: concat
masks:
- type: replace_substring
start_index: 5
masks:
- type: from_fixed
value: '01-01'
# Convert back to original datatype
- type: typecast
typecast_as: datetime
date_format: '%Y-%m-%d'
From choices (from_choices
)
A mask to replace column values by selecting from a list of choices or a dictionary of choices with weights specified.
Parameters
choices
(required): Can either be a list ofstrings
, or a list ofdictionaries
. If a list of dictionaries, each dictionary has a single key, the "choice" and value is the weight.Weights must be positive, and can be a mix of
float
andint
. Weights do not need to sum to a particular value, they are relative.If there are no weights assigned to any of the choices then all choices will have equal weights.
Examples
This example masks the department
column with a randomly selected department from the list of choices.
version: "1.0"
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: department
masks:
- type: from_choices
choices:
- "accounting"
- "sales"
- "research and development"
Show result
Before | After |
|
|
---|
This example masks the department
column with a randomly selected department from the list of choices based on the weights provided.
version: "1.0"
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: department
masks:
- type: from_choices
choices:
- "accounting": 10
- "sales": 15
- "research and development": 25
Show result
Before | After |
|
|
---|
Combinator masks
Concatenate (concat
)
A simple mask to concatenate the outputs of multiple masks together into a single string.
Parameters
masks
(required): A list of masks (or dictionary mapping keys to masks) which will be evaluated and have their outputs concatenated into a single value. The original column value is provided as the input to each mask.glue
(optional): If provided, this string will be inserted between the output of each concatenated mask. Defaults to an empty string. Useful for separating values with spaces or commas.
Example
This example generates a full name by concatenating a user’s name
with
the fixed value ‘Smith’.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: name
masks:
- type: concat
glue: " "
masks:
- type: from_column
source_column: name
- type: from_fixed
value: "Smith"
Show result
Before | After |
|
|
---|
Chain (chain
)
A simple mask to chain other masks together in series. This mask is only useful in combination with concat
,
in the case where multiple masking operations need to be performed on one part of a concatenated mask.
Parameters
masks
(required): A list of masks (or dictionary mapping keys to masks) that will be applied in sequence to the input value.
Example
This example selects a random name from the DataMasque_firstNames_mixed.csv file, transforms it to uppercase, and then concatenates ‘Smith’ onto it to generate a random full name.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: name
masks:
- type: concat
glue: ' '
masks:
- type: chain
masks:
- type: from_file
seed_column: firstname-mixed
seed_file: DataMasque_firstNames_mixed.csv
- type: transform_case
transform: uppercase
- type: from_fixed
value: 'Smith'
Show result
Before | After |
|
|
---|
String masks
Substitute (substitute
)
Replace each character in a string with another random character from its same set. The character sets are:
- Uppercase letters (
A-Z
). - Lowercase letters (
a-z
). - Digits (
0-9
).
Characters not in these sets (such as punctuation and symbols) are not replaced.
This mask is designed to be easy to drop in place to mask values that must have a specific format, but whose value is not important. For example, it could be used to mask:
- Phone numbers (e.g.
+1 (555) 867-5309
to+2 (938) 123-8372
) - License plates (e.g.
BZF123
toLMA191
) - Bank accounts (e.g.
10-9282-9478563-00
to23-1840-6492817-01
) - Passport numbers (e.g.
FD194845
toCZ858584
)
and so on.
substitute
is a good, simple and safe default for many data types. However, it is not intended to generate perfect
replacements for columns that must have special rules. For example, if a value must always start with the letter C
,
followed by 6 random numbers and letters, then substitute
is not suitable as the C
might be replaced with another
letter.
The uppercase
, lowercase
and digits
arguments can be used to disable the replacement of each of these character
sets. No errors are raised if a character set is enabled but those characters are not in the string, for example, it's
safe to try to replace letters in a phone number field.
Parameters
force_change
(optional): Since characters are chosen randomly, it is possible that a character might be randomly replaced with the same one (for example,A
is chosen as a replacement forA
). Setforce_change
totrue
to make sure the replacement character differs. Defaults tofalse
. Note that this makes the output slightly less random as the number of possible replacements is reduced by one.uppercase
(optional): A boolean to enable or disable the replacement of uppercase characters. Defaults totrue
(uppercase characters will be replaced).lowercase
(optional): A boolean to enable or disable the replacement of lowercase characters. Defaults totrue
(lowercase characters will be replaced).digits
(optional): A boolean to enable or disable the replacement of digits. Defaults totrue
(digits will be replaced).
Example
This example will apply substitute
masks to the phone
, license_plate
and validation_code
.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: phone
masks:
- type: substitute
- column: license_plate
masks:
- type: substitute
- column: validation_code
masks:
- type: substitute
Show result
Before | After |
|
|
---|
Random text (from_random_text
)
This mask replaces the column's value with randomly generated a-z characters.
Parameters
max
(required): The generated character string will be this length at maximum. The maximum length must be between 1 and 100.min
(optional): The generated character string will be this length at minimum. If no value is supplied here, the generated string's length will always be equal to themax
value.case
(optional): The case (upper or lower) of the text generated. Mixed case will be generated if this field is left blank. Must be one of:upper
,lower
Example
This example replaces the values in the name
column with a random string of
lower case characters between 5 and 10 characters in length.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: name
masks:
- type: from_random_text
min: 5
max: 10
case: lower
Show result
Before | After |
|
|
---|
Transform case (transform_case
)
A simple mask to perform a transformation to the case/capitalisation of a string.
Parameters
transform
(required): The transformation to apply. Must be one of:uppercase
,lowercase
,capitalize_words
(capitalizes first letter of each word),capitalize_string
(capitalizes first letter only).
Example
This example will convert all values in the name
column into uppercase.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: name
masks:
- type: transform_case
transform: uppercase
Show result
Before | After |
|
|
---|
Substring (take_substring
)
A simple mask to select a substring from a column value. You may wish to use this to select or remove a subset of characters from the beginning, end, or middle of a string.
Parameters
start_index
(optional): The index of the first character to include in the selected substring, with 0 being the index of the first character in the string. Defaults to0
.end_index
(optional): The index of the character immediately AFTER the selected substring (i.e. theend_index
is exclusive). If omitted, the selection will continue until the end of the string.
Positive and negative indices can be used, i.e. the first character in a string
is at index 0
, the second character is at index 1
, the last character is at
index -1
, and the second-to-last character is at index -2
.
Example
This example will return only the first 3 characters of each value in the name
column. The final result
will return the characters at positions 0, 1, and 2. This is because the end_index
is exclusive; the
characters starting from the end_index
value of 3 onwards are omitted from the final result.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: name
masks:
- type: take_substring
start_index: 0
end_index: 3
Show result
Before | After |
|
|
---|
Replace substring (replace_substring
)
A mask for transforming a selected substring of a string value. The transformation is defined by a nested sequence of masks. Matched substrings are transformed in-place, leaving the unmatched sections intact. For more complex use cases, replace_regex may be helpful.
Parameters
masks
(required): A list of masks (or dictionary mapping keys to masks) that define the transformation to apply to the selected substring. The selected substring is provided as the input to the first mask.start_index
(optional): The index of the first character to include in the selected substring, with 0 being the index of the first character in the string. Defaults to0
.end_index
(optional): The index of the character immediately AFTER the selected substring (i.e. theend_index
is exclusive). If omitted, the selection will continue until the end of the string.preserve_length
(optional): If set totrue
, then the output of themasks
will be truncated or repeated until it has the same length as the original substring. This ensures the length of the entire string is unchanged. Defaults tofalse
.
Positive and negative indices can be used, i.e. the first character in a string is at index 0, the second character is at index 1, the last character is at index -1, and the second-to-last character is at index -2.
Example
This example will replace the last 3 characters of each value in the name
column with a # symbol.
The start_index
value of -3 indicates that the third to last character is the beginning of the
substring. Because the end_index
is not specified, all characters starting from the third to last
character of the string until the end of the string are masked. The final result will take the
characters at index position -3, -2 and -1, and replace those values with '#', leaving the rest
of the string unchanged.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: name
masks:
- type: replace_substring
start_index: -3
masks:
- type: from_fixed
value: '###'
Show result
Before | After |
|
|
---|
Replace regular expression (replace_regex
)
A mask for transforming sections of a string that match a certain regular expression. The transformation that is applied to each matched substring is defined by a nested sequence of masks. The matched substrings are transformed in-place, leaving the unmatched sections intact.
Parameters
masks
(required): A list of masks (or dictionary mapping keys to masks) defining the transformation to apply to each substring that matches the pattern specified inregex
. The entire sequence of masks will be applied to each substring that is matched, with the matched value being provided as the input to the first mask.regex
(required): The regular expression that will be used to search for substrings to mask. For more details on how to use regular expressions, see Common regular expression patterns.preserve_length
(optional): If set totrue
, then each output of themasks
will be truncated or repeated until it has the same length as the original matched substring. This ensures the length of the entire string is unchanged. Defaults tofalse
.
Example
This example replaces all numeric characters in the driversLicence
column with #
.
Please note that it is also best practice to wrap the regular expression in quotes to avoid
special characters being misinterpreted as YAML syntax:
version: '1.0'
tasks:
- type: mask_table
table: '"DriversLicence"'
key: id
rules:
- column: driversLicence
masks:
- type: replace_regex
regex: '[0-9]'
masks:
- type: from_fixed
value: '#'
Show result
Before | After |
|
|
---|
Credit card (credit_card
)
This mask provides a number of methods for masking and generating credit card numbers. Parameters can be set to control the prefix, PAN formatting and luhn-validity of the generated numbers.
Parameters
issuer_names
(optional): The generated card will have the specified issuer's prefix(es) and card lengths. If left empty all card issuers can be used to generate the card number. Cannot be specified ifpreserve_prefix
istrue
. Please refer to the list of issuers.validate_luhn
(optional): Set totrue
by default, the generated card number will pass the luhn checksum. Set tofalse
to generate random credit cards instead, which slightly improves performance by skipping luhn validation. Cannot betrue
withpan_format
enabled.pan_format
(optional): Set tofalse
by default, the digits between the fourth and last six digits will be marked out with 'X's. Cannot betrue
withvalidate_luhn
enabled.preserve_prefix
(optional): Set tofalse
by default, the generated card will keep the card issuer's prefix but generate new numbers. The length of the prefix will vary based on the issuer. Due to the increased accuracy of enabling this option, which performs a lookup against a list of known issuers, this is not as fast as plain random generation. Actual speed difference will depend on the data to be masked. Cannot betrue
ifissuer_names
have been specified.
Example
This example generates credit card numbers that pass the luhn checksum, with card issuer set to either MasterCard, Visa, or American Express.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: credit_card
issuer_names:
- VISA
- MASTERCARD
- AMERICAN EXPRESS
validate_luhn: true
preserve_prefix: false
pan_format: false
Show result
Before | After |
|
|
---|
This example generates credit card numbers that retain the original card prefix and the middle numbers in the card will be marked out with 'X'.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: credit_card
validate_luhn: false
preserve_prefix: true
pan_format: true
Show result
Before | After |
|
|
---|
Card issuer names
These card issuer names can be used as arguments to the issuer_names
parameter. They are not case-sensitive.
Visa | Mastercard | American Express |
China T-Union | China Unionpay | Dankort |
Diners Club International | Diners Club United States & Canada | Discover Card |
Instapayment | Interpayment | JCB |
Lankapay | Maestro | Maestro UK |
MIR | NPS Pridnestrovie | Rupay |
Troy | Ukrcard | Verce |
Card prefixes
DataMasque contains a list of over 105,000 prefixes which are used when the preserve_prefix
parameter is set. If a
prefix is not found, then DataMasque falls back to preserving just the first digit.
A full list of prefixes can be found in the credit_card_prefixes.csv file (1.9MB CSV).
Imitate (imitate
) (deprecated)
The imitate
mask was renamed to substitute
in DataMasque 2.11. Unless noted in the changelog,
imitate
masks will continue to function with the same behaviour and options as the substitute
mask. Refer to
the substitute
mask documentation for usage details and configuration. Backwards
compatibility may be removed in a future DataMasque version, so imitate
should be replaced with substitute
in
rulesets, when possible.
Numeric masks
Random Number (from_random_number
)
This mask replaces the column value with a random number. The default number type is
integer. If decimal_places
is set greater than 0, a decimal number will be
generated.
Parameters
max
(required): The generated number is guaranteed to be less than or equal to this maximum value.min
(optional): This is the minimum value for the generated number. Defaults to0
.mode
(optional): If this parameter is supplied, the number will be chosen from a triangular distribution where 'mode' is the peak. Useful for street numbers etc. (e.g. withmin
andmode
both set to1
to generate lower street numbers more frequently).decimal_places
(optional): Number of decimal places to generate. Defaults to 0.
Example (integer)
This example replaces the quantity
column in the products table with a random
integer between 1 and 200.
version: '1.0'
tasks:
- type: mask_table
table: products
key: id
rules:
- column: quantity
masks:
- type: from_random_number
min: 1
max: 200
Show result
Before | After |
|
|
---|
Example (decimal)
This example replaces the price
column in the products table with a random
number to 2 decimal places between 1 and 200.
version: '1.0'
tasks:
- type: mask_table
table: products
key: id
rules:
- column: price
masks:
- type: from_random_number
min: 1
max: 200
decimal_places: 2
Show result
Before | After |
|
|
---|
Random boolean (from_random_boolean
)
This mask replaces the column value with a random boolean value (1
/0
or
true
/false
). from_random_boolean
is effectively equivalent to a
from_random_number
mask with options max: 1
and min: 0
.
Note The return type is an integer which can be automatically cast to the appropriate boolean type in most databases. For file masking and Dynamo, literal
1
or0
will be written to the file, which may not be the intended behaviour.
- If literal
true
/false
boolean values are required, please chainfrom_random_boolean
with atypecast
:
masks: - type: chain masks: - type: from_random_boolean - type: typecast typecast_as: boolean
- If
"true"
/"false"
strings are required please use thefrom_choices
mask with"true"
or"false"
values (note the quoting around the values). For more information please refer tofrom_choices
Example
This example replaces the active
column in the products table with random
boolean values.
version: '1.0'
tasks:
- type: mask_table
table: products
key: id
rules:
- column: active
masks:
- type: from_random_boolean
Show result
Before | After |
|
|
---|
Numeric Bucket (numeric_bucket
)
This mask generates random integers from within the same "bucket" (numeric range) as the value to be masked. The mask is set up using an array of integers that define the lower bound of each bucket.
buckets
Intro
Buckets
are a series of defined lower bounds from which replacement values are randomly generated.
For example, an age
column that should mask values into certain age buckets
: 1-17, 18-24, 25-64, 65+. buckets: [18,25,65]
are specified, so if a value from the column is 16
, the replacement value is randomly generated within the 1-17
range,
as opposed to any number.
Capping minimum and maximum values
The numeric bucket mask will generate numbers for the lowest bucket from 0/1 to <bucket> - 1
. Similarly, for the highest
bucket, numbers will be generated in the range bucket
to 231 - 1. This default behaviour may cause undesired
outputs which are outside the normal range for the data.
For example, consider a column containing ages. All ages currently in the column are 18 or older. Ages are to be retained in buckets 35 or younger, 36-65, 65+.
A ruleset could be defined with buckets: [36, 65]
, but this would not function correctly. Ages 35 or lower could be masked
to any value from 0-35, therefore the masked age could be below 18, which does not match the application rules. A similar
problem exists with ages over 65, the masked value could be any value from 65 to 231 - 1.
To solve this issue, lower and upper bounds should be specified. This requires an understanding of the existing data. Choosing
the upper bound would mean selecting a reasonable value that your ages could have. A better ruleset definition could be: buckets: [18, 36, 65, 100]
.
This would limit the lower age to 18 and upper age to 100.
Parameters
buckets
(required): A series of numbers representing the lower bounds of the buckets, including the number e.g.[5, 10, 15 20]
would result in ranges 1-4, 5-9, 10-14, 15-19, 20+. Note:buckets
should not contain duplicates and should be in ascending order.force_change
(optional): Since the replacement values are chosen randomly, it is possible that the number might be randomly replaced with the same one (for example,45
is chosen as a replacement for45
). Setforce_change
totrue
to ensure the replacement value is not the same as the original value. Note that this makes the output slightly less random as the number of possible replacements is reduced by one.include_zero
(optional): When set totrue
, will lower the first lower bound to include 0, otherwise the lowest bound will be 1.scale_to
(optional): Provide a number that better represents the scale of your data in the target column to which replacement values will be scaled. e.g., If the scale of your data is 1000 (values are multiples of 1000), then 4583 is not an appropriate replacement for 2000. Setscale_to
to1000
so the replacement values will be multiples of1000
(4583 -> 5000).scale_to
must be a multiple of 10.
Note: If
buckets
were defined with a size of one (e.g.buckets: [1,3,5]
) with theforce_change
parameter set totrue
this would cause an infinite loop. Instead, an error will be raised. If the target column is of a type with a smaller range than 0 - 231-1 (smallint, tinyint, etc), add a maximum value to thebuckets
as replacement values could be larger than the column can store.numeric_bucket
requires a numeric value as the input. Atypecast
mask can be used to convert any string values before masking withnumeric_bucket
. For more information please refer to Typecast.
Example
This example replaces the age
column in the users
table with random values from the specified buckets
.
version: '1.0'
tasks:
- type: mask_table
table: users
key: id
rules:
- column: age
masks:
- type: numeric_bucket
buckets: [16,18,25,65,110]
force_change: true
Show result
Before | After |
|
|
---|
Date/time masks
Random date/time (from_random_datetime
)
This mask replaces the column value with a random datetime. Generated datetime values have a resolution of 1 second (the millisecond component will be 0).
Parameters
max
(required): The generated date/time is guaranteed to be less than this date/time, formatted according to ISO 8601 or you can specifynow
/current_date_time
, which will be the date and time that the run executes.min
(required): This is the minimum value for the generated date/time, formatted according to ISO 8601 or you can specifynow
/current_date_time
, which will be the date and time that the run executes.
Note A
from_random_datetime
mask can be used in conjunction with a Typecast mask in order to make sure the format of the generated date/datetime value satisfies any format requirements.
Example
This example replaces the order_time
column with a random date time between
December 1 2019 00:00:00 UTC-0 and December 31 2019 05:30:00 UTC-0.
version: '1.0'
tasks:
- type: mask_table
table: orders
key: id
rules:
- column: order_time
masks:
- type: from_random_datetime
min: '2019-12-01T00:00+00:00'
max: '2019-12-31T05:30+00:00'
Show result
Before | After |
|
|
---|
Random date (from_random_date
)
This mask replaces the column value with a random date. Generated date values have a resolution of 1 day and do not include a time component.
Parameters
max
(required): The generated date/time is guaranteed to be less than this date/time, formatted according to ISO 8601 or you can specifynow
/current_date_time
, which will be the date and time that the run executes.min
(required): This is the minimum value for the generated date/time, formatted according to ISO 8601 or you can specifynow
/current_date_time
, which will be the date and time that the run executes.
Note A
from_random_date
mask can be used in conjunction with a Typecast mask in order to make sure the format of the generated date value satisfies any format requirements.
Example
This example replaces the date_of_birth
column with a random date between
January 1 1950 and December 31 2000.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: date_of_birth
masks:
- type: from_random_date
min: '1950-01-01'
max: '2000-12-31'
Show result
Before | After |
|
|
---|
Note that this mask uses the ISO-8601 standard for
date
values. This allows for ISO weeks to be used for formatting dates, ISO week formats are specified with a 'W' in front of the week number and optionally followed by the day of the week. Additionally, day of the year can be specified as a three-digit number. See Date formats below.
Date formats
Format | Example | Equivalent |
---|---|---|
yyyy | 2022 | January 1, 2022 |
yyyy-mm | 2022-03 | March 1, 2022 |
yyyy-mm-dd | 2022-03-05 | March 5, 2022 |
yyyy-Www | 2022-W20 | Week 20 of 2022 (May 16, 2022) |
yyyy-Www-d | 2022-W20-4 | Day 4 of Week 20 of 2022 (May 19, 2022) |
yyyy-ddd | 2022-050 | Day 50 of 2022 (February 19, 2022) |
Retain age (retain_age
)
This mask is designed to be used on a column containing dates. It will mask by generating random dates in such a way that age (in years) is retained, calculated based on the date the mask is executed. This can be used to ensure that columns such as date of birth, transaction date, registration date or other creation dates are masked without conflicting with other rules that may rely on the age of the masked date.
Parameters
date_format
(optional): If the column value is read as a string type, then it will be converted to a date using this format. The new randomly generated date will also be written back to the database in the same format. Format strings must be valid Python date/time format strings. Defaults to"%Y-%m-%d"
.day_of_year
(optional): For consistent masking, specify aday_of_year
to calculate the ages on that day of the current year. This should be specified in the format"<month>-<day>"
; for example,"8-1"
for the 1st of August. The year will be set to the current year. If this option is omitted, then ages will be calculated based on the current date.
Example
This example replaces the date_of_birth
column values while retaining the employee's age. The dates are based on the
masking run being executed on August 1st, 2022. The employees' ages are 50, 50, 49 and 22, respectively, and
are maintained even with the new dates of birth.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: date_of_birth
masks:
- type: retain_age
Show result
Before | After |
|
|
---|
Note that the mask considers birthdays on the 29th of February to be celebrated on the 28th of February on non-leap-years.
If masking
datetime
values (rather than justdate
s) then the time portion of the masked value will be00:00:00
(midnight).
Retain date component (retain_date_component
)
This mask will generate a date where up to two of the year, month, or day components value will be retained.
Parameters
date_format
(optional): If the column value is read as a string type, then it will be converted to a date using this format. The new randomly generated date will also be written back to the database in the same format. Format strings must be valid Python date/time format strings. Defaults to"%Y-%m-%d"
.year
(this ORmonth
ORday
required): This is abool
which specifies whether the year component of the date should be retained. Defaults toFalse
.month
(this ORyear
ORday
required): This is abool
which specifies whether the month component of the date should be retained. Defaults toFalse
.day
(this ORyear
ORmonth
required): This is abool
which specifies whether the day component of the date should be retained. Defaults toFalse
.minimum_year
(optional): This is the minimum year that can be generated. Defaults to 100 years ago.maximum_year
(optional): This is the maximum year that can be generated. Defaults to current year.force_change
(optional): Ensures the newly generated date is different to the input date. On the chance that the same date is generated.
Example
In this example the month of each date is retained, while the day and year will be randomly generated. The year will be a value between 1980 and 2005 (inclusive).
version: '1.0'
tasks:
- type: mask_table
table: customers
key: id
rules:
- column: date_of_birth
masks:
- type: retain_date_component
month: True
minimum_year: 1980
maximum_year: 2005
Show result
Before | After |
|
|
---|
Note: In order for
retain_date_component
to consistently produce the same values between runs in different years, a fixedmaximum_year
must be specified.
Retain year (retain_year
)
This mask replaces the column value with a random date. The generated date will have the year component retained while randomising month and day.
Parameters
date_format
(optional): If the column value is read as a string type, then it will be converted to a date using this format. The new randomly generated date will also be written back to the database in the same format. Format strings must be valid Python date/time format strings. Defaults to"%Y-%m-%d"
.force_change
(optional): Ensures the newly generated date is different to the input date. On the chance that the same date is generated.
Example
This example replaces the date_of_birth
column values with new dates while retaining their year component.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: date_of_birth
masks:
- type: retain_year
force_change: True
Show result
Before | After |
|
|
---|
Document masks
Document masks apply masking to subfields in documents that are stored in database columns, for example, JSON or XML documents.
JSON (json
)
This mask will use query to locate and mask a value inside a JSON document. The rest of the JSON document is unchanged.
The path
is specified using a list of strings or integers which will be used when traversing the data to the values
intended to be masked; some examples of path
are covered in the next section.
Parameters
transforms
(required): A list of the transforms (replacements) to perform on the JSON document.path
(required): The path to locate the value to update.masks
(required): A list of masks to be performed (Any of the valid Mask Types).on_null
(optional): A string to specify the action to take if the value isnull
. One of:skip
(default): Skip to the next transform, the document remains unchanged.error
: Raise an error and stop masking.mask
: Mask thenull
value as specified.
on_missing
(optional): A string to specify the action to take if the value is not present (due to the document structure not matching the path).skip
: Skip to the next transform, the document remains unchanged.error
(default): Raise an error and stop masking.
force_consistency
(optional): Keep consistency between replacements in the path. See the section JSON Example withforce_consistency
for details on behaviour. Defaults tofalse
.
fallback_masks
(optional): Mask to perform if the data retrieved from the database is not valid JSON.
If the
json
mask is provided anull
value (e.g. from a SQL column), the value will remainnull
.fallback_masks
will not be executed.When masking multiple values in the same JSON document, multiple
transforms
should be specified, instead of multiple table masks with a single transform each. This means that the JSON column will only need to serialized/deserialized once per row.
path
Intro
For the JSON:
{
"customer_details": {
"first_name": "Richard",
"last_name": "Willis"
},
"quantity": 18,
"products": ["product1", "product2"]
}
The following paths could be used to refer to particular values:
[customer_details, first_name]
refers to the value"Richard"
[customer_details, last_name]
refers to the value"Willis"
[quantity]
refers to the value18
[products]
refers to the product array["product1", "product2"]
[products 0]
refers to the first value in the product array:["product1"]
The wildcard operator *
can be used to apply masks to multiple items matching the query. This is useful if you don't
know how many elements will be in an array or object. For example, a JSON object with multiple people, each with
multiple addresses:
{
"users": [
{
"name": "Richard",
"addresses": [
{"type": "postal", "city": "Fairview"},
{"type": "physical", "city": "Riverside"}
]
},
{
"name": "Willis",
"addresses": [
{"type": "postal", "city": "Beachland"},
{"type": "physical", "city": "Bronson"}
]
}
]
}
The path [users, "*", name]
would mask the name
for every element in users
, regardless of how many there are.
Multiple wildcards can be used, too. The path [users, "*", addresses, "*", city]
would mask city
in all addresses
elements of all users
. Note that *
must always be quoted in YAML.
Note: Values in
path
are case-sensitive. They should not follow quoting rules for database columns (double quotation marks in an outer set of single quotation marks). Instead, normal YAML string-quoting rules apply.
Example
This example replaces the data at the path [customer_details, first_name]
of the json_data
column with a
fixed value REDACTED
. The on_null: mask
option is specified to mask the null
value as normal. The skip
option is
specified to skip that transform and continue masking on missing values (i.e. the structure does not match the path).
Note that this means the first_name
in the wrong location in the first row is not masked. In cases like this, it can
be safer to specify error
instead, so the masking run fails if data is not in the expected format.
In the second row where {"first_name": null}
, this value will be masked since we specified on_null: mask
.
Also note the use of fallback_masks
. The last row did not have valid JSON data in it, so the fallback mask was used to
replace it with an empty JSON object which, may help clean the data for further use.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: uid
rules:
- column: json_data
masks:
- type: json
transforms:
- path: [customer_details, first_name]
masks:
- type: from_fixed
value: "REDACTED"
on_null: mask
on_missing: skip
fallback_masks:
- type: from_fixed
value: "{}"
Show result
Before | After |
|
|
---|
For arrays, all masks
will be applied to each value in the array.
For example:
{
"customer_details": {
"given_names": ["Richard", "Willis"]
}
}
The path [customer_details, given_names]
would return the value ["Richard", "Willis"]
and the masks would then
be performed on "Richard"
and "Willis"
separately. This means for most mask types, each value in the array would be
transformed into a new, different value. However, if you are using a mask that always returns the same value
(e.g. from_fixed
) all values would be transformed to the same new value.
Note:
- In all databases, the
json
mask supports masking of JSON data stored in text type columns (VARCHAR
,NVARCHAR
orTEXT
).- JSON specific columns types are also supported, for example,
JSON
in PostgreSQL and MySQL, orJSONB
in PostgreSQL.- Arrays, maps, and sets inside DynamoDB columns can also be masked with the
json
mask. Sets are treated like arrays, with items indexed according to their sorted order.
JSON Example with force_consistency
This example will illustrate the benefit of using the force_consistency
parameter on transforms.
Suppose you have a table with JSON data with the following structure:
{
"name": [
{
"use": "official",
"family": "Chalmers",
"given": ["Peter", "James"]
},
{
"use": "usual",
"given": ["Jim"]
},
{
"use": "maiden",
"family": "Windsor",
"given": ["Peter", "James"]
}
]
}
When masking the items at the path name, "*", given
, it would be best to mask them with consistent values i.e. the same masked names would
appear in each of the given
items after masking. To do this, set the force_consistency
parameter of the relevant
transform to true
.
version: "1.0"
tasks:
- type: mask_table
table: dbo.json_test
key: id
rules:
- column: first_name
masks:
- type: json
transforms:
- path: ['name', '*', 'given']
masks:
- type: from_file
seed_file: DataMasque_firstNames_male.csv
seed_column: firstname-male
force_consistency: true
Show result
Before | After |
|
|
---|
Without force_consistency
the output JSON would have all different names, an example is shown below:
Show result
Before | After |
|
|
---|
XML (xml
)
This mask will use a query to locate and mask a value inside an XML document. The rest of the XML document is unchanged.
An Xpath (path
) is used to define the path to the node to mask. Once the node has been located, one or more
node_transforms
can be applied to alter its content or attributes.
Note: The
xml
mask should only be used with trusted XML data. The parser includes support for entity expansion and external references which can potentially be exploited with malicious XML payloads.
Intro to transforms
and node_transforms
XML documents are made up of one or more elements. When referring to an element, this includes the start tag, end tag, attributes and content. For example, this element representing a log:
<Log date="2022-08-09" username="user@example.com">Account created</Log>
The element to mask is located using an Xpath expression. Once found, there are a few different parts of the element that can be masked, namely:
- its name (
Log
) - its attributes (
date
andusername
) - its text (
Account created
)
Each of these items are XML nodes.
When a masking run executes, each row from the database is fetched and passed to a masking function only once.
To apply masks on different elements in an XML document, the ruleset should define a list of transforms
, one for each
element that requires masking. In turn, a list of node_transforms
must be specified, one for each node of the
element that needs to be masked.
Specifying masking in this manner allows the masking run to be more efficient by querying for each element to be masked only once.
As an example, consider how to mask the Log
in the above example. The date
and username
attributes should be
redacted, along with the text content. This would require one transform to locate the Log
element, then three
node transforms: one for the date
attribute, another for the username
attribute, and the final to mask the text
of
the element.
The relevant portion of the YAML describing this transform would look like:
transforms:
- path: 'Log'
node_transforms:
- type: attribute
attributes: 'date'
masks:
- <list of masks>
- type: attribute
attributes: 'username'
masks:
- <list of masks>
- type: text
masks:
- <list of masks>
Note: This is assuming the
Log
element is not the root element in the XML document. To get the root element use.
or an absolute Xpath (starting with//
) as the path. All XML values are read as strings which will require atypecast
mask if they are used in a mask that requires non-string values (e.g.numeric_bucket
). XML also requires strings to be written so masks that return non-string values (e.g.from_random_number
,from_random_boolean
,numeric_bucket
) need to go through atypecast
mask before being written. For more information ontypecast
please refer to the Typecast documentation. Below is an example withfrom_random_number
.
version: "1.0"
tasks:
- type: mask_table
table: xml_test
key: id
rules:
- column: xml_data
masks:
- type: xml
transforms:
- path: 'Log'
node_transforms:
- type: attribute
attributes: 'id'
masks:
- type: from_random_number
min: 1000
max: 9999
- type: typecast
typecast_as: 'string'
Consistency for multiple elements
Xpath expressions can match multiple elements. This XML document contains a UserLog
with multiple Log
s:
<Root>
<UserLog>
<Log date="2022-08-09" username="user@example.com">Account created</Log>
<Log date="2022-08-09" username="user@example.com">Logged in</Log>
<Log date="2022-08-09" username="user@example.com">Logged out</Log>
</UserLog>
</Root>
The root is called
Root
in these examples – the root node does not need to be namedRoot
.
The Xpath UserLog/Log
would match all three Log
elements. DataMasque can be configured to mask each of
the specified nodes with the same value, or as different values. For example, the text of each element could be masked
to the same value. Or, different masks can be applied to each located element. This is configured with the
force_consistency
option at the transform level. Setting this to true
will apply each node transform in the same way
to each element.
Xpath Relative Node
When evaluating an xpath expression, the root node is considered to be the current node when executing masking. Therefore, the root node should not be included when using relative xpaths.
Consider this example document:
<Root>
<UserLog>
<Log/>
</UserLog>
</Root>
To select the Log
node, the Xpath Root/UserLog/Log
is not valid, as Root
is the current node. Instead
UserLog/Log
should be used as the path is relative to Root
.
If using an absolute Xpath (i.e. an Xpath starting with //
) then the root node should be included. That is, the
Xpath //Root/UserLog/Log
and UserLog/Log
select the same node(s) in this case.
XPath with XML namespaces
When an XML document uses namespaces, the namespace prefix is not used when specifying the Xpath, but instead the namespace URI is included in curly braces {}
immediately before the element or attribute name. Note that you must include the namespace URI for each element or attribute in the path.
<Orders xmlns="http://example.com/api/"
xmlns:o="http://example.com/api/orders/">
<Order poNumber="55">
<OrderId>20</OrderId>
<o:Customer>
<o:CustomerId>10</o:CustomerId>
<o:State o:sentiment="good">Happy</o:State>
<State>NSW</State>
</o:Customer>
</Order>
</Orders>
Here's an example ruleset to mask the above XML document:
version: "1.0"
tasks:
- type: mask_file
rules:
- hash_sources:
- xpath: "/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/}OrderId/text()"
masks:
- type: xml
transforms:
- path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/}OrderId'
on_missing: error
node_transforms:
- type: text
masks:
- type: from_random_number
min: 50
max: 99
- path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order'
on_missing: error
node_transforms:
- type: attribute
attributes: 'poNumber'
masks:
- type: from_random_number
min: 50
max: 99
- path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/orders/}Customer/{http://example.com/api/orders/}State'
on_missing: error
node_transforms:
- type: text
masks:
- type: from_choices
choices:
- Happy
- Sad
- Angry
- Anxious
- Excited
- type: attribute
attributes: '{http://example.com/api/orders/}sentiment'
masks:
- type: from_choices
choices:
- good
- bad
- excellent
- path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/orders/}Customer/{http://example.com/api/}State'
on_missing: error
node_transforms:
- type: text
masks:
- type: from_choices
choices:
- ABC
- DEF
- JKL
Masking of unknown/extra attributes
There may be cases where XML elements sometimes have extra attributes that are not always known prior to masking. To
mask these, the extra_attribute_masks
option can be specified. This should contain a list of masks to apply to each
attribute that has not been masked using a defined node_transform
.
By default, each "extra" attribute value will have the masks applied to it separately. To force each of these values to
be the same, specify the force_extra_attribute_consistency: true
at the transform level. The extra_attribute_masks
will be
applied to the first extra attribute on the first node found, and the resulting value will be applied to all extra
attributes. Note that the order in which attributes are located is indeterminate and may not match the order they appear
in the XML.
Parameters
transforms
(required): A list of the transforms (replacements) to perform on the XML document.path
(required): The Xpath expression to locate the value to update.node_transforms
(required): A list of transforms to apply to the nodes on the element. The syntax of this object is shown in thenode_transforms
Parameters section below.on_missing
(optional): A string to specify the action to take if the element that the givenpath
is not present (due to the document structure not matching the path).skip
: Skip to the next transform, the document is unchanged by this transform.error
(default): Raise an error and stop masking.
force_consistency
(optional): Require each matching element to be masked to the same values. Defaults tofalse
.extra_attribute_masks
: (optional): A list of masks to apply for attributes not covered by a specificnode_transform
.force_extra_attribute_consistency
(optional): Force all "extra" attributes to be masked to the same value. Only applicable when usingextra_attribute_masks
. Defaults tofalse
.
fallback_masks
: (optional): Mask to perform if the data retrieved from the database is not valid XML.
If the
xml
mask is provided anull
value (e.g. from a SQL column), the value will remainnull
.fallback_masks
will not be executed.
node_transforms
Parameters
node_transforms
is a list of transforms to apply to the nodes of the found element(s).
type
(required): The type of node(s) of the current element to apply masking to. Must be one of:text
: The text value of the element (the content between the opening and closing tags).attribute
: Mask one or more attribute(s) on the element.name
: Mask the name of the element itself.
masks
(required): A list of masks to be performed (Any of the valid Mask Types).attributes
(optional): This option is required when using theattribute
type, and must not be present for other types. May either be astring
, or anarray
ofstrings
, which specify the attributes to applymasks
to. To apply different masks to different attributes, use multiplenode_transforms
.on_missing_attribute
(optional): A string to specify the action to take if an attribute is missing. Please see the section below on Missing XML Nodes, to see what constitutes a missing attribute.skip
: Skip to the next attribute (if masking multipleattributes
) or, if there are no attributes to be masked, to the nextnode_transform
. The document is unchanged by this transform.mask
: Apply the masks, using anull
value, then create the text content or attribute.error
(default): Raise an error and stop masking.
on_null_text
(optional): A string to specify the action to take if the text of a node isnull
(missing). Please see the section below on Missing XML Nodes, to see what constitutes a missing node.skip
(default): Skip to the nextnode_transform
. The document is unchanged by this transform.mask
: Apply the masks, using anull
value, then create the text content or attribute.error
: Raise an error and stop masking.
Missing XML Nodes
The on_missing_attribute
or on_null_text
options can be used to change how missing values are treated.
- A text node is considered null if a tag is self-closing. For example,
<Transaction amount="23.94"/>
. It is also considered null if the element is empty; for example,<Message to="user1" from="user2"></Message>
. - An attribute is considered missing if it does not exist on the element. For example, the attribute
currency
is missing from this element:<Transaction amount="23.94"/>
. An empty string attribute is not considered missing, and instead is just masked as an empty string. on_missing_attribute
oron_null_text
does not apply toname
node type, as XML tags/elements must have a name.
Retaining known attributes and removing others
There may be some instances where you want to retain known attributes, but mask all others. In this case, you can
combine the do_nothing
mask with the extra_attribute_masks
. Any attributes you want to retain will be "masked" to their
original value with do_nothing
; DataMasque considers these to be masked and then applies the extra_attributes_masks
to any
other attributes.
Examples
This example will contain 1 transforms
and 3 node_transforms
. The transforms
item will specify the path UserLog/Log
of the xml_data
column, the optional parameters not specified will be set to the default values.
- The first of the
node_transforms
replaces the text at the path with a fixed valueREDACTED
, theon_null_text: mask
option is specified to mask thenull
value as normal. - The second will mask the
username
attribute to a similar replacement by concatenating 3from_file
masks and atransform_case
mask to make sure the replacements are all still lower case. - The third will mask the
date
attribute with a suitable replacement date with afrom_random_date
mask.
Also note the use of fallback_masks
. The last row did not have valid XML data in it, so the fallback mask was used to
replace it with an empty <Root />
element which, may help clean the data for further use.
<Root>
<UserLog>
<Log date="2022-08-09" username="user@example.com">Account created</Log>
<Log date="2022-08-09" username="user@example.com"></Log>
<Log date="2022-08-09" username="user@example.com">Logged out</Log>
</UserLog>
</Root>
version: "1.0"
tasks:
- type: mask_table
table: xml_test
key: id
rules:
- column: xml_data
masks:
- type: xml
fallback_masks:
- type: from_fixed
value: '<Root />'
transforms:
- path: 'UserLog/Log'
node_transforms:
- type: text
masks:
- type: from_fixed
value: REDACTED
on_null_text: mask
- type: attribute
attributes:
- username
masks:
- type: concat
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
- type: from_file
seed_file: DataMasque_lastNames.csv
seed_column: lastnames
- type: from_file
seed_file: DataMasque_email_suffixes.csv
seed_column: email-suff
- type: transform_case
transform: lowercase
- type: attribute
attributes:
- date
masks:
- type: from_random_date
min: '2022-01-01'
max: '2022-12-31'
Show result
Before | After |
|
|
---|
XML Example with force_consistency
This example will illustrate the benefit of using the force_consistency
parameter on transforms.
Suppose you have a table with XML data with the following structure:
<Root>
<UserLog>
<Log date="2022-08-09" username="user@example.com">Account created</Log>
<Log date="2022-08-09" username="user@example.com"></Log>
<Log date="2022-08-09" username="user@example.com">Logged out</Log>
</UserLog>
</Root>
When masking the date
and username
attributes at the path UserLog/Log
, it would be best to mask them with
consistent values i.e. the same masked values would appear in each of the attributes after masking.
To do this, set the force_consistency
parameter of the relevant transform to true
.
version: "1.0"
tasks:
- type: mask_table
table: xml_test
key: id
rules:
- column: xml_data
masks:
- type: xml
fallback_masks:
- type: from_fixed
value: '<Root />'
transforms:
- path: 'UserLog/Log'
force_consistency: true
node_transforms:
- type: text
masks:
- type: from_fixed
value: REDACTED
on_null_text: mask
- type: attribute
attributes:
- username
masks:
- type: concat
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
- type: from_file
seed_file: DataMasque_lastNames.csv
seed_column: lastnames
- type: from_file
seed_file: DataMasque_email_suffixes.csv
seed_column: email-suff
- type: transform_case
transform: lowercase
- type: attribute
attributes:
- date
masks:
- type: from_random_date
min: '2022-01-01'
max: '2022-12-31'
Show result
Before | After |
|
|
---|
Without force_consistency
the output XML would have all different names, an example is shown in the first example.
Example with extra_attributes_force_consistency
This example will illustrate the benefit of using the extra_attributes_force_consistency
parameter on transforms.
Suppose you have XML data with the following structure:
<Root>
<Info>
<Employee date="2022-10-08" given_name="billy_ferwagner" preferred_name="billy_ferwagner"></Employee>
<Employee date="2022-10-08" given_name="william_florista" preferred_name="william_florista"></Employee>
</Info>
</Root>
But this time you want to mask the given_name
and preferred_name
attributes to the same values, to achieve this you
can specify any attributes you would want to mask, e.g. the date
attribute, set extra_attributes_force_consistency: true
,
and specify extra_attribute_masks
with the masks you want to be performed on the extra attributes. This will generate a
masked value from the specified masks and replace values of all attributes to that masked value.
version: "1.0"
tasks:
- type: mask_table
table: xml_test
key: id
rules:
- column: xml_data
masks:
- type: xml
fallback_masks:
- type: from_fixed
value: fallback
transforms:
- path: 'Info/Employee'
force_extra_attribute_consistency: true
extra_attribute_masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
on_null_text: mask
Show result
Before | After |
|
|
---|
Without extra_attributes_force_consistency
the output XML would mask the given_name
and preferred_name
attributes
differently as shown below.
Show result
Before | After |
|
|
---|
Definitions
You can make use of YAML anchors and aliases to assign a label (anchor) to a block of configuration, and then reference that name (via aliases) to re-use the configuration throughout your ruleset, extending or overriding specific properties/keys in the configuration as needed.
You can define an anchor with &
, then refer to it using an alias, denoted with *
.
You can use overrides with the characters <<:
to add more values, or override existing ones on an alias.
Shown below is an example that sets 4 values with an anchor, uses 3 of those values with an alias, and overrides the 2nd value.
Both tasks share the values for type
, key
and rules
using the anchor &mask_table_opts
and the alias *mask_table_opts
. The value for table
is merged/overridden by <<
or merge operator.
version: '1.0'
tasks:
- &mask_table_opts
type: mask_table
table: details
key: id
rules:
- column: customer_id
masks:
- type: from_random_text
max: 10
- <<: *mask_table_opts
table: customers
As you may commonly want to re-use the definition of a task, rule, or mask in multiple locations (often with
minor variations), DataMasque allows you to group such definitions under the task_definitions
,
rule_definitions
, and mask_definitions
configuration properties of a ruleset. The following subsections demonstrate
how to use anchors and aliases with such definitions.
Task definitions (task_definitions
)
The task_definitions
attribute of the ruleset may contain a list of task definitions to be referenced
from the ruleset's tasks
attribute through YAML anchors and aliases. Each task definition must follow
the same syntax as a task in the tasks
attribute (see Database Task Types for the full list of
available task types).
For example, the following ruleset has an anchor named &mask_table_customer_info
under task_definitions
that is inherited three times
in the tasks
list via the alias *mask_table_customer_info
. Each time the definition is re-used, the table
attribute is overridden:
version: '1.0'
task_definitions:
- &mask_table_customer_info
type: mask_table
# PLACEHOLDER is a placeholder table name to be merged/overridden
# when anchor `&mask_table_customer_info` is referred in alias
table: PLACEHOLDER
key: id
rules:
- column: customer_id
masks:
- type: from_random_text
max: 10
- column: postcode
masks:
- type: from_random_number
max: 9999
tasks:
- <<: *mask_table_customer_info
table: details
- <<: *mask_table_customer_info
table: customers
- <<: *mask_table_customer_info
table: orders
This is equivalent to specifying the three mask_table
tasks separately with the same key
, type
, and rules
but
applied to different table
s, as below:
Show equivalent ruleset without anchors and aliases
version: '1.0'
tasks:
- type: mask_table
table: details
key: id
rules:
- column: customer_id
masks:
- type: from_random_text
max: 10
- column: postcode
masks:
- type: from_random_number
max: 9999
- type: mask_table
table: customers
key: id
rules:
- column: customer_id
masks:
- type: from_random_text
max: 10
- column: postcode
masks:
- type: from_random_number
max: 9999
- type: mask_table
table: orders
key: id
rules:
- column: customer_id
masks:
- type: from_random_text
max: 10
- column: postcode
masks:
- type: from_random_number
max: 9999
Rule definitions (rule_definitions
)
The rule_definitions
attribute of the ruleset may contain a list of rule definitions to be referenced
through YAML anchors and aliases. Each rule definition must follow the same syntax as a rule in a
mask_table
task's rules
.
For example, the following ruleset has an anchor named &postcode_rule
under rule_definitions
that is inherited twice
in a list of rules
via the alias *postcode_rule
. Each time the definition is re-used, the column
attribute is overridden:
version: '1.0'
rule_definitions:
- &postcode_rule
column: postcode
masks:
- type: from_random_number
max: 9999
tasks:
- type: mask_table
table: orders
key: id
rules:
- <<: *postcode_rule
column: target_postcode
- <<: *postcode_rule
column: destination_postcode
This is equivalent to specifying the two rules separately with the same masks
but
applied to different column
s, as below:
Show equivalent ruleset without anchors and aliases
version: '1.0'
tasks:
- type: mask_table
table: orders
key: id
rules:
- column: target_postcode
masks:
- type: from_random_number
max: 9999
- column: destination_postcode
masks:
- type: from_random_number
max: 9999
Mask definitions (mask_definitions
)
The mask_definitions
attribute of the ruleset may contain a list of mask definitions to be referenced
from a list of masks through YAML anchors and aliases. Each mask definition must follow
the same syntax as a mask in the masks
attribute of a rule (see Mask Types for the full list of
available mask types).
For example, the following ruleset has an anchor named &street_mask
under mask_definitions
that is inherited twice
in a masks
list via the alias *street_mask
. Each time the definition is re-used, the table_filter_column
attribute is overridden:
version: '1.0'
mask_definitions:
- &street_mask
type: from_file
seed_file: DataMasque_address-example.csv
seed_column: street_name
table_filter_column: city
seed_filter_column: city
fallback_masks:
- type: from_fixed
value: 'Main Street'
tasks:
- type: mask_table
table: orders
key: id
rules:
- column: target_street
masks:
- <<: *street_mask
table_filter_column: target_city
- column: destination_street
masks:
- <<: *street_mask
table_filter_column: destination_city
This is equivalent to specifying the two masks separately with identical configuration except for
different table_filter_column
values, as below:
Show equivalent ruleset without anchors and aliases
version: '1.0'
tasks:
- type: mask_table
# PLACEHOLDER is a placeholder table name to be merged/overridden
# when anchor `&mask_table_customer_info` is referred in alias
table: orders
key: id
rules:
- column: target_street
masks:
- type: from_file
seed_file: DataMasque_address-example.csv
seed_column: street_name
table_filter_column: target_city
seed_filter_column: city
fallback_masks:
- type: from_fixed
value: 'Main Street'
- column: destination_street
masks:
- type: from_file
seed_file: DataMasque_address-example.csv
seed_column: street_name
table_filter_column: destination_city
seed_filter_column: city
fallback_masks:
- type: from_fixed
value: 'Main Street'
Nested definitions
It is possible to have nested definitions under a definition.
For example, a nested rule definition &customer_info_street_address
can be anchored under a task definition &mask_table_customer_info
.
version: '1.0'
task_definitions:
- &mask_table_customer_info
type: mask_table
# PLACEHOLDER is a placeholder table name to be merged/overridden
# when anchor `&mask_table_customer_info` is referred in alias
table: PLACEHOLDER
key: id
rules:
- &customer_info_street_address
column: destination_street
masks:
- type: from_file
seed_file: DataMasque_address-example.csv
seed_column: street_name
tasks:
- <<: *mask_table_customer_info
table: details
rules:
- <<: *customer_info_street_address
column: pickup_street
Inherit definitions with dictionaries
As it is not possible to use anchors and aliases to override individual items in a YAML list, DataMasque allows you to
alternatively specify lists of rules
, joins
, and masks
as dictionaries of arbitrary keys to their respective values.
A dictionary is represented in a simple key: value
form (the colon must be followed by a space):
# A mask rules example in dictionary with arbitrary keys
rules:
customer_info_postcode: *postcode_rule
customer_info_customer_id: *customer_id_rule
customer_info_destination_street: *destination_street_rule
The above dictionary is identical to following list while the rules are executed:
# A mask rules example in list
rules:
- *postcode_rule
- *customer_id_rule
- *destination_street_rule
The following example demonstrates how specifying rules
as a dictionary allows you to easily override
individual rules by their key using anchors and aliases.
By specifying rules
as a dictionary in task definition &mask_table_customer_info
and by using anchors and aliases to
reference the rules dictionary and individual rules, we were able to override the entire customer_info_postcode
rule in
the first task, and just the column
of the customer_info_customer_id
rule in the second task.
version: '1.0'
rule_definitions:
- &postcode_rule
column: postcode
masks:
- type: from_random_number
max: 9999
- &customer_id_rule
column: customer_id
masks:
- type: from_random_text
max: 10
task_definitions:
- &mask_table_customer_info
type: mask_table
# PLACEHOLDER is a placeholder table name to be merged/overridden
# when anchor `&mask_table_customer_info` is referred in alias
table: PLACEHOLDER
key: id
rules: &customer_info_rules
customer_info_postcode: *postcode_rule
customer_info_customer_id: *customer_id_rule
customer_info_destination_street:
column: destination_street
masks:
- type: from_file
seed_file: DataMasque_address-example.csv
seed_column: street_name
table_filter_column: destination_city
seed_filter_column: city
fallback_masks:
- type: from_fixed
value: 'Main Street'
tasks:
- <<: *mask_table_customer_info
table: details
rules:
<<: *customer_info_rules
customer_info_postcode:
column: post
masks:
- type: from_fixed
value: 1234
- <<: *mask_table_customer_info
table: orders
rules:
<<: *customer_info_rules
customer_info_customer_id:
<<: *customer_id_rule
column: ext_customer_id
These changes can be seen in the equivalent expanded ruleset below:
Show equivalent ruleset without anchors and aliases
version: '1.0'
tasks:
- type: mask_table
table: details
key: id
rules:
rule0:
column: post
masks:
- type: from_random_number
max: 9999
rule1:
column: customer_id
masks:
- type: from_random_text
max: 10
rule2:
column: destination_street
masks:
- type: from_file
seed_file: DataMasque_address-example.csv
seed_column: street_name
table_filter_column: destination_city
seed_filter_column: city
fallback_masks:
- type: from_fixed
value: 'Main Street'
- type: mask_table
table: orders
key: id
rules:
customer_info_postcode:
column: postcode
masks:
- type: from_random_number
max: 9999
customer_info_customer_id:
column: ext_customer_id
masks:
- type: from_random_text
max: 10
customer_info_destination_street:
column: destination_street
masks:
- type: from_file
seed_file: DataMasque_address-example.csv
seed_column: street_name
table_filter_column: destination_city
seed_filter_column: city
fallback_masks:
- type: from_fixed
value: 'Main Street'
Note: Dictionaries can be used instead of lists in the following ruleset attributes:
- For
rules
, as an attribute of:
- Task type
mask_table
, including inside anif
condition (bothrules
andelse_rules
).- For
joins
, as an attribute of:
- Task type
mask_table
- For
masks
, as an attribute of:
- A rule in a set of
rules
- Mask type
replace_regex
- Mask type
replace_substring
- Mask type
concat
- Mask type
chain
Note: The order of items in above dictionaries in DataMasque ruleset will be respected, so the operations will be performed in the order you specify them.
Advanced ruleset features
Default values to skip
You can specify a set of skip_default
values at the ruleset level:
version: '1.0'
skip_defaults:
- null
- ''
- matches: 'administrator_.*'
tasks:
- type: mask_table
table: users
key: id
rules:
- column: role
masks:
- type: from_fixed
value: 'customer'
These will be used as the skip
configuration for any masking rule that does not have skip
explicitly set. You can
still force a masking rule to not skip any values by setting skip
to an empty list: skip: []
Freezing random values
A run_secret
string can be provided as an option when starting a run. If you provide a run_secret
, then all
randomly chosen and generated values will be exactly the same for multiple runs with the same secret. This can be useful
for testing or retaining consistency across multiple masking runs. The run_secret
can consist of any sequence of at least 20
characters, however it is recommended to use a randomly generated string.
Notes:
- Output values will be frozen only across identical repeated masking runs. Variations in database content (such as inserted or deleted rows) or Run Options (i.e. Batch Size, Max rows) can still result in different outputs even when using the same
run_secret
. Additionally, using a non-unique key may result in non-deterministic results. To retain consistent masking results across variable data sources, use the Deterministic masking features.- As the parallelised unloading of Amazon Redshift data into S3 files cannot guarantee consistent ordering between runs, a fixed
run_secret
may still result in different outputs between masking runs on Amazon Redshift databases. To retain consistent masking results with Amazon Redshift, use the Deterministic masking features.
Deterministic masking
If you wish to make the masked values deterministic based on one or more input values, you can provide hash sources in a ruleset. DataMasque will use the hashed value(s) from the specified sources as one of the random generator inputs (along with the instance seed and the run secret). This allows for consistent output values to be generated even across different databases or inside documents.
For database or tabular file masking, this is done with the use of a hash_columns
argument for each column
being
masked. For file masking, the parameter is called hash_sources
and the hash used for the entire file.
DataMasque will use the hashed value(s) as one of the random generator inputs (along with the instance seed and the run secret). This allows for consistent output values to be generated even across different databases, inside documents (XML/JSON) or when masking files.
Changes to hashing values in 2.8.0: Prior to DataMasque 2.8.0, values to hash were converted to strings then hashed. This may have resulted in duplicate output values when hashing on
null
. From version 2.8.0, hashing occurs on binary representations of data. This means that hashed outputs from DataMasque 2.8.0 onwards will differ to those generated by prior versions.
Deterministic masking with databases or tabular files
Each hash_columns
item can either be just the name of the column, or can specify a json_path
or xpath
to fetch a
hash value from inside JSON or XML, respectively. For more information on json_path
or xpath
please
refer to the JSON documentation/XML documentation.
For JSON/XML hash_columns
the following logic is applied:
- If the value extracted from the
json_path
/xpath
isnull
the hash will be performed on thenull
value. - If the value does not exist at the
json_path
/xpath
the hash will be performed onnull
. - If the JSON/XML data cannot be decoded (is invalid) the hash will be performed on the entire column, even if
null
.
Note Since
null
is the default fallback for invalid data, this may result in unexpected repeated masked values. Warnings will be added to the run logs when this fallback occurs. Bothjson_path
andxpath
optional parameters cannot be set for the same hash column. It is possible to hash a column on to itself (i.e.hash_columns
contains the currentcolumn
). If you hash a column on itself, and that is used as the hash for other columns, the mask on the hash column should come last in the rules. If not, that column will use the pre-masked value as the hash and other columns use the post-masked value, meaning different values will be generated in the same row for the same mask definition.
Example
This example will mask the date_of_birth
column with a date value that has been deterministically generated based
on the hash of date_of_birth
and first_name
column values combined with a one-off randomly generated secret for
this run.
For example, in every row where date_of_birth = '2000-01-01'
and first_name = 'Carl'
, the date_of_birth
will be
replaced with a deterministically generated value (e.g. 1999-03-03
). This same replacement value will be generated
for repeated rows that match the same date_of_birth
and first_name
values within this run. However, this value
will be different for each run. To retain consistency of deterministically generated values between multiple runs,
see Consistent masking.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: date_of_birth
hash_columns:
- date_of_birth
- name
masks:
- type: from_random_date
min: '1980-01-01'
max: '2000-01-01'
An equivalent alternative version of YAML syntax is shown below:
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: date_of_birth
hash_columns:
- column_name: date_of_birth
- column_name: name
masks:
- type: from_random_date
min: '1980-01-01'
max: '2000-01-01'
Show result
Before | After |
|
|
---|
Example with xpath
This example will mask the date_of_birth
column with a date value that has been deterministically generated based
on the hash of the date_of_birth
attribute from the following XML document column.
<Root>
<Info>
<Employee date_of_birth="2022-10-08">Steve</Employee>
<Employee date_of_birth="2022-10-08">Fred</Employee>
</Info>
</Root>
To get the date_of_birth
attribute from the XML document we need to specify the path with an xpath
. The following xpath
will get the date attribute from the first Employee
element of the XML document, Info/Employee[1]/@date_of_birth
.
If you wanted to perform the hash on a list of date_of_birth
attribute's values you can specify Info/Employee/@date_of_birth
.
version: "1.0"
tasks:
- type: mask_table
table: xml_test
key: id
rules:
- column: date_of_birth
hash_columns:
- column_name: xml_data
xpath: "Info/Employee/@date_of_birth"
masks:
- type: from_random_text
min: 5
max: 10
As shown in the results below, when the date_of_birth
attribute has the same values for both Employee
elements the
resulting masked date_of_birth
column contain the same values.
Show result
Before | After |
|
|
---|
Deterministic masking with files
File masking tasks (mask_file
) support deterministic generation of masked values which are based on input values. You
can provide a hash_sources
list in a rule, similar to the hash_columns
for database masking.
However, with hash_sources
you must specify additional parameters to fetch the hash.
Parameters
json_path
(optional): A path to the value in the JSON data to use for the hash.xpath
(optional): An Xpath to the value stored in the XML data to use for the hash.file_path
(optional): A boolean value, when set totrue
the hash will be performed on the file's path (relative to the base directory). Defaults tofalse
.case_transform
(optional):upper
/lower
- Apply a case transform on the input value, for consistent hash values irrespective of case.
Note Each of the
json_path
,xpath
andfile_path
are shown as optional but exactly one of these needs to be specified. For more information onjson_path
orxpath
please refer to the JSON documentation/XML documentation.hash_sources
for multirecord files works the same as for object file, except thehash_sources
are applied to each record in the file instead of the entire file.
Example with json_path
This example will mask a JSON file which contains the following JSON data:
{
"users": {
"name": "Richard",
"addresses": [
{
"type": "postal",
"city": "Nelson"
},
{
"type": "physical",
"city": "Napier"
}
]
}
}
We want to mask the city
of the users
, hashed with the name
of the users
. To do this we need
to specify a json_path
to the value we want to use for the hash ([users
,name
]) in the hash_sources
, as shown in
the ruleset below. The masked value we would want to replace the city
with, would be best if it was from one of the seed
files DataMasque provides.
version: "1.0"
tasks:
- type: mask_file
rules:
- hash_sources:
- json_path: ['users', 'name']
masks:
- type: json
transforms:
- path: ['users','addresses', "*",'city']
masks:
- type: from_file
seed_file: DataMasque_mixed_countries_addresses.csv
seed_column: city
As shown in the results below, when the name
has the same values for both users
the resulting masked values for the city
is also the same.
Note Each row of the results table below represents a separate file.
Show result
Before | After |
|
|
---|
Example with xpath
This example will mask two different XML files but generate consistent masked data by using user id as a hash. The first contains a user's info.
<User id="8472">
<Name>Evelyn</Name>
<Address>
<City>Nelson</City>
</Address>
</User>
The second contains a list of user's log messages:
<Logs userId="8472">
<Message to="Evelyn" date="2022-08-09" type="welcome"/>
<Message to="Evelyn" date="2022-08-10" type="forgot_password"/>
</Logs>
In both cases, we want the user's name to be consistent, and this can be done by specifying the user's ID as the hash
source for each file. We'll create a ruleset with two mask_file
tasks, and use includes
to map the right task to the
correct file.
The first ruleset will mask the User/Name
text value, and the second will mask the to
attribute of each
Log/Message
element. The first ruleset will use the id
attribute on User
as the hash source, while the second will
use the userId
attribute on Logs
.
Here's the ruleset to accomplish this:
version: "1.0"
tasks:
- type: mask_file
include:
- glob: user_info_*.xml
rules:
- hash_sources:
- xpath: 'User/@id'
masks:
- type: xml
fallback_masks:
- type: from_fixed
value: '<User />'
transforms:
- path: 'User/Name'
node_transforms:
- type: text
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
- type: mask_file
include:
- glob: user_logs_*.xml
rules:
- hash_sources:
- xpath: 'Logs/@userId'
masks:
- type: xml
fallback_masks:
- type: from_fixed
value: '<Logs />'
transforms:
- path: 'User/Name'
node_transforms:
- type: attribute
attributes: 'to'
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
The two output files would use consistent names. The first file:
<User id="8472">
<Name>Sarah</Name>
<Address>
<City>Nelson</City>
</Address>
</User>
And the second:
<Logs userId="8472">
<Message to="Sarah" date="2022-08-09" type="welcome"/>
<Message to="Sarah" date="2022-08-10" type="forgot_password"/>
</Logs>
Example with file_path
This example will just show an example ruleset to mask files based on the file path:
This applies to any mask_file
task.
version: "1.0"
tasks:
- type: mask_file
rules:
- hash_sources:
- file_path: true
masks:
- type: json
transforms:
- path: ['users','addresses', "*",'city']
masks:
- type: from_file
seed_file: DataMasque_mixed_countries_addresses.csv
seed_column: city
With this ruleset the hashing will be performed on the file path, so each file with a different path with have differing masked values. This is useful if you want repeatable masking based on the name of a file.
Common regular expression patterns
A regular expression (or "regex") is a sequence of characters that acts as a search pattern
to filter or select substrings of text strings. They can be used
in replace_regex
masks and matches
conditions
of if
and skip
blocks.
Note that it is best practice to wrap the regular expression in quotes to avoid
special characters being misinterpreted as YAML syntax:
version: '1.0'
tasks:
- type: mask_table
table: driversLicence
key: ID
rules:
- column: driversLicence
masks:
- type: replace_regex
regex: "[0-9]"
masks:
- type: from_fixed
value: "#"
Notes: DataMasque uses the Python regular expression syntax, which you can learn more about in the official tutorial.
Regular expressions allow you to match various characters by specifying a character or character set. Below are a list of characters sets that can be used in regular expressions.
Characters can be specified as character classes.
"[0-9]"
can be used to match any numerical character."[a-z]"
and"[A-Z]"
will match lower case and upper case alphabetical characters respectively.- Classes can also be combined;
"[a-zA-Z]"
will match any upper or lower case character. - You may also specify a list of characters to match within square brackets. If you wish to only match vowels,
you can use
"[aeiouAEIOU]"
. - If a caret character is added before any character within the character class, it will instead match something other
than the specified character.
"^[0-9]"
will match any character other than a numerical character.
The table below detail other ways to denote certain character type without the use of character classes.
Character | Description |
---|---|
\d | One digit character, ranging from 0 to 9. |
\w | One 'word' character; an ascii letter, a digit or an underscore. |
\s | One whitespace character. This can be a space, a line break, or a tab. |
\D | One character that is NOT a digit. |
\W | One character that is NOT a word character. |
\S | One character that is NOT a whitespace character. |
Note that when used in double-quoted strings in YAML, literal backslashes must be escaped by backslashes, for example,
"\\d"
will be interpreted as\d
. We recommend quoting YAML strings with single quotes to avoid the need for this escaping, as'\d'
will be interpreted as\d
.
You can also choose to add a quantifier to any character or character class, which will specify how many characters will be matched.
Quantifier | Description |
---|---|
+ | One or more characters will be matched. |
{n} | Exactly n characters will be matched. For example, \d{3} will match strings of exactly 3 numerical digits. |
{a,b} | Any string of characters between x and y length will be matched. For example, \w{1,3} will match a string of word characters between 1 and 3 length. |
{c,} | Any string of characters with c or more characters. For example, \d{2,} will match a string of numerals of length 2 or more. |
* | Any instance of the characters appearing zero or more times. |
? | Matches if the character or character class appears once, or no times. |
Below are some common regular expression patterns that can be used to match typical database strings.
Example Patterns
Phone Number
The pattern below will match a phone number consisting of an international code in parentheses followed by a 7-digit phone number that may be hyphenated after the 3rd digit.
'\(\+\d{1,4}\)\s?\d{3}-?\d{4}'
Some example of matching patterns are shown below:
- (+64)123-4567
- (+1234) 5678910
- (+61) 987-5432
Social Security Number
The pattern below will match a Social Security Number consisting of hyphen-separated digits:
'\d{3}-\d{2}-\d{4}'
Some example of matching patterns are shown below:
- 111-22-3333
- 987-65-4321
- 112-35-8132
Time
The pattern below will match a time consisting of 1 or 2 hour digits and 2 minute digits separated by a colon:
'[0-2]?\d:\d{2}'
Some example of matching patterns are shown below:
- 7:30
- 23:50
- 05:45
Email Address
The pattern below will match an email address consisting of an alphanumeric username and
an alphanumeric domain name separated by an @
symbol. For a more comprehensive email
regular expression, see: emailregex.com.
'[a-zA-Z0-9]+@[a-zA-Z0-9\.]+'
Some example of matching patterns are shown below:
- janedoe@gmail.com
- bobsmith@hotmail.com
- thomas@yahoo.com
DISCLAIMER: The above examples are simplistic and general examples to demonstrate how regular expressions can be constructed - you should ensure these regular expressions are appropriate for your particular data before using them in your masking rulesets.