Ruleset YAML specification

Ruleset YAML specification
- Schema versioning
- Writing comments
Database task types
Masking files
File task types
- Object files (mask_file)
- Tabular files (mask_tabular_file)
Masking tables
Definitions
Advanced ruleset features
Common regular expression patterns

Ruleset YAML specification

The ruleset YAML configuration provides instructions that DataMasque will follow when performing a masking run against a target database. Rulesets are comprised of one or more tasks, which can contain many different types of instructions. The most common use case is the application of tasks for masking sensitive data in database tables with masking rules.

Ruleset Properties

The following properties are specified at the top-level of a ruleset YAML file:

version (required): The schema version used by this ruleset. The default value present in the editor when creating a new ruleset is generally the value that you should be using. See Schema Versioning for more information.
name (deprecated; will be removed in release 3.0.0): A unique name that will be used to refer to your ruleset. This field may only contain alphanumeric characters and underscores, and its length is limited to 64 characters.
tasks (required): A list of tasks to be performed in order on the target database. See Database Task Types for the comprehensive list of the available task types and their associated parameters.
task_definitions (optional): A list of task definitions to be referenced from the ruleset's list of tasks through YAML anchors and aliases.
rule_definitions (optional): A list of rule definitions to be referenced from a list of rules through YAML anchors and aliases.
mask_definitions (optional): A list of maks definitions to be referenced from a list of masks through YAML anchors and aliases.
skip_defaults (optional): See Default values to skip.
random_seed (optional): Deprecated in favour of the Run secret option. See Freezing random values.

Example

The following ruleset provides an example for replacing the last_name column of every row in the users table with the fixed value "redacted last name":

# My ruleset.yml
version: '1.0'
tasks:
  - type: mask_table
    table: users
    key: user_id
    rules:
      - column: last_name
        masks:
          - type: from_fixed
            value: 'redacted last name'

Note: The tasks must be indented using two spaces. For example:
tasks:
  - type: mask_table
^^
- type: mask_table is indented two spaces from the parent tasks:

Important note on case-sensitivity: For all tasks except run_sql, database identifiers, such as table and column names, should be referenced as you would otherwise reference them in an SQL query. When masking a case-sensitive database, identifiers must be referenced in the ruleset using the correct case.

To refer to a case-sensitive table or column in a database, the identifier must be enclosed in double quotation marks. However, when writing YAML, quotations are used to denote a string value, so any enclosing quotations are not considered as part of the value. As such, it is necessary to enclose the entire name - including double quotation marks - in an outer set of single quotation marks. For example:
# Case-sensitive table name; enclosed in both single and double quotations.
table: '"CaseSensitiveTableName"'
To refer to a case-sensitive table in a schema, the schema name must also be enclosed in quotation marks if the schema name is case-sensitive. The entire combination of schema and table name must be enclosed in single quotation marks. For example:
# Case-sensitive schema and table name; enclosed in both single and double quotations.
table: '"CaseSensitiveSchemaName"."CaseSensitiveTableName"'
If referencing a combination of table and column, you will need have quotation marks around both the table and column names within the surrounding quotation marks. For example:
# Case-sensitive table and column name; enclosed in both single and double quotations.
column: '"CaseSensitiveTable"."CaseSensitiveColumn"'
Identifier names containing double quotation marks, backslashes, periods, and whitespace should always be enclosed in double quotation marks. Also, literal double quotation marks and backslashes must be preceded by a backslash:
# Case-sensitive table and column name containing special characters.
column: '"Case\"Sensitive\\Table"."Case.Sensitive Table"'
Note: Backslashes and single quotation marks are not supported in

identifier names for Microsoft SQL Server (Linked Server) databases.

Schema versioning

Schema changes to the DataMasque ruleset specification are tracked using the version field of the ruleset. The version number consists of two fields in the format major.minor. Minor version increments reflect backwards-compatible changes to the schema, whereas major version increments represent breaking changes which will require some form of migration from previous versions. Wherever possible, DataMasque will handle such migrations for you automatically when you upgrade.

Each release of DataMasque only supports the most recent major ruleset version at the time of release. As such, the major schema version of your rulesets must equal the major version supported by your DataMasque release. The minor schema version of your rulesets must be equal to or less than the minor version supported by your DataMasque release.

The ruleset schema version supported by this release of DataMasque is "1.0".

Writing comments

While creating a ruleset, it is possible to write comments in the ruleset. A commented block is skipped during execution, and it helps to add description for specified ruleset block.

If you begin a line with # (hash symbol), all text on that line will become a comment.

version: '1.0'
# This line will become a comment.
tasks:
  - type: mask_table

If you place # on a line, all text after that # on that line will become a comment. Any text before it will still be part of the ruleset.

version: '1.0'
tasks:
  - type: mask_table # The name of this task type will not be affected by this comment.

In the DataMasque ruleset editor, the shortcut key combination for commenting ruleset blocks is CTRL + / on Linux and Windows and ⌘ + / for Mac operating systems. If your cursor is on a line and this shortcut is used, the entire line will be commented out. Highlighting multiple lines at once will cause all highlighted lines to be commented out.

Database task types

A ruleset is comprised of a list of tasks to be performed in sequence on the target database. Tasks are performed serially from top to bottom, but special serial and parallel tasks can be used to nest other tasks within them for performance (parallelism) or dependency management.

Sensitive data discovery

The run_data_discovery task type inspects the metadata of your database, searching for columns which are likely to contain sensitive data. On completion, a report is generated containing a summary of all identified columns, and their current masking coverage. It is recommended to include a single run_data_discovery task in your rulesets to help ensure complete masking coverage and provide ongoing protection as new sensitive data is added to your database.

See the Sensitive Data Discovery guide for more information on this feature.

Parameters

This task type does not have any parameters.

Example

The following shows an example ruleset that will execute only the run_data_discovery task and no masking tasks. This example usage may be useful when starting a new ruleset from scratch to determine a starting point for developing your masking rules.

The run_data_discovery may also be included in a ruleset alongside other masking tasks to provide continuous feedback on the masking coverage provided by the ruleset.

version: '1.0'
tasks:
  - type: run_data_discovery

Schema Discovery

The run_schema_discovery task type inspects the metadata of your database, searching for schemas, tables, and columns and can flag certain columns which are likely to contain sensitive data. On completion, a report is generated containing a summary of all identified schemas, tables, columns, and relevant metadata of the data within the columns.

See the Schema Discovery guide for more information on this feature.

Parameters

This task type does not have any parameters.

Example

The following shows an example ruleset that will execute only the run_schema_discovery task and no masking tasks. This example usage may be useful when starting a new ruleset from scratch to determine a starting point for developing your masking rules.

version: '1.0'
tasks:
  - type: run_schema_discovery

Table masks

Each mask_table task defines masking operations to be performed on a database table. More detail is provided about these tasks under the Masking Tables section.

Parameters

Each task with type mask_table is defined by the following parameters:

table (required): The name of the table in the database. The table name can be prefixed with a schema name to reference a table in another schema. If the table or schema name are case-sensitive, you must enclose the name in double and single quotation marks in order to specify the casing of the name. For example, table: '"CaseSensitiveSchema"."CaseSensitiveTable"'
key (required): One or more columns that identify each table row. Composite keys may be specified for the key parameter. For more details on using composite keys, see Composite keys
- For Oracle databases it should always be ROWID (key: ROWID). DataMasque will implicitly use ROWID when ROWID is not specified. For more details, refer to Query optimisation.
- For Microsoft SQL Server and PostgreSQL databases it is recommended to use the primary key, or any other unique key that is not modified during masking, for better performance. If a non-unique key is used then the masked values for all rows with the same value for the key will have the same masked values. The key columns must not contain any NULL value. If the key is case sensitive, you may enclose each key value in double and single quotation marks in order to specify the casing of the key. For example, key: '"Customer_ID"'
- If the columns specified for the key parameter cannot be used to uniquely identify rows, then the masked values will be the same for rows that have the same key value. Refer to the key and hash columns example in the Notes section for how to avoid producing duplicate masked values.
rules (required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:
- column (required): The name of the column to mask.
- masks (required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the column. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask types.
- hash_columns (optional): A list of columns which will be used as input to the Deterministic masking algorithm for this rule. If hash_columns is provided, all mask types that rely on randomisation become deterministic based on the hash_columns column values. Values in the provided columns can be null. The hash column can be prefixed with a table name to reference a column in another table, and that table name can be prefixed with a schema name to reference a table in another schema. Hash columns can also be specified by a list of dictionaries with the following keys:
  - column_name (required): The name of the column which will be used as input as above.
  - case_transform (optional): upper/lower - Allow for case transforms on the values of the input, for consistent hashed values irrespective of case. This is useful if values are stored with different cases in different tables, allowing for consistent hashing on those values, For example, email addresses could be stored as all lowercase in one table but mixed case in another.
  - json_path (optional): If the column contains JSON data, the path to the value in the JSON data you wish to perform the hash on can be specified here, otherwise the hash will be performed on the entire column.
  - xpath (optional): If the column contains XML data, the Xpath to the value stored in the XML data you wish to perform the hash on can be specified here, otherwise the hash will be performed on the entire column. For more information on the json_path please refer to the JSON documentation. For more information on the xpath please refer to the XML documentation. For more information on deterministic masking please refer to the Deterministic masking.
workers (optional): The number of parallel processes to use for masking this table (defaults to 1). Each process will operate on a separate batch of rows from the table (batch size is a configurable run option). Increasing workers may decrease masking run times, dependent on database performance and the complexity (e.g. number of columns to mask) of the masking task. It is recommended to increase the number of workers if DataMasque connections to your target database spend more time processing queries than waiting for DataMasque (i.e. the "waiting for clients" time approximates DataMasque CPU time), which suggests DataMasque could efficiently use multiple workers to mask other batchs while waiting for database responses. Additionally, it is recommended that the number of parallel processes multiplied by the number of workers assigned to each process does not exceed twice the number of CPUs available on your deployed instance.
index_key_columns (optional): When index_key_columns is true (the default setting), DataMasque will create an additional index on all key columns if there is no existing index contains all key columns. You may wish to disable the creation of a new index if you have an existing index on some of the key columns that will provide adequate performance.
- For Oracle databases, this option has no effect because ROWID is always used as the key.
- Redshift databases do not support indexes, so this option has no effect for them.
- This option currently has no effect for Microsoft SQL Server (Linked Server) databases.

version: '1.0'
tasks:
  - type: mask_table
    table: users
    key:
      - user_id
      - first_name
    rules:
      - column: date_of_birth
        hash_columns:
          - first_name
        masks:
          - type: from_random_datetime
            min: '1980-01-01'
            max: '2000-01-01'

Composite keys

A composite key can be specified in following formats:

A list of columns; for example, key: ['invoice_id', 'product_id'] would be used to indicate a composite key consisting of a combination of two columns, invoice_id and product_id.
For Microsoft SQL Server and PostgreSQL, when specifying a composite key, the order of the keys listed must respect the original order as defined in the database.
A multiline composite key. An example of a multiline composite key is shown below.

  key:
    - 'invoice_id'
    - 'product_id'

Example mask_table ruleset

version: '1.0'
tasks:
  - type: mask_table
    table: users
    key: user_id
    rules:
      - column: last_name
        masks:
        - type: from_fixed
          value: "redacted last name"

Notes

Index operations will be performed online (ONLINE=ON) on SQL Server editions that support this feature.

The following types cannot be used as key columns:

Microsoft SQL Server

datetime

time(7)

datetime2(7)

datetimeoffset(7)

PostgreSQL

real

double precision

While rules and masks should typically be provided as lists, they can also be specified as dictionaries that map arbitrary keys to rules/masks. For example:
...
rules:
  last_name_rule:
    column: last_name
    masks:
      fixed_mask:
        type: from_fixed
        value: "redacted last name"
Specifying rules or masks as a dictionary can allow you to override the rule/mask for a specific key when inheriting from a definition.

When masking a table, if a non-unique key is specified for the mask_table task alongside hash_columns, if the value of the hash_columns is different for more than one row with same key value, the final masked values will arbitrarily depend on the order that update statements are executed. This can be avoided by including the targeted hash_columns as part of a composite key for the mask_table task.

Mask a primary key or unique key

The mask_unique_key task type can be used to mask the values in a primary key or unique key. Masking of a primary key or unique key has the requirement that all masked values are unique, which requires the use of this special-purpose task type.

The mask_unique_key task type replaces all non-null rows of the target key with new, unique values, generated in accordance with a user-specified format. The target primary or unique key columns and associated foreign key columns are updated with these unique replacement values in a single operation to maintain referential integrity.

Notes:

Each mask_unique_key task will mask the members of a single primary key or unique key constraint. Multiple mask_unique_key tasks are required to mask multiple unique keys on a single table.

The mask_unique_key task can be used on tables with up to 200,000,000 non-null rows.

DataMasque will only cascade to foreign keys that directly reference the target key. DataMasque does not currently support automatic cascading to any foreign keys beyond direct foreign key references.

When applied to a composite key, replacement values are only generated for rows that contain a complete, non-null key. For any null or partially null rows, all columns of the target key will be set to NULL.

The mask_unique_key task must only be applied to columns which are in-fact unique (i.e. the target key columns have a PRIMARY KEY or UNIQUE constraint enforced). Unique keys that have multiple NULL rows (e.g. using a filtered unique index in SQL Server) are allowed; such rows will not be modified by this task.

When masking a clustered index on SQL Server, the performance of mask_unique_key can be significantly improved by disabling all other indexes and constraints on the target table for the duration of the task. It is recommended to implement this in your ruleset using run_sql tasks before and after the mask_unique_key task to disable then re-enable these constraints.

The mask_unique_key task cannot modify SQL Server columns created with the IDENTITY property, or Oracle / PostgreSQL columns created with GENERATED ALWAYS AS IDENTITY.

Use of mask_unique_key for Amazon Redshift or Microsoft SQL Server (Linked Server) databases is not currently supported in DataMasque.

Due to the random assignment of replacement values, it is possible (though generally rare) that a row may be assigned a masked value that is identical to it's pre-masking value. In these cases, the masking is still effective, as an attacker will not be able to identify which rows' values were replaced with an identical value. However, if you need to guarantee that all masked values are different from their pre-masking values, you should use min and max parameters to ensure the range of possible output values from your format string does not overlap with the range of pre-masking values in your database.

Warning: The mask_unique_key task type must not be run in parallel with tasks that operate on any of the following:

The target table of the mask_unique_key task.

Tables containing foreign keys that reference the target_key columns.

Any tables specified in additional_cascades.

Parameters

Each task with type mask_unique_key is defined by the following parameters:

table (required): The name of the database table that contains the primary key or unique key to be masked.
target_key (required): A list of items defining each column that makes up the primary or unique key, and the format in which replacement values will be generated for that column. Composite keys can be masked by including multiple columns and formats in this list. Each item has the following attributes:
- column (required): The name of the column to be masked.
- format (optional): The format which will be used to generate replacement values for the column. See Format string syntax for details. Defaults to '{!int}'.
additional_cascades (optional): Use this parameter to propagate masked values to implied foreign key columns. Implied foreign keys are dependencies that exist between tables but not enforced by foreign key constraints hence not defined in the database. A list of implied foreign keys to the target_key. Masked values will be cascaded to these columns. See Cascading of masked values for more details on how this works. Each additional_cascades item has the following attributes:
- table (required): The name of the table containing the cascade target columns, which have an implicit reference to the target_key of this task. The table name can be prefixed with a schema name to cascade to a table in another schema.
- columns (required): A list of column dictionaries - each describing the relationship between a column of the target key and a column on the cascade target table. Each column mapping item has the following attributes:
  - source (required): The name of a column in the target key from which masked values will be cascaded to the corresponding target column.
  - target (required): The name of a column on the cascade target table to which masked values from the source column will be cascaded.
batch_size (optional): To avoid excessive memory consumption when masking large tables, DataMasque generates replacement values in batches. This value controls the maximum number of unique values that are generated in a single batch. In general, the default of 50,000 will be acceptable for most use cases.

Note:

When using additional cascades to propagate composite primary or unique key columns to implied foreign key columns, you need to specify all the columns in the additional cascades parameter that are corresponding to the referenced primary or unique key columns. Warning! Not specifying all the columns that are corresponding to the referenced primary or unique key columns will cause data propagation to fail from the referenced composite primary or unique key columns to the composite implied foreign keys. In other words, partial cascades that reference a subset of the target key columns will result in data not being propagated to the target table, resulting in inconsistent data between the two tables. For example: A composite unique key in a PhoneNumbers table which consists of Prefix and LineNumber columns which are referenced by PhPrefix and PhLineNumber columns in the Calllogs table but without foreign key constraint. Therefore it is an implied foreign key that requires using the additional_cascades parameter to propagate the masked unique key values to ensure data integrity across the tables.

A ruleset needs to be written to specify all corresponding implicit foreign key columns in the additional_cascades parameter as such that the masked unique keys will be propagated collectively to the foreign keys:
 version: "1.0"
 tasks:
   - type: mask_unique_key
    table: PhoneNumbers
    target_key:
      - column: Prefix. #part of the composite unique key constraint
        format: "{!int, 1:150, pad}"
      - column: LineNumber #part of the composite unique key constraint
        format: "{!int, 50001:100000, pad}"
    additional_cascades:
      - table: CallLogs
        columns:
          # Need to include both Prefix/PhPrefix and LineNumber/PhLineNumber for data to propagate properly.
          - source: Prefix #UK a
            target: PhPrefix
          - source: LineNumber #UK a
            target: PhLineNumber
However, if they are not composite foreign keys but individual foreign keys a ruleset needs to be written to propagate the masked unique keys individually:
version: "1.0"
tasks:
  - type: mask_unique_key
   table: PhoneNumbers
   target_key:
     - column: Prefix. #has its own unique key constraint
       format: "{!int, 1:150, pad}"
     - column: LineNumber #has its own unique key constraint
       format: "{!int, 50001:100000, pad}"
   additional_cascades:
     # Need to include both Prefix/PhPrefix and LineNumber/PhLineNumber for data to propagate properly.
     - table: CallLogs
       columns:
       - source: Prefix #UK a
         target: PhPrefix
     - table: CallLogs
       columns:
       - source: LineNumber #UK b
         target: PhLineNumber

Example 1

The following example will mask the primary key column AccountNumber of the Users table with unique replacement values. Another table, Transactions, also has a column named AccountNumber which has a foreign key relationship to the AccountNumber column of the Users table.

Users transaction table diagram

Account numbers will be generated with at least 6 digits; the minimum value being 100,000 and the maximum value growing as required depending on the number of rows in the table.

In this ruleset below, you only need to specify masking rules for the primary key column, AccountNumber, to be masked. You do not need to explicitly define the foreign key columns to propagate the replacement values to in the ruleset. DataMasque will automatically detect primary key and foreign key relationships in the database and propagate the replacement values to any related foreign key columns; in this case, the new values for AccountNumber in the Users table are implicitly propagated to the AccountNumber column in the Transactions table.

version: '1.0'
tasks:
  - type: mask_unique_key
    table: Users
    target_key:
      - column: '"AccountNumber"'
        format: '{!int,100000:}' # Account numbers will be generated with at least 6 digits

Show result

Users table

Before After

AccountNumber	Name
100001	Adam
100002	Brenda
100003	Charlie
100004	Diane
100005	Esther
100006	Francis

AccountNumber	Name
106512	Adam
109983	Brenda
161822	Charlie
189413	Diane
122576	Esther
197698	Francis

Transactions table

Before After

TransactionDate	AccountNumber
2021-01-01	100001
2021-01-02	100002
2021-01-03	100004
2021-01-04	100004
2021-01-05	100003
2021-01-06	100001
2021-01-07	100006
2021-01-08	100005

TransactionDate	AccountNumber
2021-01-01	106512
2021-01-02	109983
2021-01-03	189413
2021-01-04	189413
2021-01-05	161822
2021-01-06	106512
2021-01-07	197698
2021-01-08	122576

Example 2

The following example will mask a composite unique key of a PhoneNumbers table. The composite key consists of the following columns:

Prefix VARCHAR(3): containing a zero-padded integer in the range (1-150). e.g. 001, 002, etc.
LineNumber VARCHAR(6): containing a zero-padded integer in the range (50,001-100,000).

Values matching these specific formats can be generated using the following format strings:

'{!int, 1:150, pad}': Generates integers between 1-150 (inclusive), zero-padded to a fixed width of 3 characters.
'{!int, 50001:100000, pad}': Generates integers between 50,001-100,000 (inclusive), zero-padded to a fixed width of 6 characters.

Because both of these formats have an upper value bound, we must consider the maximum number of unique composite values that are available in this space. Multiplying the number of values in the two ranges (150 * 50,000), we can determine that these two format strings will supply us with 7,500,000 rows of unique composite values. As a result, this task would fail if applied to a table containing more than 7,500,000 (non-null) rows.

In addition to the PhoneNumbers table, this schema also includes a CallLogs table containing the columns PhPrefix and PhLineNumber, which are references to the values in the Prefix and LineNumber columns of the PhoneNumbers table. However, due to specific requirements of this schema, these references are not defined using a foreign key constraint. Without a foreign key, DataMasque will not automatically propagate the replacement values generated for the PhoneNumbers table to the CallLogs table.

PhoneNumbers CallLogs table diagram

In order to ensure the referential integrity of these implicit references is maintained during masking, this example uses additional_cascades to instruct DataMasque on how to cascade updated values to these columns. DataMasque will take the values of Prefix and LineNumber columns of the PhoneNumbers table and propagate these values to the PhPrefix and PhLineNumber columns of the CallLogs table.

version: '1.0'
tasks:
  - type: mask_unique_key
    table: PhoneNumbers
    target_key:
      - column: Prefix
        format: "{!int, 1:150, pad}"
      - column: LineNumber
        format: "{!int, 50001:100000, pad}"
    additional_cascades:
      - table: CallLogs
        columns:
          - source: Prefix
            target: PhPrefix
          - source: LineNumber
            target: PhLineNumber

Show result

PhoneNumbers table

Before After

id	Prefix	LineNumber
1	056	069875
2	140	093150
3	056	051531
4	114	089796
5	079	073665

id	Prefix	LineNumber
1	012	050165
2	140	079463
3	115	096452
4	003	050010
5	105	065498

CallLogs table

Before After

PhPrefix	PhLineNumber	Timestamp
056	069875	1620110987503
056	069875	1620111698775
114	089796	1620113762010
056	069875	1620125649941
079	073665	1620125789210

PhPrefix	PhLineNumber	Timestamp
012	050165	1620110987503
012	050165	1620111698775
003	050010	1620113762010
012	050165	1620125649941
105	065498	1620125789210

Example 3

Consider a database with two tables and two schemas: Customers in the Accounts schema; and Transactions in the Sales schema. The following example will mask the primary key column CustomerId of the Accounts.Customers table. However, any changes made to this CustomerId column must also be reflected on the Customer column of the Sales.Transactions table. Due to specific requirements of this database, these cross-schema references are not defined using a foreign key constraint. In order to maintain referential integrity, this example uses additional_cascades to instruct DataMasque how to cascade updated values to the relevant column present in the other schema.

The Customer ID consists of 3 letters, followed by a hyphen, then a 4-digit number. This will be constructed using a format string to ensure any values generated conform to the required standards.

version: "1.0"
tasks:
  - type: mask_unique_key
    table: Accounts.Customers
    target_key:
      - column: CustomerId
        format: "{[a-z],3}-{[0-9],4}"
    additional_cascades:
      - table: Sales.Transactions
        columns:
          - source: CustomerId
            target: Customer

Show result

Customers table

Before After

CustomerId	Name
AAA-1111	Adam
BBB-2222	Brenda
CCC-3333	Charlie
DDD-4444	Diane
EEE=5555	Esther

CustomerId	Name
ACG-1435	Adam
BOD-0493	Brenda
CAG-6845	Charlie
HUM-5941	Diane
SNC=9341	Esther

Transactions table

Before After

TransactionDate	Customer
2021-01-01	AAA-1111
2021-01-02	EEE=5555
2021-01-03	DDD-4444
2021-01-04	BBB-2222
2021-01-05	AAA-1111
2021-01-06	CCC-3333
2021-01-07	DDD-4444
2021-01-08	EEE=5555

TransactionDate	Customer
2021-01-01	ACG-1435
2021-01-02	SNC=9341
2021-01-03	HUM-5941
2021-01-04	BOD-0493
2021-01-05	ACG-1435
2021-01-06	CAG-6845
2021-01-07	HUM-5941
2021-01-08	SNC=9341

Format string syntax

The format string syntax used by DataMasque to generate unique replacement values currently supports format strings that are combination of alphanumeric characters and symbols. This can be used to generate key values that combined randomly generated portions combined with fixed formatting to generate a key that matches any format required in your key columns.

Format strings consist of fixed values, as well as variable values that can be declared braces {}. Values within braces can be provided with a set of characters to use, followed by a length of values. Any values not declared within braces are fixed values.

Format strings can be constructed using character sets. Character are wrapped in braces and followed by a comma and a number to indicate how many characters in the set are to be generated.

Character Set	Description
[a-z]	Lower case alphabetical characters.
[A-Z]	Upper case alphabetical characters.
[0-9]	Numerical characters.
[aeiou]	Any vowel. Characters can be individually specified without using a range.

An example format strings is shown below.

format: "{[a-z],2}_{[A-Z],2}-{[a-zA-Z],3}#{[0-9],5}"

In this example, there are four sets of curly braces within the format string, each providing a set of characters followed by a number indicating how many of these characters to generate.

The first braces specify {[a-z],2}, which will generate a string of 2 lower case alphabetical characters.
After this, there is an underscore outside the brackets, which means all generated values will have an underscore after the 2 alphabetical characters.
The second braces specify {[A-Z],2}, which will generate string of 2 upper case alphabetical characters.
After this second variable, there is a hyphen -, meaning a hyphen will always be present after the 2 upper case values.
The third braces specify {[a-zA-Z],3}. which will generate a string of length 3 consisting of both lower case and upper case alphabetical characters.
After this, there is a hash #, meaning a hash character will always be placed after the third generated string.
The final braces specify {[0-9],5}, meaning a 5 digit numerical number is placed at the end of the string.

The following values are some example outputs that may be generated using the format string above:

ab_TJ-RaK#10496 pt_oq-TRu#49511 iu_QE-unT#67312 nd_UL-bES#97638

Example

In this example, we wish to mask a series of number plates.

A number plate consists of 3 alphabet characters, followed by 3 numerical digits. We wish to mask the first letter of each number plate with X, followed by 2 random alphabetical characters, followed by a hyphen and a 3 digit number.

A snippet of the table is shown below, where number_plate is a unique key of the table.

car_registration Table

number_plate	car_owner
AAA-111	Anastasia
BBB-222	Bill
CCC-333	Chris
DDD-444	Judith
EEE-444	Gordon

In this case, we will use the following string format.

format: "X{[A-Z],2}-{[0-9],3}"

We are generating a fixed value of X, followed by 2 alphabetical characters, as defined by {[A-Z],2}. After this, there is a hyphen outside of braces, so a static value of a hyphen will always be generated. After this, 3 random numerical digits are generated.

A ruleset utilising this string format in a mask_unique_key task can be shown below.

version: "1.0"
tasks:
    - type: mask_unique_key
      table: car_registration
      target_key:
        - column: number_plate
          format: "X{[A-Z],2}-{[0-9],3}"

Show result

Before

After

number_plate	car_owner
AAA-111	Anastasia
BBB-222	Bill
CCC-333	Chris
DDD-444	Judith
EEE-555	Gordon

number_plate	car_owner
X	Anastasia
XAT-475	Bill
XIB-015	Chris
XPQ-164	Judith
XUR-431	Gordon

Integer string format syntax

The !int operator can be used to generate integers. In its most basic use with no arguments, the format string {!int} will generate integers from one to infinity. Extra parameters can be added to set the output range or pad the output.

Range (min:max):

The range defines all possible integer values that may be generated by the integer generator. This is an optional parameter, defaulting to 1: (min=1, max=unbounded).

min (optional): The minimum value which will be generated (inclusive). Defaults to 1.
max (optional): The maximum value which will be generated (inclusive). When this value is not specified the maximum value will be unbounded, meaning it will grow depending on the number of values required.

For example:

{!int, 5:}: generate integers from 5 (inclusive) to infinity.
{!int, :100}: generate integers from 1 to 100 (inclusive).
{!int, 20:80:}: generate integers from 20 to 80 (inclusive).

Zero-pad (pad):

pad (optional): When specified, zero-padding will be applied to generated integers - resulting in fixed character width replacement values (e.g. 001, 002, …, 999). The zero-pad width is determined by the width of the max value, and therefore is only a valid option when a max value is specified.

For example:

{!int, :1000, pad}: generate strings in the format 0001, 0002, etc, up to 1000.

Note: Always wrap format strings in either single or double quotes. Leaving format strings unquoted in the ruleset will result in invalid YAML due to the opening { character, which has reserved usage in YAML. See below for examples of a correctly quoted format string:
format: '{!int}'        # Single quotation marks
format: "{!int, pad}"   # or double quotation marks

Hex string format syntax

Integers can be generated and output in a hexadecimal format by using the !hex operator. The range and pad options apply in the same was as for standard integer generation, however range values are interpreted as hexadecimal rather than decimal.

For example:

{!hex}: generate hex strings from 1 to infinity, i.e. 1, 2, …, a, b, …, ff, 100, etc.
{!hex, 10:100, pad}: generate hex strings from 0x10 to 0x100 (inclusive) with padding, i.e. 010, 011, …, 0fe, 0ff, 100.

UUID string format syntax

You may choose to generate values in the Universal Unique Identifier (UUID) format by declaring a !uuid format string. A UUID is a string of 32 hexadecimal digits (0 to 9, a to f), separated by hyphens. The number of characters per hyphen are 8-4-4-4-12. An example UUID would be 12345678-90ab-cdef-1234-567890abcdef.

In order to generate a UUID as the unique key, simply specify uuid in the format as shown below.

format: "{!uuid}"

You may also specify a prefix within the format string of up to 8 characters. This will ensure that the first characters in the UUID are always static. For example, specifying format: "{!uuid,aaaa}" will cause the first 4 characters of every UUID generated by the ruleset to be a.

Alternatives string format syntax

You can have DataMasque select one value from a set of alternatives for each generated value. Such a segment can be specified by wrapping your set of pipe/|-separated alternatives in parentheses (). For example, {(EN|FR)}-{!int} can be used to generate an integer prefixed by either EN or FR. At least two alternatives must be specified.

Note: This should only be used with the mask_table task as it will not satisfy the unique requirement for the mask_unique_key task.

Cascading of masked values

Referential integrity of data references to the target_key of a mask_unique_key task is maintained by "cascading" the masked replacement values to each reference. DataMasque will perform this cascade automatically for relationships defined by a database foreign key constraint. Masked values can also be cascaded to columns that are not members of such a foreign key constraint by using the additional_cascades feature.

Any rows of a cascade target (child table) which contain values that are not present in the target_key (on the parent table) will have their cascade target columns set to NULL. This situation may occur in one of the following cases:

The cascade target is a foreign key that has at some point been disabled, had values updated, then been re-enabled without being checked / validated. i.e.
- For Oracle the constraint was re-enabled using NOVALIDATE.
- For Microsoft SQL Server, the constraint was re-enabled without using WITH CHECK.
- For PostgreSQL, the constraint was dropped and recreated instead of being disabled and re-enabled.
The cascade target is an implicit reference without database constraints.

The behaviour is designed to ensure that no rows are left unmasked on the cascade target.

Important!

DataMasque will only automatically cascade to foreign keys that are enabled. Foreign keys that are present, but disabled at the time of masking will be excluded.

For Oracle databases, when defining additional_cascades for mask_unique_key tasks, all columns of the target_key must be referenced as source columns. Partial cascades that reference a subset of the target key columns are disallowed.

For an example of additional_cascades please refer to the notes under the parameters section of Mask a Primary or Unique key

Build a temporary table

If you need to repeatedly join multiple tables when masking tables, or you need to perform some custom SQL transformations to column values, then you may wish to use the build_temp_table task type to create a temporary table which can then be accessed via a join during a mask_table task:

Parameters

table_name (required): The name of the temporary table to create in the database. You will need to use this name when referencing this temporary table later (e.g. in joins and masking rules). The table name can be prefixed with a schema name if the temporary table should be created in a schema other than the user's default schema.
sql_select_statement (this OR sql_select_file required): A string containing a SELECT statement to define the contents of the temporary table. To break the statement across multiple lines, you may use multi-line YAML syntax (|- or >-).
sql_select_file (this OR sql_select_statement required): The name of a user uploaded SQL script file containing a SELECT query to define the contents of the temporary table. See the Files guide for more information on uploading SQL script files. Use this parameter if you have a complex / long query, or you wish to share the same query between many rulesets.

Example (`sql_select_statement`)

version: '1.0'
tasks:
  - type: build_temp_table
    table_name: my_temporary_table
    sql_select_statement: >-
        SELECT accounts.account_id, address.city
        FROM accounts
        INNER JOIN address
        ON accounts.address_id = address.id
        WHERE accounts.country = 'New Zealand';
  - type: mask_table
    table: my_temporary_table
    key: id
    rules:
      - column: city
        masks:
          - type: from_file
            seed_file: DataMasque_NZ_addresses.csv
            seed_column: city

Example (`sql_select_file`)

version: '1.0'
tasks:
  - type: build_temp_table
    table_name: my_temporary_table
    sql_select_file: create_temp_table.sql
  - type: mask_table
    table: my_temporary_table
    key: id
    rules:
      - column: city
        masks:
          - type: from_file
            seed_file: DataMasque_NZ_addresses.csv
            seed_column: city

Notes:

DataMasque will create the temporary tables before applying any masks, and delete them after all tables have been masked. DataMasque will also ensure the temporary tables do not already exist in the database (removing existing temporary tables with the same name if needed). The temporary tables you define will only be available for use in joins, and cannot be masked themselves.

For Microsoft SQL Server databases, temporary table names must begin with the ## characters, as they will be created as 'Global Temporary Tables' so that they are visible to all parallel masking connections. However, in YAML the # character begins an inline comment, so the temporary table name must be wrapped in double or single quotes (e.g. table: '##my_temporary_table').

For Oracle and PostgreSQL databases, 'temporary tables' are created as regular tables so that temporary tables are visible to all parallel masking connections.

The build_temp_table task type is not currently supported for Microsoft SQL Server (Linked Server) databases.

Run SQL

Use the run_sql task type if you need to:

Run SQL scripts to prepare the database for masking.
Clean up after a masking run (e.g. disabling/enabling triggers).
Run simple update operations.

You can supply SQL for DataMasque to execute either as a script file (see Files guide), or inline in the ruleset:

Parameters

sql (this OR sql_file required): An SQL script to be executed. For multi-line scripts, you may use the YAML block style syntax (|-).
sql_file (this OR sql required): The name of a user-provided file containing an SQL script to be executed (see Files guide). Use this parameter if you have large blocks of SQL to run, or scripts that you wish to share between many rulesets.

Example Microsoft SQL Server (`sql`)

Note: This example uses Microsoft SQL Server specific syntax, as master has been specified as the target database.

version: '1.0'
tasks:
  - type: run_sql
    sql: |-
      USE [master];
      ALTER DATABASE eCommerce SET RECOVERY SIMPLE WITH NO_WAIT;
      USE [eCommerce];
      ALTER TABLE [SalesRecords].[Customer] DROP CONSTRAINT [FK_SALESRECORDS_CUSTOMER];
      ALTER TABLE [Invoices].[Customer] DROP CONSTRAINT [FK_INVOICES_CUSTOMER];

Example (`sql_file`)

tasks:
  - type: run_sql
    sql_file: pre_script_1.sql

Notes:

The run_sql task type executes in autocommit mode, and will exit on the first error encountered.

The run_sql task type does not run in dry run mode.

The run_sql task type is not currently supported for Microsoft SQL Server (Linked Server) databases.

For PostgreSQL and MySQL connections, the SQL interpreter considers a colon followed by letters or numbers as bound parameters. The colon can be escaped with a \ before it. e.g. Attempting to insert JSON data: INSERT INTO table_name (column_name) VALUES('{"is_real" :true, "key":"value"}'::json) run_sql will interpret true as a bound parameter, to fix this you can escape the bound parameter by adding a backslash(\) before the colon(:) as follows: INSERT INTO table_name (column_name) VALUES('{"is_real" \:true, "key":"value"}'::json)

Notes for Oracle:

It is recommended to test the execution of your SQL script with Oracle SQLPlus before use in a run_sql task.

DataMasque appends "WHENEVER SQLERROR EXIT SQL.SQLCODE" to the beginning of the SQL script, so the run_sql task will exit on the first error encountered.

Even if the database raises errors while executing the SQL script, some statements may still have finished executing successfully. It is recommended to check the database on any failed run_sql task.

run_sql does not use the schema specified in the connection configuration, instead it will default to the schema of

the user. If you wish to change schema, specify it in the script with SET SCHEMA.

Notes for Microsoft SQL Server:

It is recommended to test the execution of your SQL script with Microsoft sqlcmd or SQL Server Management Studio before use in a run_sql task.

Even if the database raises errors while executing the SQL script, some statements may still have finished executing successfully. It is recommended to check the database on any failed run_sql task.

Notes for PostgreSQL:

It is recommended to test the execution of your SQL script with psql before use in a run_sql task.

A run_sql task will be executed with a simple query cycle, where statements are executed in a single transaction (unless explicit transaction control commands are included to force a different behaviour).

Notes for Redshift:

Using multiple SQL statements in a single run_sql task type is not supported currently.

Notes for MySQL:

It is recommended to test the execution of your SQL script with MySQL shell before use in a run_sql task.

Executing run_sql tasks with more than one statement may not raise errors upon failure. If the first statement executes correctly but subsequent statements fail, errors may not be raised.

Truncate a table

Use the truncate_table task type to specify tables to be truncated by DataMasque. All rows will be deleted, but the table structure will be left in place.

Parameters

table (required): The name of the table to truncate. The table name can be prefixed with a schema name to reference a table in another schema.

Example

tasks:
  - type: truncate_table
    table: history_table
  ...

Notes:

The truncate_table task type does not run in dry run mode.

The truncate_table task type is not currently supported for Microsoft SQL Server (Linked Server) databases.

Parallel Tasks

Using the parallel task type, you can specify a block of tasks to be executed in parallel, spread across as many workers as are available.

Each parallel task distributes to a maximum of 10 sub-tasks. It is recommended to begin testing parallelisation with at most 4 tasks in parallel, then increase parallelisation if the database has more capacity.

Parallel tasks can be nested inside other serial/parallel tasks.

Parameters

tasks (required): A set of tasks to perform in parallel.

Example

tasks:
  - type: parallel
    tasks:
      - type: mask_table
        table: employees
        key: id
        rules:
          - column: 'name'
            masks:
              - type: from_fixed
                value: 'REDACTED'
      - type: mask_table
        table: customers
        key: id
        rules:
          - column: 'address'
            masks:
              - type: from_fixed
                value: 'REDACTED'
 ...

Warning: You should not mask the same table in multiple tasks (including mask_table, run_sql tasks) in parallel, as this could result in data being incorrectly masked.

Serial Tasks

Although tasks are performed serially in the order they are listed in the ruleset by default, you can specify a block of tasks to be performed in serial within a parallel block. This is useful when a subset of parallelisable tasks have dependencies that mean they must be executed in sequence.

Serial tasks can be nested inside other serial/parallel tasks.

Parameters

tasks (required): A set of tasks to perform in series.

Example

tasks:
  - type: parallel
    tasks:
      - type: serial
        tasks:
          - type: run_sql
            sql_file: pre_employees_script.sql
          - type: mask_table
            table: 'employees'
            key: id
            rules:
              - column: 'name'
                masks:
                  - type: from_fixed
                    value: 'REDACTED'
      - type: mask_table
        table: 'customers'
        key: id
        rules:
          - column: 'address'
            masks:
              - type: from_fixed
                value: 'REDACTED'
 ...

Masking files

Each mask_file or mask_tabular_file task specifies the masking rules to apply to each file in the base directory and/or any subdirectories. As well as any files/directories intended to be skipped or included, and any conditionals required to define which data to mask in the masking process. Masking rules and masks are applied sequentially in the order they are listed. When multiple masks are combined in sequence, the output value from each mask is passed as the input to the next mask in the sequence.

Note If the source connection and destination connection are of the same type and have the same base directory, the files will be overwritten. The list of files is read at the start of the masking run, so new files added during the masking run will not be masked and will not be present in the destination.

File task types

A ruleset consists of a list of tasks to be performed in sequence on the target data source after the file is masked, it is then written to the selected data destination.

Object file masks

Each mask_file task defines masking operations to be performed on a file or set of files. More detail is provided about these tasks under the Masking files section.

Parameters

Each task with type mask_file is defined by the following parameters:

rules (required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:
- masks (required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the file. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask types.
recurse (optional): A boolean value, when set to true any folders in the Data Source will be recursed into and the files contained will also be masked. Defaults to false.
workers (optional): The number of parallel workers to use for this masking task. Defaults to 1.
skip (optional): Specifies file/directory to leave alone and not to mask.
- regex (optional): Specifies a regex which upon matching to file names will leave alone.
- glob (optional): Specifies a glob which upon matching to folder names will leave alone.
include (optional): Specifies file/directory to include.
- regex (optional): Specifies a regex which upon matching to file names will them include in the masking run.
- glob (optional): Specifies a glob which upon matching to directory names will them include in the masking run.
encoding (optional): The encoding to use when reading and writing files. Defaults to UTF-8. Refer to Python Standard Encodings for a list of supported encodings.

For more information about the ordering of skip and include please refer to Include/Skip.

Note: regex/glob will match to the path from the base directory specified in the source connection, consider adding .*(regex) or *(glob) to the beginning of the expression for matching. For example: If the structure is /path1/path2/target_file.json, if the base directory is path1/ and recurse: true in the ruleset, the regex/glob will try to match path2/target_file.json. When including a path by specifying a glob such as target_path/* the recurse option needs to be set to true otherwise the included path won't be entered and the files it contains will not be masked.

Supported file types

In general, mask_file has been designed to mask JSON or XML files. Each file is loaded as a string and passed to the masks. Therefore, to mask a JSON file, a json mask would be implemented, like the following example:

version: "1.0"
tasks:
  - type: mask_file
    recurse: true
    skip:
      - regex: '^(.*)2.json'
      - glob: "input/*"
    include:
      - glob: "other_inputs/*.json"
    rules:
      - masks:
          - type: json
            transforms:
            - path: ['name']
              masks:
                - type: from_fixed
                  value: REDACTED

This would replace the root name attribute in the JSON with the text REDACTED.

Similarly, for XML files, use an xml mask:

version: "1.0"
tasks:
  - type: mask_file
    recurse: true
    skip:
      - regex: '^(.*)2.xml'
      - glob: "input/*"
    include:
      - glob: "other_inputs/*.xml"
    rules:
      - masks:
          - type: xml
            transforms:
            - path: 'User/Name'
              node_transforms:
                - type: text
                  masks:
                    - type: from_fixed
                      value: REDACTED

This would replace the content of the node(s) at User/Name with the text REDACTED.

To mask other types of files, basic redaction is possible. For example, to replace the contents of every txt file with the text REDACTED:

version: "1.0"
tasks:
  - type: mask_file
    recurse: true
    include:
      - glob: "*.txt"
    rules:
      - masks:
          - type: from_fixed
            value: REDACTED

It is possible to use any mask that accepts text input (or no input), although their effectiveness will depend on the size and content of the input file.

Note also, that files that have not been processed will not be copied from the source to the destination. That is, DataMasque will either load a file (based on skip/include rules), mask it, then copy it to the destination, or it will ignore the file. Unmasked files will not be copied to the destination.

Tabular file masks

Each mask_tabular_file task defines masking operations to be performed on a file or set of files (CSV, Parquet or fixed-width columns). More detail is provided about these tasks under the Masking files section.

Parameters

Each task with type mask_tabular_file is defined by the following parameters:

rules (required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:
- column (required): A column within the tabular file intended for masking. This is the header row of the column (for CSVs) or name of column (for parquet files).
- masks (required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the column. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask types.
recurse (optional): A boolean value, when set to true any folders in the Data Source will be recursed into and the files contained will also be masked. Defaults to false.
workers (optional): The number of parallel workers to use for this masking task. Defaults to 1.
skip (optional): Specifies file/directory to leave alone and not to mask.
- regex (optional): Specifies a regex which upon matching to file names will leave alone.
- glob (optional): Specifies a glob which upon matching to directory names will leave alone.
include (optional): Specifies file/directory to include.
- regex (optional): Specifies a regex which upon matching to file names will them include in the masking run.
- glob (optional): Specifies a glob which upon matching to directory names will them include in the masking run.
encoding (optional): The encoding to use when reading and writing files. Defaults to UTF-8. Refer to Python Standard Encodings for a list of supported encodings.
fixed_width_extension (optional): The file extension that fixed-width files have. Not required if no fixed-width files are to be masked. Should not include a leading . (e.g. specify txt not .txt).
fixed_width_columns_indexes (optional): An array of two-element arrays of start and end indexes of the fixed width columns. Required if fixed_width_extension is specified.
fixed_width_column_names (optional): An array of string defining the names of the fixed-width columns and used to refer to them in masking rules. Required if fixed_width_extension is specified, and must match the length of fixed_width_columns_indexes.
fixed_width_too_wide_action (optional): The action to take if masked data exceeds the width of the column, can be either truncate to truncate the value to fit in the column or error to raise an error and stop the masking run. Defaults to truncate.
fixed_width_line_ending (optional): The line ending to use when writing out the fixed width data. Will attempt to be detected from the input file, otherwise defaults to \n.

For more information about:

How tabular file types are detected, see Tabular File Type Detection.
Parameters for fixed-width file masking, see Fixed Width File Masking Parameters.
The ordering of skip and include, please refer to Include/Skip.
Table joins are not supported in tabular file masking.

Note: regex/glob will match to the path from the base directory specified in the source connection, consider adding .*(regex) or *(glob) to the beginning of the expression for matching. For example: If the structure is /path1/path2/target_file.json, if the base directory is path1/ and recurse: true in the ruleset, the regex/glob will try to match path2/target_file.json. When including a path by specifying a glob such as target_path/* the recurse option needs to be set to true otherwise the included path won't be entered and the files it contains will not be masked.

version: "1.0"
tasks:
  - type: mask_tabular_file
    recurse: true
    skip:
      - regex: '^(.*)2.json'
      - glob: "input/*"
    include:
      - glob: "other_inputs/*"
    rules:
      - column: name
        masks:
        - type: from_fixed
          value: REDACTED

Tabular File Type Detection

DataMasque uses file extensions to determine how tabular files are loaded for masking. The detection is not case-sensitive. Files with the extension csv are treated as CSV files. Files with extension parquet are treated as Apache Parquet files.

Note CSV files require header columns for tablular masking as the header columns are used as column names during masking. CSV and fixed-width files are all string based files, therefore values should be cast to other types if being used with masks that require specific types (e.g. numeric_bucket). To do this use a typecast mask, for more information please refer to Typecast

DataMasque will only attempt to load fixed width files if fixed_width_extension is specified, and will treat any files with this extension as fixed-width. See also Fixed Width File Masking Parameters.

Once files are loaded they are all masked in the same way, that is, rules are executed and applied on a per row/column basis regardless of the original source type. Data will be written back out in the same format as it was read.

Fixed Width File Masking Parameters

Masking of fixed-width files is only attempted if fixed_width_extension is specified. If fixed_width_extension is present in the ruleset without fixed_width_columns_indexes and fixed_width_column_names, then an error will be raised. However, it is valid to have fixed_width_columns_indexes and fixed_width_column_names missing if fixed_width_extension is also absent.

If fixed_width_extension is set, then DataMasque will treat any files with that extension as fixed-width and load them based on the other fixed-width options. To assist in explaining the rules, consider an example file called users.txt. This is the content:

Adam    2010-01-01  AAA-1111
Brenda  2010-01-01  EEE-5555
Charlie 2010-02-02  GGG-7777

It has 3 columns, the first containing a name, which is from index 0 to 8. The second column contains a date and spans from 8 to 19. The final column contains a transaction ID and spans from index 19 to 27.

Note these indexes are specified to be contiguous as some fixed-width formats require contiguous columns, therefore a trailing space is included in the first and second columns. DataMasque automatically strips leading and trailing spaces when the data is read. Contiguous columns are not required though, so the same result could be achieved with indexes (0, 7), (8, 18), (19, 27). When non-contiguous columns are specified DataMasque inserts spaces in between columns.

Since fixed-width files do not have column headers, the ruleset must also specify these. They can be any arbitrary valid column identifier (i.e. alphanumeric string without special characters) and are used to identify the columns in the masking rules. In this case they will be named name, date and transaction_id.

Considering these rules will yield a ruleset like this:

version: "1.0"
tasks:
  - type: mask_tabular_file
    recurse: true
    fixed_width_extension: txt
    fixed_width_columns_indexes:
    - [0, 8]
    - [8, 19]
    - [19, 27]
    fixed_width_column_names:
    - name
    - date
    - transaction_id
    rules:
      - column: name
        masks:
        - type: from_file
          seed_file: DataMasque_firstNames_mixed.csv
          seed_column: firstname-mixed
      - column: date
        masks:
          - type: from_random_date
            min: '1950-01-01'
            max: '2000-12-31'
          - type: typecast
            typecast_as: string
            date_format: '%Y-%m-%d'
      - column: transaction_id
        masks:
          - type: imitate

Note that when this ruleset is executed, DataMasque will still load any CSVs or Parquet files it encounters, however it will use the standard loaders instead of applying the fixed width rules. In this case fixed-width rules will only be used for txt files.

This can be useful if CSV or Parquet files exist in the source with the same columns and need to be masked in the same manner. If these files exist, and they shouldn't be masked, then skip rules should be added to skip them.

Include/Skip

When specifying which files/directories to include or skip for a mask_file/mask_tabular_file task, the ordering of which list is checked first needs to be considered. The include items are checked followed by the skip items, so if an item is present in both include and skip lists, that item will be included in the masking task.

Masking Tables

Each mask_table task specifies the masking rules to apply to a database table, as well as any required joins and any conditionals needed to define which rows should be masked. Masking rules and masks are applied sequentially in the order they are listed. When multiple masks are combined in sequence, the output value from each mask is passed as the input to the next mask in the sequence.

Note: While mask_table is suitable for most generic masking requirements, it is not capable of masking unique keys or primary keys. Masking of such values requires the use of the special-purpose mask_unique_key task.

Selecting data to mask

DataMasque provides some advanced features for selecting additional data from the database for use in your masking rules.

Joining tables

When masking a table, you can specify a list of joins (or dictionary mapping keys to joins) that will join the rows of a target table to rows from one or more additional tables, providing you with the additional joined values to use in your masking rules.

Parameters

target_table (required): The name of the new table you wish to join into the masking data. The target table can be prefixed with a schema name to reference a table in another schema.
target_key (required): The key on target_table to use when performing the join. This can be specified as a single column name or a list of column names.
source_table (required): The name of the table you wish join the target_table with. This could be the table being masked, or another table earlier in the list of joins (allowing you to perform multi-step joins). The source table can be prefixed with a schema name to reference a table in another schema.
source_key (required): The key on source_table to use when performing the join. This can be specified as a single column name or a list of column names.

Example

In this example, we have two tables: Customers and Membership. We would like the mask the first_name and last_name columns of the Customers table, but only if the user's membership status is Active.

The Customers contains data on customers, including their name and membership ID. The Membership table contains the status of the membership of each Customer: either Active or Inactive. The membership_ip column of the Users has a foreign key relation with the id column of the Membership table.

Customers Table

customer_id	first_name	last_name	membership_id
1	Anastasia	Rose	10001
2	Bill	Jones	10002
3	Chris	Yang	10003
4	Judith	Taylor	10004
5	Gordon	Smith	10005

Membership Table

id	membership_status
10000	Active
10001	Active
10002	Inactive
10003	Active
10004	Inactive

Customers Membership diagram

In order to access the membership_status column of the of the Membership table, we need to define a join in our ruleset from the Customers table to the Membership table.

version: "1.0"
tasks:
  - type: mask_table
    table: Customers
    key: customer_id
    joins:
      - target_table: Membership
        target_key: id
        source_table: Customers
        source_key: membership_id
    rules:
      - if:
        - column: '"Membership".membership_status'
          equals: Active
        rules:
          - column: first_name
            masks:
              - type: from_file
                seed_file: DataMasque_firstNames_mixed.csv
                seed_column: firstname-mixed
          - column: last_name
            masks:
              - type: from_file
                seed_file: DataMasque_lastNames.csv
                seed_column: lastnames

After performing the join, this will allow us to reference the membership_status column of the Membership table in our ruleset. In this example, we can reference the column with Membership.membership_status. Using this column, we can use Conditional Masking to only mask the rows of Customers where the status of the membership is 'Active'.

Note: To reference a column in a joined table, the table name of joined table must be added as a prefix to the column name**

The example below utilises the from_file mask type detailed here to select a random first name from the DataMasque_firstNames-mixed.csv and a random last name from the DataMasque_lastNames.csv files that can be found on our Supplementary Files user guide. It will first check if the membership_status for the customer is 'Active', and if so, masks the two name columns: otherwise, these columns are left unmasked.

version: "1.0"
tasks:
  - type: mask_table
    table: Customers
    key: customer_id
    joins:
      - target_table: Membership
        target_key: id
        source_table: Customers
        source_key: membership_id
    rules:
      - if:
        - column: '"Membership".membership_status'
          equals: Active
        rules:
          - column: first_name
            masks:
              - type: from_file
                seed_file: DataMasque_firstNames_mixed.csv
                seed_column: firstname-mixed
          - column: last_name
            masks:
              - type: from_file
                seed_file: DataMasque_lastNames.csv
                seed_column: lastnames

This example will produce the following results in the Customers table. The customers with customer_id 3 and 5 are not masked, as the status of their membership is 'Inactive' in the joined Membership table.

customer_id	first_name	last_name	membership_id
1	Tia	Pallin	10001
2	Nikau	Koller	10002
3	Chris	Yang	10003
4	Anika	Thom	10004
5	Gordon	Smith	10005

Note:

For Microsoft SQL Server (Linked Server), joins are not currently supported.

For Microsoft SQL Server, when using temporary table, the name of the temporary table must be wrapped in quotation marks, as the # symbol in the YAML editor denotes the beginning of a comment (e.g. target_table: '##my_temporary_table' or '##my_temporary_table.column').

To reference a temporary table column (e.g. the table_filter_column parameter of the from_file mask type or as a part of hash_columns) you must prefix the column name with its table name (e.g. table.column).

Any column name specified without a table prefix is assumed to belong to the table being masked (as specified by the table parameter for the task). You cannot specify tables that belong to other schemas.

Conditional masking

You may wish to only apply masks to rows or values that meet some conditions. DataMasque has three different methods for conditionally applying masks to meet different use cases:

Use case	Mechanism
I want to restrict which rows are fetched for masking from the database table.	Where
I want to apply certain masking rules to only a subset of rows.	If
I want to skip applying masks to certain column values.	Skip

Warning: Use of the conditional masking features 'where', 'skip', or 'if/else', may mean your masking rules are not applied to some database rows or values. It is recommended to verify the resulting output satisfies your masking requirements.

Where - restricting database fetches

To restrict which rows are fetched for masking from a database table, you can specify a where clause for a masked_table:

version: "1.0"
tasks:
  - type: mask_table
    table: users
    key: id
    where: >-
      "users"."role" <> 'administrator'
    rules:
      ...

The where clause can refer to any columns in the masked table or joined tables. All columns must be referenced using their table-qualified name (e.g. Users.FirstName). Ensure to use appropriate quoting as required. For example, if the identifier uses a reserved word, starts with an illegal character, or is a case-sensitive identifier.

Important!

Any rows excluded by the where clause will not be masked.

The SQL you provide for the where clause will not be validated before execution, please take care when constructing your SQL.

The SQL you provide for the where clause should not end in a semicolon, as this will cause a masking error.

Any string in the where clause variables must be quoted in single quotation marks.

Joined tables cannot be referenced in the where clause currently.

Note for Amazon Redshift:

Use of where clause for Amazon Redshift is not yet supported in DataMasque. This is in our roadmap and will be included in future releases.

If - conditional rules

You can choose to apply certain masking rules to only a subset of rows within a table, while still allowing other masks to be applied to those rows. This can be achieved through the use of if-conditions in rules lists.

Example

In the following example, the last_name of all users will be replaced with 'Smith', but the user's gender will determine the mask applied to their first_name:

version: "1.0"
tasks:
  - type: mask_table
    table: users
    key: id
    rules:
      - column: last_name
        masks:
        - type: from_fixed
          value: 'Smith'
      - if:
          - column: gender
            equals: 'female'
        rules:
          - column: first_name
            masks:
            - type: from_fixed
              value: 'Alice'
        else_rules:
          - if:
              - column: gender
                equals: 'male'
            rules:
              - column: first_name
                masks:
                  - type: from_fixed
                    value: 'Bob'
            else_rules:
              - column: first_name
                masks:
                  - type: from_fixed
                    value: 'Chris'

Parameters:

if (required): A list of conditions (see below) that must all evaluate as true for the nested list of rules to be applied to a row.
rules (required): A nested list of masking rules/nested-if-conditions (or dictionary mapping labels to rules) that will only be applied to rows that meet the conditions defined under if.
else_rules (optional): A nested list of masking rules/nested-if-conditions (or dictionary mapping labels to rules) that will only be applied to rows that do NOT meet the conditions defined under if.

A condition under if can contain the following attributes:

column (required): The database column to check this condition against. The column name can be prefixed with a table name to reference a column in another table, and that table name can be prefixed with a schema name to reference a table in another schema.
equals (optional): If specified, the condition will only evaluate as true if the column value exactly equals the specified value. Data types are also checked (i.e. 100 is not equal to "100").
matches (optional): If specified, the condition will only evaluate as true if the string of the column value matches the specified regular expression. For more details on how to use regular expressions, see Common regular expression patterns.
less_than (optional): If specified, the condition will only evaluate as true if the column value is a number or date/time and is less than the given value.
less_than_or_equal (optional): If specified, the condition will only evaluate as true if the column value is a number or date/time and is less than or equal to the given value.
greater_than (optional): If specified, the condition will only evaluate as true if the column value is a number or date/time and is greater than the given value.
greater_than_or_equal (optional): If specified, the condition will only evaluate as true if the column value is a number or date/time and is greater than or equal to the given value.

Conditions can also be grouped with the logical operators or, not, and and:

version: "1.0"
tasks:
  - type: mask_table
    table: users
    key: user_id
    rules:
    - if:
      - and:
        - not:
          - column: username
            matches: "customer_.\w+"
        - or:
          - column: admin
            equals: true
          - column: role
            equals: "admin"
      rules:
        - column: username
          masks:
          - type: from_fixed
            value: "Bob"

Note: When using an if conditional in rulesets, final row counts will reflect the number of rows processed rather than the number of rows masked. This is due to the rows being filtered on the application side and so all rows fetched will be processed and added to the row count. Alternatively where conditionals can be used in the ruleset which will provide an accurate row count of masked rows.

Skip - not masking specific values

A common use-case is to not apply masks to certain values, e.g. to leave NULL values or empty strings unchanged. You can choose to not mask certain values in a column by specifying a number of values to skip:

version: "1.0"
tasks:
  - type: mask_table
    table: users
    key: user_id
    rules:
      - column: username
        skip:
         - null
         - ""
         - matches: "admin_.\w+"
        masks:
          - type: from_fixed
            value: "Bob"

Any column values that are exactly equal to any of the string/numeric/null values in the skip list will not be masked (data types are also checked, i.e. 100 is not equal to "100"). Additionally, string column values matching a regular expression can be skipped by specifying the skip value as matches: "my_regex". For more details on how to use regular expressions, see Common regular expression patterns.

Mask types

Masks are the basic 'building-block' algorithms provided by DataMasque for generating and manipulating column values. Multiple masks can be combined in a list to create a pipeline of transformations on the data, or combined using combinator masks to build up more complex output values.

Parameters

Mask algorithms are defined by their type parameter - this parameter is common to (and required by) all masks:

type (required) determines the type of mask, and therefore what other parameters can be specified.

Note: Masks operate by either manipulating the original column value, or by generating an entirely new value that replaces the original value. The latter can be referred to as a 'source' mask, as the mask is a source of new values. Such 'source' masks are indicated by the from_ prefix on the mask type.

Available mask types

Generic masks
Combinator masks
- Concatenate (concat)
- Chain (chain)
String masks
Numeric masks
- Random Number (from_random_number)
- Random Boolean (from_random_boolean)
Date/time masks

Generic masks

Fixed value (`from_fixed`)

A simple mask that replaces all column values with the same fixed value.

Parameters

value (required): The value to replace all column values with. Can be any data type, but should match that of the column being masked. This value can be enclosed in quotation marks, which will convert the value a string, or entered without quotation marks.

Example

This example will replace all values in the name column of the employees table with 'Alex'.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: ID
    rules:
      - column: name
        masks:
          - type: from_fixed
            value: 'Alex'

Show result

Before

After

name
Bill
Chris
Anastasia
Judith
Gordon
Joel

name
Alex
Alex
Alex
Alex
Alex
Alex

DataMasque also supports blank values for the from_fixed mask type. To specify a blank entry, you can enter the value as """, '', or leave the field blank if there are no spaces.

version: '1.0'
tasks:
  - type: mask_table
    table: customers
    key: id
    rules:
      column: name
      masks:
      - type: from_fixed
        value: ''

From column (`from_column`)

A simple mask to replace a column's value by copying the value from another database column (from the same table, or a joined table).

Parameters

source_column (required): The name of the column to copy values from. The source column can be prefixed with a table name to reference a column in another table, and that table name can be prefixed with a schema name to reference a table in another schema.

Note: from_column is not supported for mask_file tasks, but is supported for mask_tabular_file tasks.

Example

This example will replace the values of the name column in the employees table with values from the first_name column of the users table, joined on id.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    joins:
      - target_table: users
        target_key: id
        source_table: employees
        source_key: user_id
    rules:
      - column: name
        masks:
          - type: from_column
            source_column: users.first_name

Show result

Given the users table:

id	first_name
24	Kyle
25	Reid
26	Helen
27	Callum
28	Raymond
29	Vivien

The following transformation will take place on the employees table.

Before

After

user_id	name
26	Bill
25	Chris
29	Anastasia
24	Judith
27	Gordon
28	Joel

user_id	name
26	Helen
25	Reid
29	Vivien
24	Kyle
27	Callum
28	Raymond

From file (`from_file`)

A mask to replace column values by selecting from a column (seed_column) in a CSV formatted seed file (seed_file). The seed file must contain one or more columns with named column headers. See the Files documentation for more information on uploading seed files.

If there are any blank values in the file, they will not be considered when randomly selecting a replacement value. If you wish to have blank values randomly selected from your seed file, ensure they are enclosed with quotation marks. For example, "".

Potential replacement values from the seed file rows may be pre-filtered using the table_filter_column and seed_filter_column parameters. This allows you to subset the available values for each row by matching a column value from the table to a column value in the seed file. If filtering does not match any rows in the seed file, fallback_masks can be provided to generate the replacement value instead.

After filtering has been applied (optional), replacement values from the seed file are selected randomly by default. To enable deterministic selection of replacement values, you may specify one or more hash_columns for the masking rule. See Deterministic masking for more details.

Parameters

seed_file (required): The name of a user-provided CSV seed file to select values from (see Files guide).
seed_column (optional): The name of the column in the seed file that will provide replacement values. This property only needs to be specified for CSV files with multiple columns.
table_filter_column and seed_filter_column (optional): If both attributes are provided, the replacement value will only be selected from CSV rows where the value for the seed_filter_column in the CSV row is equal to the value for the table_filter_column column in the database row being masked.
fallback_masks (optional): If there are no rows in the CSV that match the table_filter_column/seed_filter_column condition, then this nested list of masks will be applied to the column value instead. If no fallback_masks are provided and no match is found, then the column value will be replaced with a NULL value.
null_string (optional): If set, any values in the seed file column that match this value will be treated as NULL (instead of a string) when written to the database.

Example

This example masks the street_name column with a random street name from a seed file (DataMasque_address_example.csv) while ensuring that the randomly selected street is from the same city as the original. A fallback value of “Main Street” is used in the event that the database row contains a city that does not exist in the CSV.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: street_name
        masks:
          - type: from_file
            seed_file: DataMasque_address_example.csv
            seed_column: street_name
            table_filter_column: city
            seed_filter_column: city
            fallback_masks:
              - type: from_fixed
                value: 'Main Street'

Show result

Given the following database table and using the DataMasque_address_example.csv seed file:

street_number	street_name	city
44	Queen Street	Auckland
57	Oxford Street	London
12	Houston Street	New York
5	Fake Street	Fake City

This mask could produce the following masked output:

street_number	street_same	city
44	High Street	Auckland
57	Brick Lane	London
12	Maiden Lane	New York
5	Main Street	Fake City

From format string (`from_format_string`)

A simple mask to generate values of a defined format. This is useful when all values of a column have the same format. DataMasque will generate values according to a format specified according to the format string syntax.

Parameters

value_format (required): The format of the data to be generated, which must meet the requirements of the format string syntax.

version: '1.0'
tasks:
  - type: mask_table
    table: drivers
    key: id
    rules:
      - column: licence_plate
        masks:
          - type: from_format_string
            value_format: "{[A-Z],3}{[0-9],3}"

Show result

Before

After

licence_plate
ABC123
DEF456
GHI789
JKL246
MNP813
RTU579

licence_plate
VTC799
GEQ249
HWQ462
INY664
AKZ548
SCL654

Note

In addition to what is possible with the formats for a mask_unique_key task there is more you can do with format strings with the mask_table task. For example, you can easily select one value from a small number of alternatives using the syntax for alternatives: {(EN|FR)}

Typecast (`typecast`)

A simple mask to convert a value from one data type to another. This is particularly useful when a value is stored as one data type in a database, but needs to be transformed to a different data type to apply certain mask types within DataMasque.

If the typecast cannot be performed (e.g. because the value of the column is incompatible with the target data type), an error will be raised.

Parameters

typecast_as (required): The data type you wish to convert the column value to. Must be one of: date, datetime, string, integer, or float.
date_format (optional): When converting a string to a datetime, this format string can be provided to specify the expected format of the string. Conversely, when converting a datetime to a string, this format string can be provided to determine the format of the resulting string. Format strings must be valid Python date/time format strings.

Example

This example will replace every start_date in the employees table with a fixed date value of 2010-01-01. The fixed date value is specified as a string and typecast as a datetime before being written to the database.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: start_date
        masks:
          - type: from_fixed
            value: '2010-01-01'
          - type: typecast
            typecast_as: datetime
            date_format: '%Y-%m-%d'

Do nothing (`do_nothing`)

A simple mask which does not mask the data instead, sets the original value. This is useful when there are certain items or elements of columns that are not intended to be masked.

For example, you can use it to retain certain known XML attributes and mask all the rest. For more information please refer to XML Documentation.

Manipulating date values as strings

If you wish to manipulate non-string values using their string representations, you can use the typecast mask. Using the pattern demonstrated below, you may convert the value to a string, apply some manipulations, and then convert back to the original data type. For example, to truncate the day and month from a date while retaining the year, you may do the following: _example

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: date_of_birth
        masks:
          # Convert to string
          - type: typecast
            typecast_as: string
            date_format: '%Y-%m-%d'
          # Take the 'YYYY-' component and append a fixed value of
          # '01-01' to create the value 'YYYY-01-01'
          - type: concat
            masks:
              - type: replace_substring
                start_index: 5
                masks:
                - type: from_fixed
                  value: '01-01'
          # Convert back to original datatype
          - type: typecast
            typecast_as: datetime
            date_format: '%Y-%m-%d'

From choices (`from_choices`)

A mask to replace column values by selecting from a list of choices or a dictionary of choices with weights specified.

Parameteres

choices (required): Can either be a list of strings, or a list of dictionaries. If a list of dictionaries, each dictionary has a single key, the "choice" and value is the weight.

Weights must be positive, and can be a mix of float and int. Weights do not need to sum to a particular value, they are relative.

If there are no weights assigned to any of the choices then all choices will have equal weights.

Examples

This example masks the department column with a randomly selected department from the list of choices.

version: "1.0"
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: department
        masks:
          - type: from_choices
            choices:
              - "accounting"
              - "sales"
              - "research and development"

Show result

Before

After

department
employee
employee
employee
employee
employee
employee
employee
employee
employee
employee

department
research and development
accounting
sales
research and development
accounting
sales
research and development
sales
accounting
research and development

This example masks the department column with a randomly selected department from the list of choices based on the weights provided.

version: "1.0"
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: department
        masks:
          - type: from_choices
            choices:
              - "accounting": 10
              - "sales": 15
              - "research and development": 25

Show result

Before

After

department
sales
sales
sales
sales
sales
sales
sales
sales
sales
sales

department
research and development
accounting
sales
research and development
research and development
sales
research and development
sales
accounting
research and development

Combinator masks

Concatenate (`concat`)

A simple mask to concatenate the outputs of multiple masks together into a single string.

Parameters

masks (required): A list of masks (or dictionary mapping keys to masks) which will be evaluated and have their outputs concatenated into a single value. The original column value is provided as the input to each mask.
glue (optional): If provided, this string will be inserted between the output of each concatenated mask. Defaults to an empty string. Useful for separating values with spaces or commas.

Example

This example generates a full name by concatenating a user’s name with the fixed value ‘Smith’.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: name
        masks:
          - type: concat
            glue: " "
            masks:
              - type: from_column
                source_column: name
              - type: from_fixed
                value: "Smith"

Show result

Before

After

name
Bill
Chris
Anastasia
Judith
Gordon
Joel

name
Bill Smith
Chris Smith
Anastasia Smith
Judith Smith
Gordon Smith
Joel Smith

Chain (`chain`)

A simple mask to chain other masks together in series. This mask is only useful in combination with concat, in the case where multiple masking operations need to be performed on one part of a concatenated mask.

Parameters

masks (required): A list of masks (or dictionary mapping keys to masks) that will be applied in sequence to the input value.

Example

This example selects a random name from the DataMasque_firstNames_mixed.csv file, transforms it to uppercase, and then concatenates ‘Smith’ onto it to generate a random full name.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: name
        masks:
          - type: concat
            glue: ' '
            masks:
              - type: chain
                masks:
                  - type: from_file
                    seed_column: firstname-mixed
                    seed_file: DataMasque_firstNames_mixed.csv
                  - type: transform_case
                    transform: uppercase
              - type: from_fixed
                value: 'Smith'

Show result

Before

After

name
Bill
Chris
Anastasia
Judith
Gordon
Joel

name
STEVE Smith
JAMES Smith
JENNIFER Smith
FRANK Smith
CHARLIE Smith
CALEB Smith

String masks

Imitate (`imitate`)

Replace each character in a string with another random character from its same set. The character sets are:

Uppercase letters (A-Z).
Lowercase letters (a-z).
Digits (0-9).

Characters not in these sets (such as punctuation and symbols) are not replaced.

This mask is designed to be easy to drop in place to mask values that must have a specific format, but whose value is not important. For example, it could be used to mask:

Phone numbers (e.g. +1 (555) 867-5309 to +2 (938) 123-8372)
License plates (e.g. BZF123 to LMA191)
Bank accounts (e.g. 10-9282-9478563-00 to 23-1840-6492817-01)
Passport numbers (e.g. FD194845 to CZ858584)

and so on.

imitate is a good, simple and safe default for many data types. However, it is not intended to generate perfect replacements for columns that must have special rules. For example, if a value must always start with the letter C, followed by 6 random numbers and letters, then imitate is not suitable as the C might be replaced with another letter.

The uppercase, lowercase and digits arguments can be used to disable the replacement of each of these character sets. No errors are raised if a character set is enabled but those characters are not in the string, for example, it's safe to try to replace letters in a phone number field.

Parameters

force_change (optional): Since characters are chosen randomly, it is possible that a character might be randomly replaced with the same one (for example, A is chosen as a replacement for A). Set force_change to true to make sure the replacement character differs. Defaults to false. Note that this makes the output slightly less random as the number of possible replacements is reduced by one.
uppercase (optional): A boolean to enable or disable the replacement of uppercase characters. Defaults to true (uppercase characters will be replaced).
lowercase (optional): A boolean to enable or disable the replacement of lowercase characters. Defaults to true (lowercase characters will be replaced).
digits (optional): A boolean to enable or disable the replacement of digits. Defaults to true (digits will be replaced).

Example

This example will apply imitate masks to the phone, license_plate and validation_code.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: phone
        masks:
          - type: imitate
      - column: license_plate
        masks:
          - type: imitate
      - column: validation_code
        masks:
          - type: imitate

Show result

Before

After

phone	license_plate	validation_code
(09) 8198822	BA981	aFec9-LIZN7
+64 (21) 0917762	GL1748	77HG8-bbA9
1-800 GET-MASQUE	CDF345	Lm85-gC5D

phone	license_plate	validation_code
(29) 01691548	BV912	bZwh0-NCZY9
+91 (45) 54173964	XP9165	01MV0-kqC7
2-975 JDV-PLASHE	LCU788	Ys04-wL9V

Random text (`from_random_text`)

This mask replaces the column's value with randomly generated a-z characters.

Parameters

max (required): The generated character string will be this length at maximum. The maximum length must be between 1 and 100.
min (optional): The generated character string will be this length at minimum. If no value is supplied here, the generated string's length will always be equal to the max value.
case (optional): The case (upper or lower) of the text generated. Mixed case will be generated if this field is left blank. Must be one of: upper, lower

Example

This example replaces the values in the name column with a random string of lower case characters between 5 and 10 characters in length.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: name
        masks:
          - type: from_random_text
            min: 5
            max: 10
            case: lower

Show result

Before

After

name
Bill
Chris
Anastasia
Judith
Gordon
Joel

name
fjggrw
bjoquazqit
pljfrey
sdnbomx
wpoieut
yptrf

Transform case (`transform_case`)

A simple mask to perform a transformation to the case/capitalisation of a string.

Parameters

transform (required): The transformation to apply. Must be one of: uppercase, lowercase, capitalize_words (capitalizes first letter of each word), capitalize_string (capitalizes first letter only).

Example

This example will convert all values in the name column into uppercase.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: name
        masks:
          - type: transform_case
            transform: uppercase

Show result

Before

After

name
Bill
Chris
Anastasia
Judith
Gordon
Joel

name
BILL
CHRIS
ANASTASIA
JUDITH
GORDON
JOEL

Substring (`take_substring`)

A simple mask to select a substring from a column value. You may wish to use this to select or remove a subset of characters from the beginning, end, or middle of a string.

Parameters

start_index (optional): The index of the first character to include in the selected substring, with 0 being the index of the first character in the string. Defaults to 0.
end_index (optional): The index of the character immediately AFTER the selected substring (i.e. the end_index is exclusive). If omitted, the selection will continue until the end of the string.

Positive and negative indices can be used, i.e. the first character in a string is at index 0, the second character is at index 1, the last character is at index -1, and the second-to-last character is at index -2.

Example

This example will return only the first 3 characters of each value in the name column. The final result will return the characters at positions 0, 1, and 2. This is because the end_index is exclusive; the characters starting from the end_index value of 3 onwards are omitted from the final result.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: name
        masks:
          - type: take_substring
            start_index: 0
            end_index: 3

Show result

Before

After

name
Bill
Chris
Anastasia
Judith
Gordon
Joel

name
Bil
Chr
Ana
Jud
Gor
Joe

Replace substring (`replace_substring`)

A mask for transforming a selected substring of a string value. The transformation is defined by a nested sequence of masks. Matched substrings are transformed in-place, leaving the unmatched sections intact. For more complex use cases, replace_regex may be helpful.

Parameters

masks(required): A list of masks (or dictionary mapping keys to masks) that define the transformation to apply to the selected substring. The selected substring is provided as the input to the first mask.
start_index (optional): The index of the first character to include in the selected substring, with 0 being the index of the first character in the string. Defaults to 0.
end_index (optional): The index of the character immediately AFTER the selected substring (i.e. the end_index is exclusive). If omitted, the selection will continue until the end of the string.
preserve_length (optional): If set to true, then the output of the masks will be truncated or repeated until it has the same length as the original substring. This ensures the length of the entire string is unchanged. Defaults to false.

Positive and negative indices can be used, i.e. the first character in a string is at index 0, the second character is at index 1, the last character is at index -1, and the second-to-last character is at index -2.

Example

This example will replace the last 3 characters of each value in the name column with a # symbol. The start_index value of -3 indicates that the third to last character is the beginning of the substring. Because the end_index is not specified, all characters starting from the third to last character of the string until the end of the string are masked. The final result will take the characters at index position -3, -2 and -1, and replace those values with '#', leaving the rest of the string unchanged.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: name
        masks:
          - type: replace_substring
            start_index: -3
            masks:
              - type: from_fixed
                value: '###'

Show result

Before

After

name
Bill
Chris
Anastasia
Judith
Gordon
Joel

name
B###
Ch###
Anasta###
Jud###
Gor###
J###

Replace regular expression (`replace_regex`)

A mask for transforming sections of a string that match a certain regular expression. The transformation that is applied to each matched substring is defined by a nested sequence of masks. The matched substrings are transformed in-place, leaving the unmatched sections intact.

Parameters

masks(required): A list of masks (or dictionary mapping keys to masks) defining the transformation to apply to each substring that matches the pattern specified in regex. The entire sequence of masks will be applied to each substring that is matched, with the matched value being provided as the input to the first mask.
regex (required): The regular expression that will be used to search for substrings to mask. For more details on how to use regular expressions, see Common regular expression patterns.
preserve_length (optional): If set to true, then each output of the masks will be truncated or repeated until it has the same length as the original matched substring. This ensures the length of the entire string is unchanged. Defaults to false.

Example

This example replaces all numeric characters in the driversLicence column with #. Please note that it is also best practice to wrap the regular expression in quotes to avoid special characters being misinterpreted as YAML syntax:

version: '1.0'
tasks:
  - type: mask_table
    table: '"DriversLicence"'
    key: id
    rules:
      - column: driversLicence
        masks:
          - type: replace_regex
            regex: '[0-9]'
            masks:
              - type: from_fixed
                value: '#'

Show result

Before

After

driversLicence
AB123456
CD987654
EF135790
GH246802
IJ112358

driversLicence
AB######
CD######
EF######
GH######
IJ######

Credit card (`credit_card`)

This mask provides a number of methods for masking and generating credit card numbers. Parameters can be set to control the prefix, PAN formatting and luhn-validity of the generated numbers.

Parameters

issuer_names (optional): The generated card will have the specified issuer's prefix(es) and card lengths. If left empty all card issuers can be used to generate the card number. Cannot be specified if preserve_prefix is true. Please refer to the list of issuers.
validate_luhn (optional): Set to true by default, the generated card number will pass the luhn checksum. Set to false to generate random credit cards instead, which slightly improves performance by skipping luhn validation. Cannot be true with pan_format enabled.
pan_format (optional): Set to false by default, the digits between the fourth and last six digits will be marked out with 'X's. Cannot be true with validate_luhn enabled.
preserve_prefix (optional): Set to false by default, the generated card will keep the card issuer's prefix but generate new numbers. The length of the prefix will vary based on the issuer. Due to the increased accuracy of enabling this option, which performs a lookup against a list of known issuers, this is not as fast as plain random generation. Actual speed difference will depend on the data to be masked. Cannot be true if issuer_names have been specified.

Example

This example generates credit card numbers that pass the luhn checksum, with card issuer set to either MasterCard, Visa, or American Express.

version: "1.0"
tasks:
  - type: mask_table
    table: customers
    key: customer_id
    rules:
      - column: credit_card_number
        masks:
          - type: credit_card
            issuer_names:
              - VISA
              - MASTERCARD
              - AMERICAN EXPRESS
            validate_luhn: true
            preserve_prefix: false
            pan_format: false

Show result

Before

After

credit_card_number
4988418614189936
4429545392235346
5208475828392947

credit_card_number
371006478248634
5220082637809691
4284336225480232

This example generates credit card numbers that retain the original card prefix and the middle numbers in the card will be marked out with 'X'.

version: "1.0"
tasks:
  - type: mask_table
    table: customers
    key: customer_id
    rules:
      - column: credit_card_number
        masks:
          - type: credit_card
            validate_luhn: false
            preserve_prefix: true
            pan_format: true

Show result

Before

After

credit_card_number
371006478248634
4429545392235346
5208475828392947

credit_card_number
3781XXXXX248626
4259XXXXXX809342
52784XXXXXX480232

Card issuer names

These card issuer names can be used as arguments to the issuer_names parameter. They are not case-sensitive.


Visa	Mastercard	American Express
China T-Union	China Unionpay	Dankort
Diners Club International	Diners Club United States & Canada	Discover Card
Instapayment	Interpayment	JCB
Lankapay	Maestro	Maestro UK
MIR	NPS Pridnestrovie	Rupay
Troy	Ukrcard	Verce

Card prefixes

DataMasque contains a list of over 105,000 prefixes which are used when the preserve_prefix parameter is set. If a prefix is not found, then DataMasque falls back to preserving just the first digit.

A full list of prefixes can be found in the credit_card_prefixes.csv file (1.9MB CSV).

Numeric masks

Random Number (`from_random_number`)

This mask replaces the column value with a random number. The default number type is integer. If decimal_places is set greater than 0, a decimal number will be generated.

Parameters

max (required): The generated number is guaranteed to be less than or equal to this maximum value.
min (optional): This is the minimum value for the generated number. Defaults to 0.
mode (optional): If this parameter is supplied, the number will be chosen from a triangular distribution where 'mode' is the peak. Useful for street numbers etc. (e.g. with min and mode both set to 1 to generate lower street numbers more frequently).
decimal_places (optional): Number of decimal places to generate. Defaults to 0.

Example (integer)

This example replaces the quantity column in the products table with a random integer between 1 and 200.

version: '1.0'
tasks:
  - type: mask_table
    table: products
    key: id
    rules:
      - column: quantity
        masks:
          - type: from_random_number
            min: 1
            max: 200

Show result

Before

After

quantity
30
40
60
100
150
120

quantity
152
108
159
84
88
176

Example (decimal)

This example replaces the price column in the products table with a random number to 2 decimal places between 1 and 200.

version: '1.0'
tasks:
  - type: mask_table
    table: products
    key: id
    rules:
      - column: price
        masks:
          - type: from_random_number
            min: 1
            max: 200
            decimal_places: 2

Show result

Before

After

price
10.00
20.10
50.01
80.99
150.80
199.50

price
101.52
30.89
44.23
93.01
6.11
190.90

Random boolean (`from_random_boolean`)

This mask replaces the column value with a random boolean value (1/0 or true/false). from_random_boolean is effectively equivalent to a from_random_number mask with options max: 1 and min: 0.

Note The return type is an integer which is automatically cast to boolean in databases. For file masking, literal 1 or 0 will be written to the file, which may not be the intended behaviour. If literal "true"/"false" strings are required please use the from_choices mask with "true" or "false" values (note the quoting around the values). For more information please refer to From Choices

Example

This example replaces the active column in the products table with random boolean values.

version: '1.0'
tasks:
  - type: mask_table
    table: products
    key: id
    rules:
      - column: active
        masks:
          - type: from_random_boolean

Show result

Before

After

active
true
true
false
false
true
false

active
true
false
false
true
false
true

Numeric Bucket (`numeric_bucket`)

This mask generates random integers from within the same "bucket" (numeric range) as the value to be masked. The mask is set up using an array of integers that define the lower bound of each bucket.

`buckets` Intro

Buckets are a series of defined lower bounds from which replacement values are randomly generated. For example, an age column that should mask values into certain age buckets: 1-17, 18-24, 25-64, 65+. buckets: [18,25,65] are specified, so if a value from the column is 16, the replacement value is randomly generated within the 1-17 range, as opposed to any number.

Capping minimum and maximum values

The numeric bucket mask will generate numbers for the lowest bucket from 0/1 to <bucket> - 1. Similarly, for the highest bucket, numbers will be generated in the range bucket to 2³¹ - 1. This default behaviour may cause undesired outputs which are outside the normal range for the data.

For example, consider a column containing ages. All ages currently in the column are 18 or older. Ages are to be retained in buckets 35 or younger, 36-65, 65+.

A ruleset could be defined with buckets: [36, 65], but this would not function correctly. Ages 35 or lower could be masked to any value from 0-35, therefore the masked age could be below 18, which does not match the application rules. A similar problem exists with ages over 65, the masked value could be any value from 65 to 2³¹ - 1.

To solve this issue, lower and upper bounds should be specified. This requires an understanding of the existing data. Choosing the upper bound would mean selecting a reasonable value that your ages could have. A better ruleset definition could be: buckets: [18, 36, 65, 100]. This would limit the lower age to 18 and upper age to 100.

Parameters

buckets (required): A series of numbers representing the lower bounds of the buckets, including the number e.g. [5, 10, 15 20] would result in ranges 1-4, 5-9, 10-14, 15-19, 20+. Note: buckets should not contain duplicates and should be in ascending order.
force_change (optional): Since the replacement values are chosen randomly, it is possible that the number might be randomly replaced with the same one (for example, 45 is chosen as a replacement for 45). Set force_change to true to ensure the replacement value is not the same as the original value. Note that this makes the output slightly less random as the number of possible replacements is reduced by one.
include_zero (optional): When set to true, will lower the first lower bound to include 0, otherwise the lowest bound will be 1.
scale_to (optional): Provide a number that better represents the scale of your data in the target column to which replacement values will be scaled. e.g., If the scale of your data is 1000 (values are multiples of 1000), then 4583 is not an appropriate replacement for 2000. Set scale_to to 1000 so the replacement values will be multiples of 1000 (4583 -> 5000). scale_to must be a multiple of 10.

Note: If buckets were defined with a size of one (e.g. buckets: [1,3,5]) with the force_change parameter set to true this would cause an infinite loop. Instead, an error will be raised. If the target column is of a type with a smaller range than 0 - 2³¹-1 (smallint, tinyint, etc), add a maximum value to the buckets as replacement values could be larger than the column can store. numeric_bucket requires a numeric value as the input. A typecast mask can be used to convert any string values before masking with numeric_bucket. For more information please refer to Typecast.

Example

This example replaces the age column in the users table with random values from the specified buckets.

version: '1.0'
tasks:
  - type: mask_table
    table: users
    key: id
    rules:
      - column: age
        masks:
          - type: numeric_bucket
            buckets: [16,18,25,65,110]
            force_change: true

Show result

Before

After

age
18
24
36
16
43
65

age
22
23
55
17
56
100

Date/time masks

Random date/time (`from_random_datetime`)

This mask replaces the column value with a random datetime. Generated datetime values have a resolution of 1 second (the millisecond component will be 0).

Parameters

max (required): The generated date/time is guaranteed to be less than this date/time, formatted according to ISO 8601.
min (required): This is the minimum value for the generated date/time, formatted according to ISO 8601.

Note A from_random_datetime mask can be used in conjunction with a Typecast mask in order to make sure the format of the generated date/datetime value satisfies any format requirements.

Example

This example replaces the order_time column with a random date time between December 1 2019 00:00:00 UTC-0 and December 31 2019 05:30:00 UTC-0.

version: '1.0'
tasks:
  - type: mask_table
    table: orders
    key: id
    rules:
      - column: order_time
        masks:
          - type: from_random_datetime
            min: '2019-12-01T00:00+00:00'
            max: '2019-12-31T05:30+00:00'

Show result

Before

After

order_time
2019-12-07 19:50:12+00:00
2019-12-10 22:33:00+00:00
1966-12-22 07:09:01+00:00
2019-12-27 09:23:23+00:00
2019-12-29 12:06:50+00:00
2019-12-30 08:40:34+00:00

order_time
2019-12-23 20:02:50+00:00
2019-12-09 11:32:00+00:00
2019-12-12 21:31:41+00:00
2019-12-06 12:07:18+00:00
2019-12-02 05:28:45+00:00
2019-12-20 08:40:34+00:00

Random date (`from_random_date`)

This mask replaces the column value with a random date. Generated date values have a resolution of 1 day and do not include a time component.

Parameters

max (required): The generated date/time is guaranteed to be less than this date/time, formatted according to ISO 8601.
min (required): This is the minimum value for the generated date/time, formatted according to ISO 8601.

Note A from_random_date mask can be used in conjunction with a Typecast mask in order to make sure the format of the generated date value satisfies any format requirements.

Example

This example replaces the date_of_birth column with a random date between January 1 1950 and December 31 2000.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: date_of_birth
        masks:
          - type: from_random_date
            min: '1950-01-01'
            max: '2000-12-31'

Show result

Before

After

date_of_birth
1959-03-07
1965-11-10
1966-06-22
1979-02-27
1990-05-31
1999-07-31

date_of_birth
1997-08-25
1970-10-09
1995-01-18
1973-04-18
1961-02-26
1992-04-30

Note that this mask uses the ISO-8601 standard for date values. This allows for ISO weeks to be used for formatting dates, ISO week formats are specified with a 'W' in front of the week number and optionally followed by the day of the week. Additionally, day of the year can be specified as a three-digit number. See Date formats below.

Date formats

Format	Example	Equivalent
yyyy	2022	January 1, 2022
yyyy-mm	2022-03	March 1, 2022
yyyy-mm-dd	2022-03-05	March 5, 2022
yyyy-Www	2022-W20	Week 20 of 2022 (May 16, 2022)
yyyy-Www-d	2022-W20-4	Day 4 of Week 20 of 2022 (May 19, 2022)
yyyy-ddd	2022-050	Day 50 of 2022 (February 19, 2022)

Retain age (`retain_age`)

This mask is designed to be used on a column containing dates. It will mask by generating random dates in such a way that age (in years) is retained, calculated based on the date the mask is executed. This can be used to ensure that columns such as date of birth, transaction date, registration date or other creation dates are masked without conflicting with other rules that may rely on the age of the masked date.

Parameters

date_format (optional): If the column value is read as a string type, then it will be converted to a date using this format. The new randomly generated date will also be written back to the database in the same format. Format strings must be valid Python date/time format strings. Defaults to "%Y-%m-%d".
day_of_year (optional): For consistent masking, specify a day_of_year to calculate the ages on that day of the current year. This should be specified in the format "<month>-<day>"; for example, "8-1" for the 1^st of August. The year will be set to the current year. If this option is omitted, then ages will be calculated based on the current date.

Example

This example replaces the date_of_birth column values while retaining the employee's age. The dates are based on the masking run being executed on August 1^st, 2022. The employees' ages are 50, 50, 49 and 22, respectively, and are maintained even with the new dates of birth.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: date_of_birth
        masks:
          - type: retain_age

Show result

Before

After

date_of_birth
1972-07-31
1972-08-01
1972-08-02
2000-02-29

date_of_birth
1972-05-30
1971-08-10
1973-04-21
1999-10-20

Note that the mask considers birthdays on the 29^th of February to be celebrated on the 28^th of February on non-leap-years.

If masking datetime values (rather than just dates) then the time portion of the masked value will be 00:00:00 (midnight).

Retain date component (`retain_date_component`)

This mask will generate a date where up to two of the year, month, or day components value will be retained.

Parameters

date_format (optional): If the column value is read as a string type, then it will be converted to a date using this format. The new randomly generated date will also be written back to the database in the same format. Format strings must be valid Python date/time format strings. Defaults to "%Y-%m-%d".
year (this OR month OR day required): This is a bool which specifies whether the year component of the date should be retained. Defaults to False.
month (this OR year OR day required): This is a bool which specifies whether the month component of the date should be retained. Defaults to False.
day (this OR year OR month required): This is a bool which specifies whether the day component of the date should be retained. Defaults to False.
minimum_year (optional): This is the miminum year that can be generated. Defaults to 100 years ago.
maximum_year(optional): This is the maximum year that can be generated. Defaults to current year.
force_change (optional): Ensures the newly generated date is different to the input date. On the chance that the same date is generated.

Example

In this example the month of each date is retained, while the day and year will be randomly generated. The year will be a value between 1980 and 2005 (inclusive).

version: '1.0'
tasks:
  - type: mask_table
    table: customers
    key: id
    rules:
      - column: date_of_birth
        masks:
          - type: retain_date_component
            month: True
            minimum_year: 1980
            maximum_year: 2005

Show result

Before

After

date_of_birth
1972-07-31
2000-08-01
2010-10-02
1996-02-29

date_of_birth
1998-07-21
2004-08-15
1990-10-27
2000-02-12

Retain year (`retain_year`)

This mask replaces the column value with a random date. The generated date will have the year component retained while randomising month and day.

Parameters

date_format (optional): If the column value is read as a string type, then it will be converted to a date using this format. The new randomly generated date will also be written back to the database in the same format. Format strings must be valid Python date/time format strings. Defaults to "%Y-%m-%d".
force_change (optional): Ensures the newly generated date is different to the input date. On the chance that the same date is generated.

Example

This example replaces the date_of_birth column values with new dates while retaining their year component.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: date_of_birth
        masks:
          - type: retain_year
            force_change: True

Show result

Before

After

date_of_birth
1972-07-31
1984-08-01
1996-08-02
2000-02-29

date_of_birth
1972-01-30
1984-11-10
1996-06-28
2000-02-13

Document masks

Document masks apply masking to subfields in documents that are stored in database columns, for example, JSON or XML documents.

JSON (`json`)

This mask will use query to locate and mask a value inside a JSON document. The rest of the JSON document is unchanged. The path is specified using a list of strings or integers which will be used when traversing the data to the values intended to be masked; some examples of path are covered in the next section.

Parameters

transforms (required): A list of the transforms (replacements) to perform on the JSON document.
- path (required): The path to locate the value to update.
- masks (required): A list of masks to be performed (Any of the valid Mask Types).
- on_null (optional): A string to specify the action to take if the value is null. One of:
  - skip (default): Skip to the next transform, the document remains unchanged.
  - error: Raise an error and stop masking.
  - mask: Mask the null value as specified.
- on_missing (optional): A string to specify the action to take if the value is not present (due to the document structure not matching the path).
  - skip (default): Skip to the next transform, the document remains unchanged.
  - error: Raise an error and stop masking.
- force_consistency (optional): Keep consistency between replacements in the path. See the section JSON Example with force_consistency for details on behaviour. Defaults to false.
fallback_masks (optional): Mask to perform if the data retrieved from the database is not valid JSON.

When masking multiple values in the same JSON document, multiple transforms should be specified, instead of multiple table masks with a single transform each. This means that the JSON column will only need to serialized/deserialized once per row.

`path` Intro

For the JSON:

{
    "customer_details": {
        "first_name": "Richard",
        "last_name": "Willis"
    },
    "quantity": 18,
    "products": ["product1", "product2"]
}

The path [customer_details, first_name] would refer to the value "Richard", [customer_details, last_name] would be "Willis". [quantity] gives the integer value 18, ["products"] would refer to the array ["product1", "product2"].

The wildcard operator * can be used to apply masks to multiple items matching the query. This is useful if you don't know how many elements will be in an array or object. For example, a JSON object with multiple people, each with multiple addresses:

{
  "users": [
    {
      "name": "Richard",
      "addresses": [
        {"type": "postal", "city": "Fairview"},
        {"type": "physical", "city": "Riverside"}
      ]
    },
    {
      "name": "Willis",
      "addresses": [
        {"type": "postal", "city": "Beachland"},
        {"type": "physical", "city": "Bronson"}
      ]
    }
  ]
}

The path [users, "*", name] would mask the name for every element in users, regardless of how many there are. Multiple wildcards can be used, too. The path [users, "*", addresses, "*", city] would mask city in all addresses elements of all users. Note that * must always be quoted in YAML.

Example

This example replaces the data at the path [customer_details, first_name] of the json_data column with a fixed value REDACTED. The on_null: mask option is specified to mask the null value as normal. The skip option is specified to skip that transform and continue masking on missing values (i.e. the structure does not match the path).

Note that this means the first_name in the wrong location in the first row is not masked. In cases like this, it can be safer to specify error instead, so the masking run fails if data is not in the expected format. In the second row where {"first_name": null}, this value will be masked since we specified on_null: mask.

Also note the use of fallback_masks. The last row did not have valid JSON data in it, so the fallback mask was used to replace it with an empty JSON object which, may help clean the data for further use.

version: "1.0"
tasks:
  - type: mask_table
    table: customers
    key: uid
    rules:
      - column: json_data
        masks:
          - type: json
            transforms:
              - path: [customer_details, first_name]
                masks:
                  - type: from_fixed
                    value: "REDACTED"
                on_null: mask
                on_missing: skip
            fallback_masks:
              - type: from_fixed
                value: "{}"

Show result

Before

After

json_data
{"c":{"first_name":"Sam"}}
{"customer_details":{"first_name": null}}
{"customer_details":{"first_name": "Harry"}}
{"customer_details":{"first_name": "Sally"}}
NOT_VALID_JSON

json_data
{"c":{"first_name":"Sam"}}
{"customer_details":{"first_name": "REDACTED"}}
{"customer_details":{"first_name": "REDACTED"}}
{"customer_details":{"first_name": "REDACTED"}}
{}

For arrays, all masks will be applied to each value in the array. For example:

{
  "customer_details": {
    "given_names": ["Richard", "Willis"]
  }
}

The path [customer_details, given_names] would return the value ["Richard", "Willis"] and the masks would then be performed on "Richard" and "Willis" separately. This means for most mask types, each value in the array would be transformed into a new, different value. However, if you are using a mask that always returns the same value (e.g. from_fixed) all values would be transformed to the same new value.

Note: For MSSQL, ORACLE, and REDSHIFT databases, JSON data can be stored in text data types (VARCHAR, NVARCHAR, TEXT). For PostgreSQL databases, JSON data can also be stored in JSON or JSONB data types.

JSON Example with `force_consistency`

This example will illustrate the benefit of using the force_consistency parameter on transforms. Suppose you have a table with JSON data with the following structure:

{
    "name": [
        {
            "use": "official",
            "family": "Chalmers",
            "given": ["Peter", "James"]
        },
        {
            "use": "usual",
            "given": ["Jim"]
        },
        {
            "use": "maiden",
            "family": "Windsor",
            "given": ["Peter", "James"]
        }
    ]
}

When masking the items at the path name, "*", given, it would be best to mask them with consistent values i.e. the same masked names would appear in each of the given items after masking. To do this, set the force_consistency parameter of the relevant transform to true.

version: "1.0"
tasks:
  - type: mask_table
    table: dbo.json_test
    key: id
    rules:
      - column: first_name
        masks:
          - type: json
            transforms:
              - path: ['name', '*', 'given']
                masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_male.csv
                    seed_column: firstname-male
                force_consistency: true

Show result

Before

After

json_data
{"name":[{"use":"official","family":"Chalmers","given":["Peter","James"]},{"use":"usual","given":["Jim"]},{"use":"maiden","family":"Windsor","given":["Peter","James"]}]}
{"name":[{"use":"official","family":"Stevenson","given":["Todd","Carl"]},{"use":"usual","given":["Todd"]},{"use":"maiden","family":"Pallin","given":["Todd","Carl"]}]}
{"name":[{"use":"official","family":"Radgen","given":["John","Neil"]},{"use":"usual","given":["John"]},{"use":"maiden","family":"Hoppstadter","given":["John","Neil"]}]}
{"name":[{"use":"official","family":"Baulenas","given":["Eric","Miguel"]},{"use":"usual","given":["Eric"]},{"use":"maiden","family":"Ville","given":["Eric","Miguel"]}]}
{"name":[{"use":"official","family":"Asurmendi","given":["James","Bryan"]},{"use":"usual","given":["James"]},{"use":"maiden","family":"Gotsch","given":["James","Bryan"]}]}

json_data
{"name":[{"use":"official","family":"Chalmers","given":["Claude","Dennis"]},{"use":"usual","given":["Claude"]},{"use":"maiden","family":"Windsor","given":["Claude","Dennis"]}]}
{"name":[{"use":"official","family":"Stevenson","given":["Zackery","Scot"]},{"use":"usual","given":["Zackery"]},{"use":"maiden","family":"Pallin","given":["Zackery","Scot"]}]}
{"name":[{"use":"official","family":"Radgen","given":["Joshua","Brandon"]},{"use":"usual","given":["Joshua"]},{"use":"maiden","family":"Hoppstadter","given":["Joshua","Brandon"]}]}
{"name":[{"use":"official","family":"Baulenas","given":["Andrew","Tanner"]},{"use":"usual","given":["Andrew"]},{"use":"maiden","family":"Ville","given":["Andrew","Tanner"]}]}
{"name":[{"use":"official","family":"Asurmendi","given":["Antonio","James"]},{"use":"usual","given":["Antonio"]},{"use":"maiden","family":"Gotsch","given":["Antonio","James"]}]}

Without force_consistency the output JSON would have all different names, an example is shown below:

Show result

Before

After

json_data
{"name":[{"use":"official","family":"Chalmers","given":["Peter","James"]},{"use":"usual","given":["Jim"]},{"use":"maiden","family":"Windsor","given":["Peter","James"]}]}
{"name":[{"use":"official","family":"Stevenson","given":["Todd","Carl"]},{"use":"usual","given":["Todd"]},{"use":"maiden","family":"Pallin","given":["Todd","Carl"]}]}
{"name":[{"use":"official","family":"Radgen","given":["John","Neil"]},{"use":"usual","given":["John"]},{"use":"maiden","family":"Hoppstadter","given":["John","Neil"]}]}
{"name":[{"use":"official","family":"Baulenas","given":["Eric","Miguel"]},{"use":"usual","given":["Eric"]},{"use":"maiden","family":"Ville","given":["Eric","Miguel"]}]}
{"name":[{"use":"official","family":"Asurmendi","given":["James","Bryan"]},{"use":"usual","given":["James"]},{"use":"maiden","family":"Gotsch","given":["James","Bryan"]}]}

json_data
{"name":[{"use":"official","family":"Chalmers","given":["Parker","Joseph"]},{"use":"usual","given":["Mark"]},{"use":"maiden","family":"Windsor","given":["Jeffrey","Richard"]}]}
{"name":[{"use":"official","family":"Stevenson","given":["Dale","Sebastian"]},{"use":"usual","given":["Christopher"]},{"use":"maiden","family":"Pallin","given":["Johnathan","Bracken"]}]}
{"name":[{"use":"official","family":"Radgen","given":["Tyler","Robert"]},{"use":"usual","given":["Micheal"]},{"use":"maiden","family":"Hoppstadter","given":["Herbert","Ashton"]}]}
{"name":[{"use":"official","family":"Baulenas","given":["Artie","Alfred"]},{"use":"usual","given":["Pedro"]},{"use":"maiden","family":"Ville","given":["Henderson","Bryan"]}]}
{"name":[{"use":"official","family":"Asurmendi","given":["Benjamin","Michael"]},{"use":"usual","given":["Philip"]},{"use":"maiden","family":"Gotsch","given":["Kendrick","John"]}]}

XML (`xml`)

This mask will use a query to locate and mask a value inside an XML document. The rest of the XML document is unchanged. An Xpath (path) is used to define the path to the node to mask. Once the node has been located, one or more node_transforms can be applied to alter its content or attributes.

Note: The xml mask should only be used with trusted XML data. The parser includes support for entity expansion and external references which can potentially be exploited with malicious XML payloads.

Intro to `transforms` and `node_transforms`

XML documents are made up of one or more elements. When referring to an element, this includes the start tag, end tag, attributes and content. For example, this element representing a log:

<Log date="2022-08-09" username="user@example.com">Account created</Log>

The element to mask is located using an Xpath expression. Once found, there are a few different parts of the element that can be masked, namely:

its name (Log)
its attributes (date and username)
its text (Account created)

Each of these items are XML nodes.

When a masking run executes, each row from the database is fetched and passed to a masking function only once. To apply masks on different elements in an XML document, the ruleset should define a list of transforms, one for each element that requires masking. In turn, a list of node_transforms must be specified, one for each node of the element that needs to be masked.

Specifying masking in this manner allows the masking run to be more efficient by querying for each element to be masked only once.

As an example, consider how to mask the Log in the above example. The date and username attributes should be redacted, along with the text content. This would require one transform to locate the Log element, then three node transforms: one for the date attribute, another for the username attribute, and the final to mask the text of the element.

The relevant portion of the YAML describing this transform would look like:

transforms:
  - path: 'Log'
    node_transforms:
      - type: attributes
        attributes: 'date'
        masks:
          - <list of masks>
      - type: attribute
        attributes: 'username'
        masks:
          - <list of masks>
      - type: text
        masks:
           - <list of masks>

Note: This is assuming the Log element is not the root element in the XML document. To get the root element use . or an absolute Xpath (starting with //) as the path. All XML values are read as strings which will require a typecast mask if they are used in a mask that requires non-string values (e.g. numeric_bucket). XML also requires strings to be written so masks that return non-string values (e.g. from_random_number, from_random_boolean, numeric_bucket) need to go through a typecast mask before being written. For more information on typecast please refer to the Typecast documentation. Below is an example with from_random_number.

version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: xml_data
        masks:
          - type: xml
            transforms:
              - path: 'Log'
                node_transforms:
                - type: attribute
                  attributes: 'id'
                  masks:
                    - type: from_random_number
                      min: 1000
                      max: 9999
                    - type: typecast
                      typecast_as: 'string'

Consistency for multiple elements

Xpath expressions can match multiple elements. This XML document contains a UserLog with multiple Logs:

<Root>
    <UserLog>
        <Log date="2022-08-09" username="user@example.com">Account created</Log>
        <Log date="2022-08-09" username="user@example.com">Logged in</Log>
        <Log date="2022-08-09" username="user@example.com">Logged out</Log>
    </UserLog>
</Root>

The root is called Root in these examples – the root node does not need to be named Root.

The Xpath UserLog/Log would match all three Log elements. DataMasque can be configured to mask each of the specified nodes with the same value, or as different values. For example, the text of each element could be masked to the same value. Or, different masks can be applied to each located element. This is configured with the force_consistency option at the transform level. Setting this to true will apply each node transform in the same way to each element.

Xpath Relative Node

When evaluating an xpath expression, the root node is considered to be the current node when executing masking. Therefore, the root node should not be included when using relative xpaths.

Consider this example document:

<Root>
    <UserLog>
        <Log/>
    </UserLog>
</Root>

To select the Log node, the Xpath Root/UserLog/Log is not valid, as Root is the current node. Instead UserLog/Log should be used as the path is relative to Root.

If using an absolute Xpath (i.e. an Xpath starting with //) then the root node should be included. That is, the Xpath //Root/UserLog/Log and UserLog/Log select the same node(s) in this case.

Masking of unknown/extra attributes

There may be cases where XML elements sometimes have extra attributes that are not always known prior to masking. To mask these, the extra_attribute_masks option can be specified. This should contain a list of masks to apply to each attribute that has not been masked using a defined node_transform.

By default, each "extra" attribute value will have the masks applied to it separately. To force each of these values to be the same, specify the force_extra_attribute_consistency: true at the transform level. The extra_attribute_masks will be applied to the first extra attribute on the first node found, and the resulting value will be applied to all extra attributes. Note that the order in which attributes are located is indeterminate and may not match the order they appear in the XML.

Parameters

transforms (required): A list of the transforms (replacements) to perform on the XML document.
- path (required): The Xpath expression to locate the value to update.
- node_transforms (required): A list of transforms to apply to the nodes on the element. The syntax of this object is shown in the node_transforms Parameters section below.
- on_missing (optional): A string to specify the action to take if the element that the given path is not present (due to the document structure not matching the path).
  - skip (default): Skip to the next transform, the document is unchanged by this transform.
  - error: Raise an error and stop masking.
- force_consistency (optional): Require each matching element to be masked to the same values. Defaults to false.
- extra_attribute_masks: (optional): A list of masks to apply for attributes not covered by a specific node_transform.
- force_extra_attribute_consistency (optional): Force all "extra" attributes to be masked to the same value. Only applicable when using extra_attribute_masks. Defaults to false.
fallback_masks: (optional): Mask to perform if the data retrieved from the database is not valid XML.

`node_transforms` Parameters

node_transforms is a list of transforms to apply to the nodes of the found element(s).

type (required): The type of node(s) of the current element to apply masking to. Must be one of:
- text: The text value of the element (the content between the opening and closing tags).
- attribute: Mask one or more attribute(s) on the element.
- name: Mask the name of the element itself.
masks (required): A list of masks to be performed (Any of the valid Mask Types).
attributes (optional): This option is required when using the attribute type, and must not be present for other types. May either be a string, or an array of strings, which specify the attributes to apply masks to. To apply different masks to different attributes, use multiple node_transforms.
on_missing_attribute (optional): A string to specify the action to take if an attribute is missing. Please see the section below on Missing XML Nodes, to see what constitutes a missing attribute.
- skip (default): Skip to the next attribute (if masking multiple attributes) or, if there are no attributes to be masked, to the next node_transform. The document is unchanged by this transform.
- mask: Apply the masks, using a null value, then create the text content or attribute.
- error: Raise an error and stop masking.
on_null_text (optional): A string to specify the action to take if the text of a node is null (missing). Please see the section below on Missing XML Nodes, to see what constitutes a missing node.
- skip (default): Skip to the next node_transform. The document is unchanged by this transform.
- mask: Apply the masks, using a null value, then create the text content or attribute.
- error: Raise an error and stop masking.

Missing XML Nodes

The on_missing_attribute or on_null_text options can be used to change how missing values are treated.

A text node is considered null if a tag is self-closing. For example, <Transaction amount="23.94"/>. It is also considered null if the element is empty; for example, <Message to="user1" from="user2"></Message>.
An attribute is considered missing if it does not exist on the element. For example, the attribute currency is missing from this element: <Transaction amount="23.94"/>. An empty string attribute is not considered missing, and instead is just masked as an empty string.
on_missing_attribute or on_null_text does not apply to name node type, as XML tags/elements must have a name.

Retaining known attributes and removing others

There may be some instances where you want to retain known attributes, but mask all others. In this case, you can combine the do_nothing mask with the extra_attribute_masks. Any attributes you want to retain will be "masked" to their original value with do_nothing; DataMasque considers these to be masked and then applies the extra_attributes_masks to any other attributes.

Examples

This example will contain 1 transforms and 3 node_transforms. The transforms item will specify the path UserLog/Log of the xml_data column, the optional parameters not specified will be set to the default values.

The first of the node_transforms replaces the text at the path with a fixed value REDACTED, the on_null_text: mask option is specified to mask the null value as normal.
The second will mask the username attribute to a similar replacement by concatenating 3 from_file masks and a transform_case mask to make sure the replacements are all still lower case.
The third will mask the date attribute with a suitable replacement date with a from_random_date mask.

Also note the use of fallback_masks. The last row did not have valid XML data in it, so the fallback mask was used to replace it with an empty <Root /> element which, may help clean the data for further use.

<Root>
    <UserLog>
        <Log date="2022-08-09" username="user@example.com">Account created</Log>
        <Log date="2022-08-09" username="user@example.com"></Log>
        <Log date="2022-08-09" username="user@example.com">Logged out</Log>
    </UserLog>
</Root>

version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: xml_data
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: '<Root />'
            transforms:
              - path: 'UserLog/Log'
                node_transforms:
                  - type: text
                    masks:
                      - type: from_fixed
                        value: REDACTED
                    on_null_text: mask
                  - type: attribute
                    attributes:
                        - username
                    masks:
                      - type: concat
                        masks:
                          - type: from_file
                            seed_file: DataMasque_firstNames_mixed.csv
                            seed_column: firstname-mixed
                          - type: from_file
                            seed_file: DataMasque_lastNames.csv
                            seed_column: lastnames
                          - type: from_file
                            seed_file: DataMasque_email_suffixes.csv
                            seed_column: email-suff
                      - type: transform_case
                        transform: lowercase
                  - type: attribute
                    attributes:
                        - date
                    masks:
                      - type: from_random_date
                        min: '2022-01-01'
                        max: '2022-12-31'

Show result

Before

After

json_data
Hello there!
Account createdLogged out
Account createdLogged out
Account createdLogged out
NOT_VALID_XML

json_data
Hello there!
REDACTEDREDACTEDREDACTED
REDACTEDREDACTEDREDACTED
REDACTEDREDACTEDREDACTED

XML Example with `force_consistency`

This example will illustrate the benefit of using the force_consistency parameter on transforms. Suppose you have a table with XML data with the following structure:

<Root>
    <UserLog>
        <Log date="2022-08-09" username="user@example.com">Account created</Log>
        <Log date="2022-08-09" username="user@example.com"></Log>
        <Log date="2022-08-09" username="user@example.com">Logged out</Log>
    </UserLog>
</Root>

When masking the date and username attributes at the path UserLog/Log, it would be best to mask them with consistent values i.e. the same masked values would appear in each of the attributes after masking. To do this, set the force_consistency parameter of the relevant transform to true.

version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: xml_data
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: '<Root />'
            transforms:
              - path: 'UserLog/Log'
                force_consistency: true
                node_transforms:
                  - type: text
                    masks:
                      - type: from_fixed
                        value: REDACTED
                    on_null_text: mask
                  - type: attribute
                    attributes:
                        - username
                    masks:
                      - type: concat
                        masks:
                          - type: from_file
                            seed_file: DataMasque_firstNames_mixed.csv
                            seed_column: firstname-mixed
                          - type: from_file
                            seed_file: DataMasque_lastNames.csv
                            seed_column: lastnames
                          - type: from_file
                            seed_file: DataMasque_email_suffixes.csv
                            seed_column: email-suff
                      - type: transform_case
                        transform: lowercase
                  - type: attribute
                    attributes:
                        - date
                    masks:
                      - type: from_random_date
                        min: '2022-01-01'
                        max: '2022-12-31'

Show result

Before

After

json_data
Account createdLogged out
Account createdLogged out
Account createdLogged outREDACTEDREDACTEDREDACTED
Account createdLogged out
Account createdLogged out

json_data
REDACTEDREDACTEDREDACTED
REDACTEDREDACTEDREDACTED
REDACTEDREDACTEDREDACTED
REDACTEDREDACTEDREDACTED
REDACTEDREDACTEDREDACTED

Without force_consistency the output XML would have all different names, an example is shown in the first example.

Example with `extra_attributes_force_consistency`

This example will illustrate the benefit of using the extra_attributes_force_consistency parameter on transforms. Suppose you have XML data with the following structure:

<Root>
    <Info>
        <Employee date="2022-10-08" given_name="billy_ferwagner" preferred_name="billy_ferwagner"></Employee>
        <Employee date="2022-10-08" given_name="william_florista" preferred_name="william_florista"></Employee>
    </Info>
</Root>

But this time you want to mask the given_name and preferred_name attributes to the same values, to achieve this you can specify any attributes you would want to mask, e.g. the date attribute, set extra_attributes_force_consistency: true, and specify extra_attribute_masks with the masks you want to be performed on the extra attributes. This will generate a masked value from the specified masks and replace values of all attributes to that masked value.

version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: xml_data
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: fallback
            transforms:
              - path: 'Info/Employee'
                force_extra_attribute_consistency: true
                extra_attribute_masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_mixed.csv
                    seed_column: firstname-mixed
                    on_null_text: mask

Show result

Before

After

json_data

json_data

Without extra_attributes_force_consistency the output XML would mask the given_name and preferred_name attributes differently as shown below.

Show result

Before

After

json_data

json_data

Definitions

You can make use of YAML anchors and aliases to assign a label (anchor) to a block of configuration, and then reference that name (via aliases) to re-use the configuration throughout your ruleset, extending or overriding specific properties/keys in the configuration as needed.

You can define an anchor with &, then refer to it using an alias, denoted with *. You can use overrides with the characters <<: to add more values, or override existing ones on an alias.

Shown below is an example that sets 4 values with an anchor, uses 3 of those values with an alias, and overrides the 2nd value. Both tasks share the values for type, key and rules using the anchor &mask_table_opts and the alias *mask_table_opts. The value for table is merged/overridden by << or merge operator.

version: '1.0'
tasks:
  - &mask_table_opts
    type: mask_table
    table: details
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
  - <<: *mask_table_opts
    table: customers

As you may commonly want to re-use the definition of a task, rule, or mask in multiple locations (often with minor variations), DataMasque allows you to group such definitions under the task_definitions, rule_definitions, and mask_definitions configuration properties of a ruleset. The following subsections demonstrate how to use anchors and aliases with such definitions.

Task definitions (`task_definitions`)

The task_definitions attribute of the ruleset may contain a list of task definitions to be referenced from the ruleset's tasks attribute through YAML anchors and aliases. Each task definition must follow the same syntax as a task in the tasks attribute (see Database Task Types for the full list of available task types).

For example, the following ruleset has an anchor named &mask_table_customer_info under task_definitions that is inherited three times in the tasks list via the alias *mask_table_customer_info. Each time the definition is re-used, the table attribute is overridden:

version: '1.0'
task_definitions:
  - &mask_table_customer_info
    type: mask_table
    # PLACEHOLDER is a placeholder table name to be merged/overridden
    # when anchor `&mask_table_customer_info` is referred in alias
    table: PLACEHOLDER
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
      - column: postcode
        masks:
          - type: from_random_number
            max: 9999

tasks:
  - <<: *mask_table_customer_info
    table: details
  - <<: *mask_table_customer_info
    table: customers
  - <<: *mask_table_customer_info
    table: orders

This is equivalent to specifying the three mask_table tasks separately with the same key, type, and rules but applied to different tables, as below:

Show equivalent ruleset without anchors and aliases

version: '1.0'
tasks:
  - type: mask_table
    table: details
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
      - column: postcode
        masks:
          - type: from_random_number
            max: 9999
  - type: mask_table
    table: customers
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
      - column: postcode
        masks:
          - type: from_random_number
            max: 9999
  - type: mask_table
    table: orders
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
      - column: postcode
        masks:
          - type: from_random_number
            max: 9999

Rule definitions (`rule_definitions`)

The rule_definitions attribute of the ruleset may contain a list of rule definitions to be referenced through YAML anchors and aliases. Each rule definition must follow the same syntax as a rule in a mask_table task's rules.

For example, the following ruleset has an anchor named &postcode_rule under rule_definitions that is inherited twice in a list of rules via the alias *postcode_rule. Each time the definition is re-used, the column attribute is overridden:

version: '1.0'
rule_definitions:
  - &postcode_rule
    column: postcode
    masks:
      - type: from_random_number
        max: 9999

tasks:
  - type: mask_table
    table: orders
    key: id
    rules:
      - <<: *postcode_rule
        column: target_postcode
      - <<: *postcode_rule
        column: destination_postcode

This is equivalent to specifying the two rules separately with the same masks but applied to different columns, as below:

Show equivalent ruleset without anchors and aliases

version: '1.0'
tasks:
  - type: mask_table
    table: orders
    key: id
    rules:
      - column: target_postcode
        masks:
          - type: from_random_number
            max: 9999
      - column: destination_postcode
        masks:
          - type: from_random_number
            max: 9999

Mask definitions (`mask_definitions`)

The mask_definitions attribute of the ruleset may contain a list of mask definitions to be referenced from a list of masks through YAML anchors and aliases. Each mask definition must follow the same syntax as a mask in the masks attribute of a rule (see Mask Types for the full list of available mask types).

For example, the following ruleset has an anchor named &street_mask under mask_definitions that is inherited twice in a masks list via the alias *street_mask. Each time the definition is re-used, the table_filter_column attribute is overridden:

version: '1.0'
mask_definitions:
  - &street_mask
    type: from_file
    seed_file: DataMasque_address-example.csv
    seed_column: street_name
    table_filter_column: city
    seed_filter_column: city
    fallback_masks:
      - type: from_fixed
        value: 'Main Street'
tasks:
  - type: mask_table
    table: orders
    key: id
    rules:
      - column: target_street
        masks:
          - <<: *street_mask
            table_filter_column: target_city
      - column: destination_street
        masks:
          - <<: *street_mask
            table_filter_column: destination_city

This is equivalent to specifying the two masks separately with identical configuration except for different table_filter_column values, as below:

Show equivalent ruleset without anchors and aliases

version: '1.0'
tasks:
  - type: mask_table
    # PLACEHOLDER is a placeholder table name to be merged/overridden
    # when anchor `&mask_table_customer_info` is referred in alias
    table: orders
    key: id
    rules:
      - column: target_street
        masks:
          - type: from_file
            seed_file: DataMasque_address-example.csv
            seed_column: street_name
            table_filter_column: target_city
            seed_filter_column: city
            fallback_masks:
              - type: from_fixed
                value: 'Main Street'
      - column: destination_street
        masks:
          - type: from_file
            seed_file: DataMasque_address-example.csv
            seed_column: street_name
            table_filter_column: destination_city
            seed_filter_column: city
            fallback_masks:
              - type: from_fixed
                value: 'Main Street'

Nested definitions

It is possible to have nested definitions under a definition. For example, a nested rule definition &customer_info_street_address can be anchored under a task definition &mask_table_customer_info.

version: '1.0'
task_definitions:
  - &mask_table_customer_info
    type: mask_table
    # PLACEHOLDER is a placeholder table name to be merged/overridden
    # when anchor `&mask_table_customer_info` is referred in alias
    table: PLACEHOLDER
    key: id
    rules:
      - &customer_info_street_address
        column: destination_street
        masks:
        - type: from_file
          seed_file: DataMasque_address-example.csv
          seed_column: street_name
tasks:
  - <<: *mask_table_customer_info
    table: details
    rules:
      - <<: *customer_info_street_address
        column: pickup_street

Inherit definitions with dictionaries

As it is not possible to use anchors and aliases to override individual items in a YAML list, DataMasque allows you to alternatively specify lists of rules, joins, and masks as dictionaries of arbitrary keys to their respective values.

A dictionary is represented in a simple key: value form (the colon must be followed by a space):

# A mask rules example in dictionary with arbitrary keys
rules:
  customer_info_postcode: *postcode_rule
  customer_info_customer_id: *customer_id_rule
  customer_info_destination_street: *destination_street_rule

The above dictionary is identical to following list while the rules are executed:

# A mask rules example in list
rules:
  - *postcode_rule
  - *customer_id_rule
  - *destination_street_rule

The following example demonstrates how specifying rules as a dictionary allows you to easily override individual rules by their key using anchors and aliases.

By specifying rules as a dictionary in task definition &mask_table_customer_info and by using anchors and aliases to reference the rules dictionary and individual rules, we were able to override the entire customer_info_postcode rule in the first task, and just the column of the customer_info_customer_id rule in the second task.

version: '1.0'

rule_definitions:
  - &postcode_rule
    column: postcode
    masks:
      - type: from_random_number
        max: 9999
  - &customer_id_rule
    column: customer_id
    masks:
      - type: from_random_text
        max: 10

task_definitions:
  - &mask_table_customer_info
    type: mask_table
    # PLACEHOLDER is a placeholder table name to be merged/overridden
    # when anchor `&mask_table_customer_info` is referred in alias
    table: PLACEHOLDER
    key: id
    rules: &customer_info_rules
      customer_info_postcode: *postcode_rule
      customer_info_customer_id: *customer_id_rule
      customer_info_destination_street:
        column: destination_street
        masks:
        - type: from_file
          seed_file: DataMasque_address-example.csv
          seed_column: street_name
          table_filter_column: destination_city
          seed_filter_column: city
          fallback_masks:
            - type: from_fixed
              value: 'Main Street'
tasks:
  - <<: *mask_table_customer_info
    table: details
    rules:
      <<: *customer_info_rules
      customer_info_postcode:
        column: post
        masks:
          - type: from_fixed
            value: 1234
  - <<: *mask_table_customer_info
    table: orders
    rules:
      <<: *customer_info_rules
      customer_info_customer_id:
        <<: *customer_id_rule
        column: ext_customer_id

These changes can be seen in the equivalent expanded ruleset below:

Show equivalent ruleset without anchors and aliases

version: '1.0'
tasks:
  - type: mask_table
    table: details
    key: id
    rules:
      rule0:
        column: post
        masks:
          - type: from_random_number
            max: 9999
      rule1:
        column: customer_id
        masks:
          - type: from_random_text
            max: 10
      rule2:
        column: destination_street
        masks:
        - type: from_file
          seed_file: DataMasque_address-example.csv
          seed_column: street_name
          table_filter_column: destination_city
          seed_filter_column: city
          fallback_masks:
            - type: from_fixed
              value: 'Main Street'
  - type: mask_table
    table: orders
    key: id
    rules:
      customer_info_postcode:
        column: postcode
        masks:
          - type: from_random_number
            max: 9999
      customer_info_customer_id:
        column: ext_customer_id
        masks:
          - type: from_random_text
            max: 10
      customer_info_destination_street:
        column: destination_street
        masks:
        - type: from_file
          seed_file: DataMasque_address-example.csv
          seed_column: street_name
          table_filter_column: destination_city
          seed_filter_column: city
          fallback_masks:
            - type: from_fixed
              value: 'Main Street'

Note: Dictionaries can be used instead of lists in the following ruleset attributes:

For rules, as an attribute of:

Task type mask_table, including inside an if condition (both rules and else_rules).

For joins, as an attribute of:

Task type mask_table

For masks, as an attribute of:

A rule in a set of rules

Mask type replace_regex

Mask type replace_substring

Mask type concat

Mask type chain

Note: The order of items in above dictionaries in DataMasque ruleset will be respected, so the operations will be performed in the order you specify them.

Advanced ruleset features

Default values to skip

You can specify a set of skip_default values at the ruleset level:

version: '1.0'
skip_defaults:
  - null
  - ''
  - matches: 'administrator_.*'
tasks:
  - type: mask_table
    table: users
    key: id
    rules:
      - column: role
        masks:
        - type: from_fixed
          value: 'customer'

These will be used as the skip configuration for any masking rule that does not have skip explicitly set. You can still force a masking rule to not skip any values by setting skip to an empty list: skip: []

Freezing random values

A run_secret string can be provided as an option when starting a run. If you provide a run_secret, then all randomly chosen and generated values will be exactly the same for multiple runs with the same secret. This can be useful for testing or retaining consistency across multiple masking runs. The run_secret can consist of any sequence of at least 20 characters, however it is recommended to use a randomly generated string.

Notes:

Output values will be frozen only across identical repeated masking runs. Variations in database content (such as inserted or deleted rows) or Run Options (i.e. Batch Size, Max rows) can still result in different outputs even when using the same run_secret. Additionally, using a non-unique key may result in non-deterministic results. To retain consistent masking results across variable data sources, use the Deterministic masking features.

As the parallelised unloading of Amazon Redshift data into S3 files cannot guarantee consistent ordering between runs, a fixed run_secret may still result in different outputs between masking runs on Amazon Redshift databases. To retain consistent masking results with Amazon Redshift, use the Deterministic masking features.

Deterministic masking

If you wish to make the masked values deterministic based on one or more input values, you can provide hash sources in a ruleset. DataMasque will use the hashed value(s) from the specified sources as one of the random generator inputs (along with the instance seed and the run secret). This allows for consistent output values to be generated even across different databases or inside documents.

For database or tabular file masking, this is done with the use of a hash_columns argument for each column being masked. For file masking, the parameter is called hash_sources and the hash used for the entire file.

DataMasque will use the hashed value(s) as one of the random generator inputs (along with the instance seed and the run secret). This allows for consistent output values to be generated even across different databases, inside documents (XML/JSON) or when masking files.

Changes to hashing values in 2.8.0: Prior to DataMasque 2.8.0, values to hash were converted to strings then hashed. This may have resulted in duplicate output values when hashing on null. From version 2.8.0, hashing occurs on binary representations of data. This means that hashed outputs from DataMasque 2.8.0 onwards will differ to those generated by prior versions.

Deterministic masking with databases or tabular files

Each hash_columns item can either be just the name of the column, or can specify a json_path or xpath to fetch a hash value from inside JSON or XML, respectively. For more information on json_path or xpath please refer to the JSON documentation/XML documentation. For JSON/XML hash_columns the following logic is applied:

If the value extracted from the json_path/xpath is null the hash will be performed on the null value.
If the value does not exist at the json_path/xpath the hash will be performed on null.
If the JSON/XML data cannot be decoded (is invalid) the hash will be performed on the entire column, even if null.

Note Since null is the default fallback for invalid data, this may result in unexpected repeated masked values. Warnings will be added to the run logs when this fallback occurs. Both json_path and xpath optional parameters cannot be set for the same hash column. It is possible to hash a column on to itself (i.e. hash_columns contains the current column). If you hash a column on itself, and that is used as the hash for other columns, the mask on the hash column should come last in the rules. If not, that column will use the pre-masked value as the hash and other columns use the post-masked value, meaning different values will be generated in the same row for the same mask definition.

Example

This example will mask the date_of_birth column with a date value that has been deterministically generated based on the hash of date_of_birth and first_name column values combined with a one-off randomly generated secret for this run.

For example, in every row where date_of_birth = '2000-01-01' and first_name = 'Carl', the date_of_birth will be replaced with a deterministically generated value (e.g. 1999-03-03). This same replacement value will be generated for repeated rows that match the same date_of_birth and first_name values within this run. However, this value will be different for each run. To retain consistency of deterministically generated values between multiple runs, see Consistent masking.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: date_of_birth
        hash_columns:
          - date_of_birth
          - name
        masks:
          - type: from_random_date
            min: '1980-01-01'
            max: '2000-01-01'

An equivalent alternative version of YAML syntax is shown below:

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: date_of_birth
        hash_columns:
          - column_name: date_of_birth
          - column_name: name
        masks:
          - type: from_random_date
            min: '1980-01-01'
            max: '2000-01-01'

Show result

Before

After

date_of_birth	first_name
2000-01-01	Carl
1965-11-10	Ria
2000-01-01	Carl
2000-01-01	Jose
1990-05-31	Thomas
1999-07-31	Nicole

date_of_birth	first_name
1999-03-03	Carl
1986-05-16	Ria
1999-03-03	Carl
1991-03-24	Jose
1971-05-20	Thomas
1962-12-11	Nicole

Example with `xpath`

This example will mask the date_of_birth column with a date value that has been deterministically generated based on the hash of the date_of_birth attribute from the following XML document column.

<Root>
    <Info>
        <Employee date_of_birth="2022-10-08">Steve</Employee>
        <Employee date_of_birth="2022-10-08">Fred</Employee>
    </Info>
</Root>

To get the date_of_birth attribute from the XML document we need to specify the path with an xpath. The following xpath will get the date attribute from the first Employee element of the XML document, Info/Employee[1]/@date_of_birth. If you wanted to perform the hash on a list of date_of_birth attribute's values you can specify Info/Employee/@date_of_birth.

version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: date_of_birth
        hash_columns:
          -  column_name: xml_data
             xpath: "Info/Employee/@date_of_birth"
        masks:
          - type: from_random_text
            min: 5
            max: 10

As shown in the results below, when the date_of_birth attribute has the same values for both Employee elements the resulting masked date_of_birth column contain the same values.

Show result

Before

After

xml_data	date_of_birth
SteveFred	1989-01-09
SteveFred	1990-07-14
SteveFred	1991-11-08
SteveFred	1997-03-14
SteveFred	1984-02-02

xml_data	date_of_birth
SteveFred	1992-08-07
SteveFred	1993-11-14
SteveFred	1992-08-07
SteveFred	1998-05-12
SteveFred	1992-08-07

Deterministic masking with files

File masking tasks (mask_file) support deterministic generation of masked values which are based on input values. You can provide a hash_sources list in a rule, similar to the hash_columns for database masking. However, with hash_sources you must specify additional parameters to fetch the hash.

Parameters

json_path(optional): A path to the value in the JSON data to use for the hash.
xpath(optional): An Xpath to the value stored in the XML data to use for the hash.
file_path(optional): A boolean value, when set to true the hash will be performed on the file's path (relative to the base directory). Defaults to false.
case_transform(optional): upper/lower - Apply a case transform on the input value, for consistent hash values irrespective of case.

Note Each of the json_path, xpath and file_path are shown as optional but exactly one of these needs to be specified. For more information on json_path or xpath please refer to the JSON documentation/XML documentation.

Example with `json_path`

This example will mask a JSON file which contains the following JSON data:

{
  "users": {
    "name": "Richard",
    "addresses": [
      {
        "type": "postal",
        "city": "Nelson"
      },
      {
        "type": "physical",
        "city": "Napier"
      }
    ]
  }
}

We want to mask the city of the users, hashed with the name of the users. To do this we need to specify a json_path to the value we want to use for the hash ([users,name]) in the hash_sources, as shown in the ruleset below. The masked value we would want to replace the city with, would be best if it was from one of the seed files DataMasque provides.

version: "1.0"
tasks:
  - type: mask_file
    rules:
      - hash_sources:
          - json_path: ['users', 'name']
        masks:
        - type: json
          transforms:
          - path: ['users','addresses', "*",'city']
            masks:
            - type: from_file
              seed_file: DataMasque_mixed_countries_addresses.csv
              seed_column: city

As shown in the results below, when the name has the same values for both users the resulting masked values for the city is also the same.

Note Each row of the results table below represents a separate file.

Show result

Before

After

JSON data
{"users": {"name": "Evelyn","addresses": [{"type": "postal","city": "Boronia"},{"type": "physical","city": "Cramphorne"}]}}
{"users": {"name": "Sarah","addresses": [{"type": "postal","city": "Muckleford"},{"type": "physical","city": "Bandy Creek"}]}}
{"users": {"name": "Bob","addresses": [{"type": "postal","city": "Doncaster"},{"type": "physical","city": "Shenton"}]}}
{"users": {"name": "Sarah","addresses": [{"type": "postal","city": "Woronora"},{"type": "physical","city": "Lee Point"}]}}
{"users": {"name": "Evelyn","addresses": [{"type": "postal","city": "Nelson"},{"type": "physical","city": "Napier"}]}}

JSON data
{"users":{"name":"Evelyn","addresses":[{"type":"postal","city":"Stamford"},{"type":"physical","city":"Stamford"}]}}
{"users":{"name":"Sarah","addresses":[{"type":"postal","city":"Canning Vale"},{"type":"physical","city":"Canning Vale"}]}}
{"users":{"name":"Bob","addresses":[{"type":"postal","city":"Bolivia"},{"type":"physical","city":"Bolivia"}]}}
{"users":{"name":"Sarah","addresses":[{"type":"postal","city":"Canning Vale"},{"type":"physical","city":"Canning Vale"}]}}
{"users":{"name":"Evelyn","addresses":[{"type":"postal","city":"Stamford"},{"type":"physical","city":"Stamford"}]}}

Example with `xpath`

This example will mask two different XML files but generate consistent masked data by using user id as a hash. The first contains a user's info.

<User id="8472">
    <Name>Evelyn</Name>
    <Address>
        <City>Nelson</City>
    </Address>
</User>

The second contains a list of user's log messages:

<Logs userId="8472">
    <Message to="Evelyn" date="2022-08-09" type="welcome"/>
    <Message to="Evelyn" date="2022-08-10" type="forgot_password"/>
</Logs>

In both cases, we want the user's name to be consistent, and this can be done by specifying the user's ID as the hash source for each file. We'll create a ruleset with two mask_file tasks, and use includes to map the right task to the correct file.

The first ruleset will mask the User/Name text value, and the second will mask the to attribute of each Log/Message element. The first ruleset will use the id attribute on User as the hash source, while the second will use the userId attribute on Logs.

Here's the ruleset to accomplish this:

version: "1.0"
tasks:
  - type: mask_file
    includes:
    - glob: user_info_*.xml
    rules:
      - hash_sources:
        - xpath: 'User/@id'
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: '<User />'
            transforms:
              - path: 'User/Name'
                node_transforms:
                - type: text
                  masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_mixed.csv
  - type: mask_file
    includes:
    - glob: user_logs_*.xml
    rules:
      - hash_sources:
        - xpath: 'Logs/@userId'
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: '<Logs />'
            transforms:
              - path: 'User/Name'
                node_transforms:
                - type: attributes
                  attributes: 'to'
                  masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_mixed.csv

The two output files would use consistent names. The first file:

<User id="8472">
    <Name>Sarah</Name>
    <Address>
        <City>Nelson</City>
    </Address>
</User>

And the second:

<Logs userId="8472">
    <Message to="Sarah" date="2022-08-09" type="welcome"/>
    <Message to="Sarah" date="2022-08-10" type="forgot_password"/>
</Logs>

Example with `file_path`

This example will just show an example ruleset to mask files based on the file path: This applies to any mask_file task.

version: "1.0"
tasks:
  - type: mask_file
    rules:
      - hash_sources:
          - file_path: true
        masks:
        - type: json
          transforms:
          - path: ['users','addresses', "*",'city']
            masks:
            - type: from_file
              seed_file: DataMasque_mixed_countries_addresses.csv
              seed_column: city

With this ruleset the hashing will be performed on the file path, so each file with a different path with have differing masked values. This is useful if you want repeatable masking based on the name of a file.

Common regular expression patterns

A regular expression (or "regex") is a sequence of characters that acts as a search pattern to filter or select substrings of text strings. They can be used in replace_regex masks and matches conditions of if and skip blocks. Note that it is best practice to wrap the regular expression in quotes to avoid special characters being misinterpreted as YAML syntax:

version: '1.0'
tasks:
  - type: mask_table
    table: driversLicence
    key: ID
    rules:
      - column: driversLicence
        masks:
          - type: replace_regex
            regex: "[0-9]"
            masks:
              - type: from_fixed
                value: "#"

Notes: DataMasque uses the Python regular expression syntax, which you can learn more about in the official tutorial.

Regular expressions allow you to match various characters by specifying a character or character set. Below are a list of characters sets that can be used in regular expressions.

Characters can be specified as character classes.

"[0-9]" can be used to match any numerical character.
"[a-z]" and "[A-Z]" will match lower case and upper case alphabetical characters respectively.
Classes can also be combined; "[a-zA-Z]" will match any upper or lower case character.
You may also specify a list of characters to match within square brackets. If you wish to only match vowels, you can use "[aeiouAEIOU]".
If a caret character is added before any character within the character class, it will instead match something other than the specified character. "^[0-9]" will match any character other than a numerical character.

The table below detail other ways to denote certain character type without the use of character classes.

Character	Description
\d	One digit character, ranging from 0 to 9.
\w	One 'word' character; an ascii letter, a digit or an underscore.
\s	One whitespace character. This can be a space, a line break, or a tab.
\D	One character that is NOT a digit.
\W	One character that is NOT a word character.
\S	One character that is NOT a whitespace character.

You can also choose to add a quantifier to any character or character class, which will specify how many characters will be matched.

Quantifier	Description
+	One or more characters will be matched.
{n}	Exactly `n` characters will be matched. For example, \d{3} will match strings of exactly 3 numerical digits.
{a,b}	Any string of characters between `x` and `y` length will be matched. For example, `\w{1,3}` will match a string of word characters between 1 and 3 length.
{c,}	Any string of characters with `c` or more characters. For example, `\d{2,}` will match a string of numerals of length 2 or more.
*	Any instance of the characters appearing zero or more times.
?	Matches if the character or character class appears once, or no times.

Below are some common regular expression patterns that can be used to match typical database strings.

Example Patterns

Phone Number

The pattern below will match a phone number consisting of an international code in parentheses followed by a 7-digit phone number that may be hyphenated after the 3rd digit.

"\(\+\d{1,4}\)\s?\d{3}-?\d{4}"

Some example of matching patterns are shown below:

(+64)123-4567
(+1234) 5678910
(+61) 987-5432

The pattern below will match a Social Security Number consisting of hyphen-separated digits:

"\d{3}-\d{2}-\d{4}"

Some example of matching patterns are shown below:

111-22-3333
987-65-4321
112-35-8132

Time

The pattern below will match a time consisting of 1 or 2 hour digits and 2 minute digits separated by a colon:

"[0-2]?\d:\d{2}"

Some example of matching patterns are shown below:

7:30
23:50
05:45

Email Address

The pattern below will match an email address consisting of an alphanumeric username and an alphanumeric domain name separated by an @ symbol. For a more comprehensive email regular expression, see: emailregex.com.

"[a-zA-Z0-9]+@[a-zA-Z0-9\.]+"

Some example of matching patterns are shown below:

janedoe@gmail.com
bobsmith@hotmail.com
thomas@yahoo.com

DISCLAIMER: The above examples are simplistic and general examples to demonstrate how regular expressions can be constructed - you should ensure these regular expressions are appropriate for your particular data before using them in your masking rulesets.

Ruleset YAML specification