DataMasque Portal

Ruleset YAML specification

Ruleset YAML specification

The ruleset YAML configuration provides instructions that DataMasque will follow when performing a masking run against a target database. Rulesets are comprised of one or more tasks, which can contain many different types of instructions. The most common use case is the application of tasks for masking sensitive data in database tables with masking rules.

Ruleset Properties

The following properties are specified at the top-level of a ruleset YAML file:

  • version (required): The schema version used by this ruleset. The default value present in the editor when creating a new ruleset is generally the value that you should be using. See Schema Versioning for more information.
  • name (deprecated; will be removed in release 3.0.0): A unique name that will be used to refer to your ruleset. This field may only contain alphanumeric characters and underscores, and its length is limited to 64 characters.
  • tasks (required): A list of tasks to be performed in order on the target database. See Database Task Types for the comprehensive list of the available task types and their associated parameters.
  • task_definitions (optional): A list of task definitions to be referenced from the ruleset's list of tasks through YAML anchors and aliases.
  • rule_definitions (optional): A list of rule definitions to be referenced from a list of rules through YAML anchors and aliases.
  • mask_definitions (optional): A list of maks definitions to be referenced from a list of masks through YAML anchors and aliases.
  • skip_defaults (optional): See Default values to skip.
  • random_seed (optional): Deprecated in favour of the Run secret option. See Freezing random values.
Example

The following ruleset provides an example for replacing the last_name column of every row in the users table with the fixed value "redacted last name":

# My ruleset.yml
version: '1.0'
tasks:
  - type: mask_table
    table: users
    key: user_id
    rules:
      - column: last_name
        masks:
          - type: from_fixed
            value: 'redacted last name'

Note: The tasks must be indented using two spaces. For example:

tasks:
  - type: mask_table
^^

- type: mask_table is indented two spaces from the parent tasks:

Important note on case-sensitivity: For all tasks except run_sql, database identifiers, such as table and column names, should be referenced as you would otherwise reference them in an SQL query. When masking a case-sensitive database, identifiers must be referenced in the ruleset using the correct case.

To refer to a case-sensitive table or column in a database, the identifier must be enclosed in double quotation marks. However, when writing YAML, quotations are used to denote a string value, so any enclosing quotations are not considered as part of the value. As such, it is necessary to enclose the entire name - including double quotation marks - in an outer set of single quotation marks. For example:

# Case-sensitive table name; enclosed in both single and double quotations.
table: '"CaseSensitiveTableName"'

To refer to a case-sensitive table in a schema, the schema name must also be enclosed in quotation marks if the schema name is case-sensitive. The entire combination of schema and table name must be enclosed in single quotation marks. For example:

# Case-sensitive schema and table name; enclosed in both single and double quotations.
table: '"CaseSensitiveSchemaName"."CaseSensitiveTableName"'

If referencing a combination of table and column, you will need have quotation marks around both the table and column names within the surrounding quotation marks. For example:

# Case-sensitive table and column name; enclosed in both single and double quotations.
column: '"CaseSensitiveTable"."CaseSensitiveColumn"'

Identifier names containing double quotation marks, backslashes, periods, and whitespace should always be enclosed in double quotation marks. Also, literal double quotation marks and backslashes must be preceded by a backslash:

# Case-sensitive table and column name containing special characters.
column: '"Case\"Sensitive\\Table"."Case.Sensitive Table"'

Notes:

  • Backslashes and single quotation marks are not supported in identifier names for Microsoft SQL Server (Linked Server) databases.
  • For Redshift databases, DataMasque follows PostgreSQL's rules for case-sensitivity of quoted/unquoted identifiers if enable_case_sensitive_identifier is true, and treats all identifiers as case-insensitive if enable_case_sensitive_identifier is false. See:

https://docs.aws.amazon.com/redshift/latest/dg/renablecasesensitiveidentifier.html

Schema versioning

Schema changes to the DataMasque ruleset specification are tracked using the version field of the ruleset. The version number consists of two fields in the format major.minor. Minor version increments reflect backwards-compatible changes to the schema, whereas major version increments represent breaking changes which will require some form of migration from previous versions. Wherever possible, DataMasque will handle such migrations for you automatically when you upgrade.

Each release of DataMasque only supports the most recent major ruleset version at the time of release. As such, the major schema version of your rulesets must equal the major version supported by your DataMasque release. The minor schema version of your rulesets must be equal to or less than the minor version supported by your DataMasque release.

The ruleset schema version supported by this release of DataMasque is "1.0".


Writing comments

While creating a ruleset, it is possible to write comments in the ruleset. A commented block is skipped during execution, and it helps to add description for specified ruleset block.

If you begin a line with # (hash symbol), all text on that line will become a comment.

version: '1.0'
# This line will become a comment.
tasks:
  - type: mask_table

If you place # on a line, all text after that # on that line will become a comment. Any text before it will still be part of the ruleset.

version: '1.0'
tasks:
  - type: mask_table # The name of this task type will not be affected by this comment.

In the DataMasque ruleset editor, the shortcut key combination for commenting ruleset blocks is CTRL + / on Linux and Windows and ⌘ + / for Mac operating systems. If your cursor is on a line and this shortcut is used, the entire line will be commented out. Highlighting multiple lines at once will cause all highlighted lines to be commented out.

Database task types

A ruleset is comprised of a list of tasks to be performed in sequence on the target database. Tasks are performed serially from top to bottom, but special serial and parallel tasks can be used to nest other tasks within them for performance (parallelism) or dependency management.

Sensitive data discovery

The run_data_discovery task type inspects the metadata of your database, searching for columns which are likely to contain sensitive data. On completion, a report is generated containing a summary of all identified columns, and their current masking coverage. It is recommended to include a single run_data_discovery task in your rulesets to help ensure complete masking coverage and provide ongoing protection as new sensitive data is added to your database.

See the Sensitive Data Discovery guide for more information on this feature.

Note: The run_data_discovery task type is not currently supported for Amazon DynamoDB.

Parameters

This task type does not have any parameters.

Example

The following shows an example ruleset that will execute only the run_data_discovery task and no masking tasks. This example usage may be useful when starting a new ruleset from scratch to determine a starting point for developing your masking rules.

The run_data_discovery may also be included in a ruleset alongside other masking tasks to provide continuous feedback on the masking coverage provided by the ruleset.

version: '1.0'
tasks:
  - type: run_data_discovery

Schema Discovery

The run_schema_discovery task type inspects the metadata of your database, searching for schemas, tables, and columns and can flag certain columns which are likely to contain sensitive data. On completion, a report is generated containing a summary of all identified schemas, tables, columns, and relevant metadata of the data within the columns.

See the Schema Discovery guide for more information on this feature.

Note: The run_schema_discovery task type is not currently supported for Amazon DynamoDB.

Parameters

This task type does not have any parameters.

Example

The following shows an example ruleset that will execute only the run_schema_discovery task and no masking tasks. This example usage may be useful when starting a new ruleset from scratch to determine a starting point for developing your masking rules.

version: '1.0'
tasks:
  - type: run_schema_discovery

Table masks

Each mask_table task defines masking operations to be performed on a database table. More detail is provided about these tasks under the Masking Tables section.

Parameters

Each task with type mask_table is defined by the following parameters:

  • table (required): The name of the table in the database. The table name can be prefixed with a schema name to reference a table in another schema. If the table or schema name are case-sensitive, you must enclose the name in double and single quotation marks in order to specify the casing of the name. For example, table: '"CaseSensitiveSchema"."CaseSensitiveTable"'

    • For rulesets run against Amazon DynamoDB connections, the table name can be prefixed with the AWS region that contains the table e.g. <region>/<table-name> (without angle brackets). Note that the region prefix is required if you have not created a .aws/credentials file or environment variable with the default region or are running DataMasque on EC2. Amazon DynamoDB connections currently only support masking a single table in each ruleset.
  • key (required): One or more columns that identify each table row. Composite keys may be specified for the key parameter. For more details on using composite keys, see Composite keys

    • For Oracle databases it should always be ROWID (key: ROWID). DataMasque will implicitly use ROWID when ROWID is not specified. For more details, refer to Query optimisation.
    • For Microsoft SQL Server and PostgreSQL databases it is recommended to use the primary key, or any other unique key that is not modified during masking, for better performance.
    • The key columns must not contain any NULL value.
    • If the key column names are case-sensitive, you may enclose each key value in double and single quotation marks in order to specify the casing of the key. For example, key: '"Customer_ID"'
    • If the columns specified for the key parameter cannot be used to uniquely identify rows, then the masked values will be the same for rows that have the same key value. Refer to the key and hash columns example in the Notes section for how to avoid producing duplicate masked values.
    • Columns specified in the key cannot be masked in the task's rules. DynamoDB is an exception to this rule, allowing key columns to be masked with e.g. the from_unique mask type.
  • on_missing_column (optional): The action to take if a column in the ruleset is not found on the table. This option is only available for DynamoDB. If specified, must be one of error or skip. Defaults to error, which means a masking run will fail if a column in the ruleset does not exist. If set to skip then missing columns do not cause an error and other columns are masked as normal.

  • rules (required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:

    • column (required): The name of the column to mask. For rulesets run against Amazon DynamoDB connections, a wildcard * can be used to apply masks to columns not matched by other column's rules.
    • masks (required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the column. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask types.
    • hash_columns (optional): A list of columns which will be used as input to the Deterministic masking algorithm for this rule. If hash_columns is provided, all mask types that rely on randomisation become deterministic based on the hash_columns column values. Values in the provided columns can be null. The hash column can be prefixed with a table name to reference a column in another table, and that table name can be prefixed with a schema name to reference a table in another schema. Hash columns can also be specified by a list of dictionaries with the following keys:
      • column_name (required): The name of the column which will be used as input as above.
      • case_transform (optional): upper/lower - Allow for case transforms on the values of the input, for consistent hashed values irrespective of case. This is useful if values are stored with different cases in different tables, allowing for consistent hashing on those values, For example, email addresses could be stored as all lowercase in one table but mixed case in another.
      • json_path (optional): If the column contains JSON data, the path to the value in the JSON data you wish to perform the hash on can be specified here, otherwise the hash will be performed on the entire column.
      • xpath (optional): If the column contains XML data, the Xpath to the value stored in the XML data you wish to perform the hash on can be specified here, otherwise the hash will be performed on the entire column. For more information on the json_path please refer to the JSON documentation. For more information on the xpath please refer to the XML documentation. For more information on deterministic masking please refer to the Deterministic masking.
  • workers (optional): The number of parallel processes to use for masking this table (defaults to 1). Each process will operate on a separate batch of rows from the table (batch size is a configurable run option). Increasing workers may decrease masking run times, dependent on database performance and the complexity (e.g. number of columns to mask) of the masking task. It is recommended to increase the number of workers if DataMasque connections to your target database spend more time processing queries than waiting for DataMasque (i.e. the "waiting for clients" time approximates DataMasque CPU time), which suggests DataMasque could efficiently use multiple workers to mask other batches while waiting for database responses. Additionally, it is recommended that the number of parallel processes multiplied by the number of workers assigned to each process does not exceed twice the number of CPUs available on your deployed instance.

  • index_key_columns (optional): When index_key_columns is true (the default setting), DataMasque will create an additional index on all key columns if there is no existing index contains all key columns. You may wish to disable the creation of a new index if you have an existing index on some of the key columns that will provide adequate performance.

    • For Oracle databases, this option has no effect because ROWID is always used as the key.
    • Redshift databases do not support indexes, so this option has no effect for them.
    • This option currently has no effect for Microsoft SQL Server (Linked Server) databases.
version: '1.0'
tasks:
  - type: mask_table
    table: users
    key:
      - user_id
      - first_name
    rules:
      - column: date_of_birth
        hash_columns:
          - first_name
        masks:
          - type: from_random_datetime
            min: '1980-01-01'
            max: '2000-01-01'
Composite keys

A composite key can be specified in following formats:

  • A list of columns; for example, key: ['invoice_id', 'product_id'] would be used to indicate a composite key consisting of a combination of two columns, invoice_id and product_id.
  • For Microsoft SQL Server and PostgreSQL, when specifying a composite key, the order of the keys listed must respect the original order as defined in the database.
  • A multiline composite key. An example of a multiline composite key is shown below.
  key:
    - 'invoice_id'
    - 'product_id'
Example mask_table ruleset
version: '1.0'
tasks:
  - type: mask_table
    table: users
    key: user_id
    rules:
      - column: last_name
        masks:
        - type: from_fixed
          value: "redacted last name"

Notes

  1. Index operations will be performed online (ONLINE=ON) on SQL Server editions that support this feature.
  2. The following types cannot be used as key columns:
    • Microsoft SQL Server
      • datetime
      • time(7)
      • datetime2(7)
      • datetimeoffset(7)
    • PostgreSQL
      • real
      • double precision
  3. While rules and masks should typically be provided as lists, they can also be specified as dictionaries that map arbitrary keys to rules/masks. For example:
...
rules:
  last_name_rule:
    column: last_name
    masks:
      fixed_mask:
        type: from_fixed
        value: "redacted last name"
  1. Specifying rules or masks as a dictionary can allow you to override the rule/mask for a specific key when inheriting from a definition.
  2. When masking a table, if a non-unique key is specified for the mask_table task alongside hash_columns, if the value of the hash_columns is different for more than one row with same key value, the final masked values will arbitrarily depend on the order that update statements are executed. This can be avoided by including the targeted hash_columns as part of a composite key for the mask_table task.

Mask a primary key or unique key

The mask_unique_key task type can be used to mask the values in a primary key or unique key. Masking of a primary key or unique key has the requirement that all masked values are unique, which requires the use of this special-purpose task type.

The mask_unique_key task type replaces all non-null rows of the target key with new, unique values, generated in accordance with a user-specified format. The target primary or unique key columns and associated foreign key columns are updated with these unique replacement values in a single operation to maintain referential integrity.

Any foreign keys that reference a column to be updated will be disabled prior to masking in order to prevent referential integrity errors, and will be re-enabled after the completion of the task.

Notes:

  • Each mask_unique_key task will mask the members of a single primary key or unique key constraint. Multiple mask_unique_key tasks are required to mask multiple independent unique keys on a single table.
  • The mask_unique_key task can be used on tables with up to 200,000,000 non-null rows.
  • When applied to a composite key, replacement values are only generated for rows that contain a complete, non-null key. For any null or partially null rows, all columns of the target key will be set to NULL.
  • The mask_unique_key task must only be applied to columns which are in-fact unique (i.e. the target key columns have a PRIMARY KEY or UNIQUE constraint enforced). Unique keys that have multiple NULL rows (e.g. using a filtered unique index in SQL Server) are allowed; such rows will not be modified by this task.
  • When masking a clustered index on SQL Server, the performance of mask_unique_key can be significantly improved by disabling all other indexes and constraints on the target table for the duration of the task. It is recommended to implement this in your ruleset using run_sql tasks before and after the mask_unique_key task to disable and then re-enable these constraints.
  • The mask_unique_key task does not support SQL Server columns created with the IDENTITY property, or Oracle / PostgreSQL columns created with GENERATED ALWAYS AS IDENTITY.
  • Use of mask_unique_key for Amazon DynamoDB, Amazon Redshift or Microsoft SQL Server (Linked Server) databases is not currently supported in DataMasque, consider using mask_table with a from_unique mask instead.
  • Due to the random assignment of replacement values, it is possible (though generally rare) that a row may be assigned a masked value that is identical to its pre-masking value. In these cases, the masking is still effective, as an attacker will not be able to identify which rows' values were replaced with an identical value. However, if you need to guarantee that all masked values are different from their pre-masking values, you should use min and max parameters to ensure the range of possible output values from your format string does not overlap with the range of pre-masking values in your database.
  • Notes for Amazon DynamoDB: The mask_unique_key task type is not supported for Amazon DynamoDB. The from_unique mask should be used instead.
  • Notes for masking files: The mask_unique_key task type is not supported for file masking. The from_unique mask should be used instead.

Warning: The mask_unique_key task type must not be run in parallel with tasks that operate on any of the following:

  • The target table of the mask_unique_key task.
  • Tables containing foreign keys that reference the target_key columns (either directly or through a chain of foreign keys).
  • Any tables specified in additional_cascades.
Parameters

Each task with type mask_unique_key is defined by the following parameters:

  • table (required): The name of the database table that contains the primary key or unique key to be masked.
  • target_key (required): A list of items defining each column that makes up the primary or unique key, and the format in which replacement values will be generated for that column. Composite keys can be masked by including multiple columns and formats in this list. Each item has the following attributes:
    • column (required): The name of the column to be masked.
    • format (optional): The format which will be used to generate replacement values for the column. See Format string syntax for details. Defaults to '{!int}'.
  • additional_cascades (optional): Use this parameter to propagate masked values to implied foreign keys of the target_key. Implied foreign keys are dependencies that exist between tables but are not enforced by foreign key constraints, and hence are not defined in the database. Masked values will be cascaded to these columns; see Cascading of masked values for more details on how this works. Each additional_cascades item has the following attributes:
    • table (required): The name of the table containing the cascade target columns, which have an implicit reference to the target_key of this task. The table name can be prefixed with a schema name to cascade to a table in another schema.
    • columns (required): A list of column dictionaries - each describing the relationship between a column of the target key and a column on the cascade target table. Each column mapping item has the following attributes:
      • source (required): The name of a column in the target key from which masked values will be cascaded to the corresponding target column.
      • target (required): The name of a column on the cascade target table to which masked values from the source column will be cascaded.
    • update_foreign_keys (optional): If set to true, foreign keys referencing the target columns will be updated in the same way as foreign keys referencing the target_key. Defaults to true.
  • batch_size (optional): To avoid excessive memory consumption when masking large tables, DataMasque generates replacement values in batches. This value controls the maximum number of unique values that are generated in a single batch. In general, the default of 50,000 will be acceptable for most use cases. The batch size used by DataMasque will be limited by the number of rows in the target table.

Note:

  • When using additional cascades to propagate composite primary or unique key columns to implied foreign key columns, you need to specify all the columns in the additional cascades parameter that are corresponding to the referenced primary or unique key columns.
    • Warning! Using additional_cascades to propagate partial columns on composite primary or unique key columns is not recommended. Not specifying all the columns that are corresponding to the referenced primary or unique key columns will cause data propagation to fail from the referenced composite primary or unique key columns to the composite implied foreign keys. In other words, partial cascades that reference a subset of the target key columns will result in data for omitted columns not being propagated to the target table, potentially resulting in inconsistent data between the two tables.
    • For Oracle databases, all columns of the target_key must be referenced as source columns in additional_cascades. Partial cascades that reference a subset of the target key columns are disallowed. For example: A composite unique key in a PhoneNumbers table which consists of Prefix and LineNumber columns which are referenced by PhPrefix and PhLineNumber columns in the Calllogs table but without foreign key constraint. Therefore it is an implied foreign key that requires using the additional_cascades parameter to propagate the masked unique key values to ensure data integrity across the tables.

PhoneNumber CallLogs implicit fk table diagram

A ruleset needs to be written to specify all corresponding implicit foreign key columns in the additional_cascades parameter as such that the masked unique keys will be propagated collectively to the foreign keys:

 version: "1.0"
 tasks:
   - type: mask_unique_key
     table: PhoneNumbers
     target_key:
       - column: Prefix. #part of the composite unique key constraint
         format: "{!int, 1:150, pad}"
       - column: LineNumber #part of the composite unique key constraint
         format: "{!int, 50001:100000, pad}"
     additional_cascades:
       - table: CallLogs
         columns:
           # Need to include both Prefix/PhPrefix and LineNumber/PhLineNumber for data to propagate properly.
           - source: Prefix #UK a
             target: PhPrefix
           - source: LineNumber #UK a
             target: PhLineNumber
  • However, if they are not composite foreign keys but individual foreign keys a ruleset needs to be written to propagate the masked unique keys individually:

PhoneNumber CallLogs implicit fk table diagram

version: "1.0"
tasks:
  - type: mask_unique_key
    table: PhoneNumbers
    target_key:
      - column: Prefix. #has its own unique key constraint
        format: "{!int, 1:150, pad}"
      - column: LineNumber #has its own unique key constraint
        format: "{!int, 50001:100000, pad}"
    additional_cascades:
      # Need to include both Prefix/PhPrefix and LineNumber/PhLineNumber for data to propagate properly.
      - table: CallLogs
        columns:
        - source: Prefix #UK a
          target: PhPrefix
      - table: CallLogs
        columns:
        - source: LineNumber #UK b
          target: PhLineNumber
Example 1

The following example will mask the primary key column AccountNumber of the Users table with unique replacement values. Another table, Transactions, also has a column named AccountNumber which has a foreign key relationship to the AccountNumber column of the Users table.

Users transaction table diagram

Account numbers will be generated with at least 6 digits; the minimum value being 100,000 and the maximum value growing as required depending on the number of rows in the table.

In this ruleset below, you only need to specify masking rules for the primary key column, AccountNumber, to be masked. You do not need to explicitly define the foreign key columns to propagate the replacement values to in the ruleset. DataMasque will automatically detect primary key and foreign key relationships in the database and propagate the replacement values to any related foreign key columns; in this case, the new values for AccountNumber in the Users table are implicitly propagated to the AccountNumber column in the Transactions table.

version: '1.0'
tasks:
  - type: mask_unique_key
    table: Users
    target_key:
      - column: '"AccountNumber"'
        format: '{!int,100000:}' # Account numbers will be generated with at least 6 digits

Show result

Users table
Before After
AccountNumber Name
100001Adam
100002Brenda
100003Charlie
100004Diane
100005Esther
100006Francis
AccountNumber Name
106512Adam
109983Brenda
161822Charlie
189413Diane
122576Esther
197698Francis
Transactions table
Before After
TransactionDate AccountNumber
2021-01-01100001
2021-01-02100002
2021-01-03100004
2021-01-04100004
2021-01-05100003
2021-01-06100001
2021-01-07100006
2021-01-08100005
TransactionDate AccountNumber
2021-01-01106512
2021-01-02109983
2021-01-03189413
2021-01-04189413
2021-01-05161822
2021-01-06106512
2021-01-07197698
2021-01-08122576

Example 2

The following example will mask a composite unique key of a PhoneNumbers table. The composite key consists of the following columns:

  • Prefix VARCHAR(3): containing a zero-padded integer in the range (1-150). e.g. 001, 002, etc.
  • LineNumber VARCHAR(6): containing a zero-padded integer in the range (50,001-100,000).

Values matching these specific formats can be generated using the following format strings:

  • '{!int, 1:150, pad}': Generates integers between 1-150 (inclusive), zero-padded to a fixed width of 3 characters.
  • '{!int, 50001:100000, pad}': Generates integers between 50,001-100,000 (inclusive), zero-padded to a fixed width of 6 characters.

Because both of these formats have an upper value bound, we must consider the maximum number of unique composite values that are available in this space. Multiplying the number of values in the two ranges (150 * 50,000), we can determine that these two format strings will supply us with 7,500,000 rows of unique composite values. As a result, this task would fail if applied to a table containing more than 7,500,000 (non-null) rows.

In addition to the PhoneNumbers table, this schema also includes a CallLogs table containing the columns PhPrefix and PhLineNumber, which are references to the values in the Prefix and LineNumber columns of the PhoneNumbers table. However, due to specific requirements of this schema, these references are not defined using a foreign key constraint. Without a foreign key, DataMasque will not automatically propagate the replacement values generated for the PhoneNumbers table to the CallLogs table.

PhoneNumbers CallLogs table diagram

In order to ensure the referential integrity of these implicit references is maintained during masking, this example uses additional_cascades to instruct DataMasque on how to cascade updated values to these columns. DataMasque will take the values of Prefix and LineNumber columns of the PhoneNumbers table and propagate these values to the PhPrefix and PhLineNumber columns of the CallLogs table.

version: '1.0'
tasks:
  - type: mask_unique_key
    table: PhoneNumbers
    target_key:
      - column: Prefix
        format: "{!int, 1:150, pad}"
      - column: LineNumber
        format: "{!int, 50001:100000, pad}"
    additional_cascades:
      - table: CallLogs
        columns:
          - source: Prefix
            target: PhPrefix
          - source: LineNumber
            target: PhLineNumber

Show result

PhoneNumbers table
Before After
id Prefix LineNumber
1056069875
2140093150
3056051531
4114089796
5079073665
id Prefix LineNumber
1012050165
2140079463
3115096452
4003050010
5105065498
CallLogs table
Before After
PhPrefix PhLineNumber Timestamp
0560698751620110987503
0560698751620111698775
1140897961620113762010
0560698751620125649941
0790736651620125789210
PhPrefix PhLineNumber Timestamp
0120501651620110987503
0120501651620111698775
0030500101620113762010
0120501651620125649941
1050654981620125789210

Example 3

Consider a database with two tables and two schemas: Customers in the Accounts schema; and Transactions in the Sales schema. The following example will mask the primary key column CustomerId of the Accounts.Customers table. However, any changes made to this CustomerId column must also be reflected on the Customer column of the Sales.Transactions table. Due to specific requirements of this database, these cross-schema references are not defined using a foreign key constraint. In order to maintain referential integrity, this example uses additional_cascades to instruct DataMasque how to cascade updated values to the relevant column present in the other schema.

The Customer ID consists of 3 letters, followed by a hyphen, then a 4-digit number. This will be constructed using a format string to ensure any values generated conform to the required standards.

version: "1.0"
tasks:
  - type: mask_unique_key
    table: Accounts.Customers
    target_key:
      - column: CustomerId
        format: "{[a-z],3}-{[0-9],4}"
    additional_cascades:
      - table: Sales.Transactions
        columns:
          - source: CustomerId
            target: Customer

Show result

Customers table
Before After
CustomerId Name
AAA-1111Adam
BBB-2222Brenda
CCC-3333Charlie
DDD-4444Diane
EEE=5555Esther
CustomerId Name
ACG-1435Adam
BOD-0493Brenda
CAG-6845Charlie
HUM-5941Diane
SNC=9341Esther
Transactions table
Before After
TransactionDate Customer
2021-01-01AAA-1111
2021-01-02EEE=5555
2021-01-03DDD-4444
2021-01-04BBB-2222
2021-01-05AAA-1111
2021-01-06CCC-3333
2021-01-07DDD-4444
2021-01-08EEE=5555
TransactionDate Customer
2021-01-01ACG-1435
2021-01-02SNC=9341
2021-01-03HUM-5941
2021-01-04BOD-0493
2021-01-05ACG-1435
2021-01-06CAG-6845
2021-01-07HUM-5941
2021-01-08SNC=9341

Format string syntax

The format string syntax used by DataMasque to generate unique replacement values currently supports format strings that are combination of alphanumeric characters and symbols. This can be used to generate key values that combined randomly generated portions combined with fixed formatting to generate a key that matches any format required in your key columns.

Format strings consist of variable components that are declared inside braces {}, as well as fixed components outside of braces.

Note: In order to guarantee the uniqueness of generated values, format strings used with mask_unique_key and from_unique may contain at most one variable component whose length is not constant (e.g. a variable component for an unpadded random integer whose length will depend on the number of digits: {!int, 1:100}).

Format strings can be constructed using character sets. Character are wrapped in braces and followed by a comma and a number to indicate how many characters in the set are to be generated.

Character Set Description
[a-z] Lower case alphabetical characters.
[A-Z] Upper case alphabetical characters.
[0-9] Numerical characters.
[aeiou] Any vowel. Characters can be individually specified without using a range.

An example format strings is shown below.

format: "{[a-z],2}_{[A-Z],2}-{[a-zA-Z],3}#{[0-9],5}"

In this example, there are four sets of curly braces within the format string, each providing a set of characters followed by a number indicating how many of these characters to generate.

  • The first braces specify {[a-z],2}, which will generate a string of 2 lower case alphabetical characters.
  • After this, there is an underscore outside the brackets, which means all generated values will have an underscore after the 2 alphabetical characters.
  • The second braces specify {[A-Z],2}, which will generate string of 2 upper case alphabetical characters.
  • After this second variable, there is a hyphen -, meaning a hyphen will always be present after the 2 upper case values.
  • The third braces specify {[a-zA-Z],3}. which will generate a string of length 3 consisting of both lower case and upper case alphabetical characters.
  • After this, there is a hash #, meaning a hash character will always be placed after the third generated string.
  • The final braces specify {[0-9],5}, meaning a 5 digit numerical number is placed at the end of the string.

The following values are some example outputs that may be generated using the format string above:

ab_TJ-RaK#10496 pt_oq-TRu#49511 iu_QE-unT#67312 nd_UL-bES#97638

Example

In this example, we wish to mask a series of number plates.

A number plate consists of 3 alphabet characters, followed by 3 numerical digits. We wish to mask the first letter of each number plate with X, followed by 2 random alphabetical characters, followed by a hyphen and a 3 digit number.

A snippet of the table is shown below, where number_plate is a unique key of the table.

car_registration Table

number_plate car_owner
AAA-111 Anastasia
BBB-222 Bill
CCC-333 Chris
DDD-444 Judith
EEE-444 Gordon

In this case, we will use the following string format.

format: "X{[A-Z],2}-{[0-9],3}"

We are generating a fixed value of X, followed by 2 alphabetical characters, as defined by {[A-Z],2}. After this, there is a hyphen outside of braces, so a static value of a hyphen will always be generated. After this, 3 random numerical digits are generated.

A ruleset utilising this string format in a mask_unique_key task can be shown below.

version: "1.0"
tasks:
    - type: mask_unique_key
      table: car_registration
      target_key:
        - column: number_plate
          format: "X{[A-Z],2}-{[0-9],3}"

Show result

Before After
number_plate car_owner
AAA-111 Anastasia
BBB-222 Bill
CCC-333 Chris
DDD-444 Judith
EEE-555 Gordon
number_plate car_owner
X Anastasia
XAT-475 Bill
XIB-015 Chris
XPQ-164 Judith
XUR-431 Gordon

Integer string format syntax

The !int operator can be used to generate integers. In its most basic use with no arguments, the format string {!int} will generate integers from one to infinity. Extra parameters can be added to set the output range or pad the output.

Range (min:max):

The range defines all possible integer values that may be generated by the integer generator. This is an optional parameter, defaulting to 1: (min=1, max=unbounded).

  • min (optional): The minimum value which will be generated (inclusive). Defaults to 1.
  • max (optional): The maximum value which will be generated (inclusive). When this value is not specified the maximum value will be unbounded, meaning it will grow depending on the number of values required.

For example:

  • {!int, 5:}: generate integers from 5 (inclusive) to infinity.
  • {!int, :100}: generate integers from 1 to 100 (inclusive).
  • {!int, 20:80}: generate integers from 20 to 80 (inclusive).

Zero-pad (pad):

  • pad (optional): When specified, zero-padding will be applied to generated integers - resulting in fixed character width replacement values (e.g. 001, 002, …, 999). The zero-pad width is determined by the width of the max value, and therefore is only a valid option when a max value is specified.

For example:

  • {!int, :1000, pad}: generate strings in the format 0001, 0002, etc, up to 1000.

Note: Always wrap format strings in either single or double quotes. Leaving format strings unquoted in the ruleset will result in invalid YAML due to the opening { character, which has reserved usage in YAML. See below for examples of a correctly quoted format string:

format: '{!int}'        # Single quotation marks
format: "{!int, pad}"   # or double quotation marks

Hex string format syntax

Integers can be generated and output in a hexadecimal format by using the !hex operator. The range and pad options apply in the same was as for standard integer generation, however range values are interpreted as hexadecimal rather than decimal.

For example:

  • {!hex}: generate hex strings from 1 to infinity, i.e. 1, 2, …, a, b, …, ff, 100, etc.
  • {!hex, 10:100, pad}: generate hex strings from 0x10 to 0x100 (inclusive) with padding, i.e. 010, 011, …, 0fe, 0ff, 100.

UUID string format syntax

You may choose to generate values in the Universal Unique Identifier (UUID) format by declaring a !uuid format string. A UUID is a string of 32 hexadecimal digits (0 to 9, a to f), separated by hyphens. The number of characters per hyphen are 8-4-4-4-12. An example UUID would be 12345678-90ab-cdef-1234-567890abcdef.

In order to generate a UUID as the unique key, simply specify uuid in the format as shown below.

format: "{!uuid}"

You may also specify a prefix within the format string of up to 8 characters. This will ensure that the first characters in the UUID are always static. For example, specifying format: "{!uuid,aaaa}" will cause the first 4 characters of every UUID generated by the ruleset to be a.

Alternatives string format syntax

You can have DataMasque select one value from a set of alternatives for each generated value. Such a segment can be specified by wrapping your set of pipe/|-separated alternatives in parentheses (). For example, {(EN|FR)}-{!int} can be used to generate an integer prefixed by either EN or FR. At least two alternatives must be specified.

Note: This should only be used with the from_format_string mask in a mask_table task as it will not satisfy the unique requirement for the mask_unique_key task.

Cascading of masked values

Referential integrity of data references to the target_key of a mask_unique_key task is maintained by "cascading" the masked replacement values to each reference. DataMasque will perform this cascade automatically for relationships defined by a database foreign key constraint. Masked values can also be cascaded to columns that are not members of such a foreign key constraint by using the additional_cascades feature.

Specifically, foreign keys that meet the following criteria will be updated:

  1. Foreign keys of the target_key or an additional_cascade with update_foreign_keys: true will be updated.
  2. Only foreign keys in the same schema as the target table will be updated.
  3. Only foreign keys that are enabled will be updated. Foreign keys that are present, but disabled at the time of masking will be excluded. Specifically:
    1. In Oracle, foreign keys disabled by ALTER TABLE some_table DISABLE CONSTRAINT some_constraint are not updated.
    2. In Microsoft SQL Server, foreign keys disabled by ALTER TABLE some_table NOCHECK CONSTRAINT some_constraint
    3. Note that MySQL, PostgreSQL and Redshift do not have equivalent support for disabling foreign key constraints, so all discovered foreign keys are updated.
  4. Foreign keys will be searched for recursively - i.e. Foreign keys that reference an already updated foreign key will also be updated.
  5. A foreign key that references a superset of the columns in its referenced key or the original target_key will be updated, but the columns that do not reference the target_key will be left unchanged.
    • For example, consider using mask_unique_key to mask a key column user_id. If a composite foreign key (ref_user_id, ref_group_id) references user_id and another column group_id, then the ref_user_id in that foreign key will be updated but the ref_group_id will not.
  6. A foreign key that references a subset of the columns in the target_key will NOT be updated, as row references may be ambiguous.
    • For example, consider using mask_unique_key to mask a composite key (user_id, group_id). If a foreign key references user_id or group_id alone, then that foreign key will not be updated.
    • Note that if an additional_cascade with update_foreign_keys: true uses the same target_key column as a source_column multiple times, then foreign keys of that additional_cascade will still be updated if they reference each column in the original target_key at least once.
    • Additionally, if an additional_cascade with update_foreign_keys: true references only a subset of the target_key columns, then foreign keys of that additional_cascade will be updated as long as they reference all of the target columns in the additional_cascade.

If any cascade would update the same set or a subset of the columns of another cascade, it will be marked as redundant and only the other cascade will be performed. For example, consider using mask_unique_key to mask a key column user_id where a foreign key column ref_user_id references user_id and an additional_cascade is specified to update both ref_user_id and ref_user_id_2 from the single source_column of user_id. In this situation, the foreign key is redundant as it updates a subset of the columns in the additional cascade, so only the update specified by the additional cascade needs to be performed.

The following situations would result in an unresolvable set of cascades, and will cause DataMasque to report an error in the run log:

  1. Any cascade whose columns partially overlap those of another cascade without either column set being a superset of, subset of, or equal set to the other column set.
    • For example, consider using mask_unique_key to mask a key column user_id that is referenced by a composite foreign key (ref_user_id, ref_group_id) while an additional_cascade is specified to update columns ref_user_id and ref_role_id. Because the foreign key and additional_cascade would both update ref_user_id but would also update columns that each other cascade would not update (ref_group_id and ref_role_id), they cannot be resolved and an error will be reported.
  2. Any cascade that would update a column from a different column in the target_key than another cascade.
    • For example, consider using mask_unique_key to mask a composite key (user_id, group_id) that is referenced by a composite foreign key (ref_user_id, ref_group_id) while an additional_cascade is specified to update ref_user_id=group_id and ref_group_id=user_id. Because the foreign key and additional_cascade would update ref_user_id and ref_group_id from different source columns, they cannot be resolved and an error will be reported.

Any rows of a cascade target (child table) which contain values that are not present in the target_key (on the parent table) will have their cascade target columns set to NULL. This situation may occur in one of the following cases:

  • The cascade target is a foreign key that has at some point been disabled, had values updated, then been re-enabled without being checked / validated. i.e.

    • For Oracle the constraint was re-enabled using NOVALIDATE.
    • For Microsoft SQL Server, the constraint was re-enabled without using WITH CHECK.
    • For PostgreSQL, the constraint was dropped and recreated instead of being disabled and re-enabled.
  • The cascade target is an implicit reference without database constraints.

The behaviour is designed to ensure that no rows are left unmasked on the cascade target.

Important!

  • The reported number of rows updated by mask_unique_key will include updates made by all cascades, and count each time a row is affected by a different update (i.e. multiple cascade updates on the same table).

Build a temporary table

If you need to repeatedly join multiple tables when masking tables, or you need to perform some custom SQL transformations to column values, then you may wish to use the build_temp_table task type to create a temporary table which can then be accessed via a join during a mask_table task:

Parameters
  • table_name (required): The name of the temporary table to create in the database. You will need to use this name when referencing this temporary table later (e.g. in joins and masking rules). The table name can be prefixed with a schema name if the temporary table should be created in a schema other than the user's default schema.
  • sql_select_statement (this OR sql_select_file required): A string containing a SELECT statement to define the contents of the temporary table. To break the statement across multiple lines, you may use multi-line YAML syntax (|- or >-).
  • sql_select_file (this OR sql_select_statement required): The name of a user uploaded SQL script file containing a SELECT query to define the contents of the temporary table. See the Files guide for more information on uploading SQL script files. Use this parameter if you have a complex / long query, or you wish to share the same query between many rulesets.
Example (sql_select_statement)
version: '1.0'
tasks:
  - type: build_temp_table
    table_name: my_temporary_table
    sql_select_statement: >-
        SELECT accounts.account_id, address.city
        FROM accounts
        INNER JOIN address
        ON accounts.address_id = address.id
        WHERE accounts.country = 'New Zealand';
  - type: mask_table
    table: my_temporary_table
    key: id
    rules:
      - column: city
        masks:
          - type: from_file
            seed_file: DataMasque_NZ_addresses.csv
            seed_column: city
Example (sql_select_file)
version: '1.0'
tasks:
  - type: build_temp_table
    table_name: my_temporary_table
    sql_select_file: create_temp_table.sql
  - type: mask_table
    table: my_temporary_table
    key: id
    rules:
      - column: city
        masks:
          - type: from_file
            seed_file: DataMasque_NZ_addresses.csv
            seed_column: city

Notes:

  • DataMasque will create the temporary tables before applying any masks, and delete them after all tables have been masked. DataMasque will also ensure the temporary tables do not already exist in the database (removing existing temporary tables with the same name if needed). The temporary tables you define will only be available for use in joins, and cannot be masked themselves.
  • For Microsoft SQL Server databases, temporary table names must begin with the ## characters, as they will be created as 'Global Temporary Tables' so that they are visible to all parallel masking connections. However, in YAML the # character begins an inline comment, so the temporary table name must be wrapped in double or single quotes (e.g. table: '##my_temporary_table').
  • For Oracle and PostgreSQL databases, 'temporary tables' are created as regular tables so that temporary tables are visible to all parallel masking connections.
  • The build_temp_table task type is not currently supported for Amazon DynamoDB or Microsoft SQL Server (Linked Server) databases.

Run SQL

Use the run_sql task type when you need to run any native SQL DDL or DML statements or scripts at any point within a ruleset.

You can supply SQL for DataMasque to execute either as a script file (see Files guide), or inline in the ruleset:

Parameters
  • sql (this OR sql_file required): An SQL script to be executed. For multi-line scripts, you may use the YAML block style syntax (|-).
  • sql_file (this OR sql required): The name of a user-provided file containing an SQL script to be executed (see Files guide). Use this parameter if you have large blocks of SQL to run, or scripts that you wish to share between many rulesets.
Example Microsoft SQL Server (sql)

Note: This example uses Microsoft SQL Server specific syntax, as master has been specified as the target database.

version: '1.0'
tasks:
  - type: run_sql
    sql: |-
      USE [master];
      ALTER DATABASE eCommerce SET RECOVERY SIMPLE WITH NO_WAIT;
      USE [eCommerce];
      ALTER TABLE [SalesRecords].[Customer] DROP CONSTRAINT [FK_SALESRECORDS_CUSTOMER];
      ALTER TABLE [Invoices].[Customer] DROP CONSTRAINT [FK_INVOICES_CUSTOMER];
Example (sql_file)
tasks:
  - type: run_sql
    sql_file: pre_script_1.sql

Notes:

  • The run_sql task type executes in autocommit mode, and will exit on the first error encountered.
  • The run_sql task type does not run in dry run mode.
  • The run_sql task type is not currently supported for Microsoft SQL Server (Linked Server) databases.
  • For PostgreSQL and MySQL connections, the SQL interpreter considers a colon followed by letters or numbers as bound parameters. The colon can be escaped with a \ before it. e.g. Attempting to insert JSON data: INSERT INTO table_name (column_name) VALUES('{"is_real" :true, "key":"value"}'::json) run_sql will interpret true as a bound parameter, to fix this you can escape the bound parameter by adding a backslash(\) before the colon(:) as follows: INSERT INTO table_name (column_name) VALUES('{"is_real" \:true, "key":"value"}'::json)

Notes for Oracle:

  • It is recommended to test the execution of your SQL script with Oracle SQLPlus before use in a run_sql task.
  • DataMasque appends "WHENEVER SQLERROR EXIT SQL.SQLCODE" to the beginning of the SQL script, so the run_sql task will exit on the first error encountered.
  • Even if the database raises errors while executing the SQL script, some statements may still have finished executing successfully. It is recommended to check the database on any failed run_sql task.
  • run_sql does not use the schema specified in the connection configuration, instead it will default to the schema of
  • the user. If you wish to change schema, specify it in the script with SET SCHEMA.

Notes for Microsoft SQL Server:

  • It is recommended to test the execution of your SQL script with Microsoft sqlcmd or SQL Server Management Studio before use in a run_sql task.
  • Even if the database raises errors while executing the SQL script, some statements may still have finished executing successfully. It is recommended to check the database on any failed run_sql task.

Notes for PostgreSQL:

  • It is recommended to test the execution of your SQL script with psql before use in a run_sql task.
  • A run_sql task will be executed with a simple query cycle, where statements are executed in a single transaction (unless explicit transaction control commands are included to force a different behaviour).

Notes for Redshift:

  • Using multiple SQL statements in a single run_sql task type is not supported currently.

Notes for MySQL:

  • It is recommended to test the execution of your SQL script with MySQL shell before use in a run_sql task.
  • Warning: Executing run_sql tasks with more than one statement may not raise errors upon failure. If the first statement executes correctly but subsequent statements fail, errors may not be raised.

Notes for Amazon DynamoDB:

  • The run_sql task type is not supported for Amazon DynamoDB.

Truncate a table

Use the truncate_table task type to specify tables to be truncated by DataMasque. All rows will be deleted, but the table structure will be left in place.

Parameters
  • table (required): The name of the table to truncate. The table name can be prefixed with a schema name to reference a table in another schema.
Example
tasks:
  - type: truncate_table
    table: history_table
  ...

Notes:

  • The truncate_table task type does not run in dry run mode.
  • The truncate_table task type is not currently supported for Amazon DynamoDB or Microsoft SQL Server (Linked Server) databases.

Parallel Tasks

Using the parallel task type, you can specify a block of tasks to be executed in parallel, spread across as many workers as are available.

Each parallel task distributes to a maximum of 10 sub-tasks. It is recommended to begin testing parallelisation with at most 4 tasks in parallel, then increase parallelisation if the database has more capacity.

Parallel tasks can be nested inside other serial/parallel tasks.

Parameters
  • tasks (required): A set of tasks to perform in parallel.
Example
tasks:
  - type: parallel
    tasks:
      - type: mask_table
        table: employees
        key: id
        rules:
          - column: 'name'
            masks:
              - type: from_fixed
                value: 'REDACTED'
      - type: mask_table
        table: customers
        key: id
        rules:
          - column: 'address'
            masks:
              - type: from_fixed
                value: 'REDACTED'
 ...

Warning: You should not mask the same table in multiple tasks (including mask_table, run_sql tasks) in parallel, as this could result in data being incorrectly masked.

Note: The parallel task type is not currently supported for Amazon DynamoDB.


Serial Tasks

Although tasks are performed serially in the order they are listed in the ruleset by default, you can specify a block of tasks to be performed in serial within a parallel block. This is useful when a subset of parallelisable tasks have dependencies that mean they must be executed in sequence.

Serial tasks can be nested inside other serial/parallel tasks.

Parameters
  • tasks (required): A set of tasks to perform in series.
Example
tasks:
  - type: parallel
    tasks:
      - type: serial
        tasks:
          - type: run_sql
            sql_file: pre_employees_script.sql
          - type: mask_table
            table: 'employees'
            key: id
            rules:
              - column: 'name'
                masks:
                  - type: from_fixed
                    value: 'REDACTED'
      - type: mask_table
        table: 'customers'
        key: id
        rules:
          - column: 'address'
            masks:
              - type: from_fixed
                value: 'REDACTED'
 ...

Masking files

Each mask_file or mask_tabular_file task specifies the masking rules to apply to each file in the base directory and/or any subdirectories. As well as any files/directories intended to be skipped or included, and any conditionals required to define which data to mask in the masking process. Masking rules and masks are applied sequentially in the order they are listed. When multiple masks are combined in sequence, the output value from each mask is passed as the input to the next mask in the sequence.

Note If the source connection and destination connection are of the same type and have the same base directory, the files will be overwritten. The list of files is read at the start of the masking run, so new files added during the masking run will not be masked and will not be present in the destination.


File task types

A ruleset consists of a list of tasks to be performed in sequence on the target data source after the file is masked, it is then written to the selected data destination.

Object file masks

Each mask_file task defines masking operations to be performed on a file or set of files. More detail is provided about these tasks under the Masking files section.

Parameters

Each task with type mask_file is defined by the following parameters:

  • rules (required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:
    • masks (required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the file. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask types.
  • recurse (optional): A boolean value, when set to true any folders in the Data Source will be recursed into and the files contained will also be masked. Defaults to false.
  • workers (optional): The number of parallel workers to use for this masking task. Defaults to 1.
  • skip (optional): Specifies files to not include in the masking run.
    • regex (optional): Files that match this regular expression will not be included for masking.
    • glob (optional): Files that match this glob will not be included for masking.
    • match_against (optional): Choose if the skip rule is applied to the whole path or just the filename. Valid choices are path or filename. Defaults to path.
  • include (optional): Specifies files to include in the masking run. If not set, all files will be masked (except those skipped by a skip rule).
    • regex (optional): Files that match this regular expression will be masked.
    • glob (optional): Files that match this glob will be masked.
    • match_against (optional): Choose if the include rule is applied to the whole path or just the filename. Valid choices are path or filename. Defaults to path.
  • encoding (optional): The encoding to use when reading and writing files. Defaults to UTF-8. Refer to Python Standard Encodings for a list of supported encodings.

For more information about the ordering of skip and include please refer to Include/Skip.

Note: regex/glob will match to the path from the base directory specified in the source connection, consider adding .*(regex) or *(glob) to the beginning of the expression for matching. For example: If the structure is /path1/path2/target_file.json, if the base directory is path1/ and recurse: true in the ruleset, the regex/glob will try to match path2/target_file.json. When including a path by specifying a glob such as target_path/* the recurse option needs to be set to true otherwise the included path won't be entered and the files it contains will not be masked.

Supported file types

In general, mask_file has been designed to mask XML, JSON, NDJSON, and Apache Avro files. Each file is loaded as a string and passed to the masks. Therefore, to mask a JSON file, a json mask would be implemented, like the following example:

version: "1.0"
tasks:
  - type: mask_file
    recurse: true
    skip:
      - regex: '.*\.json'
      - glob: "input/*"
    include:
      - glob: "other_inputs/*.json"
    rules:
      - masks:
          - type: json
            transforms:
            - path: ['name']
              masks:
                - type: from_fixed
                  value: REDACTED

This would replace the root name attribute in the JSON with the text REDACTED.

Multirecord Files (NDJSON/Avro)

NDJSON (Newline Delimited JSON) (.ndjson) files can be masked by specifying a json mask in the ruleset which will be applied to each line of JSON in the NDJSON file.

For NDJSON files, a similar ruleset can be specified which will be applied to each record of the file, rather than being applied to the entire file.

version: "1.0"
tasks:
  - type: mask_file
    rules:
      - masks:
        - type: json
          transforms:
            - path: ['name']
              masks:
                - type: from_fixed
                  value: REDACTED

This would replace the root name attribute for each record in the NDJSON with the text REDACTED.

Apache Avro (.avro) files can be masked in the same way as NDJSON files. When masking Avro files there also needs to be an Avro Schema file (.avsc) in the same directory and with the same name as the .avro file, as the schema file is required to write the masked .avro file to ensure data type consistency. For example, if a file is called data.avro DataMasque will expect to find a schema file called data.avsc in the same directory.

Note: The Avro Schema file (.avsc) will be copied unmasked into the output directory. Deterministic Masking is also supported for multirecord files, for more information please refer to Deterministic Masking with Files.

Similarly, for XML files, use an xml mask:

version: "1.0"
tasks:
  - type: mask_file
    recurse: true
    skip:
      - regex: '^(.*).xml'
      - glob: "input/*"
    include:
      - glob: "other_inputs/*.xml"
    rules:
      - masks:
          - type: xml
            transforms:
            - path: 'User/Name'
              node_transforms:
                - type: text
                  masks:
                    - type: from_fixed
                      value: REDACTED

This would replace the content of the node(s) at User/Name with the text REDACTED.

To mask other types of files, basic redaction is possible. For example, to replace the contents of every txt file with the text REDACTED:

version: "1.0"
tasks:
  - type: mask_file
    recurse: true
    include:
      - glob: "*.txt"
    rules:
      - masks:
          - type: from_fixed
            value: REDACTED

It is possible to use any mask that accepts text input (or no input), although their effectiveness will depend on the size and content of the input file.

Note also, that files that have not been processed will not be copied from the source to the destination. That is, DataMasque will either load a file (based on skip/include rules), mask it, then copy it to the destination, or it will ignore the file. Unmasked files will not be copied to the destination.


Tabular file masks

Each mask_tabular_file task defines masking operations to be performed on a file or set of files (CSV, Parquet or fixed-width columns). More detail is provided about these tasks under the Masking files section.

Parameters

Each task with type mask_tabular_file is defined by the following parameters:

  • rules (required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:
    • column (required): A column within the tabular file intended for masking. This is the header row of the column (for CSVs) or name of column (for parquet files).
    • masks (required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the column. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask types.
  • recurse (optional): A boolean value, when set to true any folders in the Data Source will be recursed into and the files contained will also be masked. Defaults to false.
  • workers (optional): The number of parallel workers to use for this masking task. Defaults to 1.
  • skip (optional): Specifies files to not include in the masking run.
    • regex (optional): Files that match this regular expression will not be included for masking.
    • glob (optional): Files that match this glob will not be included for masking.
    • match_against (optional): Choose if the skip rule is applied to the whole path or just the filename. Valid choices are path or filename. Defaults to path.
  • include (optional): Specifies files to include in the masking run. If not set, all files will be masked (except those skipped by a skip rule).
    • regex (optional): Files that match this regular expression will be masked.
    • glob (optional): Files that match this glob will be masked.
    • match_against (optional): Choose if the include rule is applied to the whole path or just the filename. Valid choices are path or filename. Defaults to path.
  • encoding (optional): The encoding to use when reading and writing files. Defaults to UTF-8. Refer to Python Standard Encodings for a list of supported encodings.
  • delimiter (optional): The character that is used to delimit columns in character-delimited files (e.g. , for CSVs or \t for TDVs). Defaults to ,.
  • fixed_width_extension (optional): The file extension that fixed-width files have. Not required if no fixed-width files are to be masked. Should not include a leading . (e.g. specify txt not .txt).
  • fixed_width_columns_indexes (optional): An array of two-element arrays of start and end indexes of the fixed width columns. Required if fixed_width_extension is specified.
  • fixed_width_column_names (optional): An array of string defining the names of the fixed-width columns and used to refer to them in masking rules. Required if fixed_width_extension is specified, and must match the length of fixed_width_columns_indexes.
  • fixed_width_too_wide_action (optional): The action to take if masked data exceeds the width of the column, can be either truncate to truncate the value to fit in the column or error to raise an error and stop the masking run. Defaults to truncate.
  • fixed_width_line_ending (optional): The line ending to use when writing out the fixed width data. Will attempt to be detected from the input file, otherwise defaults to \n.

For more information about:

Note: regex/glob will match to the path from the base directory specified in the source connection, consider adding .*(regex) or *(glob) to the beginning of the expression for matching. For example: If the structure is /path1/path2/target_file.json, if the base directory is path1/ and recurse: true in the ruleset, the regex/glob will try to match path2/target_file.json. When including a path by specifying a glob such as target_path/* the recurse option needs to be set to true otherwise the included path won't be entered and the files it contains will not be masked.

version: "1.0"
tasks:
  - type: mask_tabular_file
    recurse: true
    skip:
      - regex: '.*\.json'
      - glob: "input/*"
    include:
      - glob: "other_inputs/*"
    rules:
      - column: name
        masks:
        - type: from_fixed
          value: REDACTED
Tabular File Type Detection

DataMasque uses file extensions to determine how tabular files are loaded for masking. The extension comparison is not case-sensitive. Files with extension parquet are treated as Apache Parquet files. If a fixed_width_extension is specified (see Fixed Width File Masking Parameters below) then the file is loaded as fixed-width. All other file types are loaded as character-delimited. If the delimiter parameter is unspecified then it defaults to , (i.e. CSV).

Note Delimiters must be a single character, and should be quoted with double quotes. For example: delimiter: " " for a space or delimiter: ";" for a semicolon. Tabs or other invisible delimiters should use the correct escape sequence with a backslash. For example, delimiter: "\t" for a tab.

The skip, include and delimiter options should be used together to specify the correct delimiter for each file type that is to be masked.

Note CSV files require header columns for tabular masking as the header columns are used as column names during masking. CSV and fixed-width files are all string based files, therefore values should be cast to other types if being used with masks that require specific types (e.g. numeric_bucket). To do this use a typecast mask, for more information please refer to Typecast

DataMasque will only attempt to load fixed width files if fixed_width_extension is specified, and will treat any files with this extension as fixed-width. See also Fixed Width File Masking Parameters.

Once files are loaded they are all masked in the same way, that is, rules are executed and applied on a per row/column basis regardless of the original source type. Data will be written back out in the same format as it was read.

Fixed Width File Masking Parameters

Masking of fixed-width files is only attempted if fixed_width_extension is specified. If fixed_width_extension is present in the ruleset without fixed_width_columns_indexes and fixed_width_column_names, then an error will be raised. However, it is valid to have fixed_width_columns_indexes and fixed_width_column_names missing if fixed_width_extension is also absent.

If fixed_width_extension is set, then DataMasque will treat any files with that extension as fixed-width and load them based on the other fixed-width options. To assist in explaining the rules, consider an example file called users.txt. This is the content:

Adam    2010-01-01  AAA-1111
Brenda  2010-01-01  EEE-5555
Charlie 2010-02-02  GGG-7777

It has 3 columns, the first containing a name, which is from index 0 to 8. The second column contains a date and spans from 8 to 19. The final column contains a transaction ID and spans from index 19 to 27.

Note these indexes are specified to be contiguous as some fixed-width formats require contiguous columns, therefore a trailing space is included in the first and second columns. DataMasque automatically strips leading and trailing spaces when the data is read. Contiguous columns are not required though, so the same result could be achieved with indexes (0, 7), (8, 18) and (19, 27). When non-contiguous columns are specified DataMasque inserts spaces in between columns.

Since fixed-width files do not have column headers, the ruleset must also specify these. They can be any arbitrary valid column identifier (i.e. alphanumeric string without special characters) and are used to identify the columns in the masking rules. In this case they will be named name, date and transaction_id.

Considering these rules will yield a ruleset like this:

version: "1.0"
tasks:
  - type: mask_tabular_file
    recurse: true
    fixed_width_extension: txt
    fixed_width_columns_indexes:
    - [0, 8]
    - [8, 19]
    - [19, 27]
    fixed_width_column_names:
    - name
    - date
    - transaction_id
    rules:
      - column: name
        masks:
        - type: from_file
          seed_file: DataMasque_firstNames_mixed.csv
          seed_column: firstname-mixed
      - column: date
        masks:
          - type: from_random_date
            min: '1950-01-01'
            max: '2000-12-31'
          - type: typecast
            typecast_as: string
            date_format: '%Y-%m-%d'
      - column: transaction_id
        masks:
          - type: substitute

Note that when this ruleset is executed, DataMasque will still load any CSVs or Parquet files it encounters, however it will use the standard loaders instead of applying the fixed width rules. In this case fixed-width rules will only be used for txt files.

This can be useful if CSV or Parquet files exist in the source with the same columns and need to be masked in the same manner. If these files exist, and they shouldn't be masked, then skip rules should be added to skip them.

Choosing files to mask with include/skip

When specifying which files to include or skip for a mask_file/mask_tabular_file task, the order of checking needs to be considered. The skip rules are checked first, followed by the include rules, which means skip takes precedence. Warning! If a file matches both the skip and include rules, that item will not be included in the masking task.

Rules are applied with "or" logic, that is, as long as the file matches any rule in skip or include, it will be skipped/included, respectively.

In comparisons, the file's path does not include a leading /.

By default, include and skip rules are applied to the file's path, relative to the base_directory. By specifying match_against: filename on the include/skip, the rule will be applied to just the file's name.

Filtering is performed only on files, not on directories. A directory will always be entered into, even if it would be skipped based on filters. If a directory has no files to be masked (they are all skipped, or none are included) then an empty directory will not be created on the target.

glob matching does not support the ** syntax for multi-level directory matching. If the depth of directories is unknown, or more advanced path matching is required regex filtering should be used.

These rules are explained in more detail below, with some examples and common file filtering scenarios.

include/skip examples and common scenarios

In general, glob is used for simple matching, and regex should be considered when more advanced rules are required. Note that some of these examples use a mask_file task and others use a mask_tabular_file task, however there is no difference in the way the file matching rules are executed between these different task types.

Mask all files with a certain extension

The ruleset will include all files that end with .xml inside the base directory and any subdirectories.

version: "1.0"
tasks:
  - type: mask_file
    recurse: true
    include:
    - glob: "*.xml"

The recurse option will enter every subdirectory. Since the match_against option has not been specified the default path option is used, so the glob is applied against the file's path. However since a path like dir1/dir2/file.xml and a filename like file.xml both end with .xml they both match the specified glob; in this case specifying match_against: filename would have no effect.

Mask all files with a specific start and end

This ruleset will mask any file that starts with user and has a .json extension. For example, user_1.json, users_file.json, etc. These files can be anywhere inside the source. The match_against option is set to filename.

version: "1.0"
tasks:
  - type: mask_file
    recurse: true
    include:
    - glob: "user*.json"
      match_against: filename

match_against: filename is specified because the glob will not match paths in subdirectories. For example, the path dir1/dir2/user_1.json does not match the glob user*.json, but the filename on its own (user_1.json) does.

Note: When matching against filename, a glob with a / in it is not valid since a filename should never contain a /. DataMasque does not prevent you from using a / in a glob when matching against filename so if there are files you expect to be included, but they are not, this could be the reason why.

Mask files of a certain type only in the base directory

To mask CSV files only in the base directory, disable the recurse option.

version: "1.0"
tasks:
  - type: mask_tabular_file
    include:
    - glob: "*.csv"

Without recurse: true, subdirectories will not be entered into, so only .csv files in the base directory will be masked.

Mask files in nested directories of known depth

In this example, there is a directory layout like this:

  • users1.csv
  • users2.csv
  • staff/users1.csv
  • staff/users2.csv
  • staff/management/users1.csv
  • staff/management/users2.csv
  • staff/contractors/users1.csv
  • staff/contractors/users2.csv

(Assume there are also some files of other types that are not to be masked).

All users*.csv files should be masked, except those in the base directory. Since we know the depth of the directories (staff/ is one level; staff/management/ and staff/contractors/ are two levels) two globs can be added to match these.

version: "1.0"
tasks:
  - type: mask_tabular_file
    include:
    - glob: "staff/*.csv"
    - glob: "staff/*/*.csv"

Since the matching is performed on the path, none of the files in the base directory match. Files in the staff/ directory match the first glob and files in staff/management/ or staff/contractors/ match the second. Since include use OR logic then as long as at least one include rule matches a file will be included.

Mask files in all nested directories

If the depth of the directories is unknown, or could be very deep, so that a lot of glob rules need to be created, then using regex to apply regular expressions to the path can be a better option.

Following on from the previous example, CSV files might be nested inside a deep directory tree. The following ruleset uses a regex to match any .csv file inside staff/ or any of its child directories.

version: "1.0"
tasks:
  - type: mask_tabular_file
    include:
    - regex: "^staff/.*\.csv$"

Explaining the entire regular expression syntax is beyond the scope of these examples, but since this pattern is quite common we will briefly explain how it works:

  • ^ means the expression must match start of the path, it can't just be found somewhere in the middle.
  • staff/ has no special meaning, so it matches the literal staff/ part of the path.
  • .* has two parts, . means match any character and * means match the preceding expression (the .) any number of times. So together they mean match any number of any characters.
  • \. matches a literal .. The \ escapes the . so it doesn't behave as matching any character.
  • csv has no special meaning, so it matches the literal csv extension.
  • $ the end expression must also match at the end of the path, .csv can't just appear somewhere in the middle.

Hence, any path that starts with staff/ and ends with .csv will be matched by this regex.

Often you will be able to apply a slightly modified version of this regex, by just changing the start and end to match your particular paths and file extensions.

For more information on the regular expression syntax used by DataMasque, check the Python Regex Howto.

Mask all files with a given extension, except those with a certain prefix

In this example all .avro files in the base directory will be masked, except the ones starting with users*. This means files like sales_1.avro, data.avro, orders-for-2020.avro will be masked, but users-list.avro and users-exported.avro would not be. It uses an include rules to include all .avro files and a skip rule to skip any starting with users.

version: "1.0"
tasks:
  - type: mask_file
    include:
    - glob: "*.avro"
    skip:
    - glob: "users*.avro"
      match_against: filename

The file users-exported.avro matches both *.avro and users*.avro, but since skip takes precedence the file will not be included. The file data.avro only matches *.avro so it will be included.

Mask files in all nested directories, excluding those with a given prefix

This example will include all .csv files in the staff/ directory, except those that start with data. For example, it matches staff/sales.csv and staff/management/users.csv but not staff/data01.csv or staff/management/data-list.csv.

version: "1.0"
tasks:
  - type: mask_tabular_file
    include:
    - regex: "^staff/.*\.csv$"
    skip:
    - glob: "data*.csv"
      match_against: filename

This ruleset show that glob and regex can be combined to produce flexible and simple matching rules.

Mask files that match a particular pattern.

This example includes XML files that start with users_, then have at least one digit (the \d+ parameter). But since it uses match_against: filename it will not match against the path, therefore files inside a directory starting with users_ won't match.

version: "1.0"
tasks:
  - type: mask_tabular_file
    include:
    - regex: "^users_\d+.xml$"
      match_against: filename

Some example results are:

  • users_5.xml: match
  • users_10.xml: match
  • users_list.xml: no match (the filename doesn't match the regex)
  • users_5/file.xml: no match (the path matches the regex but comparison is performed against the filename)
  • users_5/users_10.xml: match (the filename users_10.xml matches the regex)

Masking Tables

Each mask_table task specifies the masking rules to apply to a database table, as well as any required joins and any conditionals needed to define which rows should be masked. Masking rules and masks are applied sequentially in the order they are listed. When multiple masks are combined in sequence, the output value from each mask is passed as the input to the next mask in the sequence.

Note: While mask_table is suitable for most generic masking requirements, it is not capable of masking unique keys or primary keys. Masking of such values requires the use of the special-purpose mask_unique_key task.

Selecting data to mask

DataMasque provides some advanced features for selecting additional data from the database for use in your masking rules.

Joining tables

When masking a table, you can specify a list of joins (or dictionary mapping keys to joins) that will join the rows of a target table to rows from one or more additional tables, providing you with the additional joined values to use in your masking rules.

Parameters
  • target_table (required): The name of the new table you wish to join into the masking data. The target table can be prefixed with a schema name to reference a table in another schema.
  • target_key (required): The key on target_table to use when performing the join. This can be specified as a single column name or a list of column names.
  • source_table (required): The name of the table you wish join the target_table with. This could be the table being masked, or another table earlier in the list of joins (allowing you to perform multi-step joins). The source table can be prefixed with a schema name to reference a table in another schema.
  • source_key (required): The key on source_table to use when performing the join. This can be specified as a single column name or a list of column names.
Example

In this example, we have two tables: Customers and Membership. We would like the mask the first_name and last_name columns of the Customers table, but only if the user's membership status is Active.

The Customers contains data on customers, including their name and membership ID. The Membership table contains the status of the membership of each Customer: either Active or Inactive. The membership_ip column of the Users has a foreign key relation with the id column of the Membership table.


Customers Table

customer_id first_name last_name membership_id
1 Anastasia Rose 10001
2 Bill Jones 10002
3 Chris Yang 10003
4 Judith Taylor 10004
5 Gordon Smith 10005


Membership Table

id membership_status
10000 Active
10001 Active
10002 Inactive
10003 Active
10004 Inactive

Customers Membership diagram

In order to access the membership_status column of the Membership table, we need to define a join in our ruleset from the Customers table to the Membership table.

version: "1.0"
tasks:
  - type: mask_table
    table: Customers
    key: customer_id
    joins:
      - target_table: Membership
        target_key: id
        source_table: Customers
        source_key: membership_id
    rules:
      - if:
        - column: '"Membership".membership_status'
          equals: Active
        rules:
          - column: first_name
            masks:
              - type: from_file
                seed_file: DataMasque_firstNames_mixed.csv
                seed_column: firstname-mixed
          - column: last_name
            masks:
              - type: from_file
                seed_file: DataMasque_lastNames.csv
                seed_column: lastnames

After performing the join, this will allow us to reference the membership_status column of the Membership table in our ruleset. In this example, we can reference the column with Membership.membership_status. Using this column, we can use Conditional Masking to only mask the rows of Customers where the status of the membership is 'Active'.

Note: To reference a column in a joined table, the table name of joined table must be added as a prefix to the column name**

The example below utilises the from_file mask type detailed here to select a random first name from the DataMasque_firstNames-mixed.csv and a random last name from the DataMasque_lastNames.csv files that can be found on our Supplementary Files user guide. It will first check if the membership_status for the customer is 'Active', and if so, masks the two name columns: otherwise, these columns are left unmasked.

version: "1.0"
tasks:
  - type: mask_table
    table: Customers
    key: customer_id
    joins:
      - target_table: Membership
        target_key: id
        source_table: Customers
        source_key: membership_id
    rules:
      - if:
        - column: '"Membership".membership_status'
          equals: Active
        rules:
          - column: first_name
            masks:
              - type: from_file
                seed_file: DataMasque_firstNames_mixed.csv
                seed_column: firstname-mixed
          - column: last_name
            masks:
              - type: from_file
                seed_file: DataMasque_lastNames.csv
                seed_column: lastnames

This example will produce the following results in the Customers table. The customers with customer_id 3 and 5 are not masked, as the status of their membership is 'Inactive' in the joined Membership table.

customer_id first_name last_name membership_id
1 Tia Pallin 10001
2 Nikau Koller 10002
3 Chris Yang 10003
4 Anika Thom 10004
5 Gordon Smith 10005

Note:

  • For Microsoft SQL Server (Linked Server), joins are not currently supported.

  • For Microsoft SQL Server, when using temporary table, the name of the temporary table must be wrapped in quotation marks, as the # symbol in the YAML editor denotes the beginning of a comment (e.g. target_table: '##my_temporary_table' or '##my_temporary_table.column').

  • To reference a temporary table column (e.g. the table_filter_column parameter of the from_file mask type or as a part of hash_columns) you must prefix the column name with its table name (e.g. table.column).

  • Any column name specified without a table prefix is assumed to belong to the table being masked (as specified by the table parameter for the task). You cannot specify tables that belong to other schemas.


Conditional masking

You may wish to only apply masks to rows or values that meet some conditions. DataMasque has three different methods for conditionally applying masks to meet different use cases:

Use case Mechanism
I want to restrict which rows are fetched for masking from the database table. Where
I want to apply certain masking rules to only a subset of rows. If
I want to skip applying masks to certain column values. Skip

Warning: Use of the conditional masking features 'where', 'skip', or 'if/else', may mean your masking rules are not applied to some database rows or values. It is recommended to verify the resulting output satisfies your masking requirements.

Where - restricting database fetches

To restrict which rows are fetched for masking from a database table, you can specify a where clause for a masked_table:

version: "1.0"
tasks:
  - type: mask_table
    table: users
    key: id
    where: >-
      "users"."role" <> 'administrator'
    rules:
      ...

The where clause can refer to any columns in the masked table or joined tables. All columns must be referenced using their table-qualified name (e.g. Users.FirstName). Ensure to use appropriate quoting as required. For example, if the identifier uses a reserved word, starts with an illegal character, or is a case-sensitive identifier.

Important!

  • Any rows excluded by the where clause will not be masked.
  • The SQL you provide for the where clause will not be validated before execution, please take care when constructing your SQL.
  • The SQL you provide for the where clause should not end in a semicolon, as this will cause a masking error.
  • Any string in the where clause variables must be quoted in single quotation marks.
  • Joined tables cannot be referenced in the where clause currently.

Note for Amazon Redshift:

  • Use of where clause for Amazon Redshift is not yet supported in DataMasque. This is in our roadmap and will be included in future releases.

If - conditional rules

You can choose to apply certain masking rules to only a subset of rows within a table, while still allowing other masks to be applied to those rows. This can be achieved through the use of if-conditions in rules lists.

DynamoDB note: Masking Amazon DynamoDB uses a mask_table task, so standard mask_table condition parameters apply.

Example

In the following example, the last_name of all users will be replaced with 'Smith', but the user's gender will determine the mask applied to their first_name:

version: "1.0"
tasks:
  - type: mask_table
    table: users
    key: id
    rules:
      - column: last_name
        masks:
        - type: from_fixed
          value: 'Smith'
      - if:
          - column: gender
            equals: 'female'
        rules:
          - column: first_name
            masks:
            - type: from_fixed
              value: 'Alice'
        else_rules:
          - if:
              - column: gender
                equals: 'male'
            rules:
              - column: first_name
                masks:
                  - type: from_fixed
                    value: 'Bob'
            else_rules:
              - column: first_name
                masks:
                  - type: from_fixed
                    value: 'Chris'

This example applies the same conditions in a mask_tabular_file task.

version: "1.0"
name: basic_file_mask
tasks:
  - type: mask_tabular_file
    include:
      - glob: "*.csv"
    rules:
      - if:
        - column: gender
          equals: 'female'
        rules:
          - column: first_name
            masks:
            - type: from_fixed
              value: 'Alice'
        else_rules:
          - if:
              - column: gender
                equals: 'male'
            rules:
              - column: first_name
                masks:
                  - type: from_fixed
                    value: 'Bob'
            else_rules:
              - column: first_name
                masks:
                  - type: from_fixed
                    value: 'Chris'
Condition Parameters
  • if (required): A list of conditions (see below) that must all evaluate as true for the nested list of rules to be applied to a row.
  • rules (required): A nested list of masking rules/nested-if-conditions (or dictionary mapping labels to rules) that will only be applied to rows that meet the conditions defined under if.
  • else_rules (optional): A nested list of masking rules/nested-if-conditions (or dictionary mapping labels to rules) that will only be applied to rows that do NOT meet the conditions defined under if.

A condition under if can contain the following attributes.

There are 3 different parameters that can be used to select which data values the condition is applied to:

  • column (required): The database column to check this condition against. The column name can be prefixed with a table name to reference a column in another table, and that table name can be prefixed with a schema name to reference a table in another schema.
  • json_path (required within mask_file tasks): For JSON records, a json_path must be provided to extract the value to which the condition will be applied. Or
  • xpath (required within mask_file tasks): For XML records, an xpath must be provided to extract the value to which the condition will be applied. The following conditions can be applied to the selected data values:
  • equals (optional): If specified, the condition will only evaluate as true if the value exactly equals the specified value. Data types are also checked (i.e. 100 is not equal to "100"). Data extracted from XML is treated as strings and the comparison value in the ruleset must be quoted.
  • matches (optional): If specified, the condition will only evaluate as true if the string of the value matches the specified regular expression. For more details on how to use regular expressions, see Common regular expression patterns.
  • less_than (optional): If specified, the condition will only evaluate as true if the value is a number or date/datetime and is less than the given value. If the specified value is numeric, the value from the data will be treated as a number.
  • less_than_or_equal (optional): If specified, the condition will only evaluate as true if the value is a number or date/datetime and is less than or equal to the given value. If the specified value is numeric, the value from the data will be treated as a number.
  • greater_than (optional): If specified, the condition will only evaluate as true if the value is a number or date/datetime and is greater than the given value. If the specified value is numeric, the value from the data will be treated as a number.
  • greater_than_or_equal (optional): If specified, the condition will only evaluate as true if the value is a number or date/datetime and is greater than or equal to the given value. If the specified value is numeric, the value from the data will be treated as a number.
  • age_greater_than (optional): If specified, the condition will only evaluate as true if the value is a date/datetime and the difference in years is greater than this value. If the value is not a date/datetime use date_format to try to parse it as a date/datetime. If date_format is not provided the date format %Y-%m-%d will be used as a fallback. Can be combined with age_less_than to enforce an age within a range.
  • age_less_than (optional): If specified, the condition will only evaluate as true if the value is a date/datetime and the difference in years is less than this value. If the value is not a date/datetime use date_format to try to parse it as a date/datetime. If date_format is not provided the date format %Y-%m-%d will be used as a fallback. Can be combined with age_greater_than to enforce an age within a range.
  • disable_auto_date_parsing (optional): By default, strings in rulesets that appear to be dates are attempted to be parsed into dates/datetimes. This can cause comparisons to unexpectedly fail due to type mismatches. Set disable_auto_date_parsing to true to disable this automatic parsing. This setting is overridden when specifying the date_format argument.

When performing conditions on values containing dates/datetimes the following parameter can be specified:

  • date_format (optional): The format of the date/datetime stored. If specified, the condition will attempt to load the column value or the value at the json_path/xpath as a datetime with the specified format. If the value cannot be parsed with the date_format an error will be raised.

Notes:

  • If the comparison is any of less_than,less_than_or_equal,greater_than, or greater_than_or_equal:
    • now or current_date_time can be specified as the value to use the current datetime at the time of masking.
    • Use a similar format to the date/datetime stored e.g. if the datetime contains hours:mins:seconds+time_zone, specify those in the value to compare against ("2012-12-12 10:30:00+00:00").
  • If the date/datetime contains timezone information it will be compared to the specified date with UTC timezone. This also applies when using now or current_date_time.

An additional option can be specified for conditions within mask_file tasks:

  • on_missing (optional): Determines how to handle records where no value exists at a condition's json_path or xpath, or where the record is not valid JSON/XML and a condition specifies a json_path/xpath:
    • error (default): Raise an error in the run log.
    • skip: Skip this rule for the current record/file.
    • apply_if_rules: Apply the rules defined in rules that are normally applied when the condition is True.
    • apply_else_rules: Apply the rules defined in else_rules that are normally applied when the condition is False. Only permitted when else_rules are provided.

Notes:

  • When comparing values in an XML document, equals conditions comparing to non-string values will always be false since they are stored as a string in the XML document. Please make sure equals conditions match exactly to what is contained in the document.
  • When a database column specified in an if does not exist, the masking run will always raise an error. on_missing behaviour only applies when a no values exists at a specified json_path or xpath or where the column/record contains invalid JSON/XML when a json_path/xpath is specified.

Conditions can also be grouped with the logical operators or, not, and and:

version: "1.0"
tasks:
  - type: mask_table
    table: users
    key: user_id
    rules:
    - if:
      - and:
        - not:
          - column: username
            matches: 'customer_\w+'
        - or:
          - column: admin
            equals: true
          - column: role
            equals: "admin"
      rules:
        - column: username
          masks:
          - type: from_fixed
            value: "Bob"

Note: When using an if conditional in rulesets, final row counts will reflect the number of rows processed rather than the number of rows masked. This is due to the rows being filtered on the application side and so all rows fetched will be processed and added to the row count. Alternatively where conditionals can be used in the ruleset which will provide an accurate row count of masked rows.


Skip - not masking specific values

A common use-case is to not apply masks to certain values, e.g. to leave NULL values or empty strings unchanged. You can choose to not mask certain values in a column by specifying a number of values to skip:

version: "1.0"
tasks:
  - type: mask_table
    table: users
    key: user_id
    rules:
      - column: username
        skip:
         - null
         - ""
         - matches: 'admin_\w+'
        masks:
          - type: from_fixed
            value: "Bob"

Any column values that are exactly equal to any of the string/numeric/null values in the skip list will not be masked (data types are also checked, i.e. 100 is not equal to "100"). Additionally, string column values matching a regular expression can be skipped by specifying the skip value as matches: "my_regex". For more details on how to use regular expressions, see Common regular expression patterns.


Conditional masking for mask_file tasks

Conditional masking for object files (as part of a mask_table) differs to database masking. For databases, conditions are evaluated per row, and affect the values (columns) in only that row.

For object files, the condition is applied per file and affects values for that file.

Multi-record object files (NDJSON or Avro) are more like databases in that the conditions are applied per record in the file and affect values for that record.

Masking tabular files (with mask_tabular_file) applies rules per row in the file, again, similar to masking a database table.

For the full list of parameters please refer to Condition Parameters

Notes: where is not supported for conditional masking for files as where is used to restrict what is returned in database queries, and therefore it does not apply to files.

The implementation of skip for files is different from databases. It is used in conjunction with include to choose which files to mask or not. Please refer to mask_file tasks and mask_tabular_file tasks.

For tabular file masking, refer to the general If section.

Here are the use cases for each type of condition:

Use case Mechanism
I want to apply certain masking rules to the values contained in the files. If
I want to skip certain files based on the file name. Please refer to the skip option for mask_file tasks

Below are some example rulesets for conditional file masking.

If - conditional rules (files)
Example

In the following example, the last_name of all users will be replaced with 'Smith', but the user's gender will determine the mask applied to their first_name:

version: "1.0"
name: json_with_conditional
tasks:
  - type: mask_file
    rules:
      - masks:
        - type: json
          transforms:
            - path: ["last_name"]
              on_missing: error
              masks:
                - type: from_fixed
                  value: "Stevens"
      - if:
        - json_path: ["gender"]
          equals: "female"
        on_missing: apply_if_rules
        rules:
          - masks:
            - type: json
              transforms:
                - path: ["first_name"]
                  on_missing: error
                  masks:
                    - type: from_fixed
                      value: "Alice"
        else_rules:
          - if:
            - json_path: ["gender"]
              equals: "male"
            on_missing: skip
            rules:
              - masks:
                - type: json
                  transforms:
                    - path: ["first_name"]
                      on_missing: error
                      masks:
                        - type: from_fixed
                          value: "Bob"
            else_rules:
              - masks:
                - type: json
                  transforms:
                    - path: ["first_name"]
                      on_missing: error
                      masks:
                        - type: from_fixed
                          value: "Jamie"

Similarly, with XML files

version: "1.0"
name: xml_with_conditional
tasks:
  - type: mask_file
    rules:
      - masks:
        - type: xml
          transforms:
            - path: "//Root/Employee/LastName"
              on_missing: error
              node_transforms:
                - type: text
                  masks:
                    - type: from_fixed
                      value: Stevens
      - if:
        - xpath: "//Root/Employee/Gender/text()"
          equals: "female"
        on_missing: apply_if_rules
        rules:
          - masks:
            - type: xml
              transforms:
                - path: "//Root/Employee/FirstName"
                  on_missing: error
                  node_transforms:
                    - type: text
                      masks:
                        - type: from_fixed
                          value: Alice
        else_rules:
          - if:
            - xpath: "//Root/Employee/Gender/text()"
              equals: "male"
            on_missing: skip
            rules:
              - masks:
                - type: xml
                  transforms:
                    - path: "//Root/Employee/FirstName"
                      on_missing: error
                      node_transforms:
                        - type: text
                          masks:
                            - type: from_fixed
                              value: Bob
            else_rules:
              - masks:
                - type: xml
                  transforms:
                    - path: "//Root/Employee/FirstName"
                      on_missing: error
                      node_transforms:
                        - type: text
                          masks:
                            - type: from_fixed
                              value: Jamie

An example for tabular files can be found in the If section.


Mask types

Masks are the basic 'building-block' algorithms provided by DataMasque for generating and manipulating column values. Multiple masks can be combined in a list to create a pipeline of transformations on the data, or combined using combinator masks to build up more complex output values.

Parameters

Mask algorithms are defined by their type parameter - this parameter is common to (and required by) all masks:

  • type (required) determines the type of mask, and therefore what other parameters can be specified.

Note: Masks operate by either manipulating the original column value, or by generating an entirely new value that replaces the original value. The latter can be referred to as a 'source' mask, as the mask is a source of new values. Such 'source' masks are indicated by the from_ prefix on the mask type.

Available mask types

Generic masks

Fixed value (from_fixed)

A simple mask that replaces all column values with the same fixed value.

Parameters
  • value (required): The value to replace all column values with. Can be any data type, but should match that of the column being masked. This value can be enclosed in quotation marks, which will convert the value a string, or entered without quotation marks.
Example

This example will replace all values in the name column of the employees table with 'Alex'.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: ID
    rules:
      - column: name
        masks:
          - type: from_fixed
            value: 'Alex'

Show result

Before After
name
Bill
Chris
Anastasia
Judith
Gordon
Joel
name
Alex
Alex
Alex
Alex
Alex
Alex

DataMasque also supports blank values for the from_fixed mask type. To specify a blank entry, you can enter the value as """, '', or leave the field blank if there are no spaces.

version: '1.0'
tasks:
  - type: mask_table
    table: customers
    key: id
    rules:
      column: name
      masks:
      - type: from_fixed
        value: ''

From column (from_column)

A simple mask to replace a column's value by copying the value from another database column (from the same table, or a joined table).

Parameters
  • source_column (required): The name of the column to copy values from. The source column can be prefixed with a table name to reference a column in another table, and that table name can be prefixed with a schema name to reference a table in another schema.

Note: from_column is not supported for mask_file tasks, but is supported for mask_tabular_file tasks.

Example

This example will replace the values of the name column in the employees table with values from the first_name column of the users table, joined on id.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    joins:
      - target_table: users
        target_key: id
        source_table: employees
        source_key: user_id
    rules:
      - column: name
        masks:
          - type: from_column
            source_column: users.first_name

Show result

Given the users table:

id first_name
24 Kyle
25 Reid
26 Helen
27 Callum
28 Raymond
29 Vivien

The following transformation will take place on the employees table.

Before After
user_id name
26Bill
25Chris
29Anastasia
24Judith
27Gordon
28Joel
user_id name
26Helen
25Reid
29Vivien
24Kyle
27Callum
28Raymond


From file (from_file)

A mask to replace column values by selecting from a column (seed_column) in a CSV formatted seed file (seed_file). The seed file must contain one or more columns with named column headers. See the Files documentation for more information on uploading seed files.

If there are any blank values in the file, they will not be considered when randomly selecting a replacement value. If you wish to have blank values randomly selected from your seed file, ensure they are enclosed with quotation marks. For example, "".

Potential replacement values from the seed file rows may be pre-filtered using the table_filter_column and seed_filter_column parameters. This allows you to subset the available values for each row by matching a column value from the table to a column value in the seed file. If filtering does not match any rows in the seed file, fallback_masks can be provided to generate the replacement value instead.

After filtering has been applied (optional), replacement values from the seed file are selected randomly by default. To enable deterministic selection of replacement values, you may specify one or more hash_columns for the masking rule. See Deterministic masking for more details.

Parameters
  • seed_file (required): The name of a user-provided CSV seed file to select values from (see Files guide).
  • seed_column (optional): The name of the column in the seed file that will provide replacement values. This property only needs to be specified for CSV files with multiple columns.
  • table_filter_column and seed_filter_column (optional): If both attributes are provided, the replacement value will only be selected from CSV rows where the value for the seed_filter_column in the CSV row is equal to the value for the table_filter_column column in the database row being masked.
  • fallback_masks (optional): If there are no rows in the CSV that match the table_filter_column/seed_filter_column condition, then this nested list of masks will be applied to the column value instead. If no fallback_masks are provided and no match is found, then the column value will be replaced with a NULL value.
  • null_string (optional): If set, any values in the seed file column that match this value will be treated as NULL (instead of a string) when written to the database.
Example

This example masks the street_name column with a random street name from a seed file (DataMasque_address_example.csv) while ensuring that the randomly selected street is from the same city as the original. A fallback value of “Main Street” is used in the event that the database row contains a city that does not exist in the CSV.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: street_name
        masks:
          - type: from_file
            seed_file: DataMasque_address_example.csv
            seed_column: street_name
            table_filter_column: city
            seed_filter_column: city
            fallback_masks:
              - type: from_fixed
                value: 'Main Street'

Show result

Given the following database table and using the DataMasque_address_example.csv seed file:

street_number street_name city
44 Queen Street Auckland
57 Oxford Street London
12 Houston Street New York
5 Fake Street Fake City

This mask could produce the following masked output:

street_number street_same city
44 High Street Auckland
57 Brick Lane London
12 Maiden Lane New York
5 Main Street Fake City


From format string (from_format_string)

A simple mask to generate values of a defined format. This is useful when all values of a column have the same format. DataMasque will generate values according to a format specified according to the format string syntax.

Parameters
  • value_format (required): The format of the data to be generated, which must meet the requirements of the format string syntax.
version: '1.0'
tasks:
  - type: mask_table
    table: drivers
    key: id
    rules:
      - column: licence_plate
        masks:
          - type: from_format_string
            value_format: "{[A-Z],3}{[0-9],3}"

Show result

Before After
licence_plate
ABC123
DEF456
GHI789
JKL246
MNP813
RTU579
licence_plate
VTC799
GEQ249
HWQ462
INY664
AKZ548
SCL654

Note

  • In addition to what is possible with the formats for a mask_unique_key task there is more you can do with format strings with the mask_table task. For example, you can easily select one value from a small number of alternatives using the syntax for alternatives: {(EN|FR)}
  • If you need to guarantee that generated values are unique within a column, consider using the from_unique mask type.

From unique (from_unique)

A simple mask that generates string values that are guaranteed to be unique within the target column.

Note: To automatically cascade primary and unique key values to foreign keys or mask composite keys, consider using mask_unique_key instead.

Parameters
  • format (required): The format that will be used to generate values. See format string syntax for details on how to construct a format string.
version: '1.0'
tasks:
  - type: mask_table
    table: drivers
    key: id
    rules:
      - column: licence_plate
        masks:
          - type: from_unique
            format: "{[A-Z],3}{[0-9],3}"

Show result

Before After
licence_plate
ABC123
DEF456
GHI789
JKL246
MNP813
RTU579
licence_plate
ABT675
AAT808
AAB476
AAV509
ABH457
ABO157

Note

  • All values produced by from_unique will be strings. To convert values to other data types, you can chain your from_unique mask with a typecast mask.
  • Unlike from_format_string, from_unique ensures generated values are unique. This is achieved using the same underlying unique value generation procedure as mask_unique_key.
  • The specified format must allow for a sufficient number of unique values to cover the full number of rows in the target table or file (rounded up to the nearest multiple of the batch_size for mask_table tasks), otherwise an error will be reported when executing the run.
  • Using the from_unique mask type differs from the mask_unique_key task type in the following ways:
    • from_unique cannot perform any cascading of values to related columns, such as foreign keys.
    • from_unique cannot be used to guarantee joint uniqueness across the columns in a composite key, unless guaranteeing uniqueness within a single column within the composite key is sufficient for the target use case.
    • from_unique can be used to update unique columns at the same time as other columns in a mask_table task, without the need for a separate database update performed by a mask_unique_key task.
    • from_unique can be used to generate unique values in mask_tabular_file tasks or in mask_table tasks for databases that do not support mask_unique_key (such as DynamoDB).
  • Using from_unique in a mask_table task where worker_count > 1 is only supported for Oracle and Amazon DynamoDB connections.
  • from_unique cannot be used in the rules of a mask_file task.
  • Like other mask types, from_unique cannot be used to update the key columns of a mask_table task (except for databases that allow mask_table to update key columns, such as DynamoDB).
  • All from_unique masks within a given task will be produced from the same sequence of unique values, such that two from_unique masks with identical format will produce the same value for the same row. This can be useful to ensure values produced by identical from_unique masks in different if branches are jointly unique.

Typecast (typecast)

A simple mask to convert a value from one data type to another. This is particularly useful when a value is stored as one data type in a database, but needs to be transformed to a different data type to apply certain mask types within DataMasque.

If the typecast cannot be performed (e.g. because the value of the column is incompatible with the target data type), an error will be raised.

Parameters
  • typecast_as (required): The data type you wish to convert the column value to. Must be one of: date, datetime, string, integer, float, or boolean.
  • date_format (optional): When converting a string to a datetime, this format string can be provided to specify the expected format of the string. Conversely, when converting a datetime to a string, this format string can be provided to determine the format of the resulting string. Format strings must be valid Python date/time format strings.
Example

This example will replace every start_date in the employees table with a fixed date value of 2010-01-01. The fixed date value is specified as a string and typecast as a datetime before being written to the database.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: start_date
        masks:
          - type: from_fixed
            value: '2010-01-01'
          - type: typecast
            typecast_as: datetime
            date_format: '%Y-%m-%d'

Do nothing (do_nothing)

A simple mask which does not mask the data instead, sets the original value. This is useful when there are certain items or elements of columns that are not intended to be masked.

For example, you can use it to retain certain known XML attributes and mask all the rest. For more information please refer to XML Documentation.


Manipulating date values as strings

If you wish to manipulate non-string values using their string representations, you can use the typecast mask. Using the pattern demonstrated below, you may convert the value to a string, apply some manipulations, and then convert back to the original data type. For example, to truncate the day and month from a date while retaining the year, you may do the following: _example

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: date_of_birth
        masks:
          # Convert to string
          - type: typecast
            typecast_as: string
            date_format: '%Y-%m-%d'
          # Take the 'YYYY-' component and append a fixed value of
          # '01-01' to create the value 'YYYY-01-01'
          - type: concat
            masks:
              - type: replace_substring
                start_index: 5
                masks:
                - type: from_fixed
                  value: '01-01'
          # Convert back to original datatype
          - type: typecast
            typecast_as: datetime
            date_format: '%Y-%m-%d'

From choices (from_choices)

A mask to replace column values by selecting from a list of choices or a dictionary of choices with weights specified.

Parameters
  • choices (required): Can either be a list of strings, or a list of dictionaries. If a list of dictionaries, each dictionary has a single key, the "choice" and value is the weight.

    Weights must be positive, and can be a mix of float and int. Weights do not need to sum to a particular value, they are relative.

    If there are no weights assigned to any of the choices then all choices will have equal weights.

Examples

This example masks the department column with a randomly selected department from the list of choices.

version: "1.0"
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: department
        masks:
          - type: from_choices
            choices:
              - "accounting"
              - "sales"
              - "research and development"

Show result

Before After
department
employee
employee
employee
employee
employee
employee
employee
employee
employee
employee
department
research and development
accounting
sales
research and development
accounting
sales
research and development
sales
accounting
research and development

This example masks the department column with a randomly selected department from the list of choices based on the weights provided.

version: "1.0"
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: department
        masks:
          - type: from_choices
            choices:
              - "accounting": 10
              - "sales": 15
              - "research and development": 25

Show result

Before After
department
sales
sales
sales
sales
sales
sales
sales
sales
sales
sales
department
research and development
accounting
sales
research and development
research and development
sales
research and development
sales
accounting
research and development


Combinator masks

Concatenate (concat)

A simple mask to concatenate the outputs of multiple masks together into a single string.

Parameters
  • masks (required): A list of masks (or dictionary mapping keys to masks) which will be evaluated and have their outputs concatenated into a single value. The original column value is provided as the input to each mask.
  • glue (optional): If provided, this string will be inserted between the output of each concatenated mask. Defaults to an empty string. Useful for separating values with spaces or commas.
Example

This example generates a full name by concatenating a user’s name with the fixed value ‘Smith’.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: name
        masks:
          - type: concat
            glue: " "
            masks:
              - type: from_column
                source_column: name
              - type: from_fixed
                value: "Smith"

Show result

Before After
name
Bill
Chris
Anastasia
Judith
Gordon
Joel
name
Bill Smith
Chris Smith
Anastasia Smith
Judith Smith
Gordon Smith
Joel Smith


Chain (chain)

A simple mask to chain other masks together in series. This mask is only useful in combination with concat, in the case where multiple masking operations need to be performed on one part of a concatenated mask.

Parameters
  • masks (required): A list of masks (or dictionary mapping keys to masks) that will be applied in sequence to the input value.
Example

This example selects a random name from the DataMasque_firstNames_mixed.csv file, transforms it to uppercase, and then concatenates ‘Smith’ onto it to generate a random full name.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: name
        masks:
          - type: concat
            glue: ' '
            masks:
              - type: chain
                masks:
                  - type: from_file
                    seed_column: firstname-mixed
                    seed_file: DataMasque_firstNames_mixed.csv
                  - type: transform_case
                    transform: uppercase
              - type: from_fixed
                value: 'Smith'

Show result

Before After
name
Bill
Chris
Anastasia
Judith
Gordon
Joel
name
STEVE Smith
JAMES Smith
JENNIFER Smith
FRANK Smith
CHARLIE Smith
CALEB Smith


String masks

Substitute (substitute)

Replace each character in a string with another random character from its same set. The character sets are:

  • Uppercase letters (A-Z).
  • Lowercase letters (a-z).
  • Digits (0-9).

Characters not in these sets (such as punctuation and symbols) are not replaced.

This mask is designed to be easy to drop in place to mask values that must have a specific format, but whose value is not important. For example, it could be used to mask:

  • Phone numbers (e.g. +1 (555) 867-5309 to +2 (938) 123-8372)
  • License plates (e.g. BZF123 to LMA191)
  • Bank accounts (e.g. 10-9282-9478563-00 to 23-1840-6492817-01)
  • Passport numbers (e.g. FD194845 to CZ858584)

and so on.

substitute is a good, simple and safe default for many data types. However, it is not intended to generate perfect replacements for columns that must have special rules. For example, if a value must always start with the letter C, followed by 6 random numbers and letters, then substitute is not suitable as the C might be replaced with another letter.

The uppercase, lowercase and digits arguments can be used to disable the replacement of each of these character sets. No errors are raised if a character set is enabled but those characters are not in the string, for example, it's safe to try to replace letters in a phone number field.

Parameters
  • force_change (optional): Since characters are chosen randomly, it is possible that a character might be randomly replaced with the same one (for example, A is chosen as a replacement for A). Set force_change to true to make sure the replacement character differs. Defaults to false. Note that this makes the output slightly less random as the number of possible replacements is reduced by one.
  • uppercase (optional): A boolean to enable or disable the replacement of uppercase characters. Defaults to true (uppercase characters will be replaced).
  • lowercase (optional): A boolean to enable or disable the replacement of lowercase characters. Defaults to true (lowercase characters will be replaced).
  • digits (optional): A boolean to enable or disable the replacement of digits. Defaults to true (digits will be replaced).
Example

This example will apply substitute masks to the phone, license_plate and validation_code.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: phone
        masks:
          - type: substitute
      - column: license_plate
        masks:
          - type: substitute
      - column: validation_code
        masks:
          - type: substitute

Show result

Before After
phone license_plate validation_code
(09) 8198822BA981aFec9-LIZN7
+64 (21) 0917762GL174877HG8-bbA9
1-800 GET-MASQUECDF345Lm85-gC5D
phone license_plate validation_code
(29) 01691548BV912bZwh0-NCZY9
+91 (45) 54173964XP916501MV0-kqC7
2-975 JDV-PLASHELCU788Ys04-wL9V


Random text (from_random_text)

This mask replaces the column's value with randomly generated a-z characters.

Parameters
  • max (required): The generated character string will be this length at maximum. The maximum length must be between 1 and 100.
  • min (optional): The generated character string will be this length at minimum. If no value is supplied here, the generated string's length will always be equal to the max value.
  • case (optional): The case (upper or lower) of the text generated. Mixed case will be generated if this field is left blank. Must be one of: upper, lower
Example

This example replaces the values in the name column with a random string of lower case characters between 5 and 10 characters in length.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: name
        masks:
          - type: from_random_text
            min: 5
            max: 10
            case: lower

Show result

Before After
name
Bill
Chris
Anastasia
Judith
Gordon
Joel
name
fjggrw
bjoquazqit
pljfrey
sdnbomx
wpoieut
yptrf


Transform case (transform_case)

A simple mask to perform a transformation to the case/capitalisation of a string.

Parameters
  • transform (required): The transformation to apply. Must be one of: uppercase, lowercase, capitalize_words (capitalizes first letter of each word), capitalize_string (capitalizes first letter only).
Example

This example will convert all values in the name column into uppercase.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: name
        masks:
          - type: transform_case
            transform: uppercase

Show result

Before After
name
Bill
Chris
Anastasia
Judith
Gordon
Joel
name
BILL
CHRIS
ANASTASIA
JUDITH
GORDON
JOEL


Substring (take_substring)

A simple mask to select a substring from a column value. You may wish to use this to select or remove a subset of characters from the beginning, end, or middle of a string.

Parameters
  • start_index (optional): The index of the first character to include in the selected substring, with 0 being the index of the first character in the string. Defaults to 0.
  • end_index (optional): The index of the character immediately AFTER the selected substring (i.e. the end_index is exclusive). If omitted, the selection will continue until the end of the string.

Positive and negative indices can be used, i.e. the first character in a string is at index 0, the second character is at index 1, the last character is at index -1, and the second-to-last character is at index -2.

Example

This example will return only the first 3 characters of each value in the name column. The final result will return the characters at positions 0, 1, and 2. This is because the end_index is exclusive; the characters starting from the end_index value of 3 onwards are omitted from the final result.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: name
        masks:
          - type: take_substring
            start_index: 0
            end_index: 3

Show result

Before After
name
Bill
Chris
Anastasia
Judith
Gordon
Joel
name
Bil
Chr
Ana
Jud
Gor
Joe


Replace substring (replace_substring)

A mask for transforming a selected substring of a string value. The transformation is defined by a nested sequence of masks. Matched substrings are transformed in-place, leaving the unmatched sections intact. For more complex use cases, replace_regex may be helpful.

Parameters
  • masks(required): A list of masks (or dictionary mapping keys to masks) that define the transformation to apply to the selected substring. The selected substring is provided as the input to the first mask.
  • start_index (optional): The index of the first character to include in the selected substring, with 0 being the index of the first character in the string. Defaults to 0.
  • end_index (optional): The index of the character immediately AFTER the selected substring (i.e. the end_index is exclusive). If omitted, the selection will continue until the end of the string.
  • preserve_length (optional): If set to true, then the output of the masks will be truncated or repeated until it has the same length as the original substring. This ensures the length of the entire string is unchanged. Defaults to false.

Positive and negative indices can be used, i.e. the first character in a string is at index 0, the second character is at index 1, the last character is at index -1, and the second-to-last character is at index -2.

Example

This example will replace the last 3 characters of each value in the name column with a # symbol. The start_index value of -3 indicates that the third to last character is the beginning of the substring. Because the end_index is not specified, all characters starting from the third to last character of the string until the end of the string are masked. The final result will take the characters at index position -3, -2 and -1, and replace those values with '#', leaving the rest of the string unchanged.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: name
        masks:
          - type: replace_substring
            start_index: -3
            masks:
              - type: from_fixed
                value: '###'

Show result

Before After
name
Bill
Chris
Anastasia
Judith
Gordon
Joel
name
B###
Ch###
Anasta###
Jud###
Gor###
J###


Replace regular expression (replace_regex)

A mask for transforming sections of a string that match a certain regular expression. The transformation that is applied to each matched substring is defined by a nested sequence of masks. The matched substrings are transformed in-place, leaving the unmatched sections intact.

Parameters
  • masks(required): A list of masks (or dictionary mapping keys to masks) defining the transformation to apply to each substring that matches the pattern specified in regex. The entire sequence of masks will be applied to each substring that is matched, with the matched value being provided as the input to the first mask.
  • regex (required): The regular expression that will be used to search for substrings to mask. For more details on how to use regular expressions, see Common regular expression patterns.
  • preserve_length (optional): If set to true, then each output of the masks will be truncated or repeated until it has the same length as the original matched substring. This ensures the length of the entire string is unchanged. Defaults to false.
Example

This example replaces all numeric characters in the driversLicence column with #. Please note that it is also best practice to wrap the regular expression in quotes to avoid special characters being misinterpreted as YAML syntax:

version: '1.0'
tasks:
  - type: mask_table
    table: '"DriversLicence"'
    key: id
    rules:
      - column: driversLicence
        masks:
          - type: replace_regex
            regex: '[0-9]'
            masks:
              - type: from_fixed
                value: '#'

Show result

Before After
driversLicence
AB123456
CD987654
EF135790
GH246802
IJ112358
driversLicence
AB######
CD######
EF######
GH######
IJ######


Credit card (credit_card)

This mask provides a number of methods for masking and generating credit card numbers. Parameters can be set to control the prefix, PAN formatting and luhn-validity of the generated numbers.

Parameters
  • issuer_names (optional): The generated card will have the specified issuer's prefix(es) and card lengths. If left empty all card issuers can be used to generate the card number. Cannot be specified if preserve_prefix is true. Please refer to the list of issuers.
  • validate_luhn (optional): Set to true by default, the generated card number will pass the luhn checksum. Set to false to generate random credit cards instead, which slightly improves performance by skipping luhn validation. Cannot be true with pan_format enabled.
  • pan_format (optional): Set to false by default, the digits between the fourth and last six digits will be marked out with 'X's. Cannot be true with validate_luhn enabled.
  • preserve_prefix (optional): Set to false by default, the generated card will keep the card issuer's prefix but generate new numbers. The length of the prefix will vary based on the issuer. Due to the increased accuracy of enabling this option, which performs a lookup against a list of known issuers, this is not as fast as plain random generation. Actual speed difference will depend on the data to be masked. Cannot be true if issuer_names have been specified.
Example

This example generates credit card numbers that pass the luhn checksum, with card issuer set to either MasterCard, Visa, or American Express.

version: "1.0"
tasks:
  - type: mask_table
    table: customers
    key: customer_id
    rules:
      - column: credit_card_number
        masks:
          - type: credit_card
            issuer_names:
              - VISA
              - MASTERCARD
              - AMERICAN EXPRESS
            validate_luhn: true
            preserve_prefix: false
            pan_format: false

Show result

Before After
credit_card_number
4988418614189936
4429545392235346
5208475828392947
credit_card_number
371006478248634
5220082637809691
4284336225480232

This example generates credit card numbers that retain the original card prefix and the middle numbers in the card will be marked out with 'X'.

version: "1.0"
tasks:
  - type: mask_table
    table: customers
    key: customer_id
    rules:
      - column: credit_card_number
        masks:
          - type: credit_card
            validate_luhn: false
            preserve_prefix: true
            pan_format: true

Show result

Before After
credit_card_number
371006478248634
4429545392235346
5208475828392947
credit_card_number
3781XXXXX248626
4259XXXXXX809342
52784XXXXXX480232

Card issuer names

These card issuer names can be used as arguments to the issuer_names parameter. They are not case-sensitive.

VisaMastercardAmerican Express
China T-UnionChina UnionpayDankort
Diners Club InternationalDiners Club United States & CanadaDiscover Card
InstapaymentInterpaymentJCB
LankapayMaestroMaestro UK
MIRNPS PridnestrovieRupay
TroyUkrcardVerce
Card prefixes

DataMasque contains a list of over 105,000 prefixes which are used when the preserve_prefix parameter is set. If a prefix is not found, then DataMasque falls back to preserving just the first digit.

A full list of prefixes can be found in the credit_card_prefixes.csv file (1.9MB CSV).


Imitate (imitate) (deprecated)

The imitate mask was renamed to substitute in DataMasque 2.11. Unless noted in the changelog, imitate masks will continue to function with the same behaviour and options as the substitute mask. Refer to the substitute mask documentation for usage details and configuration. Backwards compatibility may be removed in a future DataMasque version, so imitate should be replaced with substitute in rulesets, when possible.


Numeric masks

Random Number (from_random_number)

This mask replaces the column value with a random number. The default number type is integer. If decimal_places is set greater than 0, a decimal number will be generated.

Parameters
  • max (required): The generated number is guaranteed to be less than or equal to this maximum value.
  • min (optional): This is the minimum value for the generated number. Defaults to 0.
  • mode (optional): If this parameter is supplied, the number will be chosen from a triangular distribution where 'mode' is the peak. Useful for street numbers etc. (e.g. with min and mode both set to 1 to generate lower street numbers more frequently).
  • decimal_places (optional): Number of decimal places to generate. Defaults to 0.
Example (integer)

This example replaces the quantity column in the products table with a random integer between 1 and 200.

version: '1.0'
tasks:
  - type: mask_table
    table: products
    key: id
    rules:
      - column: quantity
        masks:
          - type: from_random_number
            min: 1
            max: 200

Show result

Before After
quantity
30
40
60
100
150
120
quantity
152
108
159
84
88
176

Example (decimal)

This example replaces the price column in the products table with a random number to 2 decimal places between 1 and 200.

version: '1.0'
tasks:
  - type: mask_table
    table: products
    key: id
    rules:
      - column: price
        masks:
          - type: from_random_number
            min: 1
            max: 200
            decimal_places: 2

Show result

Before After
price
10.00
20.10
50.01
80.99
150.80
199.50
price
101.52
30.89
44.23
93.01
6.11
190.90


Random boolean (from_random_boolean)

This mask replaces the column value with a random boolean value (1/0 or true/false). from_random_boolean is effectively equivalent to a from_random_number mask with options max: 1 and min: 0.

Note The return type is an integer which can be automatically cast to the appropriate boolean type in most databases. For file masking and Dynamo, literal 1 or 0 will be written to the file, which may not be the intended behaviour.

  • If literal true/false boolean values are required, please chain from_random_boolean with a typecast:
masks:
  - type: chain
    masks:
      - type: from_random_boolean
      - type: typecast
        typecast_as: boolean
  • If "true"/"false" strings are required please use the from_choices mask with "true" or "false" values (note the quoting around the values). For more information please refer to from_choices
Example

This example replaces the active column in the products table with random boolean values.

version: '1.0'
tasks:
  - type: mask_table
    table: products
    key: id
    rules:
      - column: active
        masks:
          - type: from_random_boolean

Show result

Before After
active
true
true
false
false
true
false
active
true
false
false
true
false
true


Numeric Bucket (numeric_bucket)

This mask generates random integers from within the same "bucket" (numeric range) as the value to be masked. The mask is set up using an array of integers that define the lower bound of each bucket.

buckets Intro

Buckets are a series of defined lower bounds from which replacement values are randomly generated. For example, an age column that should mask values into certain age buckets: 1-17, 18-24, 25-64, 65+. buckets: [18,25,65] are specified, so if a value from the column is 16, the replacement value is randomly generated within the 1-17 range, as opposed to any number.

Capping minimum and maximum values

The numeric bucket mask will generate numbers for the lowest bucket from 0/1 to <bucket> - 1. Similarly, for the highest bucket, numbers will be generated in the range bucket to 231 - 1. This default behaviour may cause undesired outputs which are outside the normal range for the data.

For example, consider a column containing ages. All ages currently in the column are 18 or older. Ages are to be retained in buckets 35 or younger, 36-65, 65+.

A ruleset could be defined with buckets: [36, 65], but this would not function correctly. Ages 35 or lower could be masked to any value from 0-35, therefore the masked age could be below 18, which does not match the application rules. A similar problem exists with ages over 65, the masked value could be any value from 65 to 231 - 1.

To solve this issue, lower and upper bounds should be specified. This requires an understanding of the existing data. Choosing the upper bound would mean selecting a reasonable value that your ages could have. A better ruleset definition could be: buckets: [18, 36, 65, 100]. This would limit the lower age to 18 and upper age to 100.

Parameters
  • buckets (required): A series of numbers representing the lower bounds of the buckets, including the number e.g. [5, 10, 15 20] would result in ranges 1-4, 5-9, 10-14, 15-19, 20+. Note: buckets should not contain duplicates and should be in ascending order.
  • force_change (optional): Since the replacement values are chosen randomly, it is possible that the number might be randomly replaced with the same one (for example, 45 is chosen as a replacement for 45). Set force_change to true to ensure the replacement value is not the same as the original value. Note that this makes the output slightly less random as the number of possible replacements is reduced by one.
  • include_zero (optional): When set to true, will lower the first lower bound to include 0, otherwise the lowest bound will be 1.
  • scale_to (optional): Provide a number that better represents the scale of your data in the target column to which replacement values will be scaled. e.g., If the scale of your data is 1000 (values are multiples of 1000), then 4583 is not an appropriate replacement for 2000. Set scale_to to 1000 so the replacement values will be multiples of 1000 (4583 -> 5000). scale_to must be a multiple of 10.

Note: If buckets were defined with a size of one (e.g. buckets: [1,3,5]) with the force_change parameter set to true this would cause an infinite loop. Instead, an error will be raised. If the target column is of a type with a smaller range than 0 - 231-1 (smallint, tinyint, etc), add a maximum value to the buckets as replacement values could be larger than the column can store. numeric_bucket requires a numeric value as the input. A typecast mask can be used to convert any string values before masking with numeric_bucket. For more information please refer to Typecast.

Example

This example replaces the age column in the users table with random values from the specified buckets.

version: '1.0'
tasks:
  - type: mask_table
    table: users
    key: id
    rules:
      - column: age
        masks:
          - type: numeric_bucket
            buckets: [16,18,25,65,110]
            force_change: true

Show result

Before After
age
18
24
36
16
43
65
age
22
23
55
17
56
100


Date/time masks

Random date/time (from_random_datetime)

This mask replaces the column value with a random datetime. Generated datetime values have a resolution of 1 second (the millisecond component will be 0).

Parameters
  • max (required): The generated date/time is guaranteed to be less than this date/time, formatted according to ISO 8601 or you can specify now/current_date_time, which will be the date and time that the run executes.
  • min (required): This is the minimum value for the generated date/time, formatted according to ISO 8601 or you can specify now/current_date_time, which will be the date and time that the run executes.

Note A from_random_datetime mask can be used in conjunction with a Typecast mask in order to make sure the format of the generated date/datetime value satisfies any format requirements.

Example

This example replaces the order_time column with a random date time between December 1 2019 00:00:00 UTC-0 and December 31 2019 05:30:00 UTC-0.

version: '1.0'
tasks:
  - type: mask_table
    table: orders
    key: id
    rules:
      - column: order_time
        masks:
          - type: from_random_datetime
            min: '2019-12-01T00:00+00:00'
            max: '2019-12-31T05:30+00:00'

Show result

Before After
order_time
2019-12-07 19:50:12+00:00
2019-12-10 22:33:00+00:00
1966-12-22 07:09:01+00:00
2019-12-27 09:23:23+00:00
2019-12-29 12:06:50+00:00
2019-12-30 08:40:34+00:00
order_time
2019-12-23 20:02:50+00:00
2019-12-09 11:32:00+00:00
2019-12-12 21:31:41+00:00
2019-12-06 12:07:18+00:00
2019-12-02 05:28:45+00:00
2019-12-20 08:40:34+00:00


Random date (from_random_date)

This mask replaces the column value with a random date. Generated date values have a resolution of 1 day and do not include a time component.

Parameters
  • max (required): The generated date/time is guaranteed to be less than this date/time, formatted according to ISO 8601 or you can specify now/current_date_time, which will be the date and time that the run executes.
  • min (required): This is the minimum value for the generated date/time, formatted according to ISO 8601 or you can specify now/current_date_time, which will be the date and time that the run executes.

Note A from_random_date mask can be used in conjunction with a Typecast mask in order to make sure the format of the generated date value satisfies any format requirements.

Example

This example replaces the date_of_birth column with a random date between January 1 1950 and December 31 2000.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: date_of_birth
        masks:
          - type: from_random_date
            min: '1950-01-01'
            max: '2000-12-31'

Show result

Before After
date_of_birth
1959-03-07
1965-11-10
1966-06-22
1979-02-27
1990-05-31
1999-07-31
date_of_birth
1997-08-25
1970-10-09
1995-01-18
1973-04-18
1961-02-26
1992-04-30

Note that this mask uses the ISO-8601 standard for date values. This allows for ISO weeks to be used for formatting dates, ISO week formats are specified with a 'W' in front of the week number and optionally followed by the day of the week. Additionally, day of the year can be specified as a three-digit number. See Date formats below.

Date formats
Format Example Equivalent
yyyy2022January 1, 2022
yyyy-mm2022-03March 1, 2022
yyyy-mm-dd2022-03-05March 5, 2022
yyyy-Www2022-W20Week 20 of 2022
(May 16, 2022)
yyyy-Www-d2022-W20-4Day 4 of Week 20 of 2022
(May 19, 2022)
yyyy-ddd2022-050Day 50 of 2022
(February 19, 2022)

Retain age (retain_age)

This mask is designed to be used on a column containing dates. It will mask by generating random dates in such a way that age (in years) is retained, calculated based on the date the mask is executed. This can be used to ensure that columns such as date of birth, transaction date, registration date or other creation dates are masked without conflicting with other rules that may rely on the age of the masked date.

Parameters
  • date_format (optional): If the column value is read as a string type, then it will be converted to a date using this format. The new randomly generated date will also be written back to the database in the same format. Format strings must be valid Python date/time format strings. Defaults to "%Y-%m-%d".
  • day_of_year (optional): For consistent masking, specify a day_of_year to calculate the ages on that day of the current year. This should be specified in the format "<month>-<day>"; for example, "8-1" for the 1st of August. The year will be set to the current year. If this option is omitted, then ages will be calculated based on the current date.
Example

This example replaces the date_of_birth column values while retaining the employee's age. The dates are based on the masking run being executed on August 1st, 2022. The employees' ages are 50, 50, 49 and 22, respectively, and are maintained even with the new dates of birth.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: date_of_birth
        masks:
          - type: retain_age

Show result

Before After
date_of_birth
1972-07-31
1972-08-01
1972-08-02
2000-02-29
date_of_birth
1972-05-30
1971-08-10
1973-04-21
1999-10-20

Note that the mask considers birthdays on the 29th of February to be celebrated on the 28th of February on non-leap-years.

If masking datetime values (rather than just dates) then the time portion of the masked value will be 00:00:00 (midnight).


Retain date component (retain_date_component)

This mask will generate a date where up to two of the year, month, or day components value will be retained.

Parameters
  • date_format (optional): If the column value is read as a string type, then it will be converted to a date using this format. The new randomly generated date will also be written back to the database in the same format. Format strings must be valid Python date/time format strings. Defaults to "%Y-%m-%d".
  • year (this OR month OR day required): This is a bool which specifies whether the year component of the date should be retained. Defaults to False.
  • month (this OR year OR day required): This is a bool which specifies whether the month component of the date should be retained. Defaults to False.
  • day (this OR year OR month required): This is a bool which specifies whether the day component of the date should be retained. Defaults to False.
  • minimum_year (optional): This is the minimum year that can be generated. Defaults to 100 years ago.
  • maximum_year(optional): This is the maximum year that can be generated. Defaults to current year.
  • force_change (optional): Ensures the newly generated date is different to the input date. On the chance that the same date is generated.
Example

In this example the month of each date is retained, while the day and year will be randomly generated. The year will be a value between 1980 and 2005 (inclusive).

version: '1.0'
tasks:
  - type: mask_table
    table: customers
    key: id
    rules:
      - column: date_of_birth
        masks:
          - type: retain_date_component
            month: True
            minimum_year: 1980
            maximum_year: 2005

Show result

Before After
date_of_birth
1972-07-31
2000-08-01
2010-10-02
1996-02-29
date_of_birth
1998-07-21
2004-08-15
1990-10-27
2000-02-12

Note: In order for retain_date_component to consistently produce the same values between runs in different years, a fixed maximum_year must be specified.


Retain year (retain_year)

This mask replaces the column value with a random date. The generated date will have the year component retained while randomising month and day.

Parameters
  • date_format (optional): If the column value is read as a string type, then it will be converted to a date using this format. The new randomly generated date will also be written back to the database in the same format. Format strings must be valid Python date/time format strings. Defaults to "%Y-%m-%d".
  • force_change (optional): Ensures the newly generated date is different to the input date. On the chance that the same date is generated.
Example

This example replaces the date_of_birth column values with new dates while retaining their year component.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: date_of_birth
        masks:
          - type: retain_year
            force_change: True

Show result

Before After
date_of_birth
1972-07-31
1984-08-01
1996-08-02
2000-02-29
date_of_birth
1972-01-30
1984-11-10
1996-06-28
2000-02-13


Document masks

Document masks apply masking to subfields in documents that are stored in database columns, for example, JSON or XML documents.

JSON (json)

This mask will use query to locate and mask a value inside a JSON document. The rest of the JSON document is unchanged. The path is specified using a list of strings or integers which will be used when traversing the data to the values intended to be masked; some examples of path are covered in the next section.

Parameters
  • transforms (required): A list of the transforms (replacements) to perform on the JSON document.
    • path (required): The path to locate the value to update.
    • masks (required): A list of masks to be performed (Any of the valid Mask Types).
    • on_null (optional): A string to specify the action to take if the value is null. One of:
      • skip (default): Skip to the next transform, the document remains unchanged.
      • error: Raise an error and stop masking.
      • mask: Mask the null value as specified.
    • on_missing (optional): A string to specify the action to take if the value is not present (due to the document structure not matching the path).
      • skip: Skip to the next transform, the document remains unchanged.
      • error (default): Raise an error and stop masking.
    • force_consistency (optional): Keep consistency between replacements in the path. See the section JSON Example with force_consistency for details on behaviour. Defaults to false.
  • fallback_masks (optional): Mask to perform if the data retrieved from the database is not valid JSON.

If the json mask is provided a null value (e.g. from a SQL column), the value will remain null. fallback_masks will not be executed.

When masking multiple values in the same JSON document, multiple transforms should be specified, instead of multiple table masks with a single transform each. This means that the JSON column will only need to serialized/deserialized once per row.

path Intro

For the JSON:

{
    "customer_details": {
        "first_name": "Richard",
        "last_name": "Willis"
    },
    "quantity": 18,
    "products": ["product1", "product2"]
}

The following paths could be used to refer to particular values:

  • [customer_details, first_name] refers to the value "Richard"
  • [customer_details, last_name] refers to the value"Willis"
  • [quantity] refers to the value 18
  • [products] refers to the product array ["product1", "product2"]
  • [products 0] refers to the first value in the product array: ["product1"]

The wildcard operator * can be used to apply masks to multiple items matching the query. This is useful if you don't know how many elements will be in an array or object. For example, a JSON object with multiple people, each with multiple addresses:

{
  "users": [
    {
      "name": "Richard",
      "addresses": [
        {"type": "postal", "city": "Fairview"},
        {"type": "physical", "city": "Riverside"}
      ]
    },
    {
      "name": "Willis",
      "addresses": [
        {"type": "postal", "city": "Beachland"},
        {"type": "physical", "city": "Bronson"}
      ]
    }
  ]
}

The path [users, "*", name] would mask the name for every element in users, regardless of how many there are. Multiple wildcards can be used, too. The path [users, "*", addresses, "*", city] would mask city in all addresses elements of all users. Note that * must always be quoted in YAML.

Note: Values in path are case-sensitive. They should not follow quoting rules for database columns (double quotation marks in an outer set of single quotation marks). Instead, normal YAML string-quoting rules apply.

Example

This example replaces the data at the path [customer_details, first_name] of the json_data column with a fixed value REDACTED. The on_null: mask option is specified to mask the null value as normal. The skip option is specified to skip that transform and continue masking on missing values (i.e. the structure does not match the path).

Note that this means the first_name in the wrong location in the first row is not masked. In cases like this, it can be safer to specify error instead, so the masking run fails if data is not in the expected format. In the second row where {"first_name": null}, this value will be masked since we specified on_null: mask.

Also note the use of fallback_masks. The last row did not have valid JSON data in it, so the fallback mask was used to replace it with an empty JSON object which, may help clean the data for further use.

version: "1.0"
tasks:
  - type: mask_table
    table: customers
    key: uid
    rules:
      - column: json_data
        masks:
          - type: json
            transforms:
              - path: [customer_details, first_name]
                masks:
                  - type: from_fixed
                    value: "REDACTED"
                on_null: mask
                on_missing: skip
            fallback_masks:
              - type: from_fixed
                value: "{}"

Show result

Before After
json_data
{"c":{"first_name":"Sam"}}
{"customer_details":{"first_name": null}}
{"customer_details":{"first_name": "Harry"}}
{"customer_details":{"first_name": "Sally"}}
NOT_VALID_JSON
json_data
{"c":{"first_name":"Sam"}}
{"customer_details":{"first_name": "REDACTED"}}
{"customer_details":{"first_name": "REDACTED"}}
{"customer_details":{"first_name": "REDACTED"}}
{}

For arrays, all masks will be applied to each value in the array. For example:

{
  "customer_details": {
    "given_names": ["Richard", "Willis"]
  }
}

The path [customer_details, given_names] would return the value ["Richard", "Willis"] and the masks would then be performed on "Richard" and "Willis" separately. This means for most mask types, each value in the array would be transformed into a new, different value. However, if you are using a mask that always returns the same value (e.g. from_fixed) all values would be transformed to the same new value.

Note:

  • In all databases, the json mask supports masking of JSON data stored in text type columns (VARCHAR, NVARCHAR or TEXT).
  • JSON specific columns types are also supported, for example, JSON in PostgreSQL and MySQL, or JSONB in PostgreSQL.
  • Arrays, maps, and sets inside DynamoDB columns can also be masked with the json mask. Sets are treated like arrays, with items indexed according to their sorted order.
JSON Example with force_consistency

This example will illustrate the benefit of using the force_consistency parameter on transforms. Suppose you have a table with JSON data with the following structure:

{
    "name": [
        {
            "use": "official",
            "family": "Chalmers",
            "given": ["Peter", "James"]
        },
        {
            "use": "usual",
            "given": ["Jim"]
        },
        {
            "use": "maiden",
            "family": "Windsor",
            "given": ["Peter", "James"]
        }
    ]
}

When masking the items at the path name, "*", given, it would be best to mask them with consistent values i.e. the same masked names would appear in each of the given items after masking. To do this, set the force_consistency parameter of the relevant transform to true.

version: "1.0"
tasks:
  - type: mask_table
    table: dbo.json_test
    key: id
    rules:
      - column: first_name
        masks:
          - type: json
            transforms:
              - path: ['name', '*', 'given']
                masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_male.csv
                    seed_column: firstname-male
                force_consistency: true

Show result

Before After
json_data
{"name":[{"use":"official","family":"Chalmers","given":["Peter","James"]},{"use":"usual","given":["Jim"]},{"use":"maiden","family":"Windsor","given":["Peter","James"]}]}
{"name":[{"use":"official","family":"Stevenson","given":["Todd","Carl"]},{"use":"usual","given":["Todd"]},{"use":"maiden","family":"Pallin","given":["Todd","Carl"]}]}
{"name":[{"use":"official","family":"Radgen","given":["John","Neil"]},{"use":"usual","given":["John"]},{"use":"maiden","family":"Hoppstadter","given":["John","Neil"]}]}
{"name":[{"use":"official","family":"Baulenas","given":["Eric","Miguel"]},{"use":"usual","given":["Eric"]},{"use":"maiden","family":"Ville","given":["Eric","Miguel"]}]}
{"name":[{"use":"official","family":"Asurmendi","given":["James","Bryan"]},{"use":"usual","given":["James"]},{"use":"maiden","family":"Gotsch","given":["James","Bryan"]}]}
json_data
{"name":[{"use":"official","family":"Chalmers","given":["Claude","Dennis"]},{"use":"usual","given":["Claude"]},{"use":"maiden","family":"Windsor","given":["Claude","Dennis"]}]}
{"name":[{"use":"official","family":"Stevenson","given":["Zackery","Scot"]},{"use":"usual","given":["Zackery"]},{"use":"maiden","family":"Pallin","given":["Zackery","Scot"]}]}
{"name":[{"use":"official","family":"Radgen","given":["Joshua","Brandon"]},{"use":"usual","given":["Joshua"]},{"use":"maiden","family":"Hoppstadter","given":["Joshua","Brandon"]}]}
{"name":[{"use":"official","family":"Baulenas","given":["Andrew","Tanner"]},{"use":"usual","given":["Andrew"]},{"use":"maiden","family":"Ville","given":["Andrew","Tanner"]}]}
{"name":[{"use":"official","family":"Asurmendi","given":["Antonio","James"]},{"use":"usual","given":["Antonio"]},{"use":"maiden","family":"Gotsch","given":["Antonio","James"]}]}

Without force_consistency the output JSON would have all different names, an example is shown below:

Show result

Before After
json_data
{"name":[{"use":"official","family":"Chalmers","given":["Peter","James"]},{"use":"usual","given":["Jim"]},{"use":"maiden","family":"Windsor","given":["Peter","James"]}]}
{"name":[{"use":"official","family":"Stevenson","given":["Todd","Carl"]},{"use":"usual","given":["Todd"]},{"use":"maiden","family":"Pallin","given":["Todd","Carl"]}]}
{"name":[{"use":"official","family":"Radgen","given":["John","Neil"]},{"use":"usual","given":["John"]},{"use":"maiden","family":"Hoppstadter","given":["John","Neil"]}]}
{"name":[{"use":"official","family":"Baulenas","given":["Eric","Miguel"]},{"use":"usual","given":["Eric"]},{"use":"maiden","family":"Ville","given":["Eric","Miguel"]}]}
{"name":[{"use":"official","family":"Asurmendi","given":["James","Bryan"]},{"use":"usual","given":["James"]},{"use":"maiden","family":"Gotsch","given":["James","Bryan"]}]}
json_data
{"name":[{"use":"official","family":"Chalmers","given":["Parker","Joseph"]},{"use":"usual","given":["Mark"]},{"use":"maiden","family":"Windsor","given":["Jeffrey","Richard"]}]}
{"name":[{"use":"official","family":"Stevenson","given":["Dale","Sebastian"]},{"use":"usual","given":["Christopher"]},{"use":"maiden","family":"Pallin","given":["Johnathan","Bracken"]}]}
{"name":[{"use":"official","family":"Radgen","given":["Tyler","Robert"]},{"use":"usual","given":["Micheal"]},{"use":"maiden","family":"Hoppstadter","given":["Herbert","Ashton"]}]}
{"name":[{"use":"official","family":"Baulenas","given":["Artie","Alfred"]},{"use":"usual","given":["Pedro"]},{"use":"maiden","family":"Ville","given":["Henderson","Bryan"]}]}
{"name":[{"use":"official","family":"Asurmendi","given":["Benjamin","Michael"]},{"use":"usual","given":["Philip"]},{"use":"maiden","family":"Gotsch","given":["Kendrick","John"]}]}


XML (xml)

This mask will use a query to locate and mask a value inside an XML document. The rest of the XML document is unchanged. An Xpath (path) is used to define the path to the node to mask. Once the node has been located, one or more node_transforms can be applied to alter its content or attributes.

Note: The xml mask should only be used with trusted XML data. The parser includes support for entity expansion and external references which can potentially be exploited with malicious XML payloads.

Intro to transforms and node_transforms

XML documents are made up of one or more elements. When referring to an element, this includes the start tag, end tag, attributes and content. For example, this element representing a log:

<Log date="2022-08-09" username="user@example.com">Account created</Log>

The element to mask is located using an Xpath expression. Once found, there are a few different parts of the element that can be masked, namely:

  • its name (Log)
  • its attributes (date and username)
  • its text (Account created)

Each of these items are XML nodes.

When a masking run executes, each row from the database is fetched and passed to a masking function only once. To apply masks on different elements in an XML document, the ruleset should define a list of transforms, one for each element that requires masking. In turn, a list of node_transforms must be specified, one for each node of the element that needs to be masked.

Specifying masking in this manner allows the masking run to be more efficient by querying for each element to be masked only once.

As an example, consider how to mask the Log in the above example. The date and username attributes should be redacted, along with the text content. This would require one transform to locate the Log element, then three node transforms: one for the date attribute, another for the username attribute, and the final to mask the text of the element.

The relevant portion of the YAML describing this transform would look like:

transforms:
  - path: 'Log'
    node_transforms:
      - type: attribute
        attributes: 'date'
        masks:
          - <list of masks>
      - type: attribute
        attributes: 'username'
        masks:
          - <list of masks>
      - type: text
        masks:
           - <list of masks>

Note: This is assuming the Log element is not the root element in the XML document. To get the root element use . or an absolute Xpath (starting with //) as the path. All XML values are read as strings which will require a typecast mask if they are used in a mask that requires non-string values (e.g. numeric_bucket). XML also requires strings to be written so masks that return non-string values (e.g. from_random_number, from_random_boolean, numeric_bucket) need to go through a typecast mask before being written. For more information on typecast please refer to the Typecast documentation. Below is an example with from_random_number.

version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: xml_data
        masks:
          - type: xml
            transforms:
              - path: 'Log'
                node_transforms:
                - type: attribute
                  attributes: 'id'
                  masks:
                    - type: from_random_number
                      min: 1000
                      max: 9999
                    - type: typecast
                      typecast_as: 'string'
Consistency for multiple elements

Xpath expressions can match multiple elements. This XML document contains a UserLog with multiple Logs:

<Root>
    <UserLog>
        <Log date="2022-08-09" username="user@example.com">Account created</Log>
        <Log date="2022-08-09" username="user@example.com">Logged in</Log>
        <Log date="2022-08-09" username="user@example.com">Logged out</Log>
    </UserLog>
</Root>

The root is called Root in these examples – the root node does not need to be named Root.

The Xpath UserLog/Log would match all three Log elements. DataMasque can be configured to mask each of the specified nodes with the same value, or as different values. For example, the text of each element could be masked to the same value. Or, different masks can be applied to each located element. This is configured with the force_consistency option at the transform level. Setting this to true will apply each node transform in the same way to each element.

Xpath Relative Node

When evaluating an xpath expression, the root node is considered to be the current node when executing masking. Therefore, the root node should not be included when using relative xpaths.

Consider this example document:

<Root>
    <UserLog>
        <Log/>
    </UserLog>
</Root>

To select the Log node, the Xpath Root/UserLog/Log is not valid, as Root is the current node. Instead UserLog/Log should be used as the path is relative to Root.

If using an absolute Xpath (i.e. an Xpath starting with //) then the root node should be included. That is, the Xpath //Root/UserLog/Log and UserLog/Log select the same node(s) in this case.

XPath with XML namespaces

When an XML document uses namespaces, the namespace prefix is not used when specifying the Xpath, but instead the namespace URI is included in curly braces {} immediately before the element or attribute name. Note that you must include the namespace URI for each element or attribute in the path.

<Orders xmlns="http://example.com/api/"
        xmlns:o="http://example.com/api/orders/">
  <Order poNumber="55">
    <OrderId>20</OrderId>
    <o:Customer>
        <o:CustomerId>10</o:CustomerId>
        <o:State o:sentiment="good">Happy</o:State>
        <State>NSW</State>
    </o:Customer>
  </Order>
</Orders>

Here's an example ruleset to mask the above XML document:

version: "1.0"
tasks:
  - type: mask_file
    rules:
      - hash_sources:
        - xpath: "/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/}OrderId/text()"
        masks:
        - type: xml
          transforms:
          - path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/}OrderId'
            on_missing: error
            node_transforms:
            - type: text
              masks:
              - type: from_random_number
                min: 50
                max: 99
          - path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order'
            on_missing: error
            node_transforms:
            - type: attribute
              attributes: 'poNumber'
              masks:
              - type: from_random_number
                min: 50
                max: 99
          - path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/orders/}Customer/{http://example.com/api/orders/}State'
            on_missing: error
            node_transforms:
              - type: text
                masks:
                - type: from_choices
                  choices:
                  - Happy
                  - Sad
                  - Angry
                  - Anxious
                  - Excited
              - type: attribute
                attributes: '{http://example.com/api/orders/}sentiment'
                masks:
                - type: from_choices
                  choices:
                  - good
                  - bad
                  - excellent
          - path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/orders/}Customer/{http://example.com/api/}State'
            on_missing: error
            node_transforms:
              - type: text
                masks:
                - type: from_choices
                  choices:
                  - ABC
                  - DEF
                  - JKL
Masking of unknown/extra attributes

There may be cases where XML elements sometimes have extra attributes that are not always known prior to masking. To mask these, the extra_attribute_masks option can be specified. This should contain a list of masks to apply to each attribute that has not been masked using a defined node_transform.

By default, each "extra" attribute value will have the masks applied to it separately. To force each of these values to be the same, specify the force_extra_attribute_consistency: true at the transform level. The extra_attribute_masks will be applied to the first extra attribute on the first node found, and the resulting value will be applied to all extra attributes. Note that the order in which attributes are located is indeterminate and may not match the order they appear in the XML.

Parameters
  • transforms (required): A list of the transforms (replacements) to perform on the XML document.
    • path (required): The Xpath expression to locate the value to update.
    • node_transforms (required): A list of transforms to apply to the nodes on the element. The syntax of this object is shown in the node_transforms Parameters section below.
    • on_missing (optional): A string to specify the action to take if the element that the given path is not present (due to the document structure not matching the path).
      • skip: Skip to the next transform, the document is unchanged by this transform.
      • error (default): Raise an error and stop masking.
    • force_consistency (optional): Require each matching element to be masked to the same values. Defaults to false.
    • extra_attribute_masks: (optional): A list of masks to apply for attributes not covered by a specific node_transform.
    • force_extra_attribute_consistency (optional): Force all "extra" attributes to be masked to the same value. Only applicable when using extra_attribute_masks. Defaults to false.
  • fallback_masks: (optional): Mask to perform if the data retrieved from the database is not valid XML.

If the xml mask is provided a null value (e.g. from a SQL column), the value will remain null. fallback_masks will not be executed.

node_transforms Parameters

node_transforms is a list of transforms to apply to the nodes of the found element(s).

  • type (required): The type of node(s) of the current element to apply masking to. Must be one of:
    • text: The text value of the element (the content between the opening and closing tags).
    • attribute: Mask one or more attribute(s) on the element.
    • name: Mask the name of the element itself.
  • masks (required): A list of masks to be performed (Any of the valid Mask Types).
  • attributes (optional): This option is required when using the attribute type, and must not be present for other types. May either be a string, or an array of strings, which specify the attributes to apply masks to. To apply different masks to different attributes, use multiple node_transforms.
  • on_missing_attribute (optional): A string to specify the action to take if an attribute is missing. Please see the section below on Missing XML Nodes, to see what constitutes a missing attribute.
    • skip: Skip to the next attribute (if masking multiple attributes) or, if there are no attributes to be masked, to the next node_transform. The document is unchanged by this transform.
    • mask: Apply the masks, using a null value, then create the text content or attribute.
    • error (default): Raise an error and stop masking.
  • on_null_text (optional): A string to specify the action to take if the text of a node is null (missing). Please see the section below on Missing XML Nodes, to see what constitutes a missing node.
    • skip (default): Skip to the next node_transform. The document is unchanged by this transform.
    • mask: Apply the masks, using a null value, then create the text content or attribute.
    • error: Raise an error and stop masking.
Missing XML Nodes

The on_missing_attribute or on_null_text options can be used to change how missing values are treated.

  • A text node is considered null if a tag is self-closing. For example, <Transaction amount="23.94"/>. It is also considered null if the element is empty; for example, <Message to="user1" from="user2"></Message>.
  • An attribute is considered missing if it does not exist on the element. For example, the attribute currency is missing from this element: <Transaction amount="23.94"/>. An empty string attribute is not considered missing, and instead is just masked as an empty string.
  • on_missing_attribute or on_null_text does not apply to name node type, as XML tags/elements must have a name.
Retaining known attributes and removing others

There may be some instances where you want to retain known attributes, but mask all others. In this case, you can combine the do_nothing mask with the extra_attribute_masks. Any attributes you want to retain will be "masked" to their original value with do_nothing; DataMasque considers these to be masked and then applies the extra_attributes_masks to any other attributes.

Examples

This example will contain 1 transforms and 3 node_transforms. The transforms item will specify the path UserLog/Log of the xml_data column, the optional parameters not specified will be set to the default values.

  • The first of the node_transforms replaces the text at the path with a fixed value REDACTED, the on_null_text: mask option is specified to mask the null value as normal.
  • The second will mask the username attribute to a similar replacement by concatenating 3 from_file masks and a transform_case mask to make sure the replacements are all still lower case.
  • The third will mask the date attribute with a suitable replacement date with a from_random_date mask.

Also note the use of fallback_masks. The last row did not have valid XML data in it, so the fallback mask was used to replace it with an empty <Root /> element which, may help clean the data for further use.

<Root>
    <UserLog>
        <Log date="2022-08-09" username="user@example.com">Account created</Log>
        <Log date="2022-08-09" username="user@example.com"></Log>
        <Log date="2022-08-09" username="user@example.com">Logged out</Log>
    </UserLog>
</Root>
version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: xml_data
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: '<Root />'
            transforms:
              - path: 'UserLog/Log'
                node_transforms:
                  - type: text
                    masks:
                      - type: from_fixed
                        value: REDACTED
                    on_null_text: mask
                  - type: attribute
                    attributes:
                        - username
                    masks:
                      - type: concat
                        masks:
                          - type: from_file
                            seed_file: DataMasque_firstNames_mixed.csv
                            seed_column: firstname-mixed
                          - type: from_file
                            seed_file: DataMasque_lastNames.csv
                            seed_column: lastnames
                          - type: from_file
                            seed_file: DataMasque_email_suffixes.csv
                            seed_column: email-suff
                      - type: transform_case
                        transform: lowercase
                  - type: attribute
                    attributes:
                        - date
                    masks:
                      - type: from_random_date
                        min: '2022-01-01'
                        max: '2022-12-31'

Show result

Before After
json_data
Hello there!
Account createdLogged out
Account createdLogged out
Account createdLogged out
NOT_VALID_XML
json_data
Hello there!
REDACTEDREDACTEDREDACTED
REDACTEDREDACTEDREDACTED
REDACTEDREDACTEDREDACTED

XML Example with force_consistency

This example will illustrate the benefit of using the force_consistency parameter on transforms. Suppose you have a table with XML data with the following structure:

<Root>
    <UserLog>
        <Log date="2022-08-09" username="user@example.com">Account created</Log>
        <Log date="2022-08-09" username="user@example.com"></Log>
        <Log date="2022-08-09" username="user@example.com">Logged out</Log>
    </UserLog>
</Root>

When masking the date and username attributes at the path UserLog/Log, it would be best to mask them with consistent values i.e. the same masked values would appear in each of the attributes after masking. To do this, set the force_consistency parameter of the relevant transform to true.

version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: xml_data
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: '<Root />'
            transforms:
              - path: 'UserLog/Log'
                force_consistency: true
                node_transforms:
                  - type: text
                    masks:
                      - type: from_fixed
                        value: REDACTED
                    on_null_text: mask
                  - type: attribute
                    attributes:
                        - username
                    masks:
                      - type: concat
                        masks:
                          - type: from_file
                            seed_file: DataMasque_firstNames_mixed.csv
                            seed_column: firstname-mixed
                          - type: from_file
                            seed_file: DataMasque_lastNames.csv
                            seed_column: lastnames
                          - type: from_file
                            seed_file: DataMasque_email_suffixes.csv
                            seed_column: email-suff
                      - type: transform_case
                        transform: lowercase
                  - type: attribute
                    attributes:
                        - date
                    masks:
                      - type: from_random_date
                        min: '2022-01-01'
                        max: '2022-12-31'

Show result

Before After
json_data
Account createdLogged out
Account createdLogged out
Account createdLogged outREDACTEDREDACTEDREDACTED
Account createdLogged out
Account createdLogged out
json_data
REDACTEDREDACTEDREDACTED
REDACTEDREDACTEDREDACTED
REDACTEDREDACTEDREDACTED
REDACTEDREDACTEDREDACTED
REDACTEDREDACTEDREDACTED

Without force_consistency the output XML would have all different names, an example is shown in the first example.

Example with extra_attributes_force_consistency

This example will illustrate the benefit of using the extra_attributes_force_consistency parameter on transforms. Suppose you have XML data with the following structure:

<Root>
    <Info>
        <Employee date="2022-10-08" given_name="billy_ferwagner" preferred_name="billy_ferwagner"></Employee>
        <Employee date="2022-10-08" given_name="william_florista" preferred_name="william_florista"></Employee>
    </Info>
</Root>

But this time you want to mask the given_name and preferred_name attributes to the same values, to achieve this you can specify any attributes you would want to mask, e.g. the date attribute, set extra_attributes_force_consistency: true, and specify extra_attribute_masks with the masks you want to be performed on the extra attributes. This will generate a masked value from the specified masks and replace values of all attributes to that masked value.

version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: xml_data
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: fallback
            transforms:
              - path: 'Info/Employee'
                force_extra_attribute_consistency: true
                extra_attribute_masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_mixed.csv
                    seed_column: firstname-mixed
                    on_null_text: mask

Show result

Before After
json_data
json_data

Without extra_attributes_force_consistency the output XML would mask the given_name and preferred_name attributes differently as shown below.

Show result

Before After
json_data
json_data


Definitions

You can make use of YAML anchors and aliases to assign a label (anchor) to a block of configuration, and then reference that name (via aliases) to re-use the configuration throughout your ruleset, extending or overriding specific properties/keys in the configuration as needed.

You can define an anchor with &, then refer to it using an alias, denoted with *. You can use overrides with the characters <<: to add more values, or override existing ones on an alias.

Shown below is an example that sets 4 values with an anchor, uses 3 of those values with an alias, and overrides the 2nd value. Both tasks share the values for type, key and rules using the anchor &mask_table_opts and the alias *mask_table_opts. The value for table is merged/overridden by << or merge operator.

version: '1.0'
tasks:
  - &mask_table_opts
    type: mask_table
    table: details
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
  - <<: *mask_table_opts
    table: customers

As you may commonly want to re-use the definition of a task, rule, or mask in multiple locations (often with minor variations), DataMasque allows you to group such definitions under the task_definitions, rule_definitions, and mask_definitions configuration properties of a ruleset. The following subsections demonstrate how to use anchors and aliases with such definitions.

Task definitions (task_definitions)

The task_definitions attribute of the ruleset may contain a list of task definitions to be referenced from the ruleset's tasks attribute through YAML anchors and aliases. Each task definition must follow the same syntax as a task in the tasks attribute (see Database Task Types for the full list of available task types).

For example, the following ruleset has an anchor named &mask_table_customer_info under task_definitions that is inherited three times in the tasks list via the alias *mask_table_customer_info. Each time the definition is re-used, the table attribute is overridden:

version: '1.0'
task_definitions:
  - &mask_table_customer_info
    type: mask_table
    # PLACEHOLDER is a placeholder table name to be merged/overridden
    # when anchor `&mask_table_customer_info` is referred in alias
    table: PLACEHOLDER
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
      - column: postcode
        masks:
          - type: from_random_number
            max: 9999

tasks:
  - <<: *mask_table_customer_info
    table: details
  - <<: *mask_table_customer_info
    table: customers
  - <<: *mask_table_customer_info
    table: orders

This is equivalent to specifying the three mask_table tasks separately with the same key, type, and rules but applied to different tables, as below:

Show equivalent ruleset without anchors and aliases

version: '1.0'
tasks:
  - type: mask_table
    table: details
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
      - column: postcode
        masks:
          - type: from_random_number
            max: 9999
  - type: mask_table
    table: customers
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
      - column: postcode
        masks:
          - type: from_random_number
            max: 9999
  - type: mask_table
    table: orders
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
      - column: postcode
        masks:
          - type: from_random_number
            max: 9999

Rule definitions (rule_definitions)

The rule_definitions attribute of the ruleset may contain a list of rule definitions to be referenced through YAML anchors and aliases. Each rule definition must follow the same syntax as a rule in a mask_table task's rules.

For example, the following ruleset has an anchor named &postcode_rule under rule_definitions that is inherited twice in a list of rules via the alias *postcode_rule. Each time the definition is re-used, the column attribute is overridden:

version: '1.0'
rule_definitions:
  - &postcode_rule
    column: postcode
    masks:
      - type: from_random_number
        max: 9999

tasks:
  - type: mask_table
    table: orders
    key: id
    rules:
      - <<: *postcode_rule
        column: target_postcode
      - <<: *postcode_rule
        column: destination_postcode

This is equivalent to specifying the two rules separately with the same masks but applied to different columns, as below:

Show equivalent ruleset without anchors and aliases

version: '1.0'
tasks:
  - type: mask_table
    table: orders
    key: id
    rules:
      - column: target_postcode
        masks:
          - type: from_random_number
            max: 9999
      - column: destination_postcode
        masks:
          - type: from_random_number
            max: 9999

Mask definitions (mask_definitions)

The mask_definitions attribute of the ruleset may contain a list of mask definitions to be referenced from a list of masks through YAML anchors and aliases. Each mask definition must follow the same syntax as a mask in the masks attribute of a rule (see Mask Types for the full list of available mask types).

For example, the following ruleset has an anchor named &street_mask under mask_definitions that is inherited twice in a masks list via the alias *street_mask. Each time the definition is re-used, the table_filter_column attribute is overridden:

version: '1.0'
mask_definitions:
  - &street_mask
    type: from_file
    seed_file: DataMasque_address-example.csv
    seed_column: street_name
    table_filter_column: city
    seed_filter_column: city
    fallback_masks:
      - type: from_fixed
        value: 'Main Street'
tasks:
  - type: mask_table
    table: orders
    key: id
    rules:
      - column: target_street
        masks:
          - <<: *street_mask
            table_filter_column: target_city
      - column: destination_street
        masks:
          - <<: *street_mask
            table_filter_column: destination_city

This is equivalent to specifying the two masks separately with identical configuration except for different table_filter_column values, as below:

Show equivalent ruleset without anchors and aliases

version: '1.0'
tasks:
  - type: mask_table
    # PLACEHOLDER is a placeholder table name to be merged/overridden
    # when anchor `&mask_table_customer_info` is referred in alias
    table: orders
    key: id
    rules:
      - column: target_street
        masks:
          - type: from_file
            seed_file: DataMasque_address-example.csv
            seed_column: street_name
            table_filter_column: target_city
            seed_filter_column: city
            fallback_masks:
              - type: from_fixed
                value: 'Main Street'
      - column: destination_street
        masks:
          - type: from_file
            seed_file: DataMasque_address-example.csv
            seed_column: street_name
            table_filter_column: destination_city
            seed_filter_column: city
            fallback_masks:
              - type: from_fixed
                value: 'Main Street'

Nested definitions

It is possible to have nested definitions under a definition. For example, a nested rule definition &customer_info_street_address can be anchored under a task definition &mask_table_customer_info.

version: '1.0'
task_definitions:
  - &mask_table_customer_info
    type: mask_table
    # PLACEHOLDER is a placeholder table name to be merged/overridden
    # when anchor `&mask_table_customer_info` is referred in alias
    table: PLACEHOLDER
    key: id
    rules:
      - &customer_info_street_address
        column: destination_street
        masks:
        - type: from_file
          seed_file: DataMasque_address-example.csv
          seed_column: street_name
tasks:
  - <<: *mask_table_customer_info
    table: details
    rules:
      - <<: *customer_info_street_address
        column: pickup_street

Inherit definitions with dictionaries

As it is not possible to use anchors and aliases to override individual items in a YAML list, DataMasque allows you to alternatively specify lists of rules, joins, and masks as dictionaries of arbitrary keys to their respective values.

A dictionary is represented in a simple key: value form (the colon must be followed by a space):

# A mask rules example in dictionary with arbitrary keys
rules:
  customer_info_postcode: *postcode_rule
  customer_info_customer_id: *customer_id_rule
  customer_info_destination_street: *destination_street_rule

The above dictionary is identical to following list while the rules are executed:

# A mask rules example in list
rules:
  - *postcode_rule
  - *customer_id_rule
  - *destination_street_rule

The following example demonstrates how specifying rules as a dictionary allows you to easily override individual rules by their key using anchors and aliases.

By specifying rules as a dictionary in task definition &mask_table_customer_info and by using anchors and aliases to reference the rules dictionary and individual rules, we were able to override the entire customer_info_postcode rule in the first task, and just the column of the customer_info_customer_id rule in the second task.

version: '1.0'

rule_definitions:
  - &postcode_rule
    column: postcode
    masks:
      - type: from_random_number
        max: 9999
  - &customer_id_rule
    column: customer_id
    masks:
      - type: from_random_text
        max: 10

task_definitions:
  - &mask_table_customer_info
    type: mask_table
    # PLACEHOLDER is a placeholder table name to be merged/overridden
    # when anchor `&mask_table_customer_info` is referred in alias
    table: PLACEHOLDER
    key: id
    rules: &customer_info_rules
      customer_info_postcode: *postcode_rule
      customer_info_customer_id: *customer_id_rule
      customer_info_destination_street:
        column: destination_street
        masks:
        - type: from_file
          seed_file: DataMasque_address-example.csv
          seed_column: street_name
          table_filter_column: destination_city
          seed_filter_column: city
          fallback_masks:
            - type: from_fixed
              value: 'Main Street'
tasks:
  - <<: *mask_table_customer_info
    table: details
    rules:
      <<: *customer_info_rules
      customer_info_postcode:
        column: post
        masks:
          - type: from_fixed
            value: 1234
  - <<: *mask_table_customer_info
    table: orders
    rules:
      <<: *customer_info_rules
      customer_info_customer_id:
        <<: *customer_id_rule
        column: ext_customer_id

These changes can be seen in the equivalent expanded ruleset below:

Show equivalent ruleset without anchors and aliases

version: '1.0'
tasks:
  - type: mask_table
    table: details
    key: id
    rules:
      rule0:
        column: post
        masks:
          - type: from_random_number
            max: 9999
      rule1:
        column: customer_id
        masks:
          - type: from_random_text
            max: 10
      rule2:
        column: destination_street
        masks:
        - type: from_file
          seed_file: DataMasque_address-example.csv
          seed_column: street_name
          table_filter_column: destination_city
          seed_filter_column: city
          fallback_masks:
            - type: from_fixed
              value: 'Main Street'
  - type: mask_table
    table: orders
    key: id
    rules:
      customer_info_postcode:
        column: postcode
        masks:
          - type: from_random_number
            max: 9999
      customer_info_customer_id:
        column: ext_customer_id
        masks:
          - type: from_random_text
            max: 10
      customer_info_destination_street:
        column: destination_street
        masks:
        - type: from_file
          seed_file: DataMasque_address-example.csv
          seed_column: street_name
          table_filter_column: destination_city
          seed_filter_column: city
          fallback_masks:
            - type: from_fixed
              value: 'Main Street'

Note: Dictionaries can be used instead of lists in the following ruleset attributes:

Note: The order of items in above dictionaries in DataMasque ruleset will be respected, so the operations will be performed in the order you specify them.

Advanced ruleset features

Default values to skip

You can specify a set of skip_default values at the ruleset level:

version: '1.0'
skip_defaults:
  - null
  - ''
  - matches: 'administrator_.*'
tasks:
  - type: mask_table
    table: users
    key: id
    rules:
      - column: role
        masks:
        - type: from_fixed
          value: 'customer'

These will be used as the skip configuration for any masking rule that does not have skip explicitly set. You can still force a masking rule to not skip any values by setting skip to an empty list: skip: []


Freezing random values

A run_secret string can be provided as an option when starting a run. If you provide a run_secret, then all randomly chosen and generated values will be exactly the same for multiple runs with the same secret. This can be useful for testing or retaining consistency across multiple masking runs. The run_secret can consist of any sequence of at least 20 characters, however it is recommended to use a randomly generated string.

Notes:

  • Output values will be frozen only across identical repeated masking runs. Variations in database content (such as inserted or deleted rows) or Run Options (i.e. Batch Size, Max rows) can still result in different outputs even when using the same run_secret. Additionally, using a non-unique key may result in non-deterministic results. To retain consistent masking results across variable data sources, use the Deterministic masking features.
  • As the parallelised unloading of Amazon Redshift data into S3 files cannot guarantee consistent ordering between runs, a fixed run_secret may still result in different outputs between masking runs on Amazon Redshift databases. To retain consistent masking results with Amazon Redshift, use the Deterministic masking features.

Deterministic masking

If you wish to make the masked values deterministic based on one or more input values, you can provide hash sources in a ruleset. DataMasque will use the hashed value(s) from the specified sources as one of the random generator inputs (along with the instance seed and the run secret). This allows for consistent output values to be generated even across different databases or inside documents.

For database or tabular file masking, this is done with the use of a hash_columns argument for each column being masked. For file masking, the parameter is called hash_sources and the hash used for the entire file.

DataMasque will use the hashed value(s) as one of the random generator inputs (along with the instance seed and the run secret). This allows for consistent output values to be generated even across different databases, inside documents (XML/JSON) or when masking files.

Changes to hashing values in 2.8.0: Prior to DataMasque 2.8.0, values to hash were converted to strings then hashed. This may have resulted in duplicate output values when hashing on null. From version 2.8.0, hashing occurs on binary representations of data. This means that hashed outputs from DataMasque 2.8.0 onwards will differ to those generated by prior versions.

Deterministic masking with databases or tabular files

Each hash_columns item can either be just the name of the column, or can specify a json_path or xpath to fetch a hash value from inside JSON or XML, respectively. For more information on json_path or xpath please refer to the JSON documentation/XML documentation. For JSON/XML hash_columns the following logic is applied:

  • If the value extracted from the json_path/xpath is null the hash will be performed on the null value.
  • If the value does not exist at the json_path/xpath the hash will be performed on null.
  • If the JSON/XML data cannot be decoded (is invalid) the hash will be performed on the entire column, even if null.

Note Since null is the default fallback for invalid data, this may result in unexpected repeated masked values. Warnings will be added to the run logs when this fallback occurs. Both json_path and xpath optional parameters cannot be set for the same hash column. It is possible to hash a column on to itself (i.e. hash_columns contains the current column). If you hash a column on itself, and that is used as the hash for other columns, the mask on the hash column should come last in the rules. If not, that column will use the pre-masked value as the hash and other columns use the post-masked value, meaning different values will be generated in the same row for the same mask definition.

Example

This example will mask the date_of_birth column with a date value that has been deterministically generated based on the hash of date_of_birth and first_name column values combined with a one-off randomly generated secret for this run.

For example, in every row where date_of_birth = '2000-01-01' and first_name = 'Carl', the date_of_birth will be replaced with a deterministically generated value (e.g. 1999-03-03). This same replacement value will be generated for repeated rows that match the same date_of_birth and first_name values within this run. However, this value will be different for each run. To retain consistency of deterministically generated values between multiple runs, see Consistent masking.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: date_of_birth
        hash_columns:
          - date_of_birth
          - name
        masks:
          - type: from_random_date
            min: '1980-01-01'
            max: '2000-01-01'

An equivalent alternative version of YAML syntax is shown below:

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: date_of_birth
        hash_columns:
          - column_name: date_of_birth
          - column_name: name
        masks:
          - type: from_random_date
            min: '1980-01-01'
            max: '2000-01-01'

Show result

Before After
date_of_birth first_name
2000-01-01 Carl
1965-11-10 Ria
2000-01-01 Carl
2000-01-01 Jose
1990-05-31 Thomas
1999-07-31 Nicole
date_of_birth first_name
1999-03-03 Carl
1986-05-16 Ria
1999-03-03 Carl
1991-03-24 Jose
1971-05-20 Thomas
1962-12-11 Nicole

Example with xpath

This example will mask the date_of_birth column with a date value that has been deterministically generated based on the hash of the date_of_birth attribute from the following XML document column.

<Root>
    <Info>
        <Employee date_of_birth="2022-10-08">Steve</Employee>
        <Employee date_of_birth="2022-10-08">Fred</Employee>
    </Info>
</Root>

To get the date_of_birth attribute from the XML document we need to specify the path with an xpath. The following xpath will get the date attribute from the first Employee element of the XML document, Info/Employee[1]/@date_of_birth. If you wanted to perform the hash on a list of date_of_birth attribute's values you can specify Info/Employee/@date_of_birth.

version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: date_of_birth
        hash_columns:
          -  column_name: xml_data
             xpath: "Info/Employee/@date_of_birth"
        masks:
          - type: from_random_text
            min: 5
            max: 10

As shown in the results below, when the date_of_birth attribute has the same values for both Employee elements the resulting masked date_of_birth column contain the same values.

Show result

Before After
xml_data date_of_birth
SteveFred 1989-01-09
SteveFred 1990-07-14
SteveFred 1991-11-08
SteveFred 1997-03-14
SteveFred 1984-02-02
xml_data date_of_birth
SteveFred 1992-08-07
SteveFred 1993-11-14
SteveFred 1992-08-07
SteveFred 1998-05-12
SteveFred 1992-08-07

Deterministic masking with files

File masking tasks (mask_file) support deterministic generation of masked values which are based on input values. You can provide a hash_sources list in a rule, similar to the hash_columns for database masking. However, with hash_sources you must specify additional parameters to fetch the hash.

Parameters

  • json_path (optional): A path to the value in the JSON data to use for the hash.
  • xpath (optional): An Xpath to the value stored in the XML data to use for the hash.
  • file_path (optional): A boolean value, when set to true the hash will be performed on the file's path (relative to the base directory). Defaults to false.
  • case_transform (optional): upper/lower - Apply a case transform on the input value, for consistent hash values irrespective of case.

Note Each of the json_path, xpath and file_path are shown as optional but exactly one of these needs to be specified. For more information on json_path or xpath please refer to the JSON documentation/XML documentation. hash_sources for multirecord files works the same as for object file, except the hash_sources are applied to each record in the file instead of the entire file.

Example with json_path

This example will mask a JSON file which contains the following JSON data:

{
  "users": {
    "name": "Richard",
    "addresses": [
      {
        "type": "postal",
        "city": "Nelson"
      },
      {
        "type": "physical",
        "city": "Napier"
      }
    ]
  }
}

We want to mask the city of the users, hashed with the name of the users. To do this we need to specify a json_path to the value we want to use for the hash ([users,name]) in the hash_sources, as shown in the ruleset below. The masked value we would want to replace the city with, would be best if it was from one of the seed files DataMasque provides.

version: "1.0"
tasks:
  - type: mask_file
    rules:
      - hash_sources:
          - json_path: ['users', 'name']
        masks:
        - type: json
          transforms:
          - path: ['users','addresses', "*",'city']
            masks:
            - type: from_file
              seed_file: DataMasque_mixed_countries_addresses.csv
              seed_column: city

As shown in the results below, when the name has the same values for both users the resulting masked values for the city is also the same.

Note Each row of the results table below represents a separate file.

Show result

Before After
JSON data
{"users": {"name": "Evelyn","addresses": [{"type": "postal","city": "Boronia"},{"type": "physical","city": "Cramphorne"}]}}
{"users": {"name": "Sarah","addresses": [{"type": "postal","city": "Muckleford"},{"type": "physical","city": "Bandy Creek"}]}}
{"users": {"name": "Bob","addresses": [{"type": "postal","city": "Doncaster"},{"type": "physical","city": "Shenton"}]}}
{"users": {"name": "Sarah","addresses": [{"type": "postal","city": "Woronora"},{"type": "physical","city": "Lee Point"}]}}
{"users": {"name": "Evelyn","addresses": [{"type": "postal","city": "Nelson"},{"type": "physical","city": "Napier"}]}}
JSON data
{"users":{"name":"Evelyn","addresses":[{"type":"postal","city":"Stamford"},{"type":"physical","city":"Stamford"}]}}
{"users":{"name":"Sarah","addresses":[{"type":"postal","city":"Canning Vale"},{"type":"physical","city":"Canning Vale"}]}}
{"users":{"name":"Bob","addresses":[{"type":"postal","city":"Bolivia"},{"type":"physical","city":"Bolivia"}]}}
{"users":{"name":"Sarah","addresses":[{"type":"postal","city":"Canning Vale"},{"type":"physical","city":"Canning Vale"}]}}
{"users":{"name":"Evelyn","addresses":[{"type":"postal","city":"Stamford"},{"type":"physical","city":"Stamford"}]}}

Example with xpath

This example will mask two different XML files but generate consistent masked data by using user id as a hash. The first contains a user's info.

<User id="8472">
    <Name>Evelyn</Name>
    <Address>
        <City>Nelson</City>
    </Address>
</User>

The second contains a list of user's log messages:

<Logs userId="8472">
    <Message to="Evelyn" date="2022-08-09" type="welcome"/>
    <Message to="Evelyn" date="2022-08-10" type="forgot_password"/>
</Logs>

In both cases, we want the user's name to be consistent, and this can be done by specifying the user's ID as the hash source for each file. We'll create a ruleset with two mask_file tasks, and use includes to map the right task to the correct file.

The first ruleset will mask the User/Name text value, and the second will mask the to attribute of each Log/Message element. The first ruleset will use the id attribute on User as the hash source, while the second will use the userId attribute on Logs.

Here's the ruleset to accomplish this:

version: "1.0"
tasks:
  - type: mask_file
    include:
    - glob: user_info_*.xml
    rules:
      - hash_sources:
        - xpath: 'User/@id'
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: '<User />'
            transforms:
              - path: 'User/Name'
                node_transforms:
                - type: text
                  masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_mixed.csv
  - type: mask_file
    include:
    - glob: user_logs_*.xml
    rules:
      - hash_sources:
        - xpath: 'Logs/@userId'
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: '<Logs />'
            transforms:
              - path: 'User/Name'
                node_transforms:
                - type: attribute
                  attributes: 'to'
                  masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_mixed.csv

The two output files would use consistent names. The first file:

<User id="8472">
    <Name>Sarah</Name>
    <Address>
        <City>Nelson</City>
    </Address>
</User>

And the second:

<Logs userId="8472">
    <Message to="Sarah" date="2022-08-09" type="welcome"/>
    <Message to="Sarah" date="2022-08-10" type="forgot_password"/>
</Logs>
Example with file_path

This example will just show an example ruleset to mask files based on the file path: This applies to any mask_file task.

version: "1.0"
tasks:
  - type: mask_file
    rules:
      - hash_sources:
          - file_path: true
        masks:
        - type: json
          transforms:
          - path: ['users','addresses', "*",'city']
            masks:
            - type: from_file
              seed_file: DataMasque_mixed_countries_addresses.csv
              seed_column: city

With this ruleset the hashing will be performed on the file path, so each file with a different path with have differing masked values. This is useful if you want repeatable masking based on the name of a file.


Common regular expression patterns

A regular expression (or "regex") is a sequence of characters that acts as a search pattern to filter or select substrings of text strings. They can be used in replace_regex masks and matches conditions of if and skip blocks. Note that it is best practice to wrap the regular expression in quotes to avoid special characters being misinterpreted as YAML syntax:

version: '1.0'
tasks:
  - type: mask_table
    table: driversLicence
    key: ID
    rules:
      - column: driversLicence
        masks:
          - type: replace_regex
            regex: "[0-9]"
            masks:
              - type: from_fixed
                value: "#"

Notes: DataMasque uses the Python regular expression syntax, which you can learn more about in the official tutorial.

Regular expressions allow you to match various characters by specifying a character or character set. Below are a list of characters sets that can be used in regular expressions.

Characters can be specified as character classes.

  • "[0-9]" can be used to match any numerical character.
  • "[a-z]" and "[A-Z]" will match lower case and upper case alphabetical characters respectively.
  • Classes can also be combined; "[a-zA-Z]" will match any upper or lower case character.
  • You may also specify a list of characters to match within square brackets. If you wish to only match vowels, you can use "[aeiouAEIOU]".
  • If a caret character is added before any character within the character class, it will instead match something other than the specified character. "^[0-9]" will match any character other than a numerical character.

The table below detail other ways to denote certain character type without the use of character classes.

Character Description
\d One digit character, ranging from 0 to 9.
\w One 'word' character; an ascii letter, a digit or an underscore.
\s One whitespace character. This can be a space, a line break, or a tab.
\D One character that is NOT a digit.
\W One character that is NOT a word character.
\S One character that is NOT a whitespace character.

Note that when used in double-quoted strings in YAML, literal backslashes must be escaped by backslashes, for example, "\\d" will be interpreted as \d. We recommend quoting YAML strings with single quotes to avoid the need for this escaping, as '\d' will be interpreted as \d.

You can also choose to add a quantifier to any character or character class, which will specify how many characters will be matched.

Quantifier Description
+ One or more characters will be matched.
{n} Exactly n characters will be matched. For example, \d{3} will match strings of exactly 3 numerical digits.
{a,b} Any string of characters between x and y length will be matched. For example, \w{1,3} will match a string of word characters between 1 and 3 length.
{c,} Any string of characters with c or more characters. For example, \d{2,} will match a string of numerals of length 2 or more.
* Any instance of the characters appearing zero or more times.
? Matches if the character or character class appears once, or no times.

Below are some common regular expression patterns that can be used to match typical database strings.

Example Patterns

Phone Number

The pattern below will match a phone number consisting of an international code in parentheses followed by a 7-digit phone number that may be hyphenated after the 3rd digit.

'\(\+\d{1,4}\)\s?\d{3}-?\d{4}'

Some example of matching patterns are shown below:

  • (+64)123-4567
  • (+1234) 5678910
  • (+61) 987-5432

Social Security Number

The pattern below will match a Social Security Number consisting of hyphen-separated digits:

'\d{3}-\d{2}-\d{4}'

Some example of matching patterns are shown below:

  • 111-22-3333
  • 987-65-4321
  • 112-35-8132

Time

The pattern below will match a time consisting of 1 or 2 hour digits and 2 minute digits separated by a colon:

'[0-2]?\d:\d{2}'

Some example of matching patterns are shown below:

  • 7:30
  • 23:50
  • 05:45

Email Address

The pattern below will match an email address consisting of an alphanumeric username and an alphanumeric domain name separated by an @ symbol. For a more comprehensive email regular expression, see: emailregex.com.

'[a-zA-Z0-9]+@[a-zA-Z0-9\.]+'

Some example of matching patterns are shown below:

  • janedoe@gmail.com
  • bobsmith@hotmail.com
  • thomas@yahoo.com

DISCLAIMER: The above examples are simplistic and general examples to demonstrate how regular expressions can be constructed - you should ensure these regular expressions are appropriate for your particular data before using them in your masking rulesets.