DataMasque Portal

Ruleset YAML specification

Ruleset YAML specification

The ruleset YAML configuration provides instructions that DataMasque will follow when performing a masking run against a target database. Rulesets are comprised of one or more tasks, which can contain many different types of instructions. The most common use case is the application of tasks for masking sensitive data in database tables with masking rules.

Ruleset Properties

The following properties are specified at the top-level of a ruleset YAML file:

  • version (required): The schema version used by this ruleset. The default value present in the editor when creating a new ruleset is generally the value that you should be using. See Schema Versioning for more information.
  • name (deprecated; will be removed in release 3.0.0): A unique name that will be used to refer to your ruleset. This field may only contain alphanumeric characters and underscores, and its length is limited to 64 characters.
  • tasks (required): A list of tasks to be performed in order on the target database. See Database Task Types for the comprehensive list of the available task types and their associated parameters.
  • task_definitions (optional): A list of task definitions to be referenced from the ruleset's list of tasks through YAML anchors and aliases.
  • rule_definitions (optional): A list of rule definitions to be referenced from a list of rules through YAML anchors and aliases.
  • mask_definitions (optional): A list of mask definitions to be referenced from a list of masks through YAML anchors and aliases.
  • skip_defaults (optional): See Default values to skip.
  • disable_warning_on_duplicate_rules (optional) If multiple tasks in the same ruleset attempt to mask the same table or column DataMasque will show a warning. Set this to true to suppress the warning. Defaults to false.
  • disable_warning_on_glacier_storage (optional): Only valid for file masking rulesets If a file is archived with Amazon S3 Glacier storage, DataMasque will show a warning that it is skipping the file. Set this to true to suppress the warning and skip the file silently. Defaults to false.
  • random_seed (optional): Deprecated in favour of the Run secret option. See Freezing random values.

Example

The following ruleset provides an example for replacing the last_name column of every row in the users table with the fixed value "redacted last name":

# My ruleset.yml
version: '1.0'
tasks:
  - type: mask_table
    table: users
    key: user_id
    rules:
      - column: last_name
        masks:
          - type: from_fixed
            value: 'redacted last name'

Note: The tasks must be indented using two spaces. For example:

tasks:
  - type: mask_table
^^

- type: mask_table is indented two spaces from the parent tasks:

Important note on case-sensitivity: For all tasks except run_sql, database identifiers, such as table and column names, should be referenced as you would otherwise reference them in an SQL query. When masking a case-sensitive database, identifiers must be referenced in the ruleset using the correct case.

To refer to a case-sensitive table or column in a database, the identifier must be enclosed in double quotation marks. However, when writing YAML, quotations are used to denote a string value, so any enclosing quotations are not considered as part of the value. As such, it is necessary to enclose the entire name - including double quotation marks - in an outer set of single quotation marks. For example:

# Case-sensitive table name; enclosed in both single and double quotations.
table: '"CaseSensitiveTableName"'

To refer to a case-sensitive table in a schema, the schema name must also be enclosed in quotation marks if the schema name is case-sensitive. The entire combination of schema and table name must be enclosed in single quotation marks. For example:

# Case-sensitive schema and table name; enclosed in both single and double quotations.
table: '"CaseSensitiveSchemaName"."CaseSensitiveTableName"'

If referencing a combination of table and column, you will need have quotation marks around both the table and column names within the surrounding quotation marks. For example:

# Case-sensitive table and column name; enclosed in both single and double quotations.
column: '"CaseSensitiveTable"."CaseSensitiveColumn"'

Identifier names containing double quotation marks, backslashes, periods, and whitespace should always be enclosed in double quotation marks. Also, literal double quotation marks and backslashes must be preceded by a backslash:

# Case-sensitive table and column name containing special characters.
column: '"Case\"Sensitive\\Table"."Case.Sensitive Table"'

Notes:

  • Backslashes and single quotation marks are not supported in identifier names for Microsoft SQL Server (Linked Server) databases.
  • For Redshift databases, DataMasque follows PostgreSQL's rules for case-sensitivity of quoted/unquoted identifiers if enable_case_sensitive_identifier is true, and treats all identifiers as case-insensitive if enable_case_sensitive_identifier is false. See: https://docs.aws.amazon.com/redshift/latest/dg/renablecasesensitiveidentifier.html

Database task types

A ruleset consists of a list of tasks to be performed in sequence on the target database. Tasks are performed serially from top to bottom, but special serial and parallel tasks can be used to nest other tasks within them for performance (parallelism) or dependency management.

Sensitive data discovery

The run_data_discovery task type inspects the metadata of your database, searching for columns which are likely to contain sensitive data. On completion, a report is generated containing a summary of all identified columns, and their current masking coverage. It is recommended to include a single run_data_discovery task in your rulesets to help ensure complete masking coverage and provide ongoing protection as new sensitive data is added to your database.

See the Sensitive Data Discovery guide for more information on this feature.

Note: The run_data_discovery task type is not currently supported for Amazon DynamoDB.

Parameters

This task type does not have any parameters.

Example

The following shows an example ruleset that will execute only the run_data_discovery task and no masking tasks. This example usage may be useful when starting a new ruleset from scratch to determine a starting point for developing your masking rules.

The run_data_discovery may also be included in a ruleset alongside other masking tasks to provide continuous feedback on the masking coverage provided by the ruleset.

version: '1.0'
tasks:
  - type: run_data_discovery

Schema Discovery

The run_schema_discovery task type inspects the metadata of your database, searching for schemas, tables, and columns and can flag certain columns which are likely to contain sensitive data. On completion, a report is generated containing a summary of all identified schemas, tables, columns, and relevant metadata of the data within the columns.

See the Schema Discovery guide for more information on this feature.

Note: The run_schema_discovery task type is not currently supported for Amazon DynamoDB.

Parameters

This task type does not have any parameters.

Example

The following shows an example ruleset that will execute only the run_schema_discovery task and no masking tasks. This example usage may be useful when starting a new ruleset from scratch to determine a starting point for developing your masking rules.

version: '1.0'
tasks:
  - type: run_schema_discovery

Table masks

Each mask_table task specifies the masking rules to apply to a database table, as well as any required joins and any conditionals needed to define which rows should be masked. Masking rules and masks are applied sequentially in the order they are listed. When multiple masks are combined in sequence, the output value from each mask is passed as the input to the next mask in the sequence.

More detail about mask functions and rules is provided under Masking Overview section.

Note: While mask_table is suitable for the majority of generic masking requirements, many masking functions are not capable of masking unique keys or primary keys. To mask these columns, use the from_unique_imitate mask function, or the special-purpose mask_unique_key task.

Parameters

Each task with type mask_table is defined by the following parameters:

  • table (required): The name of the table in the database. The table name can be prefixed with a schema name to reference a table in another schema. If the table or schema name are case-sensitive, you must enclose the name in double and single quotation marks in order to specify the casing of the name. For example, table: '"CaseSensitiveSchema"."CaseSensitiveTable"'

    • For rulesets run against Amazon DynamoDB connections, the table name can be prefixed with the AWS region that contains the table e.g. <region>/<table-name> (without angle brackets). Note that the region prefix is required if you have not created a .aws/credentials file or environment variable with the default region or are running DataMasque on EC2. Amazon DynamoDB connections currently only support masking a single table in each ruleset.
  • key (required): One or more columns that identify each table row. Composite keys may be specified for the key parameter. For more details on using composite keys, see Composite keys

    • For Oracle databases it should always be ROWID (key: ROWID). For more details, refer to Query optimisation.
    • For Microsoft SQL Server and PostgreSQL databases it is recommended to use the primary key, or any other unique key that is not modified during masking, for better performance.
    • The key columns must not contain any NULL value.
    • If the key column names are case-sensitive, you may enclose each key value in double and single quotation marks in order to specify the casing of the key. For example, key: '"Customer_ID"'
    • If the columns specified for the key parameter cannot be used to uniquely identify rows, then the masked values will be the same for rows that have the same key value. Refer to the key and hash columns example in the Notes section for how to avoid producing duplicate masked values.
    • Columns specified in the key cannot be masked in the task's rules. Amazon DynamoDB is an exception to this rule, allowing key columns to be masked with e.g. the from_unique mask type.
  • on_missing_column (optional): The action to take if a column in the ruleset is not found on the table. This option is only available for Amazon DynamoDB. If specified, must be one of error or skip. Defaults to error, which means a masking run will fail if a column in the ruleset does not exist. If set to skip then missing columns do not cause an error and other columns are masked as normal.

  • use_calculated_bounds (optional): When set to true (default) then the batch bounding keys are calculated using values in the database. If set to false, then DataMasque generates these keys without querying the database, which may improve performance. However, this option is only supported for integer, non-composite keys. For more details, see Calculated or generated bounds.

    rules (required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:

    • column (required): The name of the column to mask. For rulesets run against Amazon DynamoDB connections, a wildcard * can be used to apply masks to columns not matched by other column's rules.

    • masks (required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the column. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask functions.

    • hash_columns (optional): A list of columns which will be used as
      input to the Deterministic masking algorithm
      for this rule. If hash_columns is provided, all mask types that rely on randomisation become deterministic based on the hash_columns column values. Values in the provided columns can be null.

      The hash column can be prefixed with a table name to reference a column in another table, and that table name can be prefixed with a schema name to reference a table in another schema.

      Hash columns can also be specified by a list of dictionaries with the following keys:

      • column_name (required): The name of the column which will be used as input as above.
      • case_transform (optional): upper/lower - Allows for case transforms on the values of the input, ensuring consistent hashed values irrespective of case.
      • trim (optional): If the value extracted from the column, json_path, or xpath contains leading or trailing white space, it can be trimmed by setting this option to true. Defaults to false.
      • json_path (optional): If the column contains JSON data, the path to the value in the JSON data you wish to perform the hash on can be specified here, otherwise the hash will be performed on the entire column.
      • xpath (optional): If the column contains XML data, the Xpath to the value stored in the XML data you wish to perform the hash on can be specified here, otherwise the hash will be performed on the entire column.
      • coerce_whole_numbers_to_int (optional): When set to true, this option will coerce whole number float values in the specified column to integers before being used for hashing.

    For more information on the json_path, please refer to the JSON documentation.
    For more information on the xpath, please refer to the XML documentation.
    For more information on deterministic masking, please refer to the Deterministic masking.

    About coerce_whole_numbers_to_int

    When coerce_whole_numbers_to_int is enabled, whole number float values will be coerced to integers (e.g. 1.0 to 1, 4.0 to 4, and so on). This ensures consistent hashing between integer columns and columns where whole numbers are stored as floats/decimals.

    Numbers that aren't whole are not affected by this option. For example, 1.5 will still be treated as 1.5 even when this option is enabled. Strings are also not affected.

    Example Behavior
    Input Input Type Output with coerce_whole_numbers_to_int set to false Output with coerce_whole_numbers_to_int set to true
    2.4 float/decimal 2.4 2.4
    "3.0" string "3.0" "3.0"
    "3.1" string "3.1" "3.1"
    2.0 float/decimal 2.0 2
    5 integer 5 5
  • workers (optional): The number of parallel processes to use for masking this table (defaults to 1). Each process will operate on a separate batch of rows from the table (batch size is a configurable run option). Increasing workers may decrease masking run times, dependent on database performance and the complexity (e.g. number of columns to mask) of the masking task. It is recommended to increase the number of workers if DataMasque connections to your target database spend more time processing queries than waiting for DataMasque (i.e. the "waiting for clients" time approximates DataMasque CPU time), which suggests DataMasque could efficiently use multiple workers to mask other batches while waiting for database responses. Additionally, it is recommended that the number of parallel processes multiplied by the number of workers assigned to each process does not exceed twice the number of CPUs available on your deployed instance.

  • index_key_columns (optional): When index_key_columns is true (the default setting), DataMasque will create an additional index on all key columns if there is no existing index contains all key columns. You may wish to disable the creation of a new index if you have an existing index on some of the key columns that will provide adequate performance.

    • Redshift databases do not support indexes, so this option has no effect for them.
    • This option currently has no effect for Microsoft SQL Server (Linked Server) databases.
  • allow_masking_foreign_keys (optional): To preserve referential integrity, an error will be raised if any of the rules masks a column that is a foreign key, as the parent of a foreign key should usually be masked with mask_unique_key and cascaded to its foreign key(s) when possible. This error is disabled when this option is set to true. Defaults to false.

  • disable_warning_on_use_of_conditionals (optional): By default, a warning is shown if conditional options skip, if/else or where are used, as this may mean not all rows are being masked (see Conditional Masking). To disable this warning, set disable_warning_on_use_of_conditionals to true. Defaults to false.

version: '1.0'
tasks:
  - type: mask_table
    table: users
    key:
      - user_id
      - first_name
    rules:
      - column: date_of_birth
        hash_columns:
          - first_name
        masks:
          - type: from_random_datetime
            min: '1980-01-01'
            max: '2000-01-01'

If case_transform is specified under hash_columns, then the value of the first_name column will be transformed before being used for deterministic masking, ensuring that case will not affect the consistency of the mask. In the example below, where case_transform is specified on the hash column first_name, values such as "john" and "John" will both be converted into "JOHN" before hashing, ensuring that the final masked value is consistent regardless of case.

version: '1.0'
tasks:
  - type: mask_table
    table: users
    key:
      - user_id
      - first_name
    rules:
      - column: date_of_birth
        hash_columns:
          - column_name: first_name
            case_transform: upper
        masks:
          - type: from_random_datetime
            min: '1980-01-01'
            max: '2000-01-01'

If trim is specified as true, then any whitespace in the first_name column will be removed before the value is used for deterministic masking. In this case, values such as " John", " John " and "John " will all be converted into "John" before hashing, ensuring that the final masked value is consistent regardless of surrounding whitespace.

version: '1.0'
tasks:
  - type: mask_table
    table: users
    key:
      - user_id
      - first_name
    rules:
      - column: date_of_birth
        hash_columns:
          - column_name: first_name
            trim: true
        masks:
          - type: from_random_datetime
            min: '1980-01-01'
            max: '2000-01-01'

Composite keys

A composite key can be specified in following formats:

  • A list of columns; for example, key: ['invoice_id', 'product_id'] would be used to indicate a composite key consisting of a combination of two columns, invoice_id and product_id.
  • For Microsoft SQL Server and PostgreSQL, when specifying a composite key, the order of the keys listed must respect the original order as defined in the database.
  • A multiline composite key. An example of a multiline composite key is shown below.
  key:
    - 'invoice_id'
    - 'product_id'

Calculated or generated bounds

DataMasque's mask_table processes a table's data in batches. It fetches a specified number of rows, masks them, and writes them back to the database. The number of rows is determined by the batch size.

By default, DataMasque calculates bounding keys for each batch using the values in the key column(s) specified in the mask_table task. For example, if a table's id column has values ranging from 100,000 to 200,000 and the batch size is 50,000, the batches would be as follows:

Batch Number Start id End id No. Rows In Batch
1 100,000 149,999 50,000
2 150,000 199,999 50,000
3 200,000 200,000 1

This approach is called calculated bounds and is the default method if use_calculated_bounds is not specified or set to true. Calculated bounds work with non-integer columns and composite keys. As long as the key is unique, all batches (except the last one if the total number of rows is not divisible by the batch size), will have the same number of rows as the batch size.

Generating bounding keys requires querying the database, which can be time-consuming for large tables (e.g. 100,000,000 rows or more). In such cases, setting use_calculated_bounds to false allows DataMasque to generate bounding keys without querying the table. This means batch sizes may vary if key values are non-contiguous or do not start at 0, but no batch will exceed the specified batch size.

With use_calculated_bounds: false, DataMasque generates bounds by incrementally adding the batch size until the largest value in the key column is reached. Here's how the batches would look for the same example:

Batch Number Start id End id No. Rows In Batch
1 0 49,999 0
2 50,000 99,999 0
3 100,000 149,999 50,000
4 150,000 199,999 50,000
5 200,000 200,000 1

In this scenario, batch 1 is empty. DataMasque skips empty batches, but querying still occurs, which can be inefficient if there are many non-contiguous IDs.

The performance impact of enabling or disabling use_calculated_bounds depends on your dataset. Experimenting with use_calculated_bounds: false is recommended to see if it improves masking speed.

use_calculated_bounds may only be disabled if all the following conditions are true:

  • A single key column is specified.
  • The column is of an integer* type.
  • The column must contain no negative values. 0 is allowed.
  • The column must be a primary key or have a unique index/constraint.
  • The maximum value in the column must be less than or equal to 2^32 (4,294,967,296).
  • The column must not be nullable (e.g. should be created with NOT NULL constraint).

* Supported column types for all databases: INTEGER, BIGINT and SMALLINT. Additionally, Oracle also supports NUMBER columns with a scale of 0.

When to disable calculated bounds

The following table compares the effects of enabling or disabling use_calculated_bounds:

use_calculated_bounds: true use_calculated_bounds: false
Supports non-integer keys Yes No
Supports composite keys Yes No
Time spent up front querying bounds keys May have long delays for bigger tables No time spent up front
May waste time querying empty batches No Yes
Batch sizes consistent Yes, except for last batch Not consistent for non-contiguous keys
Requires an index* No Yes

*If use_calculated_bounds is true and no index is present, an index will be temporarily added for the task. Setting use_calculated_bounds to false is intended to boost performance on large tables. However, adding an index to such large tables may be time-consuming, thus making the overall task take longer. Therefore, if no index exists, the task will fail instead of adding an index. An index should be added prior to executing the mask_table task, either manually or with the use of a run_sql task.

Supported Database Engines

Disabling calculated bounds is supported on:

  • IBM Db2 LUW
  • MariaDB
  • Microsoft SQL Server
  • MySQL
  • Oracle
  • PostgreSQL

Example mask_table ruleset

version: '1.0'
tasks:
  - type: mask_table
    table: users
    key: user_id
    rules:
      - column: last_name
        masks:
        - type: from_fixed
          value: "redacted last name"

Notes

  1. Index operations will be performed online (ONLINE=ON) on SQL Server editions that support this feature.
  2. Certain data types cannot be used as key columns for specific database connections. See Unsupported Key Columns by Database for details.
  3. While rules and masks should typically be provided as lists, they can also be specified as dictionaries that map arbitrary keys to rules/masks. For example:
...
rules:
  last_name_rule:
    column: last_name
    masks:
      fixed_mask:
        type: from_fixed
        value: "redacted last name"
  1. Specifying rules or masks as a dictionary can allow you to override the rule/mask for a specific key when inheriting from a definition.
  2. When masking a table, if a non-unique key is specified for the mask_table task alongside hash_columns, if the value of the hash_columns is different for more than one row with same key value, the final masked values will arbitrarily depend on the order that update statements are executed. This can be avoided by including the targeted hash_columns as part of a composite key for the mask_table task.

Mask a primary key or unique key

The mask_unique_key task type can be used to mask the values in a primary key or unique key. Masking of a primary key or unique key has the requirement that all masked values are unique, which requires the use of this special-purpose task type.

The mask_unique_key task type replaces all non-null rows of the target key with new, unique values, generated in accordance with a user-specified format. The target primary or unique key columns and associated foreign key columns are updated with these unique replacement values in a single operation to maintain referential integrity.

Any foreign keys that reference a column to be updated will be disabled prior to masking in order to prevent referential integrity errors, and will be re-enabled after the completion of the task.

Notes:

  • Each mask_unique_key task will mask the members of a single primary key or unique key constraint. Multiple mask_unique_key tasks are required to mask multiple independent unique keys on a single table.
  • The mask_unique_key task can be used on tables with up to 200,000,000 non-null rows.
  • When applied to a composite key, replacement values are only generated for rows that contain a complete, non-null key. For any null or partially null rows, all columns of the target key will be set to NULL.
  • The mask_unique_key task must only be applied to columns which are in-fact unique (i.e. the target key columns have a PRIMARY KEY or UNIQUE constraint enforced). Unique keys that have multiple NULL rows (e.g. using a filtered unique index in SQL Server) are allowed; such rows will not be modified by this task.
  • If unique key masking is attempted on a column that does not contain unique values, an error will be raised that a unique index could not be created during the maskuniquekey task and the masking run will fail.
  • When masking a clustered index on SQL Server, the performance of mask_unique_key can be significantly improved by disabling all other indexes and constraints on the target table for the duration of the task. It is recommended to implement this in your ruleset using run_sql tasks before and after the mask_unique_key task to disable and then re-enable these constraints.
  • The mask_unique_key task does not support SQL Server columns created with the IDENTITY property, or Oracle / PostgreSQL columns created with GENERATED ALWAYS AS IDENTITY.
  • Use of mask_unique_key for Amazon DynamoDB, Amazon Redshift or Microsoft SQL Server (Linked Server) databases is not currently supported in DataMasque, consider using mask_table with a from_unique mask instead.
  • Due to the random assignment of replacement values, it is possible (though generally rare) that a row may be assigned a masked value that is identical to its pre-masking value. In these cases, the masking is still effective, as an attacker will not be able to identify which rows' values were replaced with an identical value. However, if you need to guarantee that all masked values are different from their pre-masking values, you should use min and max parameters to ensure the range of possible output values from your format string does not overlap with the range of pre-masking values in your database.
  • Notes for Amazon DynamoDB: The mask_unique_key task type is not supported for Amazon DynamoDB. The from_unique mask should be used instead.
  • Notes for masking files: The mask_unique_key task type is not supported for file masking. The from_unique mask should be used instead.

Warning: The mask_unique_key task type must not be run in parallel with tasks that operate on any of the following:

  • The target table of the mask_unique_key task.
  • Tables containing foreign keys that reference the target_key columns (either directly or through a chain of foreign keys).
  • Any tables specified in additional_cascades.

Parameters

Each task with type mask_unique_key is defined by the following parameters:

  • table (required): The name of the database table that contains the primary key or unique key to be masked.
  • target_key (required): A list of items defining each column that makes up the primary or unique key, and the format in which replacement values will be generated for that column. Composite keys can be masked by including multiple columns and formats in this list. Each item has the following attributes:
    • column (required): The name of the column to be masked.
    • format (optional): The format which will be used to generate replacement values for the column. See Format string syntax for details. Defaults to '{!int}'.
  • additional_cascades (optional): Use this parameter to propagate masked values to implied foreign keys of the target_key. Implied foreign keys are dependencies that exist between tables but are not enforced by foreign key constraints, and hence are not defined in the database. Masked values will be cascaded to these columns; see Cascading of masked values for more details on how this works. Each additional_cascades item has the following attributes:
    • table (required): The name of the table containing the cascade target columns, which have an implicit reference to the target_key of this task. The table name can be prefixed with a schema name to cascade to a table in another schema.
    • columns (required): A list of column dictionaries - each describing the relationship between a column of the target key and a column on the cascade target table. Each column mapping item has the following attributes:
      • source (required): The name of a column in the target key from which masked values will be cascaded to the corresponding target column.
      • target (required): The name of a column on the cascade target table to which masked values from the source column will be cascaded.
    • update_foreign_keys (optional): If set to true, foreign keys referencing the target columns will be updated in the same way as foreign keys referencing the target_key. Defaults to true.
  • batch_size (optional): To avoid excessive memory consumption when masking large tables, DataMasque generates replacement values in batches. This value controls the maximum number of unique values that are generated in a single batch. In general, the default of 50,000 will be acceptable for most use cases. The batch size used by DataMasque will be limited by the number of rows in the target table.
  • allow_masking_foreign_keys (optional): To preserve referential integrity, an error will be raised if any of the target_key columns is a foreign key, as the parent of a foreign key should usually be masked with mask_unique_key and cascaded to its foreign key(s) when possible. This error is disabled when this option is set to true. Defaults to false.
  • action_on_batch_size_exceeded (optional): Specifies what DataMasque will do if the number of rows returned from the database exceeds the specified batch size, which can lead to excessive memory usage and performance degradation. This could happen if the key columns are non-unique or when querying corrupt tables. If specified, must be either warning or error. Defaults to error. The warning option will cause DataMasque to log a warning in the run log and continue masking. The error option will cause DataMasque to fail the masking run and log the error.
  • disable_warning_on_missing_unique_index (optional): A boolean which, when set to true, prevents the warning when a key doesn't have an adequate unique index and DataMasque creates one for the duration of a masking run being logged. Defaults to false.

Note:

  • When using additional cascades to propagate composite primary or unique key columns to implied foreign key columns, you need to specify all the columns in the additional cascades parameter that are corresponding to the referenced primary or unique key columns.
    • Warning! Using additional_cascades to propagate partial columns on composite primary or unique key columns is not recommended. Not specifying all the columns that are corresponding to the referenced primary or unique key columns will cause data propagation to fail from the referenced composite primary or unique key columns to the composite implied foreign keys. In other words, partial cascades that reference a subset of the target key columns will result in data for omitted columns not being propagated to the target table, potentially resulting in inconsistent data between the two tables.
    • For Oracle databases, all columns of the target_key must be referenced as source columns in additional_cascades. Partial cascades that reference a subset of the target key columns are disallowed. For example: A composite unique key in a PhoneNumbers table which consists of Prefix and LineNumber columns which are referenced by PhPrefix and PhLineNumber columns in the Calllogs table but without foreign key constraint. Therefore, it is an implied foreign key that requires using the additional_cascades parameter to propagate the masked unique key values to ensure data integrity across the tables.

PhoneNumber CallLogs implicit fk table diagram

A ruleset needs to be written to specify all corresponding implicit foreign key columns in the additional_cascades parameter as such that the masked unique keys will be propagated collectively to the foreign keys:

 version: "1.0"
 tasks:
   - type: mask_unique_key
     table: PhoneNumbers
     target_key:
       - column: Prefix. #part of the composite unique key constraint
         format: "{!int, 1:150, pad}"
       - column: LineNumber #part of the composite unique key constraint
         format: "{!int, 50001:100000, pad}"
     additional_cascades:
       - table: CallLogs
         columns:
           # Need to include both Prefix/PhPrefix and LineNumber/PhLineNumber for data to propagate properly.
           - source: Prefix #UK a
             target: PhPrefix
           - source: LineNumber #UK a
             target: PhLineNumber
  • However, if they are not composite foreign keys but individual foreign keys a ruleset needs to be written to propagate the masked unique keys individually:

PhoneNumber CallLogs implicit fk table diagram

version: "1.0"
tasks:
  - type: mask_unique_key
    table: PhoneNumbers
    target_key:
      - column: Prefix. #has its own unique key constraint
        format: "{!int, 1:150, pad}"
      - column: LineNumber #has its own unique key constraint
        format: "{!int, 50001:100000, pad}"
    additional_cascades:
      # Need to include both Prefix/PhPrefix and LineNumber/PhLineNumber for data to propagate properly.
      - table: CallLogs
        columns:
        - source: Prefix #UK a
          target: PhPrefix
      - table: CallLogs
        columns:
        - source: LineNumber #UK b
          target: PhLineNumber

Example 1

The following example will mask the primary key column AccountNumber of the Users table with unique replacement values. Another table, Transactions, also has a column named AccountNumber which has a foreign key relationship to the AccountNumber column of the Users table.

Users transaction table diagram

Account numbers will be generated with at least 6 digits; the minimum value being 100,000 and the maximum value growing as required depending on the number of rows in the table.

In this ruleset below, you only need to specify masking rules for the primary key column, AccountNumber, to be masked. You do not need to explicitly define the foreign key columns to propagate the replacement values to in the ruleset. DataMasque will automatically detect primary key and foreign key relationships in the database and propagate the replacement values to any related foreign key columns; in this case, the new values for AccountNumber in the Users table are implicitly propagated to the AccountNumber column in the Transactions table.

version: '1.0'
tasks:
  - type: mask_unique_key
    table: Users
    target_key:
      - column: '"AccountNumber"'
        format: '{!int,100000:}' # Account numbers will be generated with at least 6 digits

Show result

Users table
Before After
AccountNumber Name
100001Adam
100002Brenda
100003Charlie
100004Diane
100005Esther
100006Francis
AccountNumber Name
106512Adam
109983Brenda
161822Charlie
189413Diane
122576Esther
197698Francis
Transactions table
Before After
TransactionDate AccountNumber
2021-01-01100001
2021-01-02100002
2021-01-03100004
2021-01-04100004
2021-01-05100003
2021-01-06100001
2021-01-07100006
2021-01-08100005
TransactionDate AccountNumber
2021-01-01106512
2021-01-02109983
2021-01-03189413
2021-01-04189413
2021-01-05161822
2021-01-06106512
2021-01-07197698
2021-01-08122576

Example 2

The following example will mask a composite unique key of a PhoneNumbers table. The composite key consists of the following columns:

  • Prefix VARCHAR(3): containing a zero-padded integer in the range (1-150). e.g. 001, 002, etc.
  • LineNumber VARCHAR(6): containing a zero-padded integer in the range (50,001-100,000).

Values matching these specific formats can be generated using the following format strings:

  • '{!int, 1:150, pad}': Generates integers between 1-150 (inclusive), zero-padded to a fixed width of 3 characters.
  • '{!int, 50001:100000, pad}': Generates integers between 50,001-100,000 (inclusive), zero-padded to a fixed width of 6 characters.

Because both of these formats have an upper value bound, we must consider the maximum number of unique composite values that are available in this space. Multiplying the number of values in the two ranges (150 * 50,000), we can determine that these two format strings will supply us with 7,500,000 rows of unique composite values. As a result, this task would fail if applied to a table containing more than 7,500,000 (non-null) rows.

In addition to the PhoneNumbers table, this schema also includes a CallLogs table containing the columns PhPrefix and PhLineNumber, which are references to the values in the Prefix and LineNumber columns of the PhoneNumbers table. However, due to specific requirements of this schema, these references are not defined using a foreign key constraint. Without a foreign key, DataMasque will not automatically propagate the replacement values generated for the PhoneNumbers table to the CallLogs table.

PhoneNumbers CallLogs table diagram

In order to ensure the referential integrity of these implicit references is maintained during masking, this example uses additional_cascades to instruct DataMasque on how to cascade updated values to these columns. DataMasque will take the values of Prefix and LineNumber columns of the PhoneNumbers table and propagate these values to the PhPrefix and PhLineNumber columns of the CallLogs table.

version: '1.0'
tasks:
  - type: mask_unique_key
    table: PhoneNumbers
    target_key:
      - column: Prefix
        format: "{!int, 1:150, pad}"
      - column: LineNumber
        format: "{!int, 50001:100000, pad}"
    additional_cascades:
      - table: CallLogs
        columns:
          - source: Prefix
            target: PhPrefix
          - source: LineNumber
            target: PhLineNumber

Show result

PhoneNumbers table
Before After
id Prefix LineNumber
1056069875
2140093150
3056051531
4114089796
5079073665
id Prefix LineNumber
1012050165
2140079463
3115096452
4003050010
5105065498
CallLogs table
Before After
PhPrefix PhLineNumber Timestamp
0560698751620110987503
0560698751620111698775
1140897961620113762010
0560698751620125649941
0790736651620125789210
PhPrefix PhLineNumber Timestamp
0120501651620110987503
0120501651620111698775
0030500101620113762010
0120501651620125649941
1050654981620125789210

Example 3

Consider a database with two tables and two schemas: Customers in the Accounts schema; and Transactions in the Sales schema. The following example will mask the primary key column CustomerId of the Accounts.Customers table. However, any changes made to this CustomerId column must also be reflected on the Customer column of the Sales.Transactions table. Due to specific requirements of this database, these cross-schema references are not defined using a foreign key constraint. In order to maintain referential integrity, this example uses additional_cascades to instruct DataMasque how to cascade updated values to the relevant column present in the other schema.

The Customer ID consists of 3 letters, followed by a hyphen, then a 4-digit number. This will be constructed using a format string to ensure any values generated conform to the required standards.

version: "1.0"
tasks:
  - type: mask_unique_key
    table: Accounts.Customers
    target_key:
      - column: CustomerId
        format: "{[a-z],3}-{[0-9],4}"
    additional_cascades:
      - table: Sales.Transactions
        columns:
          - source: CustomerId
            target: Customer

Show result

Customers table
Before After
CustomerId Name
AAA-1111Adam
BBB-2222Brenda
CCC-3333Charlie
DDD-4444Diane
EEE=5555Esther
CustomerId Name
ACG-1435Adam
BOD-0493Brenda
CAG-6845Charlie
HUM-5941Diane
SNC=9341Esther
Transactions table
Before After
TransactionDate Customer
2021-01-01AAA-1111
2021-01-02EEE=5555
2021-01-03DDD-4444
2021-01-04BBB-2222
2021-01-05AAA-1111
2021-01-06CCC-3333
2021-01-07DDD-4444
2021-01-08EEE=5555
TransactionDate Customer
2021-01-01ACG-1435
2021-01-02SNC=9341
2021-01-03HUM-5941
2021-01-04BOD-0493
2021-01-05ACG-1435
2021-01-06CAG-6845
2021-01-07HUM-5941
2021-01-08SNC=9341

Format string syntax

The format string syntax used by DataMasque to generate unique replacement values currently supports format strings that are combination of alphanumeric characters and symbols. This can be used to generate key values that combined randomly generated portions combined with fixed formatting to generate a key that matches any format required in your key columns.

Format strings consist of variable components that are declared inside braces {}, as well as fixed components outside of braces.

Note: In order to guarantee the uniqueness of generated values, format strings used with mask_unique_key and from_unique may contain at most one variable component whose length is not constant (e.g. a variable component for an unpadded random integer whose length will depend on the number of digits: {!int, 1:100}).

Format strings can be constructed using character sets. Character are wrapped in braces and followed by a comma and a number to indicate how many characters in the set are to be generated.

Character Set Description
[a-z] Lower case alphabetical characters.
[A-Z] Upper case alphabetical characters.
[0-9] Numerical characters.
[aeiou] Any vowel. Characters can be individually specified without using a range.

An example format strings is shown below.

format: "{[a-z],2}_{[A-Z],2}-{[a-zA-Z],3}#{[0-9],5}"

In this example, there are four sets of curly braces within the format string, each providing a set of characters followed by a number indicating how many of these characters to generate.

  • The first braces specify {[a-z],2}, which will generate a string of 2 lower case alphabetical characters.
  • After this, there is an underscore outside the brackets, which means all generated values will have an underscore after the 2 alphabetical characters.
  • The second braces specify {[A-Z],2}, which will generate string of 2 upper case alphabetical characters.
  • After this second variable, there is a hyphen -, meaning a hyphen will always be present after the 2 upper case values.
  • The third braces specify {[a-zA-Z],3}. which will generate a string of length 3 consisting of both lower case and upper case alphabetical characters.
  • After this, there is a hash #, meaning a hash character will always be placed after the third generated string.
  • The final braces specify {[0-9],5}, meaning a 5 digit numerical number is placed at the end of the string.

The following values are some example outputs that may be generated using the format string above:

ab_TJ-RaK#10496 pt_oq-TRu#49511 iu_QE-unT#67312 nd_UL-bES#97638

Example

In this example, we wish to mask a series of number plates.

A number plate consists of 3 alphabet characters, followed by 3 numerical digits. We wish to mask the first letter of each number plate with X, followed by 2 random alphabetical characters, followed by a hyphen and a 3-digit number.

A snippet of the table is shown below, where number_plate is a unique key of the table.

car_registration Table

number_plate car_owner
AAA-111 Anastasia
BBB-222 Bill
CCC-333 Chris
DDD-444 Judith
EEE-444 Gordon

In this case, we will use the following string format.

format: "X{[A-Z],2}-{[0-9],3}"

We are generating a fixed value of X, followed by 2 alphabetical characters, as defined by {[A-Z],2}. After this, there is a hyphen outside of braces, so a static value of a hyphen will always be generated. After this, 3 random numerical digits are generated.

A ruleset utilising this string format in a mask_unique_key task can be shown below.

version: "1.0"
tasks:
    - type: mask_unique_key
      table: car_registration
      target_key:
        - column: number_plate
          format: "X{[A-Z],2}-{[0-9],3}"

Show result

Before After
number_plate car_owner
AAA-111 Anastasia
BBB-222 Bill
CCC-333 Chris
DDD-444 Judith
EEE-555 Gordon
number_plate car_owner
X Anastasia
XAT-475 Bill
XIB-015 Chris
XPQ-164 Judith
XUR-431 Gordon

Integer string format syntax

The !int operator can be used to generate integers. In its most basic use with no arguments, the format string {!int} will generate integers from one to infinity. Extra parameters can be added to set the output range or pad the output.

Range (min:max):

The range defines all possible integer values that may be generated by the integer generator. This is an optional parameter, defaulting to 1: (min=1, max=unbounded).

  • min (optional): The minimum value which will be generated (inclusive). Defaults to 1.
  • max (optional): The maximum value which will be generated (inclusive). When this value is not specified the maximum value will be unbounded, meaning it will grow depending on the number of values required.

For example:

  • {!int, 5:}: generate integers from 5 (inclusive) to infinity.
  • {!int, :100}: generate integers from 1 to 100 (inclusive).
  • {!int, 20:80}: generate integers from 20 to 80 (inclusive).

Zero-pad (pad):

  • pad (optional): When specified, zero-padding will be applied to generated integers - resulting in fixed character width replacement values (e.g. 001, 002, …, 999). The zero-pad width is determined by the width of the max value, and therefore is only a valid option when a max value is specified.

For example:

  • {!int, :1000, pad}: generate strings in the format 0001, 0002, etc., up to 1000.

Note: Always wrap format strings in either single or double quotes. Leaving format strings unquoted in the ruleset will result in invalid YAML due to the opening { character, which has reserved usage in YAML. See below for examples of a correctly quoted format string:

format: '{!int}'        # Single quotation marks
format: "{!int, pad}"   # or double quotation marks

Hex string format syntax

Integers can be generated and output in a hexadecimal format by using the !hex operator. The range and pad options apply in the same was as for standard integer generation, however range values are interpreted as hexadecimal rather than decimal.

For example:

  • {!hex}: generate hex strings from 1 to infinity, i.e. 1, 2, …, a, b, …, ff, 100, etc.
  • {!hex, 10:100, pad}: generate hex strings from 0x10 to 0x100 (inclusive) with padding, i.e. 010, 011, …, 0fe, 0ff, 100.

UUID string format syntax

You may choose to generate values in the Universal Unique Identifier (UUID) format by declaring a !uuid format string. A UUID is a string of 32 hexadecimal digits (0 to 9, a to f), separated by hyphens. The number of characters per hyphen are 8-4-4-4-12. An example UUID would be 12345678-90ab-cdef-1234-567890abcdef.

In order to generate a UUID as the unique key, simply specify uuid in the format as shown below.

format: "{!uuid}"

You may also specify a prefix within the format string of up to 8 characters. This will ensure that the first characters in the UUID are always static. For example, specifying format: "{!uuid,aaaa}" will cause the first 4 characters of every UUID generated by the ruleset to be a.

Alternatives string format syntax

You can have DataMasque select one value from a set of alternatives for each generated value. Such a segment can be specified by wrapping your set of pipe/|-separated alternatives in parentheses (). For example, {(EN|FR)}-{!int} can be used to generate an integer prefixed by either EN or FR. At least two alternatives must be specified.

Note: This should only be used with the from_format_string mask in a mask_table task as it will not satisfy the unique requirement for the mask_unique_key task.

Cascading of masked values

Referential integrity of data references to the target_key of a mask_unique_key task is maintained by "cascading" the masked replacement values to each reference. DataMasque will perform this cascade automatically for relationships defined by a database foreign key constraint. Masked values can also be cascaded to columns that are not members of such a foreign key constraint by using the additional_cascades feature.

Specifically, foreign keys that meet the following criteria will be updated:

  1. Foreign keys of the target_key or an additional_cascade with update_foreign_keys: true will be updated.
  2. Only foreign keys in the same schema as the target table will be updated.
  3. Only foreign keys that are enabled will be updated. Foreign keys that are present, but disabled at the time of masking will be excluded. Specifically:
    1. In Oracle, foreign keys disabled by ALTER TABLE some_table DISABLE CONSTRAINT some_constraint are not updated.
    2. In Microsoft SQL Server, foreign keys disabled by ALTER TABLE some_table NOCHECK CONSTRAINT some_constraint
    3. Note that MySQL, MariaDB, PostgreSQL and Redshift do not have equivalent support for disabling foreign key constraints, so all discovered foreign keys are updated.
  4. Foreign keys will be searched for recursively - i.e. Foreign keys that reference an already updated foreign key will also be updated.
  5. A foreign key that references a superset of the columns in its referenced key or the original target_key will be updated, but the columns that do not reference the target_key will be left unchanged.
    • For example, consider using mask_unique_key to mask a key column user_id. If a composite foreign key (ref_user_id, ref_group_id) references user_id and another column group_id, then the ref_user_id in that foreign key will be updated but the ref_group_id will not.
  6. A foreign key that references a subset of the columns in the target_key will NOT be updated, as row references may be ambiguous.
    • For example, consider using mask_unique_key to mask a composite key (user_id, group_id). If a foreign key references user_id or group_id alone, then that foreign key will not be updated.
    • Note that if an additional_cascade with update_foreign_keys: true uses the same target_key column as a source_column multiple times, then foreign keys of that additional_cascade will still be updated if they reference each column in the original target_key at least once.
    • Additionally, if an additional_cascade with update_foreign_keys: true references only a subset of the target_key columns, then foreign keys of that additional_cascade will be updated as long as they reference all the target columns in the additional_cascade.

If any cascade would update the same set or a subset of the columns of another cascade, it will be marked as redundant and only the other cascade will be performed. For example, consider using mask_unique_key to mask a key column user_id where a foreign key column ref_user_id references user_id and an additional_cascade is specified to update both ref_user_id and ref_user_id_2 from the single source_column of user_id. In this situation, the foreign key is redundant as it updates a subset of the columns in the additional cascade, so only the update specified by the additional cascade needs to be performed.

The following situations would result in an unresolvable set of cascades, and will cause DataMasque to report an error in the run log:

  1. Any cascade whose columns partially overlap those of another cascade without either column set being a superset of, subset of, or equal set to the other column set.
    • For example, consider using mask_unique_key to mask a key column user_id that is referenced by a composite foreign key (ref_user_id, ref_group_id) while an additional_cascade is specified to update columns ref_user_id and ref_role_id. Because the foreign key and additional_cascade would both update ref_user_id but would also update columns that each other cascade would not update (ref_group_id and ref_role_id), they cannot be resolved and an error will be reported.
  2. Any cascade that would update a column from a different column in the target_key than another cascade.
    • For example, consider using mask_unique_key to mask a composite key (user_id, group_id) that is referenced by a composite foreign key (ref_user_id, ref_group_id) while an additional_cascade is specified to update ref_user_id=group_id and ref_group_id=user_id. Because the foreign key and additional_cascade would update ref_user_id and ref_group_id from different source columns, they cannot be resolved and an error will be reported.

Any rows of a cascade target (child table) which contain values that are not present in the target_key (on the parent table) will have their cascade target columns set to NULL. This situation may occur in one of the following cases:

  • The cascade target is a foreign key that has at some point been disabled, had values updated, then been re-enabled without being checked / validated. i.e.

    • For Oracle the constraint was re-enabled using NOVALIDATE.
    • For Microsoft SQL Server, the constraint was re-enabled without using WITH CHECK.
    • For PostgreSQL, the constraint was dropped and recreated instead of being disabled and re-enabled.
  • The cascade target is an implicit reference without database constraints.

The behaviour is designed to ensure that no rows are left unmasked on the cascade target.

Important!

  • The reported number of rows updated by mask_unique_key will include updates made by all cascades, and count each time a row is affected by a different update (i.e. multiple cascade updates on the same table).

Troubleshooting failed updates

Queries run by mask_unique_key to update key columns may fail in the following circumstances:

  • As mentioned above, when any column in a composite key contains a null value, all columns will be set to null. If any columns in the composite key have a not-null constraint, then such an update query will fail. To update composite keys with a mix of nullable and non-nullable columns, consider updating either the nullable or non-nullable columns with mask_unique_key and updating the remaining columns with mask_table and the from_unique mask type.

Build a temporary table

If you need to repeatedly join multiple tables when masking tables, or you need to perform some custom SQL transformations to column values, then you may wish to use the build_temp_table task type to create a temporary table which can then be accessed via a join during a mask_table task:

Parameters

  • table_name (required): The name of the temporary table to create in the database. You will need to use this name when referencing this temporary table later (e.g. in joins and masking rules). The table name can be prefixed with a schema name if the temporary table should be created in a schema other than the user's default schema.
  • sql_select_statement (this OR sql_select_file required): A string containing a SELECT statement to define the contents of the temporary table. To break the statement across multiple lines, you may use multi-line YAML syntax (|- or >-).
  • sql_select_file (this OR sql_select_statement required): The name of a user uploaded SQL script file containing a SELECT query to define the contents of the temporary table. See the Files guide for more information on uploading SQL script files. Use this parameter if you have a complex / long query, or you wish to share the same query between many rulesets.

Example (sql_select_statement)

version: '1.0'
tasks:
  - type: build_temp_table
    table_name: my_temporary_table
    sql_select_statement: >-
        SELECT accounts.account_id, address.city
        FROM accounts
        INNER JOIN address
        ON accounts.address_id = address.id
        WHERE accounts.country = 'New Zealand';
  - type: mask_table
    table: my_temporary_table
    key: id
    rules:
      - column: city
        masks:
          - type: from_file
            seed_file: DataMasque_NZ_addresses.csv
            seed_column: city

Example (sql_select_file)

version: '1.0'
tasks:
  - type: build_temp_table
    table_name: my_temporary_table
    sql_select_file: create_temp_table.sql
  - type: mask_table
    table: my_temporary_table
    key: id
    rules:
      - column: city
        masks:
          - type: from_file
            seed_file: DataMasque_NZ_addresses.csv
            seed_column: city

Notes:

  • DataMasque will create the temporary tables before applying any masks, and delete them after all tables have been masked. DataMasque will also ensure the temporary tables do not already exist in the database (removing existing temporary tables with the same name if needed). The temporary tables you define will only be available for use in joins, and cannot be masked themselves.

  • For Microsoft SQL Server databases, temporary table names must begin with the ## characters, as they will be created as "Global Temporary Tables" so that they are visible to all parallel masking connections. However, in YAML the # character begins an inline comment, so the temporary table name must be wrapped in double or single quotes (e.g. table: '##my_temporary_table').

    • If using the temporary table name in any other tasks, it must also be quoted to avoid the # character acting as a YAML comment. For example, in a run_sql task, quote the entire SQL query:
      sql: "CREATE INDEX IDX_name ON ##my_temporary_table(name);"
  • For Oracle and PostgreSQL databases, "temporary tables" are created as regular tables so that temporary tables are visible to all parallel masking connections.

  • The build_temp_table task type is not currently supported for Amazon DynamoDB or Microsoft SQL Server (Linked Server) databases.


Run SQL

Use the run_sql task type when you need to run any native SQL DDL or DML statements or scripts at any point within a ruleset.

You can supply SQL for DataMasque to execute either as a script file (see Files guide), or inline in the ruleset:

Parameters

  • sql (this OR sql_file required): An SQL script to be executed. For multi-line scripts, you may use the YAML block style syntax (|-).
  • sql_file (this OR sql required): The name of a user-provided file containing an SQL script to be executed (see Files guide). Use this parameter if you have large blocks of SQL to run, or scripts that you wish to share between many rulesets.

Example Microsoft SQL Server (sql)

Note: This example uses Microsoft SQL Server specific syntax, as master has been specified as the target database.

version: '1.0'
tasks:
  - type: run_sql
    sql: |-
      USE [master];
      ALTER DATABASE eCommerce SET RECOVERY SIMPLE WITH NO_WAIT;
      USE [eCommerce];
      ALTER TABLE [SalesRecords].[Customer] DROP CONSTRAINT [FK_SALESRECORDS_CUSTOMER];
      ALTER TABLE [Invoices].[Customer] DROP CONSTRAINT [FK_INVOICES_CUSTOMER];

Example MySQL (sql)

version: '1.0'
tasks:
  - type: run_sql
    sql: |-
      CREATE PROCEDURE dorepeat(p1 INT)
      BEGIN
        SET @x = 0;
        REPEAT SET @x = @x + 1; UNTIL @x > p1 END REPEAT;
      END;
      CALL dorepeat(1000);

Example (sql_file)

tasks:
  - type: run_sql
    sql_file: pre_script_1.sql

Notes:

  • The run_sql task type executes in autocommit mode, and will exit on the first error encountered.
  • The run_sql task type does not run in dry run mode.
  • The run_sql task type is not currently supported for Microsoft SQL Server (Linked Server) databases.
  • The run_sql task type currently only works with one SQL query per task for IBM DB2 LUW databases.
  • For PostgreSQL, MySQL and MariaDB connections, the SQL interpreter considers a colon followed by letters or numbers as bound parameters. The colon can be escaped with a \ before it. e.g. Attempting to insert JSON data: INSERT INTO table_name (column_name) VALUES('{"is_real" :true, "key":"value"}'::json) run_sql will interpret true as a bound parameter, to fix this you can escape the bound parameter by adding a backslash(\) before the colon(:) as follows: INSERT INTO table_name (column_name) VALUES('{"is_real" \:true, "key":"value"}'::json)

Notes for Oracle:

  • It is recommended to test the execution of your SQL script with Oracle SQLPlus before use in a run_sql task.
  • DataMasque appends "WHENEVER SQLERROR EXIT SQL.SQLCODE" to the beginning of the SQL script, so the run_sql task will exit on the first error encountered.
  • Even if the database raises errors while executing the SQL script, some statements may still have finished executing successfully. It is recommended to check the database on any failed run_sql task.
  • run_sql does not use the schema specified in the connection configuration, instead it will default to the schema of the user. If you wish to change schema, specify it in the script with SET SCHEMA.
  • When running multiple statements in a run_sql block, a / character is required between each statement if END is used in the transaction.

Notes for Microsoft SQL Server:

  • It is recommended to test the execution of your SQL script with Microsoft sqlcmd or SQL Server Management Studio before use in a run_sql task.
  • Even if the database raises errors while executing the SQL script, some statements may still have finished executing successfully. It is recommended to check the database on any failed run_sql task.

Notes for PostgreSQL:

  • It is recommended to test the execution of your SQL script with psql before use in a run_sql task.
  • A run_sql task will be executed with a simple query cycle, where statements are executed in a single transaction (unless explicit transaction control commands are included to force a different behaviour).

Notes for Redshift:

  • Using multiple SQL statements in a single run_sql task type is not supported currently.

Notes for MySQL and MariaDB:

  • It is recommended to test the execution of your SQL script with MySQL/MariaDB shell before use in a run_sql task.
  • Using DELIMITER to redefine the MySQL/MariaDB delimiter is not required. DELIMITER is a mysql/mariadb command-line client command and is not supported when defining stored programs in run_sql tasks.
  • Warning: Executing run_sql tasks with more than one statement may not raise errors upon failure. If the first statement executes correctly but subsequent statements fail, errors may not be raised.

Notes for Amazon DynamoDB:

  • The run_sql task type is not supported for Amazon DynamoDB.

Notes for IBM DB2 LUW:

  • On IBM DB2 LUW, run_sql tasks can only execute one SQL query per task.
  • You may have to split up SQL into multiple run_sql tasks to achieve desired results.

Truncate a table

Use the truncate_table task type to specify tables to be truncated by DataMasque. All rows will be deleted, but the table structure will be left in place.

Parameters

  • table (required): The name of the table to truncate. The table name can be prefixed with a schema name to reference a table in another schema.

Example

tasks:
  - type: truncate_table
    table: history_table
  ...

Notes:

  • The truncate_table task type does not run in dry run mode.
  • The truncate_table task type is not currently supported for Amazon DynamoDB or Microsoft SQL Server (Linked Server) databases.
  • The truncate_table task type does not properly truncate on IBM DB2 LUW databases, instead it deletes all rows in the table. See IBM Db2 LUW Limitations.

Parallel Tasks

Using the parallel task type, you can specify a block of tasks to be executed in parallel, spread across as many workers as are available.

Each parallel task distributes to a maximum of 10 sub-tasks. It is recommended to begin testing parallelisation with at most 4 tasks in parallel, then increase parallelisation if the database has more capacity.

Parallel tasks can be nested inside other serial/parallel tasks.

Parameters

  • tasks (required): A set of tasks to perform in parallel.

Example

tasks:
  - type: parallel
    tasks:
      - type: mask_table
        table: employees
        key: id
        rules:
          - column: 'name'
            masks:
              - type: from_fixed
                value: 'REDACTED'
      - type: mask_table
        table: customers
        key: id
        rules:
          - column: 'address'
            masks:
              - type: from_fixed
                value: 'REDACTED'
 ...

Warning: You should not mask the same table in multiple tasks (including mask_table, run_sql tasks) in parallel, as this could result in data being incorrectly masked.

Note: Currently, the parallel task type is not supported for either Amazon DynamoDB or file masking.

Serial Tasks

Although tasks are performed serially in the order they are listed in the ruleset by default, you can specify a block of tasks to be performed in serial within a parallel block. This is useful when a subset of parallelisable tasks have dependencies that mean they must be executed in sequence.

Serial tasks can be nested inside other serial/parallel tasks.

Parameters

  • tasks (required): A set of tasks to perform in series.

Example

tasks:
  - type: parallel
    tasks:
      - type: serial
        tasks:
          - type: run_sql
            sql_file: pre_employees_script.sql
          - type: mask_table
            table: 'employees'
            key: id
            rules:
              - column: 'name'
                masks:
                  - type: from_fixed
                    value: 'REDACTED'
      - type: mask_table
        table: 'customers'
        key: id
        rules:
          - column: 'address'
            masks:
              - type: from_fixed
                value: 'REDACTED'
 ...

Note: Currently, the serial task type is not supported for file masking.

File task types

Each mask_file or mask_tabular_file task specifies the masking rules to apply to each file in the base directory and/or any subdirectories. As well as any files/directories intended to be skipped or included, and any conditionals required to define which data to mask in the masking process. Masking rules and masks are applied sequentially in the order they are listed. When multiple masks are combined in sequence, the output value from each mask is passed as the input to the next mask in the sequence.

Note If the source connection and destination connection are of the same type and have the same base directory, the files will be overwritten. The list of files is read at the start of the masking run, so new files added during the masking run will not be masked and will not be present in the destination.


A ruleset consists of a list of tasks to be performed in sequence on the target data source after the file is masked, it is then written to the selected data destination.


Sensitive data discovery

The run_file_data_discovery task type inspects the metadata and optionally the contents of your files, searching for columns which are likely to contain sensitive data. On completion, a report is generated containing a summary of all identified columns, and their current masking coverage.

See the Sensitive Data Discovery guide for more information on this feature.

Parameters

  • in_data_discovery (required): Enable In-Data discovery to examine the contents of discovered files for sensitive data matches. See In-Data Discovery.
  • custom_keywords (optional): Additional custom keywords and to facilitate sensitive data discovery.
  • disable_built_in_keywords (optional): Disable DataMasque's built-in discovery keywords. See Built-in data discovery keywords.
  • disable_global_custom_keywords (optional): Disable any global custom keywords configured on the settings page.
  • disable_global_ignored_keywords (optional): Disable any global ignored keywords configured in the settings page.
  • ignored_keywords (optional): Additional ignored keywords to facilitate sensitive data discovery.
  • encoding (optional): The encoding to use when reading files. Defaults to UTF-8. Refer to Python Standard Encodings for a list of supported encodings.
  • include (optional): Specifies files to include in the discovery run. If not set, all files will be discovered (except those skipped by a skip rule).
    • regex (optional): Files that match this regular expression will be discovered.
    • glob (optional): Files that match this glob will be discovered.
    • match_against (optional): Choose if the include rule is applied to the whole path or just the filename. Valid choices are path or filename. Defaults to path.
  • recurse (optional): A boolean value, when set to true any folders in the Data Source will be recursed into and the files contained will also be discovered. Defaults to false.
  • skip (optional): Specifies files to not include in the discovery run.
    • regex (optional): Files that match this regular expression will not be included for discovery.
    • glob (optional): Files that match this glob will not be included for discovery.
    • match_against (optional): Choose if the skip rule is applied to the whole path or just the filename. Valid choices are path or filename. Defaults to path.

Examples

The following shows an example ruleset that will execute only the run_file_data_discovery task and no masking tasks. This example usage may be useful to determine a starting point for developing your masking rules.

version: '1.0'
tasks:
  - type: run_file_data_discovery
    include: [{"glob": "*.csv"}]
    in_data_discovery: {"enabled": False}

This example enables In-Data discovery:

version: '1.0'
tasks:
  - type: run_file_data_discovery
    include: [{"glob": "*.csv"}]
    in_data_discovery: {"enabled": True}

Object file masks

Each mask_file task defines masking operations to be performed on a file or set of files. mask_file tasks support the following file types:

  • JSON
  • NDJSON
  • XML
  • Apache Avro

In addition, any file type can be used with from_blob, which will replace the entire file with another file.

Note The file type will be determined solely by the file extension, not by its content.

Parameters

Each task with type mask_file is defined by the following parameters:

  • rules (required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:
    • masks (required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the file. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask types.
  • recurse (optional): A boolean value, when set to true any folders in the Data Source will be recursed into and the files contained will also be masked. Defaults to false.
  • workers (optional): The number of parallel workers to use for this masking task. Defaults to 1.
  • skip (optional): Specifies files to not include in the masking run.
    • regex (optional): Files that match this regular expression will not be included for masking.
    • glob (optional): Files that match this glob will not be included for masking.
    • match_against (optional): Choose if the skip rule is applied to the whole path or just the filename. Valid choices are path or filename. Defaults to path.
  • include (optional): Specifies files to include in the masking run. If not set, all files will be masked (except those skipped by a skip rule).
    • regex (optional): Files that match this regular expression will be masked.
    • glob (optional): Files that match this glob will be masked.
    • match_against (optional): Choose if the include rule is applied to the whole path or just the filename. Valid choices are path or filename. Defaults to path.
  • encoding (optional): The encoding to use when reading and writing files. Defaults to UTF-8. Refer to Python Standard Encodings for a list of supported encodings.

For more information about the ordering of skip and include please refer to Include/Skip.

Note: Increasing the number of workers can reduce masking time by masking files in parallel, but extra workers will not mask different parts of the same file. This is useful if there are a large number of files to be masked, but will not reduce the amount of time required to mask a single large file. For files that can be split into multiple parts (for example, tabular files or multi-record files), manually chunking and recombining the files can improve masking speed.

Note: regex/glob will match to the path from the base directory specified in the source connection, consider adding .*(regex) or *(glob) to the beginning of the expression for matching. For example: If the structure is /path1/path2/target_file.json, if the base directory is path1/ and recurse: true in the ruleset, the regex/glob will try to match path2/target_file.json. When including a path by specifying a glob such as target_path/* the recurse option needs to be set to true otherwise the included path won't be entered and the files it contains will not be masked.

Supported file types

In general, mask_file has been designed to mask XML, JSON, NDJSON, and Apache Avro files. Each file is loaded as a string and passed to the masks. Therefore, to mask a JSON file, a json mask would be implemented, like the following example:

version: "1.0"
tasks:
  - type: mask_file
    recurse: true
    skip:
      - regex: '.*\.json'
      - glob: "input/*"
    include:
      - glob: "other_inputs/*.json"
    rules:
      - masks:
          - type: json
            transforms:
            - path: ['name']
              masks:
                - type: from_fixed
                  value: REDACTED

This would replace the root name attribute in the JSON with the text REDACTED.

Multirecord Files (NDJSON/Avro)

NDJSON (Newline Delimited JSON) (.ndjson) files can be masked by specifying a json mask in the ruleset which will be applied to each line of JSON in the NDJSON file.

For NDJSON files, a similar ruleset can be specified which will be applied to each record of the file, rather than being applied to the entire file.

version: "1.0"
tasks:
  - type: mask_file
    rules:
      - masks:
        - type: json
          transforms:
            - path: ['name']
              masks:
                - type: from_fixed
                  value: REDACTED

This would replace the root name attribute for each record in the NDJSON with the text REDACTED.

Apache Avro (.avro) files can be masked in the same way as NDJSON files. When masking Avro files there also needs to be an Avro Schema file (.avsc) in the same directory and with the same name as the .avro file, as the schema file is required to write the masked .avro file to ensure data type consistency. For example, if a file is called data.avro DataMasque will expect to find a schema file called data.avsc in the same directory.

Note: The Avro Schema file (.avsc) will be copied unmasked into the output directory. Deterministic Masking is also supported for multirecord files, for more information please refer to Deterministic Masking with Files.

Similarly, for XML files, use an xml mask:

version: "1.0"
tasks:
  - type: mask_file
    recurse: true
    skip:
      - regex: '^(.*).xml'
      - glob: "input/*"
    include:
      - glob: "other_inputs/*.xml"
    rules:
      - masks:
          - type: xml
            transforms:
            - path: 'User/Name'
              node_transforms:
                - type: text
                  masks:
                    - type: from_fixed
                      value: REDACTED

This would replace the content of the node(s) at User/Name with the text REDACTED.

To mask other types of files, basic redaction is possible. For example, to replace the contents of every txt file with the text REDACTED:

version: "1.0"
tasks:
  - type: mask_file
    recurse: true
    include:
      - glob: "*.txt"
    rules:
      - masks:
          - type: from_fixed
            value: REDACTED

It is possible to use any mask that accepts text input (or no input), although their effectiveness will depend on the size and content of the input file.

Note also, that files that have not been processed will not be copied from the source to the destination. That is, DataMasque will either load a file (based on skip/include rules), mask it, then copy it to the destination, or it will ignore the file. Unmasked files will not be copied to the destination.


Tabular file masks

Each mask_tabular_file task defines masking operations to be performed on a file or set of files. mask_tabular_file tasks support the following file types:

  • Comma separated (CSV), Tab delimited (TBV), and other character-delimited files
  • Parquet
  • Fixed-width columns

Parameters

Each task with type mask_tabular_file is defined by the following parameters:

  • rules (required): A list of masking rules (or dictionary mapping arbitrary keys to rules) to apply to the table. Rules are applied sequentially according to the order they are listed. Regular (non-conditional) masking rules are defined by the following attributes:
    • column (required): A column within the tabular file intended for masking. This is the header row of the column (for CSVs) or name of column (for parquet files).
    • masks (required): A list of masks (or dictionary mapping arbitrary keys to masks) to apply to the column. Masks are applied sequentially according to the order they are listed, so the output value from each mask is passed to the next mask in the list. Each type of mask has a different set of required attributes, as described for each type in Mask functions.
  • recurse (optional): A boolean value, when set to true any folders in the Data Source will be recursed into and the files contained will also be masked. Defaults to false.
  • workers (optional): The number of parallel workers to use for this masking task. Defaults to 1.
  • skip (optional): Specifies files to not include in the masking run.
    • regex (optional): Files that match this regular expression will not be included for masking.
    • glob (optional): Files that match this glob will not be included for masking.
    • match_against (optional): Choose if the skip rule is applied to the whole path or just the filename. Valid choices are path or filename. Defaults to path.
  • include (optional): Specifies files to include in the masking run. If not set, all files will be masked (except those skipped by a skip rule).
    • regex (optional): Files that match this regular expression will be masked.
    • glob (optional): Files that match this glob will be masked.
    • match_against (optional): Choose if the include rule is applied to the whole path or just the filename. Valid choices are path or filename. Defaults to path.
  • encoding (optional): The encoding to use when reading and writing files. Defaults to UTF-8. Refer to Python Standard Encodings for a list of supported encodings.
  • delimiter (optional): The character that is used to delimit columns in character-delimited files (e.g. , for CSVs or \t for TDVs [Tab Delimited Values]). Defaults to ,.
  • quoting (optional): Controls how the masked file will be quoted. Valid values are all,minimal and none. Defaults to minimal. See Quoting options.
  • column_names (optional): An array of string defining the names of the fixed-width columns or non-header character-delimited file and used to refer to them in masking rules. For masking fixed-width files, it is required if fixed_width_extension is specified, and must match the length of fixed_width_columns_indexes.
  • fixed_width_extension (optional): The file extension that fixed-width files have. Not required if no fixed-width files are to be masked. Should not include a leading . (e.g. specify txt not .txt).
  • fixed_width_columns_indexes (optional): An array of two-element arrays of start and end indexes of the fixed width columns. Required if fixed_width_extension is specified.
  • fixed_width_column_names (optional, deprecated): This option was renamed to column_names option. However, it is still supported for backward compatibility.
  • fixed_width_too_wide_action (optional): The action to take if masked data exceeds the width of the column, can be either truncate to truncate the value to fit in the column or error to raise an error and stop the masking run. Defaults to truncate.
  • fixed_width_line_ending (optional): The line ending to use when writing out the fixed width data. Will attempt to be detected from the input file, otherwise defaults to \n.

For more information about:

Note: regex/glob will match to the path from the base directory specified in the source connection, consider adding .*(regex) or *(glob) to the beginning of the expression for matching. For example: If the structure is /path1/path2/target_file.json, if the base directory is path1/ and recurse: true in the ruleset, the regex/glob will try to match path2/target_file.json. When including a path by specifying a glob such as target_path/* the recurse option needs to be set to true otherwise the included path won't be entered and the files it contains will not be masked.

version: "1.0"
tasks:
  - type: mask_tabular_file
    recurse: true
    skip:
      - regex: '.*\.json'
      - glob: "input/*"
    include:
      - glob: "other_inputs/*"
    rules:
      - column: name
        masks:
        - type: from_fixed
          value: REDACTED

Tabular File Type Detection

DataMasque uses file extensions to determine how tabular files are loaded for masking. The extension comparison is not case-sensitive. Files with extension parquet are treated as Apache Parquet files. If a fixed_width_extension is specified (see Fixed Width File Masking Parameters below) then the file is loaded as fixed-width. All other file types are loaded as character-delimited. If the delimiter parameter is unspecified then it defaults to , (i.e. CSV).

Note Delimiters must be a single character, and should be quoted with double quotes. For example: delimiter: " " for a space or delimiter: ";" for a semicolon. Tabs or other invisible delimiters should use the correct escape sequence with a backslash. For example, delimiter: "\t" for a tab.

The skip, include and delimiter options should be used together to specify the correct delimiter for each file type that is to be masked.

Note CSV files require header columns for tabular masking as the header columns are used as column names during masking. CSV and fixed-width files are all string based files, therefore values should be cast to other types if being used with masks that require specific types (e.g. numeric_bucket). To do this use a typecast mask, for more information please refer to Typecast

DataMasque will only attempt to load fixed width files if fixed_width_extension is specified, and will treat any files with this extension as fixed-width. See also Fixed Width File Masking Parameters.

Once files are loaded they are all masked in the same way, that is, rules are executed and applied on a per row/column basis regardless of the original source type. Data will be written back out in the same format as it was read.

Quoting options

Character delimited files (such as CSVs) handle special characters by quoting the field when written to a file. Special characters include the delimiter (e.g. , for CSV), double quotes (") or new line characters.

The field content is wrapped in double quotes, and double quotes are replaced by two double quotes.

For example, the text:

She said, suddenly, "Yes, that's right!"

would be written to a CSV file as:

"She said, suddenly, ""Yes, that's right!"""

This quoting will happen automatically if the field requires quoting, but the behavior can be modified based on the quoting option.

  • all: quote all fields.
  • minimal (default): only quote those fields which contain special characters such as delimiter, double quote or newlines.
  • none: never quote fields. If the delimiter occurs in the output data then an error will be raised.

Fixed Width File Masking Parameters

Masking of fixed-width files is only attempted if fixed_width_extension is specified. If fixed_width_extension is present in the ruleset without fixed_width_columns_indexes and column_names, then an error will be raised. However, it is valid to have fixed_width_columns_indexes and column_names missing if fixed_width_extension is also absent.

If fixed_width_extension is set, then DataMasque will treat any files with that extension as fixed-width and load them based on the other fixed-width options. To assist in explaining the rules, consider an example file called users.txt. This is the content:

Adam    2010-01-01  AAA-1111
Brenda  2010-01-01  EEE-5555
Charlie 2010-02-02  GGG-7777

It has 3 columns, the first containing a name, which is from index 0 to 8. The second column contains a date and spans from 8 to 19. The final column contains a transaction ID and spans from index 19 to 27.

Note these indexes are specified to be contiguous as some fixed-width formats require contiguous columns, therefore a trailing space is included in the first and second columns. DataMasque automatically strips leading and trailing spaces when the data is read. Contiguous columns are not required though, so the same result could be achieved with indexes (0, 7), (8, 18) and (19, 27). When non-contiguous columns are specified DataMasque inserts spaces in between columns.

Since fixed-width files do not have column headers, the ruleset must also specify these. They can be any arbitrary valid column identifier (i.e. alphanumeric string without special characters) and are used to identify the columns in the masking rules. In this case they will be named name, date and transaction_id.

Considering these rules will yield a ruleset like this:

version: "1.0"
tasks:
  - type: mask_tabular_file
    recurse: true
    fixed_width_extension: txt
    fixed_width_columns_indexes:
      - [0, 8]
      - [8, 19]
      - [19, 27]
    column_names:
      - name
      - date
      - transaction_id
    rules:
      - column: name
        masks:
          - type: from_file
            seed_file: DataMasque_firstNames_mixed.csv
            seed_column: firstname-mixed
      - column: date
        masks:
          - type: from_random_date
            min: '1950-01-01'
            max: '2000-12-31'
          - type: typecast
            typecast_as: string
            date_format: '%Y-%m-%d'
      - column: transaction_id
        masks:
          - type: imitate

Note that when this ruleset is executed, DataMasque will still load any CSVs or Parquet files it encounters, however it will use the standard loaders instead of applying the fixed width rules. In this case fixed-width rules will only be used for txt files.

This can be useful if CSV or Parquet files exist in the source with the same columns and need to be masked in the same manner. If these files exist, and they shouldn't be masked, then skip rules should be added to skip them.

Supported Parquet column data types

Scalar types

The following scalar Parquet column data types are supported for masking, and likewise for use with seed_filter_column in a from_file mask, from_column, if, or hash_columns.

  • Null type: null (sometimes called na)
  • Integer types: int8, int16, int32, int64
  • Unsigned integer types: uint8, uint16, uint32, uint64
  • Floating-point types: float, double
  • Boolean type: bool
  • String types: string, large_string
  • Binary types: binary, large_binary
  • Date and time types: date32, date64, timestamp, time32, time64, duration
Complex types

DataMasque supports the complex Parquet data types struct, map, and list, which are sometimes referred to as nested columns.

Note: Map keys must be strings. DataMasque does not support maps with keys of other data types. Map values, and the contents of lists and structs, can be of any supported type: either another struct, map, or list, or any of the supported scalar types listed above.

For masking struct, map, and list columns, use the json mask. In the ruleset, specify the name of the column, and then the JSON path to the field to be masked, treating lists as JSON arrays, and structs and maps as JSON objects. For example, suppose you have the following Parquet column of map type:

column name: staff_by_office
type: map<string, list<struct<first_name: string, surname: string>>>

i.e. each key in the map is an office name, and each value is a list of staff members' names, stored as a struct with fields first_name and surname.

Then the following ruleset will mask the surname of all staff in the north office:

version: "1.0"
tasks:
  - type: mask_tabular_file
    rules:
      - column: staff_by_office
        masks:
          - type: json
            transforms:
              - path:
                  - north
                  - '*'
                  - surname
                masks:
                  - type: from_fixed
                    value: REDACTED

You can also copy one complex column to another of the same type using the from_column mask, and use a value within a complex column as a hash_columns source by specifying a json_path as detailed above.

Complex columns cannot be used as the seed_filter_column in from_file masks.

Conditional masking for Parquet

Complex columns can be referenced in if conditions. Specify the json_path to the field to be compared.

Other data types

Attempts to mask or reference a column of type not mentioned above will cause the task to fail. However, you can still mask other columns in a file containing one or more columns with an unsupported data type.

Choosing files to mask with include/skip

When specifying which files to include or skip for a mask_file/mask_tabular_file task, the order of checking needs to be considered. The skip rules are checked first, followed by the include rules, which means skip takes precedence. Warning! If a file matches both the skip and include rules, that item will not be included in the masking task.

Rules are applied with "or" logic, that is, as long as the file matches any rule in skip or include, it will be skipped/included, respectively.

In comparisons, the file's path does not include a leading /.

By default, include and skip rules are applied to the file's path, relative to the base_directory. By specifying match_against: filename on the include/skip, the rule will be applied to just the file's name.

Filtering is performed only on files, not on directories. If recurse is true, a directory will always be entered into, even if it would be skipped based on filters. If a directory has no files to be masked (they are all skipped, or none are included) then an empty directory will not be created on the target.

glob matching does not support the ** syntax for multi-level directory matching. If the depth of directories is unknown, or more advanced path matching is required regex filtering should be used.

These rules are explained in more detail below, with some examples and common file filtering scenarios.

include/skip examples and common scenarios

In general, glob is used for simple matching, and regex should be considered when more advanced rules are required. Note that some of these examples use a mask_file task and others use a mask_tabular_file task, however there is no difference in the way the file matching rules are executed between these different task types.

Mask all files with a certain extension

The ruleset will include all files that end with .xml inside the base directory and any subdirectories.

version: "1.0"
tasks:
  - type: mask_file
    recurse: true
    include:
    - glob: "*.xml"

The recurse option will enter every subdirectory. Since the match_against option has not been specified the default path option is used, so the glob is applied against the file's path. However since a path like dir1/dir2/file.xml and a filename like file.xml both end with .xml they both match the specified glob; in this case specifying match_against: filename would have no effect.

Mask all files with a specific start and end

This ruleset will mask any file that starts with user and has a .json extension. For example, user_1.json, users_file.json, etc. These files can be anywhere inside the source. The match_against option is set to filename.

version: "1.0"
tasks:
  - type: mask_file
    recurse: true
    include:
    - regex: "^user.*\.json$"
      match_against: filename

Explaining the entire regular expression syntax is beyond the scope of these examples, but since this pattern is quite common we will briefly explain how it works:

  • ^ means the expression must match start of the path, it can't just be found somewhere in the middle.
  • user has no special meaning, so it matches the literal user part of the path.
  • .* has two parts, . means match any character and * means match the preceding expression (the .) any number of times. So together they mean match any number of any characters.
  • \. matches a literal .. The \ escapes the . so it doesn't behave as matching any character.
  • json has no special meaning, so it matches the literal json extension.
  • $ means the expression must also match at the end of the path, .json can't just appear somewhere in the middle.

match_against: filename is specified so that the regex pattern does not match user in a directory name. For example, we do not want it to match the path user_data/file_1.json, but that full path does satisfy the regex. By specifying match_against: filename, the regex is applied only to the filename part (file_1.json) which does not match.

Notes:

  • Regex patterns only match at the beginning of the path (or filename, if using match_against: filename). So the pattern above will not match some_user_data.json as the word user is not at the start of the filename, even if the leading ^ start anchor is omitted from the regex.
  • When using match_against: filename, a glob or regex with a / in it will cause the pattern to never match anything, since filenames never contain a /. DataMasque does not prevent you from using a / in a pattern when matching against filename, so if there are files you expect to be included, but they are not, this could be the reason why.
  • match_against: filename has no effect when using only glob patterns. Globs always match at the end of the path, and wildcards do not match across directory separators (/). It follows that the above example can be rewritten more succinctly as glob: user*.json with no match_against parameter. However, were the regular expression more complex, it may not be expressible as a glob pattern.
Mask files of a certain type only in the base directory

To mask CSV files only in the base directory, disable the recurse option.

version: "1.0"
tasks:
  - type: mask_tabular_file
    include:
    - glob: "*.csv"

Without recurse: true, subdirectories will not be entered into, so only .csv files in the base directory will be masked.

Mask files in nested directories of known depth

In this example, there is a directory layout like this:

  • users1.csv
  • users2.csv
  • staff/users1.csv
  • staff/users2.csv
  • staff/management/users1.csv
  • staff/management/users2.csv
  • staff/contractors/users1.csv
  • staff/contractors/users2.csv

(Assume there are also some files of other types that are not to be masked).

All users*.csv files should be masked, except those in the base directory. Since we know the depth of the directories (staff/ is one level; staff/management/ and staff/contractors/ are two levels) two globs can be added to match these.

version: "1.0"
tasks:
  - type: mask_tabular_file
    include:
    - glob: "staff/*.csv"
    - glob: "staff/*/*.csv"

Since the matching is performed on the path, none of the files in the base directory match. Files in the staff/ directory match the first glob and files in staff/management/ or staff/contractors/ match the second. Since include use OR logic then as long as at least one include rule matches a file will be included.

Note: Globs also match at the end of a more deeply-nested path. For example, with the above globs, folder1/folder2/staff/file.csv also matches. If you want to avoid this behaviour, use a regex starting with ^staff/ to ensure that the pattern only matches a folder named staff when it is in the base directory - see the following section.

Mask files in all nested directories

If the depth of the directories is unknown, or could be very deep, so that a lot of glob rules need to be created, then using regex to apply regular expressions to the path can be a better option.

Following on from the previous example, CSV files might be nested inside a deep directory tree. The following ruleset uses a regex to match any .csv file inside staff/ or any of its child directories.

version: "1.0"
tasks:
  - type: mask_tabular_file
    include:
    - regex: "^staff/.*\.csv$"

Any path that starts with staff/ and ends with .csv will be matched by this regex.

Often you will be able to apply a slightly modified version of this regex, by just changing the start and end to match your particular paths and file extensions.

For more information on the regular expression syntax used by DataMasque, check the Python Regex Howto.

Mask all files with a given extension, except those with a certain prefix

In this example all .avro files in the base directory will be masked, except the ones starting with users*. This means files like sales_1.avro, data.avro, orders-for-2020.avro will be masked, but users-list.avro and users-exported.avro would not be. It uses an include rules to include all .avro files and a skip rule to skip any starting with users.

version: "1.0"
tasks:
  - type: mask_file
    include:
    - glob: "*.avro"
    skip:
    - glob: "users*.avro"
      match_against: filename

The file users-exported.avro matches both *.avro and users*.avro, but since skip takes precedence the file will not be included. The file data.avro only matches *.avro so it will be included.

Mask files in all nested directories, excluding those with a given prefix

This example will include all .csv files in the staff/ directory, except those that start with data. For example, it matches staff/sales.csv and staff/management/users.csv but not staff/data01.csv or staff/management/data-list.csv.

version: "1.0"
tasks:
  - type: mask_tabular_file
    include:
    - regex: "^staff/.*\.csv$"
    skip:
    - glob: "data*.csv"
      match_against: filename

This ruleset show that glob and regex can be combined to produce flexible and simple matching rules.

Mask files that match a particular pattern.

This example includes XML files that start with users_, then have at least one digit (the \d+ parameter). But since it uses match_against: filename it will not match against the path, therefore files inside a directory starting with users_ won't match.

version: "1.0"
tasks:
  - type: mask_tabular_file
    include:
    - regex: "^users_\d+.xml$"
      match_against: filename

Some example results are:

  • users_5.xml: match
  • users_10.xml: match
  • users_list.xml: no match (the filename doesn't match the regex)
  • users_5/file.xml: no match (the path matches the regex but comparison is performed against the filename)
  • users_5/users_10.xml: match (the filename users_10.xml matches the regex)

Masking Functions

Masked data is generated by mask functions. Each mask_table/mask_file/mask_tabular_file task specifies a set of rules that use these mask functions to mask your data based on your configuration.

Full masking function details are available on the Masking Functions page.

Here is an outline of the masking functions available:

Definitions

You can make use of YAML anchors and aliases to assign a label (anchor) to a block of configuration, and then reference that name (via aliases) to re-use the configuration throughout your ruleset, extending or overriding specific properties/keys in the configuration as needed.

You can define an anchor with &, then refer to it using an alias, denoted with *. You can use overrides with the characters <<: to add more values, or override existing ones on an alias.

Shown below is an example that sets 4 values with an anchor, uses 3 of those values with an alias, and overrides the 2nd value. Both tasks share the values for type, key and rules using the anchor &mask_table_opts and the alias *mask_table_opts. The value for table is merged/overridden by << or merge operator.

version: '1.0'
tasks:
  - &mask_table_opts
    type: mask_table
    table: details
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
  - <<: *mask_table_opts
    table: customers

As you may commonly want to re-use the definition of a task, rule, or mask in multiple locations (often with minor variations), DataMasque allows you to group such definitions under the task_definitions, rule_definitions, and mask_definitions configuration properties of a ruleset. The following subsections demonstrate how to use anchors and aliases with such definitions.

Task definitions (task_definitions)

The task_definitions attribute of the ruleset may contain a list of task definitions to be referenced from the ruleset's tasks attribute through YAML anchors and aliases. Each task definition must follow the same syntax as a task in the tasks attribute (see Database Task Types for the full list of available task types).

For example, the following ruleset has an anchor named &mask_table_customer_info under task_definitions that is inherited three times in the tasks list via the alias *mask_table_customer_info. Each time the definition is re-used, the table attribute is overridden:

version: '1.0'
task_definitions:
  - &mask_table_customer_info
    type: mask_table
    # PLACEHOLDER is a placeholder table name to be merged/overridden
    # when anchor `&mask_table_customer_info` is referred in alias
    table: PLACEHOLDER
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
      - column: postcode
        masks:
          - type: from_random_number
            max: 9999

tasks:
  - <<: *mask_table_customer_info
    table: details
  - <<: *mask_table_customer_info
    table: customers
  - <<: *mask_table_customer_info
    table: orders

This is equivalent to specifying the three mask_table tasks separately with the same key, type, and rules but applied to different tables, as below:

Show equivalent ruleset without anchors and aliases

version: '1.0'
tasks:
  - type: mask_table
    table: details
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
      - column: postcode
        masks:
          - type: from_random_number
            max: 9999
  - type: mask_table
    table: customers
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
      - column: postcode
        masks:
          - type: from_random_number
            max: 9999
  - type: mask_table
    table: orders
    key: id
    rules:
      - column: customer_id
        masks:
          - type: from_random_text
            max: 10
      - column: postcode
        masks:
          - type: from_random_number
            max: 9999

Rule definitions (rule_definitions)

The rule_definitions attribute of the ruleset may contain a list of rule definitions to be referenced through YAML anchors and aliases. Each rule definition must follow the same syntax as a rule in a mask_table task's rules.

For example, the following ruleset has an anchor named &postcode_rule under rule_definitions that is inherited twice in a list of rules via the alias *postcode_rule. Each time the definition is re-used, the column attribute is overridden:

version: '1.0'
rule_definitions:
  - &postcode_rule
    column: postcode
    masks:
      - type: from_random_number
        max: 9999

tasks:
  - type: mask_table
    table: orders
    key: id
    rules:
      - <<: *postcode_rule
        column: target_postcode
      - <<: *postcode_rule
        column: destination_postcode

This is equivalent to specifying the two rules separately with the same masks but applied to different columns, as below:

Show equivalent ruleset without anchors and aliases

version: '1.0'
tasks:
  - type: mask_table
    table: orders
    key: id
    rules:
      - column: target_postcode
        masks:
          - type: from_random_number
            max: 9999
      - column: destination_postcode
        masks:
          - type: from_random_number
            max: 9999

Mask definitions (mask_definitions)

The mask_definitions attribute of the ruleset may contain a list of mask definitions to be referenced from a list of masks through YAML anchors and aliases. Each mask definition must follow the same syntax as a mask in the masks attribute of a rule (see Mask Functions for the full list of available mask types).

For example, the following ruleset has an anchor named &street_mask under mask_definitions that is inherited twice in a masks list via the alias *street_mask. Each time the definition is re-used, the table_filter_column attribute is overridden:

version: '1.0'
mask_definitions:
  - &street_mask
    type: from_file
    seed_file: DataMasque_address-example.csv
    seed_column: street_name
    table_filter_column: city
    seed_filter_column: city
    fallback_masks:
      - type: from_fixed
        value: 'Main Street'
tasks:
  - type: mask_table
    table: orders
    key: id
    rules:
      - column: target_street
        masks:
          - <<: *street_mask
            table_filter_column: target_city
      - column: destination_street
        masks:
          - <<: *street_mask
            table_filter_column: destination_city

This is equivalent to specifying the two masks separately with identical configuration except for different table_filter_column values, as below:

Show equivalent ruleset without anchors and aliases

version: '1.0'
tasks:
  - type: mask_table
    # PLACEHOLDER is a placeholder table name to be merged/overridden
    # when anchor `&mask_table_customer_info` is referred in alias
    table: orders
    key: id
    rules:
      - column: target_street
        masks:
          - type: from_file
            seed_file: DataMasque_address-example.csv
            seed_column: street_name
            table_filter_column: target_city
            seed_filter_column: city
            fallback_masks:
              - type: from_fixed
                value: 'Main Street'
      - column: destination_street
        masks:
          - type: from_file
            seed_file: DataMasque_address-example.csv
            seed_column: street_name
            table_filter_column: destination_city
            seed_filter_column: city
            fallback_masks:
              - type: from_fixed
                value: 'Main Street'

Nested definitions

It is possible to have nested definitions under a definition. For example, a nested rule definition &customer_info_street_address can be anchored under a task definition &mask_table_customer_info.

version: '1.0'
task_definitions:
  - &mask_table_customer_info
    type: mask_table
    # PLACEHOLDER is a placeholder table name to be merged/overridden
    # when anchor `&mask_table_customer_info` is referred in alias
    table: PLACEHOLDER
    key: id
    rules:
      - &customer_info_street_address
        column: destination_street
        masks:
        - type: from_file
          seed_file: DataMasque_address-example.csv
          seed_column: street_name
tasks:
  - <<: *mask_table_customer_info
    table: details
    rules:
      - <<: *customer_info_street_address
        column: pickup_street

Inherit definitions with dictionaries

As it is not possible to use anchors and aliases to override individual items in a YAML list, DataMasque allows you to alternatively specify lists of rules, joins, and masks as dictionaries of arbitrary keys to their respective values.

A dictionary is represented in a simple key: value form (the colon must be followed by a space):

# A mask rules example in dictionary with arbitrary keys
rules:
  customer_info_postcode: *postcode_rule
  customer_info_customer_id: *customer_id_rule
  customer_info_destination_street: *destination_street_rule

The above dictionary is identical to following list while the rules are executed:

# A mask rules example in list
rules:
  - *postcode_rule
  - *customer_id_rule
  - *destination_street_rule

The following example demonstrates how specifying rules as a dictionary allows you to easily override individual rules by their key using anchors and aliases.

By specifying rules as a dictionary in task definition &mask_table_customer_info and by using anchors and aliases to reference the rules dictionary and individual rules, we were able to override the entire customer_info_postcode rule in the first task, and just the column of the customer_info_customer_id rule in the second task.

version: '1.0'

rule_definitions:
  - &postcode_rule
    column: postcode
    masks:
      - type: from_random_number
        max: 9999
  - &customer_id_rule
    column: customer_id
    masks:
      - type: from_random_text
        max: 10

task_definitions:
  - &mask_table_customer_info
    type: mask_table
    # PLACEHOLDER is a placeholder table name to be merged/overridden
    # when anchor `&mask_table_customer_info` is referred in alias
    table: PLACEHOLDER
    key: id
    rules: &customer_info_rules
      customer_info_postcode: *postcode_rule
      customer_info_customer_id: *customer_id_rule
      customer_info_destination_street:
        column: destination_street
        masks:
        - type: from_file
          seed_file: DataMasque_address-example.csv
          seed_column: street_name
          table_filter_column: destination_city
          seed_filter_column: city
          fallback_masks:
            - type: from_fixed
              value: 'Main Street'
tasks:
  - <<: *mask_table_customer_info
    table: details
    rules:
      <<: *customer_info_rules
      customer_info_postcode:
        column: post
        masks:
          - type: from_fixed
            value: 1234
  - <<: *mask_table_customer_info
    table: orders
    rules:
      <<: *customer_info_rules
      customer_info_customer_id:
        <<: *customer_id_rule
        column: ext_customer_id

These changes can be seen in the equivalent expanded ruleset below:

Show equivalent ruleset without anchors and aliases

version: '1.0'
tasks:
  - type: mask_table
    table: details
    key: id
    rules:
      rule0:
        column: post
        masks:
          - type: from_random_number
            max: 9999
      rule1:
        column: customer_id
        masks:
          - type: from_random_text
            max: 10
      rule2:
        column: destination_street
        masks:
        - type: from_file
          seed_file: DataMasque_address-example.csv
          seed_column: street_name
          table_filter_column: destination_city
          seed_filter_column: city
          fallback_masks:
            - type: from_fixed
              value: 'Main Street'
  - type: mask_table
    table: orders
    key: id
    rules:
      customer_info_postcode:
        column: postcode
        masks:
          - type: from_random_number
            max: 9999
      customer_info_customer_id:
        column: ext_customer_id
        masks:
          - type: from_random_text
            max: 10
      customer_info_destination_street:
        column: destination_street
        masks:
        - type: from_file
          seed_file: DataMasque_address-example.csv
          seed_column: street_name
          table_filter_column: destination_city
          seed_filter_column: city
          fallback_masks:
            - type: from_fixed
              value: 'Main Street'

Note: Dictionaries can be used instead of lists in the following ruleset attributes:

Note: The order of items in above dictionaries in DataMasque ruleset will be respected, so the operations will be performed in the order you specify them.

Advanced ruleset features

Default values to skip

You can specify a list of skip_defaults values at the ruleset level:

version: '1.0'
skip_defaults:
  - null
  - ''
  - matches: 'administrator_.*'
tasks:
  - type: mask_table
    table: users
    key: id
    rules:
      - column: role
        masks:
        - type: from_fixed
          value: 'customer'

These will be used as the skip configuration for any masking rule that does not have skip explicitly set.

Disabling skip_defaults

To disable the skip_defaults for a particular column, set the skip option for that column to an empty array: skip: [].

Example
  • role column:
    • The role column does not have a specific skip rule defined, so it will use the skip_defaults specified at the top of the ruleset. This means that any null or empty string ('') values in the role column will not be masked; they will be skipped over.
  • details column:
    • The details column, however, explicitly sets skip: [], which overrides and disables the skip_defaults for this column. As a result, even if the value is null or an empty string (''), it will still be masked according to the rule provided (in this case, replaced with the string REDACTED).
version: '1.0'
skip_defaults:
  - null
  - ''
tasks:
  - type: mask_table
    table: users
    key: id
    rules:
      - column: role
        masks:
        - type: from_fixed
          value: 'customer'
      - column: details
        skip: []
        masks:
        - type: from_fixed
          value: 'REDACTED'
Summary
  • skip_defaults: Defines which values (like null or '') should not be masked by default.
  • skip: []: Overrides the skip_defaults for a particular column, ensuring that every value in that column (even those normally skipped) will be masked.

Freezing random values

A run_secret string can be provided as an option when starting a run. If you provide a run_secret, then all randomly chosen and generated values will be exactly the same for multiple runs with the same secret. This can be useful for testing or retaining consistency across multiple masking runs. The run_secret can consist of any sequence of at least 20 characters, however it is recommended to use a randomly generated string.

Notes:

  • Output values will be frozen only across identical repeated masking runs. Variations in database content (such as inserted or deleted rows) or Run Options (i.e. Batch Size, Max rows) can still result in different outputs even when using the same run_secret. Additionally, using a non-unique key may result in non-deterministic results. To retain consistent masking results across variable data sources, use the Deterministic masking features.
  • As the parallelised unloading of Amazon Redshift data into S3 files cannot guarantee consistent ordering between runs, a fixed run_secret may still result in different outputs between masking runs on Amazon Redshift databases. To retain consistent masking results with Amazon Redshift, use the Deterministic masking features.

Deterministic masking

If you wish to make the masked values deterministic based on one or more input values, you can provide hash columns in a ruleset. DataMasque will use the hashed value(s) from the specified sources as one of the random generator inputs (along with the instance seed and the run secret). This allows for consistent output values to be generated even across different databases or inside documents.

For database or tabular file masking, this is done with the use of a hash_columns argument for each column being masked. For file masking, the parameter is called hash_sources and the hash used for the entire file.

DataMasque will use the hashed value(s) as one of the random generator inputs (along with the instance seed and the run secret). This allows for consistent output values to be generated even across different databases, inside documents (XML/JSON) or when masking files.

Changes to hashing values in 2.8.0: Prior to DataMasque 2.8.0, values to hash were converted to strings then hashed. This may have resulted in duplicate output values when hashing on null. From version 2.8.0, hashing occurs on binary representations of data. This means that hashed outputs from DataMasque 2.8.0 onwards will differ to those generated by prior versions.

Deterministic masking with databases or tabular files

Each hash_columns item can either be just the name of the column, or can specify a json_path or xpath to fetch a hash value from inside JSON or XML, respectively. For more information on json_path or xpath please refer to the JSON documentation/XML documentation. For JSON/XML hash_columns the following logic is applied:

  • If the value extracted from the json_path/xpath is null the hash will be performed on the null value.
  • If the value does not exist at the json_path/xpath the hash will be performed on null.
  • If the JSON/XML data cannot be decoded (is invalid) the hash will be performed on the entire column, even if null.

hash_columns can also trim (strip) leading/trailing white space by setting the trim parameter to true.

Note Since null is the default fallback for invalid data, this may result in unexpected repeated masked values. Warnings will be added to the run logs when this fallback occurs. Both json_path and xpath optional parameters cannot be set for the same hash column. It is possible to hash a column on to itself (i.e. hash_columns contains the current column). If you hash a column on itself, and that is used as the hash for other columns, the mask on the hash column should come last in the rules. If not, that column will use the pre-masked value as the hash and other columns use the post-masked value, meaning different values will be generated in the same row for the same mask definition.

Example

This example will mask the date_of_birth column with a date value that has been deterministically generated based on the hash of date_of_birth and first_name column values combined with a one-off randomly generated secret for this run.

For example, in every row where date_of_birth = '2000-01-01' and first_name = 'Carl', the date_of_birth will be replaced with a deterministically generated value (e.g. 1999-03-03). This same replacement value will be generated for repeated rows that match the same date_of_birth and first_name values within this run. However, this value will be different for each run. To retain consistency of deterministically generated values between multiple runs, see Consistent masking.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: date_of_birth
        hash_columns:
          - date_of_birth
          - name
        masks:
          - type: from_random_date
            min: '1980-01-01'
            max: '2000-01-01'

An equivalent alternative version of YAML syntax is shown below:

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: date_of_birth
        hash_columns:
          - column_name: date_of_birth
          - column_name: name
        masks:
          - type: from_random_date
            min: '1980-01-01'
            max: '2000-01-01'

Show result

Before After
date_of_birth first_name
2000-01-01 Carl
1965-11-10 Ria
2000-01-01 Carl
2000-01-01 Jose
1990-05-31 Thomas
1999-07-31 Nicole
date_of_birth first_name
1999-03-03 Carl
1986-05-16 Ria
1999-03-03 Carl
1991-03-24 Jose
1971-05-20 Thomas
1962-12-11 Nicole

Example with xpath

This example will mask the date_of_birth column with a date value that has been deterministically generated based on the hash of the date_of_birth attribute from the following XML document column.

<Root>
    <Info>
        <Employee date_of_birth="2022-10-08">Steve</Employee>
        <Employee date_of_birth="2022-10-08">Fred</Employee>
    </Info>
</Root>

To get the date_of_birth attribute from the XML document we need to specify the path with an xpath. The following xpath will get the date attribute from the first Employee element of the XML document, Info/Employee[1]/@date_of_birth. If you wanted to perform the hash on a list of date_of_birth attribute's values you can specify Info/Employee/@date_of_birth.

version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: date_of_birth
        hash_columns:
          -  column_name: xml_data
             xpath: "Info/Employee/@date_of_birth"
        masks:
          - type: from_random_text
            min: 5
            max: 10

As shown in the results below, when the date_of_birth attribute has the same values for both Employee elements the resulting masked date_of_birth column contain the same values.

Show result

Before After
xml_data date_of_birth
<Root>
    <Info>
      <Employee date_of_birth="2000-10-08">Steve</Employee>
      <Employee date_of_birth="2022-10-08">Fred</Employee>
    </Info>
</Root>
1989-01-09
<Root>
  <Info>
    <Employee date_of_birth="2001-10-08">Steve</Employee>
    <Employee date_of_birth="2022-10-08">Fred</Employee>
  </Info>
</Root>
1990-07-14
<Root>
  <Info>
    <Employee date_of_birth="2002-10-08">Steve</Employee>
    <Employee date_of_birth="2022-10-08">Fred</Employee>
  </Info>
</Root>
1991-11-08
<Root>
  <Info>
    <Employee date_of_birth="2003-10-08">Steve</Employee>
    <Employee date_of_birth="2022-10-08">Fred</Employee>
  </Info>
</Root>
1997-03-14
<Root>
  <Info>
    <Employee date_of_birth="2004-10-08">Steve</Employee>
    <Employee date_of_birth="2022-10-08">Fred</Employee>
  </Info>
</Root>
1984-02-02
xml_data date_of_birth
<Root>
  <Info>
    <Employee date_of_birth="2000-10-08">Steve</Employee>
    <Employee date_of_birth="2022-10-08">Fred</Employee>
  </Info>
</Root>
1992-08-07
<Root>
  <Info>
    <Employee date_of_birth="2001-10-08">Steve</Employee>
    <Employee date_of_birth="2022-10-08">Fred</Employee>
  </Info>
</Root>
1993-11-14
<Root>
  <Info>
    <Employee date_of_birth="2000-10-08">Steve</Employee>
    <Employee date_of_birth="2022-10-08">Fred</Employee>
  </Info>
</Root>
1992-08-07
<Root>
  <Info>
    <Employee date_of_birth="2000-10-08">Steve</Employee>
    <Employee date_of_birth="2012-10-08">Fred</Employee>
  </Info>
</Root>
1998-05-12
<Root>
  <Info>
    <Employee date_of_birth="2000-10-08">Steve</Employee>
    <Employee date_of_birth="2022-10-08">Fred</Employee>
  </Info>
</Root>
1992-08-07

Deterministic masking with files

File masking tasks (mask_file) support deterministic generation of masked values which are based on input values. You can provide a hash_sources list in a rule, similar to the hash_columns for database masking. However, with hash_sources you must specify additional parameters to fetch the hash.

Parameters

  • json_path (optional): A path to the value in the JSON data to use for the hash.
  • xpath (optional): An Xpath to the value stored in the XML data to use for the hash.
  • file_path (optional): A boolean value, when set to true the hash will be performed on the file's path (relative to the base directory). Defaults to false.
  • case_transform (optional): upper/lower - Apply a case transform on the input value, for consistent hash values irrespective of case.
  • trim (optional): If the value extracted from the json_path, or xpath contains leading or trailing white space, it can be trimmed by setting this option to true. Defaults to false.

Notes Each of the json_path, xpath and file_path are shown as optional but exactly one of these needs to be specified. For more information on json_path or xpath please refer to the JSON documentation/XML documentation. hash_sources for multirecord files works the same as for object file, except the hash_sources are applied to each record in the file instead of the entire file.

When hash_sources are specified as part of a json transform or xml node_transform:

  • The json_path/xpath should be a relative path from the current node.
  • If hash_sources are specified as part of the task and within the json/xml mask:
  • The hash_sources will be overwritten with the hash_sources specified in the json/xml mask.
  • file_path is no longer a valid parameter.

Example with json_path

This example will mask a JSON file which contains the following JSON data:

{
  "users": {
    "name": "Richard",
    "addresses": [
      {
        "type": "postal",
        "city": "Nelson"
      },
      {
        "type": "physical",
        "city": "Napier"
      }
    ]
  }
}

We want to mask the city of the users, hashed with the name of the users. To do this we need to specify a json_path to the value we want to use for the hash ([users,name]) in the hash_sources, as shown in the ruleset below. The masked value we would want to replace the city with, would be best if it was from one of the seed files DataMasque provides.

version: "1.0"
tasks:
  - type: mask_file
    rules:
      - hash_sources:
          - json_path: ['users', 'name']
        masks:
          - type: json
            transforms:
              - path: ['users','addresses', '*','city']
                masks:
                  - type: from_file
                    seed_file: DataMasque_mixed_countries_addresses.csv
                    seed_column: city

As shown in the results below, when the name has the same values for both users the resulting masked values for the city is also the same.

Note Each row of the results table below represents a separate file.

Show result

Before After
JSON data
{"users":
  {"name": "Evelyn","addresses": [
    {"type": "postal","city": "Boronia"},
    {"type": "physical","city": "Cramphorne"}
  ]}}
{"users":
  {"name": "Sarah","addresses": [
    {"type": "postal","city": "Muckleford"},
    {"type": "physical","city": "Bandy Creek"}
  ]}}
{"users":
  {"name": "Bob","addresses": [
    {"type": "postal","city": "Doncaster"},
    {"type": "physical","city": "Shenton"}
  ]}}
{"users":
  {"name": "Sarah","addresses": [
    {"type": "postal","city": "Woronora"},
    {"type": "physical","city": "Lee Point"}
  ]}}
{"users":
  {"name": "Evelyn","addresses": [
    {"type": "postal","city": "Nelson"},
    {"type": "physical","city": "Napier"}
  ]}}
JSON data
{"users":
  {"name":"Evelyn","addresses":[
    {"type":"postal","city":"Stamford"},
    {"type":"physical","city":"Stamford"}
  ]}}
{"users":
  {"name":"Sarah","addresses":[
    {"type":"postal","city":"Canning Vale"},
    {"type":"physical","city":"Canning Vale"}
  ]}}
{"users":
  {"name":"Bob","addresses":[
    {"type":"postal","city":"Bolivia"},
    {"type":"physical","city":"Bolivia"}
  ]}}
{"users":
  {"name":"Sarah","addresses":[
    {"type":"postal","city":"Canning Vale"},
    {"type":"physical","city":"Canning Vale"}
  ]}}
{"users":
  {"name":"Evelyn","addresses":[
    {"type":"postal","city":"Stamford"},
    {"type":"physical","city":"Stamford"}
  ]}}

If case_transform is specified, then the casing of the hash input values will not affect the consistency of the final masked result.

version: "1.0"
tasks:
  - type: mask_file
    rules:
      - hash_sources:
          - json_path: ['users', 'name']
            case_transform: lower
        masks:
        - type: json
          transforms:
          - path: ['users','addresses', "*",'city']
            masks:
            - type: from_file
              seed_file: DataMasque_mixed_countries_addresses.csv
              seed_column: city

Note Each row of the results table below represents a separate file.

Show result

Before After
JSON data
{"users":
  {"name": "EVELYN","addresses": [
    {"type": "postal","city": "Boronia"},
    {"type": "physical","city": "Cramphorne"}
  ]}}
{"users":
  {"name": "SARAH","addresses": [
    {"type": "postal","city": "Muckleford"},
    {"type": "physical","city": "Bandy Creek"}
  ]}}
{"users":
  {"name": "Bob","addresses": [
    {"type": "postal","city": "Doncaster"},
    {"type": "physical","city": "Shenton"}
  ]}}
{"users":
  {"name": "Sarah","addresses": [
    {"type": "postal","city": "Woronora"},
    {"type": "physical","city": "Lee Point"}
  ]}}
{"users":
  {"name": "evelyn","addresses": [
    {"type": "postal","city": "Nelson"},
    {"type": "physical","city": "Napier"}
  ]}}
JSON data
{"users":
  {"name":"EVELYN","addresses":[
    {"type":"postal","city":"Stamford"},
    {"type":"physical","city":"Stamford"}
  ]}}
{"users":
  {"name":"SARAH","addresses":[
    {"type":"postal","city":"Canning Vale"},
    {"type":"physical","city":"Canning Vale"}
  ]}}
{"users":
  {"name":"Bob","addresses":[
    {"type":"postal","city":"Bolivia"},
    {"type":"physical","city":"Bolivia"}
  ]}}
{"users":
  {"name":"Sarah","addresses":[
    {"type":"postal","city":"Canning Vale"},
    {"type":"physical","city":"Canning Vale"}
  ]}}
{"users":
  {"name":"evelyn","addresses":[
    {"type":"postal","city":"Stamford"},
    {"type":"physical","city":"Stamford"}
  ]}}

If trim is specified as true, then any whitespace around the hash input values will not affect the consistency of the final masked result. In the example below, trim has been applied to the hash_source, and values in the hash_source containing whitespace have not affected the final masked result.

version: "1.0"
tasks:
  - type: mask_file
    rules:
      - hash_sources:
          - json_path: ['users', 'name']
            trim: true
        masks:
        - type: json
          transforms:
          - path: ['users','addresses', "*",'city']
            masks:
            - type: from_file
              seed_file: DataMasque_mixed_countries_addresses.csv
              seed_column: city

Show result

Before After
JSON data
{"users":
  {"name": "Evelyn  ","addresses": [
    {"type": "postal","city": "Boronia"},
    {"type": "physical","city": "Cramphorne"}
  ]}}
{"users":
  {"name": " Sarah ","addresses": [
    {"type": "postal","city": "Muckleford"},
    {"type": "physical","city": "Bandy Creek"}
  ]}}
{"users":
  {"name": "Bob","addresses": [
    {"type": "postal","city": "Doncaster"},
    {"type": "physical","city": "Shenton"}
  ]}}
{"users":
  {"name": "Sarah  ","addresses": [
    {"type": "postal","city": "Woronora"},
    {"type": "physical","city": "Lee Point"}
  ]}}
{"users":
  {"name": "  Evelyn","addresses": [
    {"type": "postal","city": "Nelson"},
    {"type": "physical","city": "Napier"}
  ]}}
JSON data
{"users":
  {"name":"Evelyn  ","addresses":[
    {"type":"postal","city":"Stamford"},
    {"type":"physical","city":"Stamford"}
  ]}}
{"users":
  {"name":" Sarah ","addresses":[
    {"type":"postal","city":"Canning Vale"},
    {"type":"physical","city":"Canning Vale"}
  ]}}
{"users":
  {"name":"Bob","addresses":[
    {"type":"postal","city":"Bolivia"},
    {"type":"physical","city":"Bolivia"}
  ]}}
{"users":
  {"name":"Sarah  ","addresses":[
    {"type":"postal","city":"Canning Vale"},
    {"type":"physical","city":"Canning Vale"}
  ]}}
{"users":
  {"name":"  Evelyn","addresses":[
    {"type":"postal","city":"Stamford"},
    {"type":"physical","city":"Stamford"}
  ]}}

Example with xpath

This example will mask two different XML files but generate consistent masked data by using user id as a hash. The first contains a user's info.

<User id="8472">
    <Name>Evelyn</Name>
    <Address>
        <City>Nelson</City>
    </Address>
</User>

The second contains a list of user's log messages:

<Logs userId="8472">
    <Message to="Evelyn" date="2022-08-09" type="welcome"/>
    <Message to="Evelyn" date="2022-08-10" type="forgot_password"/>
</Logs>

In both cases, we want the user's name to be consistent, and this can be done by specifying the user's ID as the hash source for each file. We'll create a ruleset with two mask_file tasks, and use includes to map the right task to the correct file.

The first ruleset will mask the User/Name text value, and the second will mask the to attribute of each Log/Message element. The first ruleset will use the id attribute on User as the hash source, while the second will use the userId attribute on Logs.

Here's the ruleset to accomplish this:

version: "1.0"
tasks:
  - type: mask_file
    include:
      - glob: user_info_*.xml
    rules:
      - hash_sources:
        - xpath: 'User/@id'
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: '<User />'
            transforms:
              - path: 'User/Name'
                node_transforms:
                - type: text
                  masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_mixed.csv
  - type: mask_file
    include:
      - glob: user_logs_*.xml
    rules:
      - hash_sources:
        - xpath: 'Logs/@userId'
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: '<Logs />'
            transforms:
              - path: 'User/Name'
                node_transforms:
                - type: attribute
                  attributes: 'to'
                  masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_mixed.csv

The two output files would use consistent names. The first file:

<User id="8472">
    <Name>Sarah</Name>
    <Address>
        <City>Nelson</City>
    </Address>
</User>

And the second:

<Logs userId="8472">
    <Message to="Sarah" date="2022-08-09" type="welcome"/>
    <Message to="Sarah" date="2022-08-10" type="forgot_password"/>
</Logs>

Example with file_path

This example will just show an example ruleset to mask files based on the file path: This applies to any mask_file task.

version: "1.0"
tasks:
  - type: mask_file
    rules:
      - hash_sources:
          - file_path: true
        masks:
          - type: json
            transforms:
              - path: ['users','addresses', '*','city']
                masks:
                  - type: from_file
                    seed_file: DataMasque_mixed_countries_addresses.csv
                    seed_column: city

With this ruleset the hashing will be performed on the file path, so each file with a different path with have differing masked values. This is useful if you want repeatable masking based on the name of a file.

Example with relative json_path

Referring to the document from the JSON path example above, DataMasque can generate a hash from a value that is relative to the current node being masked.

In the following example, the node being masked will be one of the ['users', 'addresses', '*','city'] text nodes.

The string .. is used to navigate up through the document hierarchy (analogous to using cd .. to move up a directory when browsing a files in a terminal).

First we show the ruleset, then explain the hash source's json_path.

version: "1.0"
tasks:
  - type: mask_file
    rules:
      masks:
        - type: json
          transforms:
            - path: ['users','addresses', '*','city']
              hash_sources:
                - json_path: ['..', '..', '..', 'name']
              masks:
                - type: from_file
                  seed_file: DataMasque_mixed_countries_addresses.csv
                  seed_column: city

Each transform in transforms is applied to each city found at path. There is only one transform in this example.

To fetch the name of the user for that particular city, to use in hashing, we must ascend up three nodes by using the .. operator three times:

  1. First ascending from the city text node to the address object.
  2. Then to the addresses array containing the list of addresses.
  3. Finally, to the user object that contains the addresses array.

The current node is now the user object. The final element in the json_path (name) can be used to fetch the name of that user.

In this way the city used in the from_file mask is generated based on the hash of the name of that user.

Example with relative xpath

Since the xml mask uses standard XPaths for node traversal, normal relative XPath syntax may be applied, as shown in this next example.

version: "1.0"
tasks:
  - type: mask_file
    include:
      - glob: user_info_*.xml
    rules:
      masks:
        - type: xml
          fallback_masks:
            - type: from_fixed
              value: '<User />'
          transforms:
            - path: '/User/Name'
              node_transforms:
                - type: text
                  hash_sources:
                    - xpath: '../@id'
                  masks:
                    - type: from_file
                      seed_file: DataMasque_firstNames_mixed.csv
  - type: mask_file
    include:
      - glob: user_logs_*.xml
    rules:
      masks:
        - type: xml
          fallback_masks:
            - type: from_fixed
              value: '<Logs />'
          transforms:
            - path: '/User/Name'
              node_transforms:
                - type: attribute
                  attributes: 'to'
                  hash_sources:
                    - xpath: '../@userId'
                  masks:
                    - type: from_file
                      seed_file: DataMasque_firstNames_mixed.csv

The above example uses two hash_sources with xpath attributes. The first uses ../@id to fetch the id attribute of the parent XML node (<User>), and the second similarly uses ../@userId to fetch the userId attribute.

Common regular expression patterns

A regular expression (or "regex") is a sequence of characters that acts as a search pattern to filter or select substrings of text strings. They can be used in replace_regex masks and matches conditions of if and skip blocks. Note that it is best practice to wrap the regular expression in quotes to avoid special characters being misinterpreted as YAML syntax:

version: '1.0'
tasks:
  - type: mask_table
    table: driversLicence
    key: ID
    rules:
      - column: driversLicence
        masks:
          - type: replace_regex
            regex: "[0-9]"
            masks:
              - type: from_fixed
                value: "#"

Notes: DataMasque uses the Python regular expression syntax, which you can learn more about in the official tutorial.

Regular expressions allow you to match various characters by specifying a character or character set. Below are a list of characters sets that can be used in regular expressions.

Characters can be specified as character classes.

  • "[0-9]" can be used to match any numerical character.
  • "[a-z]" and "[A-Z]" will match lower case and upper case alphabetical characters respectively.
  • Classes can also be combined; "[a-zA-Z]" will match any upper or lower case character.
  • You may also specify a list of characters to match within square brackets. If you wish to only match vowels, you can use "[aeiouAEIOU]".
  • If a caret character is added before any character within the character class, it will instead match something other than the specified character. "^[0-9]" will match any character other than a numerical character.

The table below detail other ways to denote certain character type without the use of character classes.

Character Description
\d One digit character, ranging from 0 to 9.
\w One 'word' character; an ascii letter, a digit or an underscore.
\s One whitespace character. This can be a space, a line break, or a tab.
\D One character that is NOT a digit.
\W One character that is NOT a word character.
\S One character that is NOT a whitespace character.

Note that when used in double-quoted strings in YAML, literal backslashes must be escaped by backslashes, for example, "\\d" will be interpreted as \d. We recommend quoting YAML strings with single quotes to avoid the need for this escaping, as '\d' will be interpreted as \d.

You can also choose to add a quantifier to any character or character class, which will specify how many characters will be matched.

Quantifier Description
+ One or more characters will be matched.
{n} Exactly n characters will be matched. For example, \d{3} will match strings of exactly 3 numerical digits.
{a,b} Any string of characters between x and y length will be matched. For example, \w{1,3} will match a string of word characters between 1 and 3 length.
{c,} Any string of characters with c or more characters. For example, \d{2,} will match a string of numerals of length 2 or more.
* Any instance of the characters appearing zero or more times.
? Matches if the character or character class appears once, or no times.

Below are some common regular expression patterns that can be used to match typical database strings.

Example Patterns

Phone Number

The pattern below will match a phone number consisting of an international code in parentheses followed by a 7-digit phone number that may be hyphenated after the 3rd digit.

'\(\+\d{1,4}\)\s?\d{3}-?\d{4}'

Some example of matching patterns are shown below:

  • (+64)123-4567
  • (+1234) 5678910
  • (+61) 987-5432

Social Security Number

The pattern below will match a Social Security Number consisting of hyphen-separated digits:

'\d{3}-\d{2}-\d{4}'

Some example of matching patterns are shown below:

  • 111-22-3333
  • 987-65-4321
  • 112-35-8132

Time

The pattern below will match a time consisting of 1 or 2 hour digits and 2 minute digits separated by a colon:

'[0-2]?\d:\d{2}'

Some example of matching patterns are shown below:

  • 7:30
  • 23:50
  • 05:45

Email Address

The pattern below will match an email address consisting of an alphanumeric username and an alphanumeric domain name separated by an @ symbol. For a more comprehensive email regular expression, see: emailregex.com.

'[a-zA-Z0-9]+@[a-zA-Z0-9\.]+'

Some example of matching patterns are shown below:

  • janedoe@gmail.com
  • bobsmith@hotmail.com
  • thomas@yahoo.com

DISCLAIMER: The above examples are simplistic and general examples to demonstrate how regular expressions can be constructed - you should ensure these regular expressions are appropriate for your particular data before using them in your masking rulesets.

Date Parsing in DataMasque

Often, rulesets need to specify a date format to convert text into a date. For example, using Date/Time masks, executing type casts or performing Conditional Masking.

Date parsing in DataMasque follows standard %-based directives, as documented in strftime() and strptime() Format Codes.

An important point to note is how years with and without centuries are handled.

Parsing years with/without centuries (%Y vs. %y)

Two directives are available to parse years:

  • %Y to parse years with the century (e.g., 2020, 1999, 1867, etc.).
  • %y to parse years without the century (e.g., 20, 99, 67, etc.).

When parsing years without a century using %y, the following interpretation applies:

  • Years from 00 to 68 are interpreted as the years 2000 to 2068.
  • Years from 69 to 99 are interpreted as the years 1969 to 1999.

While %y provides a convenient way to parse two-digit years, it introduces ambiguity in interpretation, especially near the transition between centuries.

It is recommended to use %Y whenever possible to avoid ambiguity, and ensure consistent interpretation across platforms and environments.

Determining what unique key masking to use

Below are the suggestions for when to use the three different mask functions: mask_unique_key, from_unique, and from_unique_imitate.

mask_unique_key

Use Case:

  • Best for generating unique values and propagating these to both physical and logical foreign key (FK) columns.

Key Features:

  1. Allows specifying the data format.
  2. Generates unique values.
  3. Propagates to physical and logical FK columns.
  4. Disallows hash_columns on itself or other columns.
  5. Ensures primary key (PK) columns are unique before masking.
  6. No guaranteed consistency across masking runs or databases.

Ideal Scenario:

  • Suited for scenarios where maintaining FK relationships is crucial, and data format specification is required.

from_unique

Use Case:

  • Ideal for generating unique values without the need for FK column propagation.

Key Features:

  1. Allows specifying the data format.
  2. Generates unique values.
  3. Disallows hash_columns on itself or other columns.
  4. Does not propagate masked PK values to physical and logical FK columns. 5.No guaranteed consistency across masking runs or databases.

Ideal Scenario:

  • Useful when propagation to FK columns is not needed, and there's a requirement for data format specification.

from_unique_imitate

Use Case:

  • Used when consistent values across masking runs or databases are required, and FK relationships are needed to be maintained without specifying data format.

Key Features:

  1. Does not allow data format specification.
  2. Hashes on itself and generates unique values.
  3. Propagates masked PK values to physical and logical FK columns.
  4. Guarantees consistent values across runs or databases with the same run_secret.

Ideal Scenario:

  • Suitable for cases where consistent masking is necessary across multiple runs or databases, maintaining FK relationships is crucial and data format specification is not a priority.

Schema versioning

Schema changes to the DataMasque ruleset specification are tracked using the version field of the ruleset. The version number consists of two fields in the format major.minor. Minor version increments reflect backwards-compatible changes to the schema, whereas major version increments represent breaking changes which will require some form of migration from previous versions. Wherever possible, DataMasque will handle such migrations for you automatically when you upgrade.

Each release of DataMasque only supports the most recent major ruleset version at the time of release. As such, the major schema version of your rulesets must equal the major version supported by your DataMasque release. The minor schema version of your rulesets must be equal to or less than the minor version supported by your DataMasque release.

The ruleset schema version supported by this release of DataMasque is "1.0".

Writing comments

While creating a ruleset, it is possible to write comments in the ruleset. A commented block is skipped during execution, and it helps to add description for specified ruleset block.

If you begin a line with # (hash symbol), all text on that line will become a comment.

version: '1.0'
# This line will become a comment.
tasks:
  - type: mask_table

If you place # on a line, all text after that # on that line will become a comment. Any text before it will still be part of the ruleset.

version: '1.0'
tasks:
  - type: mask_table # The name of this task type will not be affected by this comment.

In the DataMasque ruleset editor, the shortcut key combination for commenting ruleset blocks is CTRL + / on Linux and Windows and ⌘ + / for Mac operating systems. If your cursor is on a line and this shortcut is used, the entire line will be commented out. Highlighting multiple lines at once will cause all highlighted lines to be commented out.