DataMasque Portal

Generic Masks

Generic Masks are used for simple generation or transformation of most data types.

Fixed value (from_fixed)

A simple mask that replaces all column values with the same fixed value.

Parameters

  • value (required): The value to replace all column values with. Can be any data type, but should match that of the column being masked. This value can be enclosed in quotation marks, which will convert the value a string, or entered without quotation marks.

Example

This example will replace all values in the name column of the employees table with 'Alex'.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: ID
    rules:
      - column: name
        masks:
          - type: from_fixed
            value: 'Alex'

Show result

Before After
name
Bill
Chris
Anastasia
Judith
Gordon
Joel
name
Alex
Alex
Alex
Alex
Alex
Alex

DataMasque also supports blank values for the from_fixed mask type. To specify a blank entry, you can enter the value as """, '', or leave the field blank if there are no spaces.

version: '1.0'
tasks:
  - type: mask_table
    table: customers
    key: id
    rules:
      column: name
      masks:
      - type: from_fixed
        value: ''

From column (from_column)

A mask to replace a column's value by copying the value from another database column (from the same table, or a joined table).

Parameters

  • source_column (required): The name of the column to copy values from. The source column can be prefixed with a table name to reference a column in another table, and that table name can be prefixed with a schema name to reference a table in another schema.

Note: from_column is not supported for mask_file tasks, but is supported for mask_tabular_file tasks.

Example

This example will replace the values of the name column in the employees table with values from the first_name column of the users table, joined on id.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    joins:
      - target_table: users
        target_key: id
        source_table: employees
        source_key: user_id
    rules:
      - column: name
        masks:
          - type: from_column
            source_column: users.first_name

Show result

Given the users table:

id first_name
24 Kyle
25 Reid
26 Helen
27 Callum
28 Raymond
29 Vivien

The following transformation will take place on the employees table.

Before After
user_id name
26Bill
25Chris
29Anastasia
24Judith
27Gordon
28Joel
user_id name
26Helen
25Reid
29Vivien
24Kyle
27Callum
28Raymond


From file (from_file)

A mask to replace column values by selecting from a column (seed_column) in a CSV formatted seed file (seed_file). The seed file must contain one or more columns with named column headers. See the Files documentation for more information on uploading seed files.

If there are any blank values in the file, they will not be considered when randomly selecting a replacement value. If you wish to have blank values randomly selected from your seed file, ensure they are enclosed with quotation marks. For example, "".

Potential replacement values from the seed file rows may be pre-filtered using the table_filter_column and seed_filter_column parameters. This allows you to subset the available values for each row by matching a column value from the table to a column value in the seed file. If filtering does not match any rows in the seed file, fallback_masks can be provided to generate the replacement value instead.

After filtering has been applied (optional), replacement values from the seed file are selected randomly by default. To enable deterministic selection of replacement values, you may specify one or more hash_columns for the masking rule. See Deterministic masking for more details.

Parameters

  • seed_file (required): The name of a user-provided CSV seed file to select values from (see Files guide).
  • seed_column (optional): The name of the column in the seed file that will provide replacement values. This property only needs to be specified for CSV files with multiple columns.
  • table_filter_column and seed_filter_column (optional): If both attributes are provided, the replacement value will only be selected from CSV rows where the value for the seed_filter_column in the CSV row is equal to the value for the table_filter_column column in the database row being masked. These options are only supported for use with database masking, or for tabular file masking. These will not be applicable for other file masking circumstances.
  • fallback_masks (optional): If there are no rows in the CSV that match the table_filter_column/seed_filter_column condition, then this nested list of masks will be applied to the column value instead. If no fallback_masks are provided and no match is found, then the column value will be replaced with a NULL value.
  • null_string (optional): If set, any values in the seed file column that match this value will be treated as NULL (instead of a string) when written to the database.
  • value_on_missing (optional): If the mask were to generate a NULL value (due to filters not matching any rows or from NULL values in the seed file) then use this value (if set) instead.

Example

This example masks the street_name column with a random street name from a seed file (DataMasque_address_example.csv) while ensuring that the randomly selected street is from the same city as the original. A fallback value of “Main Street” is used in the event that the database row contains a city that does not exist in the CSV.

version: '1.0'
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: street_name
        masks:
          - type: from_file
            seed_file: DataMasque_address_example.csv
            seed_column: street_name
            table_filter_column: city
            seed_filter_column: city
            fallback_masks:
              - type: from_fixed
                value: 'Main Street'

Show result

Given the following database table and using the DataMasque_address_example.csv seed file:

street_number street_name city
44 Queen Street Auckland
57 Oxford Street London
12 Houston Street New York
5 Fake Street Fake City

This mask could produce the following masked output:

street_number street_same city
44 High Street Auckland
57 Brick Lane London
12 Maiden Lane New York
5 Main Street Fake City


From blob (from_blob)

Note: When used in a mask_table task, this mask type will be skipped in a dry run as it requires writing to the database.

A mask to replace column values or overwrite the entire file content with the entire contents of another file.

Important! The performance of from_blob is comparatively slower than other masking functions due to its need to write a significantly larger amount of data into the database. Consider running from_blob as its own parallel task.

from_blob is not available for use in mask_tabular_file tasks. This means you cannot use it to, for example, mask a Parquet column of binary type.

from_blob is not supported for Amazon Redshift databases.

from_blob is not supported for IBM Db2 LUW databases.

The source file can be any file type. DataMasque includes some example binary files, as detailed in Built-in blob files). See the Files documentation for more information on uploading your own files.

Parameters

  • file_name (required): The name of a user-provided file to retrieve blob data from (see Files guide).

Example

This database masking example masks the avatar column which stores blob data and replaces all values of records that have role equal to admin with the file admin_user.png in the ruleset. Note that this example assumes the file admin_user.png has been uploaded to DataMasque.

version: "1.0"
name: from_blob
tasks:
  - type: mask_table
    table: users
    key: id
    rules:
      - if:
        - column: role
          equals: "admin"
        rules:
          - column: avatar
            masks:
              - type: from_blob
                file_name: "admin_user.png"

This file masking example masks .jpg files by overwriting the content of those files with blob data from replacement.jpg, (assuming this file has been uploaded through the DataMasque web UI). All matched masked files will have the same content after masking.

version: "1.0"
name: from_file
tasks:
  - type: mask_file
    include:
      - glob: *.jpg
    rules:
      - masks:
        - type: from_blob
          file_name: "replacement.jpg"

From format string (from_format_string)

A mask to generate values of a defined format. This is useful when all values of a column have the same format. DataMasque will generate values according to a format specified according to the format string syntax.

Parameters

  • value_format (required): The format of the data to be generated, which must meet the requirements of the format string syntax.
version: '1.0'
tasks:
  - type: mask_table
    table: drivers
    key: id
    rules:
      - column: licence_plate
        masks:
          - type: from_format_string
            value_format: "{[A-Z],3}{[0-9],3}"

Show result

Before After
licence_plate
ABC123
DEF456
GHI789
JKL246
MNP813
RTU579
licence_plate
VTC799
GEQ249
HWQ462
INY664
AKZ548
SCL654

Note

  • In addition to what is possible with the formats for a mask_unique_key task there is more you can do with format strings with the mask_table task. For example, you can easily select one value from a small number of alternatives using the syntax for alternatives: {(EN|FR)}
  • If you need to guarantee that generated values are unique within a column, consider using the from_unique mask type.

From choices (from_choices)

A mask to replace column values by selecting from a list of choices or a dictionary of choices with weights specified.

Parameters

  • choices (required): Can either be a list of strings, or a list of dictionaries. If a list of dictionaries, each dictionary has a single key, the "choice" and value is the weight.

    Weights must be positive, and can be a mix of float and int. Weights do not need to sum to a particular value, they are relative.

    If there are no weights assigned to any of the choices then all choices will have equal weights.

Examples

This example masks the department column with a randomly selected department from the list of choices.

version: "1.0"
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: department
        masks:
          - type: from_choices
            choices:
              - "accounting"
              - "sales"
              - "research and development"

Show result

Before After
department
employee
employee
employee
employee
employee
employee
employee
employee
employee
employee
department
research and development
accounting
sales
research and development
accounting
sales
research and development
sales
accounting
research and development

This example masks the department column with a randomly selected department from the list of choices based on the weights provided.

version: "1.0"
tasks:
  - type: mask_table
    table: employees
    key: id
    rules:
      - column: department
        masks:
          - type: from_choices
            choices:
              - "accounting": 10
              - "sales": 15
              - "research and development": 25

Show result

Before After
department
sales
sales
sales
sales
sales
sales
sales
sales
sales
sales
department
research and development
accounting
sales
research and development
research and development
sales
research and development
sales
accounting
research and development

Secure Shuffle (secure_shuffle)

Note: When used in a mask_table task, this mask type will be skipped in a dry run as it requires writing to the database.

This mask is designed to "shuffle" a number of distinct values in a particular column. It does not strictly rearrange the data, maintaining distribution and cardinality. Instead, the distinct values in a particular column are used as random replacements throughout the column. This means that after masking a column with secure_shuffle, the distribution of values inside will be roughly uniform.

By default, null values are included in the column choices. If include_nulls_in_choices is set to false, null values will not be included in the choices, and thus, the masked column result will not include null values.

Parameters

  • include_nulls_in_choices (optional): When set to false, null values will not be included as output choices. Defaults to true. Please see Working with nulls below for more information.

Support and Restrictions

Currently, secure_shuffle only supports Oracle databases.

Because secure_shuffle works by shuffling data directly in the database, it is unable to be combined with other masking functions. For example, the output of secure_shuffle can not be used in a chain mask to apply further masking functions.

secure_shuffle may not be used as part of an if expression to conditionally apply secure_shuffle or a different mask type to the same column.

If further masking operations are to be performed on shuffled data, a second mask_table task should be used to perform these operations after the shuffle is completed.

Working with nulls

Since secure_shuffle does not maintain distribution, after masking there may be many more null values in the column than there were prior to masking.

Using include_nulls_in_choices, along with the skip option at the column level can control how nulls are handled.

The following examples shows the effect of include_nulls_in_choices and skip: null in the final column values. The original values in the column are A, B, C, and null.

Use all replacement values, including nulls

The default options for secure_shuffle masks will include nulls in the replacement value, as well as perform replace of nulls nulls.

Use all replacement values, including nulls

The default options for secure_shuffle masks will include nulls in the replacement value, as well as perform a replacement of nulls.

version: "1.0"
tasks:
  - type: mask_table
    table: table_name
    key: id
    rules:
      - column: column_to_shuffle
        masks:
          - type: secure_shuffle

The following table shows source values and possible replacement values for this ruleset.

Source Value Possible Replacement Values
A A, B, C null
B A, B, C null
C A, B, C null
null A, B, C null

This ruleset means all values, including nulls, may be used as replacements.

Exclude nulls from replacement values

This ruleset uses include_nulls_in_choices: false to exclude null from being used as a replacement value.

version: "1.0"
tasks:
  - type: mask_table
    table: table_name
    key: id
    rules:
      - column: column_to_shuffle
        masks:
          - type: secure_shuffle
            include_nulls_in_choices: false

The following table shows source values and possible replacement values for this ruleset.

Source Value Possible Replacement Values
A A, B, C
B A, B, C
C A, B, C
null A, B, C

This ruleset means that there will be no nulls in the output column. Typically, this type of data cleansing is unwanted, so it is suggested to only use include_nulls_in_choices: false with skipping of null at a column level, as the next section demonstrates.

Retain null values in the output

Combining include_nulls_in_choices: false with skipping of null values will mean that null values remain as null in the output, while non-null values are shuffled. The following ruleset achieves this:

version: "1.0"
tasks:
  - type: mask_table
    table: table_name
    key: id
    rules:
      - column: column_to_shuffle
        skip:
        - null
        masks:
          - type: secure_shuffle
            include_nulls_in_choices: false

The following table shows source values and possible replacement values for this ruleset.

Source Value Possible Replacement Values
A A, B, C
B A, B, C
C A, B, C
null null

Skip null and include null in replacement values

This last example is shown for completeness, but doesn't reflect a normal use case. It skips null values while also including null in available outputs. The means nulls will be retained, as well as being used as a replacement value.

version: "1.0"
tasks:
  - type: mask_table
    table: table_name
    key: id
    rules:
      - column: column_to_shuffle
        skip:
        - null
        masks:
          - type: secure_shuffle
            include_nulls_in_choices: true

The following table shows source values and possible replacement values for this ruleset.

Source Value Possible Replacement Values
A A, B, C, null
B A, B, C, null
C A, B, C, null
null null

This ruleset is the least production-realistic as it will introduce many more nulls than there were originally.

Example (nullable varchar)

This example securely 'shuffles' the department_name column:

version: "1.0"
tasks:
  - type: mask_table
    table: departments
    key: id
    rules:
      - column: department_name
        masks:
          - type: secure_shuffle

Show result

Before After
department_name
Software
null
HR
Finance
null
HR
department_name
Finance
Software
DATAMASQUE_NULL_SENTINEL
Finance
HR
Software


From JSON path (from_json_path)

A mask to replace data from one JSON path with data from another JSON path within the current masking JSON document. This mask allows masking of values, arrays, or objects.

Parameters

  • relative_source_path (required): The path name from which to copy the data. It will be related to the path specified in the parent transform mask and must only reference a single JSON item, either a value, a list, or an object. An error will be raised if the relative_source_path references multiple items. Example: relative_source_path: ['users', '*', 'name'].

Note: from_json_path is only supported inside json mask. For each JSON transform, if from_json_path is specified in the list of masks, force_consistency must be set to false or omitted. Otherwise, an error will be raised.

Example

This example will replace the departure data in the json_data column of the flights table with values from the destination data.

version: '1.0'
tasks:
  - type: mask_table
    table: flights
    key: id
    rules:
      - column: json_data
        masks:
          - type: json
            transforms:
              - path: ['departure']
                masks:
                  - type: from_json_path
                    relative_source_path: ['..', 'destination']

Show result

Before After
json_data
{"departure": "Willowbrook", "destination": ["Harborview", "Cedarwood"]}
{"departure": "Oceanview", "destination": {"name": "Oceanview"}}
{"departure": [], "destination": ["Pinecrest"]}
json_data
{"departure": ["Harborview", "Cedarwood"], "destination": ["Harborview", "Cedarwood"]}
{"departure": {"name": "Oceanview"}}, "destination": {"name": "Oceanview"}}
{"departure": ["Pinecrest"], "destination": ["Pinecrest"]}