DataMasque Portal

Conditional Masking

You may wish to only apply masks to rows or values that meet some conditions. DataMasque has three different methods for conditionally applying masks to meet different use cases:

Use case Mechanism
I want to restrict which rows are fetched for masking from the database table. Where
I want to apply certain masking rules to only a subset of rows or values. If
I want to skip applying masks to certain column values. Skip

Warning: Use of the conditional masking features where, skip, or if/else, may mean your masking rules are not applied to some database rows or values. It is recommended to verify the resulting output satisfies your masking requirements.

Where - restricting database fetches

To restrict which rows are fetched for masking from a database table, you can specify a where clause for a masked_table:

version: "1.0"
tasks:
  - type: mask_table
    table: users
    key: id
    where: >-
      "users"."role" <> 'administrator'
    rules:
      ...

The where clause can refer to any columns in the masked table or joined tables. All columns must be referenced using their table-qualified name (e.g. Users.FirstName). Ensure to use appropriate quoting as required. For example, if the identifier uses a reserved word, starts with an illegal character, or is a case-sensitive identifier.

Important!

  • Any rows excluded by the where clause will not be masked.
  • The SQL you provide for the where clause will not be validated before execution, please take care when constructing your SQL.
  • The SQL you provide for the where clause should not end in a semicolon, as this will cause a masking error.
  • Any string in the where clause variables must be quoted in single quotation marks.
  • Joined tables cannot be referenced in the where clause currently.
  • Using where to filter based on the system's current date/time, e.g. SYSDATE in Oracle or NOW() in PostgreSQL, is likely to cause a masking run failure. This is because the number of rows that match the where clause will probably vary during the run.

Note for Amazon Redshift:

  • Use of where clause for Amazon Redshift is not yet supported in DataMasque. This is in our roadmap and will be included in future releases.

If - conditional rules

You can choose to apply certain masking rules to only a subset of rows within a table, while still allowing other masks to be applied to those rows. This can be achieved through the use of if-conditions in rules lists.

Amazon DynamoDB note: Masking Amazon DynamoDB uses a mask_table task, so standard mask_table condition parameters apply.

Note: If a column is being masked in a from_blob mask, an if condition against that column can only check if the value in that column is null or non-null. This is because DataMasque only records if blob columns are null or not; the exact blob data is not loaded, since it could take up a lot of RAM.

Example

In the following example, the last_name of all users will be replaced with 'Smith', but the user's gender will determine the mask applied to their first_name:

version: "1.0"
tasks:
  - type: mask_table
    table: users
    key: id
    rules:
      - column: last_name
        masks:
        - type: from_fixed
          value: 'Smith'
      - if:
          - column: gender
            equals: 'female'
        rules:
          - column: first_name
            masks:
            - type: from_fixed
              value: 'Alice'
        else_rules:
          - if:
              - column: gender
                equals: 'male'
            rules:
              - column: first_name
                masks:
                  - type: from_fixed
                    value: 'Bob'
            else_rules:
              - column: first_name
                masks:
                  - type: from_fixed
                    value: 'Chris'

This example applies the same conditions in a mask_tabular_file task.

version: "1.0"
name: basic_file_mask
tasks:
  - type: mask_tabular_file
    include:
      - glob: "*.csv"
    rules:
      - if:
        - column: gender
          equals: 'female'
        rules:
          - column: first_name
            masks:
            - type: from_fixed
              value: 'Alice'
        else_rules:
          - if:
              - column: gender
                equals: 'male'
            rules:
              - column: first_name
                masks:
                  - type: from_fixed
                    value: 'Bob'
            else_rules:
              - column: first_name
                masks:
                  - type: from_fixed
                    value: 'Chris'

Condition Parameters

Structure of an if condition

  • if (required): A list of conditions (see below) that must all evaluate as true for the nested list of rules to be applied to a row.
  • rules (required): A nested list of masking rules/nested-if-conditions (or dictionary mapping labels to rules) that will only be applied to rows that meet the conditions defined under if.
  • else_rules (optional): A nested list of masking rules/nested-if-conditions (or dictionary mapping labels to rules) that will only be applied to rows that do NOT meet the conditions defined under if.

Selecting the value to be checked by the condition

There are 3 different parameters that can be used to select which data values the condition is applied to:

  • column (required): The database column to check this condition against. The column name can be prefixed with a table name to reference a column in another table, and that table name can be prefixed with a schema name to reference a table in another schema.
  • json_path: For JSON records, a json_path must be provided to extract the value to which the condition will be applied.
  • xpath: For XML records, an xpath must be provided to extract the value to which the condition will be applied.

Notes:

  • It is not valid to specify both json_path and xpath in a single condition. Specify only one or the other.
  • One of json_path or xpath must be specified for conditional masking in mask_file tasks.
  • For databases and tabular files, DataMasque decodes the JSON or XML data (as appropriate) within the selected column and uses the value at json_path or xpath to evaluate the condition. As such, the json_path or xpath must not contain wildcards as it must resolve to exactly one value.
  • For Parquet files, if conditionals with json_path can be used both for columns of string type containing JSON data and for columns of struct, map, or list type containing any data. maps and structs are interpreted as if they were JSON objects, and lists as if they were JSON arrays.
Handling missing values

An additional option can be specified for conditions that use json_path or xpath:

  • on_missing (optional): Determines how to handle records where no value exists at a condition's json_path or xpath, or where the record is not valid JSON/XML:
    • error (default): Raise an error in the run log.
    • skip: Skip this rule for the current row, record or file.
    • apply_if_rules: Apply the rules defined in rules that are normally applied when the condition is True.
    • apply_else_rules: Apply the rules defined in else_rules that are normally applied when the condition is False. Only permitted when else_rules are provided.

Note: When a database column or tabular file column specified in an if does not exist, the masking run will always raise an error. on_missing behaviour only applies when no value exists at a specified json_path or xpath or where the column/record contains invalid JSON/XML when a json_path/xpath is specified.

The following example demonstrates how to use json_path and on_missing for a column containing JSON data. If the first entry in the phone_numbers list is an empty string, then it will be set to "12345678", otherwise it will be set to "22222222". If for any particular row the phone_numbers list does not exist, or is empty, then that row will not be masked.

version: "1.0"
tasks:
  - type: mask_table
    table: users
    key: id
    rules:
      - if:
          - column: contacts
            json_path:
              - phone_numbers
              - 0
            equals: ""
        on_missing: skip
        rules:
          - column: contacts
            masks:
            - type: json
              transforms:
                - path:
                    - phone_numbers
                    - 0
                  masks:
                    - type: from_fixed
                      value: "12345678"
        else_rules:
          - column: contacts
            masks:
            - type: json
              transforms:
                - path:
                    - phone_numbers
                    - 0
                  masks:
                    - type: from_fixed
                      value: "22222222"

Comparison types

The following comparisons can be applied to the selected data value:

  • equals (optional): If specified, the condition will only evaluate as true if the value exactly equals the specified value. Data types are also checked (i.e. 100 is not equal to "100").
  • matches (optional): If specified, the condition will only evaluate as true if the string of the value matches the specified regular expression. For more details on how to use regular expressions, see Common regular expression patterns.
  • less_than (optional): If specified, the condition will only evaluate as true if the value is a number or date/datetime and is less than the given value. If the specified value is numeric, the value from the data will be treated as a number.
  • less_than_or_equal (optional): If specified, the condition will only evaluate as true if the value is a number or date/datetime and is less than or equal to the given value. If the specified value is numeric, the value from the data will be treated as a number.
  • greater_than (optional): If specified, the condition will only evaluate as true if the value is a number or date/datetime and is greater than the given value. If the specified value is numeric, the value from the data will be treated as a number.
  • greater_than_or_equal (optional): If specified, the condition will only evaluate as true if the value is a number or date/datetime and is greater than or equal to the given value. If the specified value is numeric, the value from the data will be treated as a number.
  • age_greater_than (optional): If specified, the condition will only evaluate as true if the value is a date/datetime and the difference in years is greater than this value. If the value is not a date/datetime use date_format to try to parse it as a date/datetime. If date_format is not provided the date format %Y-%m-%d will be used as a fallback. Can be combined with age_less_than to enforce an age within a range.
  • age_less_than (optional): If specified, the condition will only evaluate as true if the value is a date/datetime and the difference in years is less than this value. If the value is not a date/datetime use date_format to try to parse it as a date/datetime. If date_format is not provided the date format %Y-%m-%d will be used as a fallback. Can be combined with age_greater_than to enforce an age within a range.
  • disable_auto_date_parsing (optional): By default, strings in rulesets that appear to be dates are attempted to be parsed into dates/datetimes. This can cause comparisons to unexpectedly fail due to type mismatches. Set disable_auto_date_parsing to true to disable this automatic parsing. This setting is overridden when specifying the date_format argument.

When performing conditions on values containing dates/datetimes the following parameter can be specified:

  • date_format (optional): The format of the date/datetime stored. If specified, the condition will attempt to load the column value or the value at the json_path/xpath as a datetime with the specified format. If the value cannot be parsed with the date_format an error will be raised. See Date Parsing in DataMasque for more information.

Notes:

  • If the comparison is any of less_than,less_than_or_equal,greater_than, or greater_than_or_equal:
    • now or current_date_time can be specified as the value to use the current datetime at the time of masking.
    • Use a similar format to the date/datetime stored e.g. if the datetime contains hours:mins:seconds+time_zone, specify those in the value to compare against ("2012-12-12 10:30:00+00:00").
  • If the date or datetime in the row or field includes timezone information, dates/datetimes specified in ruleset conditions default to UTC.
  • Similarly, now or current_date_time values are understood to be in UTC.
  • If the date or datetime in the row or field does not include timezone information, ensure you specify a value without a timezone in the ruleset to avoid masking errors.
  • When comparing values in an XML document:
    • If the value in the ruleset is numeric (for example 100 or -3.14), DataMasque attempts to convert the XML data to a number for comparison. For example, 100.0 unquoted matches the value "100" from XML as they are both converted to the same number 100.
    • If you want string rather than numeric comparison, quote the value in the ruleset. In the above example, if the comparison value in the ruleset is quoted as "100.0", this is no longer a string match against "100".
    • All other data extracted from XML is treated as strings and so the comparison value in the ruleset must be quoted if necessary. For example, without quotes the value false in YAML is interpreted as a boolean, which does not match the string "false" in the XML data.

Logical operators

Conditions can also be grouped with the logical operators or, not, and and:

version: "1.0"
tasks:
  - type: mask_table
    table: users
    key: user_id
    rules:
    - if:
      - and:
        - not:
          - column: username
            matches: 'customer_\w+'
        - or:
          - column: admin
            equals: true
          - column: role
            equals: "admin"
      rules:
        - column: username
          masks:
          - type: from_fixed
            value: "Bob"

Note: When using an if conditional in rulesets, final row counts will reflect the number of rows processed rather than the number of rows masked. This is due to the rows being filtered on the application side and so all rows fetched will be processed and added to the row count. Alternatively where conditionals can be used in the ruleset which will provide an accurate row count of masked rows.

Skip - not masking specific values

A common use-case is to not apply masks to certain values, e.g. to leave NULL values or empty strings unchanged. You can choose to not mask certain values in a column by specifying a number of values to skip:

version: "1.0"
tasks:
  - type: mask_table
    table: users
    key: user_id
    rules:
      - column: username
        skip:
         - null
         - ""
         - matches: 'admin_\w+'
        masks:
          - type: from_fixed
            value: "Bob"

Any column values that are exactly equal to any of the string/numeric/null values in the skip list will not be masked (data types are also checked, i.e. 100 is not equal to "100"). Additionally, string column values matching a regular expression can be skipped by specifying the skip value as matches: "my_regex". For more details on how to use regular expressions, see Common regular expression patterns.

Conditional masking for files

Conditional masking for object files (as part of a mask_table) differs to database masking. For databases, conditions are evaluated per row, and affect the values (columns) in only that row.

For object files, the condition is applied per file and affects values for that file.

Multi-record object files (NDJSON or Avro) are more like databases in that the conditions are applied per record in the file and affect values for that record.

Masking tabular files (with mask_tabular_file) applies rules per row in the file, again, similar to masking a database table.

For the full list of parameters please refer to Condition Parameters

Notes: where is not supported for conditional masking for files as where is used to restrict what is returned in database queries, and therefore it does not apply to files.

The implementation of skip for files is different from databases. It is used in conjunction with include to choose which files to mask or not. Please refer to mask_file tasks and mask_tabular_file tasks.

For tabular file masking, refer to the general If section.

Here are the use cases for each type of condition:

Use case Mechanism
I want to apply certain masking rules to the values contained in the files. If
I want to skip certain files based on the file name. Please refer to the skip option for mask_file tasks

Below are some example rulesets for conditional file masking.

If - conditional rules (files)

Example

In the following example, the last_name of all users will be replaced with 'Smith', but the user's gender will determine the mask applied to their first_name:

version: "1.0"
name: json_with_conditional
tasks:
  - type: mask_file
    rules:
      - masks:
        - type: json
          transforms:
            - path: ["last_name"]
              on_missing: error
              masks:
                - type: from_fixed
                  value: "Stevens"
      - if:
        - json_path: ["gender"]
          equals: "female"
        on_missing: apply_if_rules
        rules:
          - masks:
            - type: json
              transforms:
                - path: ["first_name"]
                  on_missing: error
                  masks:
                    - type: from_fixed
                      value: "Alice"
        else_rules:
          - if:
            - json_path: ["gender"]
              equals: "male"
            on_missing: skip
            rules:
              - masks:
                - type: json
                  transforms:
                    - path: ["first_name"]
                      on_missing: error
                      masks:
                        - type: from_fixed
                          value: "Bob"
            else_rules:
              - masks:
                - type: json
                  transforms:
                    - path: ["first_name"]
                      on_missing: error
                      masks:
                        - type: from_fixed
                          value: "Jamie"

Similarly, with XML files

version: "1.0"
name: xml_with_conditional
tasks:
  - type: mask_file
    rules:
      - masks:
        - type: xml
          transforms:
            - path: "//Root/Employee/LastName"
              on_missing: error
              node_transforms:
                - type: text
                  masks:
                    - type: from_fixed
                      value: Stevens
      - if:
        - xpath: "//Root/Employee/Gender/text()"
          equals: "female"
        on_missing: apply_if_rules
        rules:
          - masks:
            - type: xml
              transforms:
                - path: "//Root/Employee/FirstName"
                  on_missing: error
                  node_transforms:
                    - type: text
                      masks:
                        - type: from_fixed
                          value: Alice
        else_rules:
          - if:
            - xpath: "//Root/Employee/Gender/text()"
              equals: "male"
            on_missing: skip
            rules:
              - masks:
                - type: xml
                  transforms:
                    - path: "//Root/Employee/FirstName"
                      on_missing: error
                      node_transforms:
                        - type: text
                          masks:
                            - type: from_fixed
                              value: Bob
            else_rules:
              - masks:
                - type: xml
                  transforms:
                    - path: "//Root/Employee/FirstName"
                      on_missing: error
                      node_transforms:
                        - type: text
                          masks:
                            - type: from_fixed
                              value: Jamie

An example for tabular files can be found in the If section.