DataMasque Portal

Document Masks

Document masks can mask more complex relationships inside JSON or XML objects.

JSON (json)

This mask will use query to locate and mask a value inside a JSON document. The rest of the JSON document is unchanged. The path is specified using a list of strings or integers which will be used when traversing the data to the values intended to be masked; some examples of path are covered in the next section. A JSON mask with return the same type of data that it received; for example, masking text formatted as JSON will return text, while masking a JSON encoded column or file will return a JSON encoded value.

Parameters
  • transforms (required): A list of the transforms (replacements) to perform on the JSON document.
    • path (required): The path to locate the value to update.
    • masks (required): A list of masks to be performed (Any of the valid Mask Types).
    • on_null (optional): A string to specify the action to take if the value is null. One of:
      • skip (default): Skip to the next transform, the document remains unchanged.
      • error: Raise an error and stop masking.
      • mask: Mask the null value as specified.
    • on_missing (optional): A string to specify the action to take if the value is not present (due to the document structure not matching the path).
      • skip: Skip to the next transform, the document remains unchanged.
      • error (default): Raise an error and stop masking.
    • force_consistency (optional): Keep consistency between replacements in the path. See the section JSON Example with force_consistency for details on behaviour. Defaults to false.
    • hash_sources (optional): A list of relative paths to values to be used as hash_sources to ensure consistent masking for JSON data with the same structure.
    • advance_hash (optional): A boolean which when set to true will increment the seed value for hashing, which is generated when specifying hash_columns when using deterministic masking for databases or tabular files or hash_sources when using deterministic masking for files, allowing for a repeatable sequence of masked values when the value the hashing is performed on is the same. Defaults to false.
  • fallback_masks (optional): Mask to perform if the data retrieved from the database is not valid JSON.

If the json mask is provided a null value (e.g. from a SQL column), the value will remain null. fallback_masks will not be executed.

When masking multiple values in the same JSON document, multiple transforms should be specified, instead of multiple table masks with a single transform each. This means that the JSON column will only need to serialized/deserialized once per row.

path Intro

A JSON path is a list of path components (strings or integers) used to traverse a JSON document.

The path examples below make reference to the following JSON document; it describes an order with some customer details, a quantity, and list of products.

{
    "customer_details": {
        "first_name": "Richard",
        "last_name": "Willis"
    },
    "quantity": 18,
    "products": ["product1", "product2"]
}

The following paths can be used to refer to particular values:

  • ["customer_details"] refers to the customer details object, {"first_name": "Richard", "last_name": "Willis"}
  • ["customer_details", "first_name"] refers to the value "Richard"
  • ["customer_details", "last_name"] refers to the value "Willis"
  • ["quantity"] refers to the value 18
  • ["products"] refers to the products array ["product1", "product2"]
  • ["products", 0] refers to the first value in the products array, "product1"
  • ["products", 1] refers to the second value in the products array, "product2"
Quoting numbers in paths

Numeric components of paths that refer to indexes of an array should not be quoted. Quoting is required if numbers refer to the key of an object that is a numeric string.

For example, in this JSON document users are stored in an object with string keys.

{
  "users": {
    "0": "Richard",
    "1": "Willis"
  }
}

The user "Richard" can be accessed with the path ["users", "0"].

Compare this to the following example, which stores users in an array.

{
  "users": ["Richard", "Willis"]
}

In this case, "Richard" should be accessed with the path ["users", 0], where 0 is unquoted as it refers to an array index.

Working with repeated elements of unknown length

The wildcard operator * can be used to apply masks to multiple items matching the query. This is useful if you don't know how many elements will be in an array or object. For example, a JSON object with multiple people, each with multiple addresses:

{
  "users": [
    {
      "name": "Richard",
      "addresses": [
        {"type": "postal", "city": "Fairview"},
        {"type": "physical", "city": "Riverside"}
      ]
    },
    {
      "name": "Willis",
      "addresses": [
        {"type": "postal", "city": "Beachland"},
        {"type": "physical", "city": "Bronson"}
      ]
    }
  ]
}

The path ["users", "*", "name"] would mask the name for every element in users, regardless of how many there are. Multiple wildcards can be used, too. The path ["users", "*", "addresses", "*", "city"] would mask city in all addresses elements of all users. Note that * must always be quoted in YAML.

Individual wildcard and numeric indexes may be used together. For example, to mask only the city of the first address, leaving all other address' cities unmasked, use the path ["users", "*", "addresses", 0, "city"]. Since the "*" is used after "users", this would still apply to all users.

Note: Values in path are case-sensitive. They should not follow quoting rules for database columns (double quotation marks in an outer set of single quotation marks). Instead, normal YAML string-quoting rules apply.

Example

This example replaces the data at the path [customer_details, first_name] of the json_data column with a fixed value REDACTED. The on_null: mask option is specified to mask the null value as normal. The skip option is specified to skip that transform and continue masking on missing values (i.e. the structure does not match the path).

Note that this means the first_name in the wrong location in the first row is not masked. In cases like this, it can be safer to specify error instead, so the masking run fails if data is not in the expected format. In the second row where {"first_name": null}, this value will be masked since we specified on_null: mask.

Also note the use of fallback_masks. The last row did not have valid JSON data in it, so the fallback mask was used to replace it with an empty JSON object which, may help clean the data for further use.

version: "1.0"
tasks:
  - type: mask_table
    table: customers
    key: uid
    rules:
      - column: json_data
        masks:
          - type: json
            transforms:
              - path: [customer_details, first_name]
                masks:
                  - type: from_fixed
                    value: "REDACTED"
                on_null: mask
                on_missing: skip
            fallback_masks:
              - type: from_fixed
                value: "{}"

Show result

Before After
json_data
{"c":{"first_name":"Sam"}}
{"customer_details":{"first_name": null}}
{"customer_details":{"first_name": "Harry"}}
{"customer_details":{"first_name": "Sally"}}
NOT_VALID_JSON
json_data
{"c":{"first_name":"Sam"}}
{"customer_details":{"first_name": "REDACTED"}}
{"customer_details":{"first_name": "REDACTED"}}
{"customer_details":{"first_name": "REDACTED"}}
{}

For arrays, all masks will be applied to each value in the array. For example:

{
  "customer_details": {
    "given_names": ["Richard", "Willis"]
  }
}

The path [customer_details, given_names] would return the value ["Richard", "Willis"] and the masks would then be performed on "Richard" and "Willis" separately. This means for most mask functions, each value in the array would be transformed into a new, different value. However, if you are using a mask that always returns the same value (e.g. from_fixed) all values would be transformed to the same new value.

Note:

  • In all databases, the json mask supports masking of JSON data stored in text type columns (VARCHAR, NVARCHAR or TEXT).
  • JSON specific columns types are also supported, for example, JSON in PostgreSQL and MySQL, or JSONB in PostgreSQL.
  • Arrays, maps, and sets inside Amazon DynamoDB columns can also be masked with the json mask. Sets are treated like arrays, with items indexed according to their sorted order.
JSON Example with force_consistency

This example will illustrate the benefit of using the force_consistency parameter on transforms. Suppose you have a table with JSON data with the following structure:

{
    "name": [
        {
            "use": "official",
            "family": "Chalmers",
            "given": ["Peter", "James"]
        },
        {
            "use": "usual",
            "given": ["Jim"]
        },
        {
            "use": "maiden",
            "family": "Windsor",
            "given": ["Peter", "James"]
        }
    ]
}

When masking the items at the path name, '*', given, it would be best to mask them with consistent values i.e. the same masked names would appear in each of the given items after masking. To do this, set the force_consistency parameter of the relevant transform to true.

version: "1.0"
tasks:
  - type: mask_table
    table: dbo.json_test
    key: id
    rules:
      - column: json_data
        masks:
          - type: json
            transforms:
              - path: ['name', '*', 'given']
                masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_male.csv
                    seed_column: firstname-male
                force_consistency: true

Show result

Before After
json_data
{"name":[
    {"use":"official","family":"Chalmers","given":["Peter","James"]},
    {"use":"usual","given":["Jim"]},
    {"use":"maiden","family":"Windsor","given":["Peter","James"]}
]}
{"name":[
    {"use":"official","family":"Stevenson","given":["Todd","Carl"]},
    {"use":"usual","given":["Todd"]},
    {"use":"maiden","family":"Pallin","given":["Todd","Carl"]}
]}
{"name":[
    {"use":"official","family":"Radgen","given":["John","Neil"]},
    {"use":"usual","given":["John"]},
    {"use":"maiden","family":"Hoppstadter","given":["John","Neil"]}
]}
{"name":[
    {"use":"official","family":"Baulenas","given":["Eric","Miguel"]},
    {"use":"usual","given":["Eric"]},
    {"use":"maiden","family":"Ville","given":["Eric","Miguel"]}
]}
{"name":[
    {"use":"official","family":"Asurmendi","given":["James","Bryan"]},
    {"use":"usual","given":["James"]},
    {"use":"maiden","family":"Gotsch","given":["James","Bryan"]}
]}
json_data
{"name":[
    {"use":"official","family":"Chalmers","given":["Claude","Dennis"]},
    {"use":"usual","given":["Claude"]},
    {"use":"maiden","family":"Windsor","given":["Claude","Dennis"]}
]}
{"name":[
    {"use":"official","family":"Stevenson","given":["Zackery","Scot"]},
    {"use":"usual","given":["Zackery"]},
    {"use":"maiden","family":"Pallin","given":["Zackery","Scot"]}
]}
{"name":[
    {"use":"official","family":"Radgen","given":["Joshua","Brandon"]},
    {"use":"usual","given":["Joshua"]},
    {"use":"maiden","family":"Hoppstadter","given":["Joshua","Brandon"]}
]}
{"name":[
    {"use":"official","family":"Baulenas","given":["Andrew","Tanner"]},
    {"use":"usual","given":["Andrew"]},
    {"use":"maiden","family":"Ville","given":["Andrew","Tanner"]}
]}
{"name":[
    {"use":"official","family":"Asurmendi","given":["Antonio","James"]},
    {"use":"usual","given":["Antonio"]},
    {"use":"maiden","family":"Gotsch","given":["Antonio","James"]}
]}

Without force_consistency the output JSON would have all different names, an example is shown below:

Show result

Before After
json_data
{"name":[
    {"use":"official","family":"Chalmers","given":["Peter","James"]},
    {"use":"usual","given":["Jim"]},
    {"use":"maiden","family":"Windsor","given":["Peter","James"]}
]}
{"name":[
    {"use":"official","family":"Stevenson","given":["Todd","Carl"]},
    {"use":"usual","given":["Todd"]},
    {"use":"maiden","family":"Pallin","given":["Todd","Carl"]}
]}
{"name":[
    {"use":"official","family":"Radgen","given":["John","Neil"]},
    {"use":"usual","given":["John"]},
    {"use":"maiden","family":"Hoppstadter","given":["John","Neil"]}
]}
{"name":[
    {"use":"official","family":"Baulenas","given":["Eric","Miguel"]},
    {"use":"usual","given":["Eric"]},
    {"use":"maiden","family":"Ville","given":["Eric","Miguel"]}
]}
{"name":[
    {"use":"official","family":"Asurmendi","given":["James","Bryan"]},
    {"use":"usual","given":["James"]},
    {"use":"maiden","family":"Gotsch","given":["James","Bryan"]}
]}
json_data
{"name":[
    {"use":"official","family":"Chalmers","given":["Parker","Joseph"]},
    {"use":"usual","given":["Mark"]},
    {"use":"maiden","family":"Windsor","given":["Jeffrey","Richard"]}
]}
{"name":[
    {"use":"official","family":"Stevenson","given":["Dale","Sebastian"]},
    {"use":"usual","given":["Christopher"]},
    {"use":"maiden","family":"Pallin","given":["Johnathan","Bracken"]}
]}
{"name":[
    {"use":"official","family":"Radgen","given":["Tyler","Robert"]},
    {"use":"usual","given":["Micheal"]},
    {"use":"maiden","family":"Hoppstadter","given":["Herbert","Ashton"]}
]}
{"name":[
    {"use":"official","family":"Baulenas","given":["Artie","Alfred"]},
    {"use":"usual","given":["Pedro"]},
    {"use":"maiden","family":"Ville","given":["Henderson","Bryan"]}
]}
{"name":[
    {"use":"official","family":"Asurmendi","given":["Benjamin","Michael"]},
    {"use":"usual","given":["Philip"]},
    {"use":"maiden","family":"Gotsch","given":["Kendrick","John"]}
]}

JSON Example with advance_hash

This example will demonstrate the benefit of using the advance_hash parameter when using deterministic masking. Suppose you have JSON data with the same structure as the previous example, but the masked values should be deterministic, based on the id column in the table. In order to get the deterministic behaviour, either hash_columns or hash_sources will need to be specified, in this case hash_columns will be specified as the JSON data is stored in a database. A wildcard is now included in the path as each name should be replaced with a different name, without it each item in the list will be the same due to deterministic masking.

version: "1.0"
tasks:
  - type: mask_table
    table: dbo.json_test
    key: id
    rules:
      - hash_sources:
          - column_name: id
        column: json_data
        masks:
          - type: json
            transforms:
              - path: ['name', '*', 'given', '*']
                masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_male.csv
                    seed_column: firstname-male

Show result

Before After
idjson_data
1
{"name":[
    {"use":"official","family":"Chalmers","given":["Peter","James"]},
    {"use":"usual","given":["Jim"]},
    {"use":"maiden","family":"Windsor","given":["Peter","James"]}
]}
2
{"name":[
    {"use":"official","family":"Stevenson","given":["Todd","Carl"]},
    {"use":"usual","given":["Todd"]},
    {"use":"maiden","family":"Pallin","given":["Todd","Carl"]}
]}
3
{"name":[
    {"use":"official","family":"Radgen","given":["John","Neil"]},
    {"use":"usual","given":["John"]},
    {"use":"maiden","family":"Hoppstadter","given":["John","Neil"]}
]}
4
{"name":[
    {"use":"official","family":"Baulenas","given":["Eric","Miguel"]},
    {"use":"usual","given":["Eric"]},
    {"use":"maiden","family":"Ville","given":["Eric","Miguel"]}
]}
5
{"name":[
    {"use":"official","family":"Asurmendi","given":["James","Bryan"]},
    {"use":"usual","given":["James"]},
    {"use":"maiden","family":"Gotsch","given":["James","Bryan"]}
]}
idjson_data
1
{"name":[
    {"use":"official","family":"Chalmers","given":["James","James"]},
    {"use":"usual","given":["James"]},
    {"use":"maiden","family":"Windsor","given":["James","James"]}
]}
2
{"name":[
    {"use":"official","family":"Stevenson","given":["Libby","Libby"]},
    {"use":"usual","given":["Libby"]},
    {"use":"maiden","family":"Pallin","given":["Libby","Libby"]}
]}
3
{"name":[
    {"use":"official","family":"Radgen","given":["Matthew","Matthew"]},
    {"use":"usual","given":["Matthew"]},
    {"use":"maiden","family":"Hoppstadter","given":["Matthew","Matthew"]}
]}
4
{"name":[
    {"use":"official","family":"Baulenas","given":["Paul","Paul"]},
    {"use":"usual","given":["Paul"]},
    {"use":"maiden","family":"Ville","given":["Paul","Paul"]}
]}
5
{"name":[
    {"use":"official","family":"Asurmendi","given":["Steven","Steven"]},
    {"use":"usual","given":["Steven"]},
    {"use":"maiden","family":"Gotsch","given":["Steven","Steven"]}
]}

But when using deterministic masking (with hash_columns/hash_sources specified) all masked values for the path ['name', '*', 'given', '*'] will be the same. If advance_hash is enabled, masked values will be a repeatable sequence rather than a single repeated value. This next example shows the difference with advance_hash: true specified.

version: "1.0"
tasks:
  - type: mask_table
    table: dbo.json_test
    key: id
    rules:
      - hash_sources:
          - column_name: id
        column: json_data
        masks:
          - type: json
            transforms:
              - path: ['name', '*', 'given', '*']
                masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_male.csv
                    seed_column: firstname-male
                advance_hash: true

Show result

Before After
idjson_data
1
{"name":[
    {"use":"official","family":"Chalmers","given":["Peter","James"]},
    {"use":"usual","given":["Jim"]},
    {"use":"maiden","family":"Windsor","given":["Peter","James"]}
]}
2
{"name":[
    {"use":"official","family":"Stevenson","given":["Todd","Carl"]},
    {"use":"usual","given":["Todd"]},
    {"use":"maiden","family":"Pallin","given":["Todd","Carl"]}
]}
3
{"name":[
    {"use":"official","family":"Radgen","given":["John","Neil"]},
    {"use":"usual","given":["John"]},
    {"use":"maiden","family":"Hoppstadter","given":["John","Neil"]}
]}
4
{"name":[
    {"use":"official","family":"Baulenas","given":["Eric","Miguel"]},
    {"use":"usual","given":["Eric"]},
    {"use":"maiden","family":"Ville","given":["Eric","Miguel"]}
]}
5
{"name":[
    {"use":"official","family":"Asurmendi","given":["James","Bryan"]},
    {"use":"usual","given":["James"]},
    {"use":"maiden","family":"Gotsch","given":["James","Bryan"]}
]}
idjson_data
1
{"name":[
    {"use":"official","family":"Chalmers","given":["James","Madelyn"]},
    {"use":"usual","given":["Henry"]},
    {"use":"maiden","family":"Windsor","given":["Jack","Cynthia"]}
]}
2
{"name":[
    {"use":"official","family":"Stevenson","given":["Libby","Christopher"]},
    {"use":"usual","given":["Clerance"]},
    {"use":"maiden","family":"Pallin","given":["Lois","Ariadne"]}
]}
3
{"name":[
    {"use":"official","family":"Radgen","given":["Matthew","Darrell"]},
    {"use":"usual","given":["Joy"]},
    {"use":"maiden","family":"Hoppstadter","given":["James","Scott"]}
]}
4
{"name":[
    {"use":"official","family":"Baulenas","given":["Paul","Thelma"]},
    {"use":"usual","given":["Jon"]},
    {"use":"maiden","family":"Ville","given":["Jessica","Felix"]}
]}
5
{"name":[
    {"use":"official","family":"Asurmendi","given":["Steven","Jordan"]},
    {"use":"usual","given":["Paola"]},
    {"use":"maiden","family":"Gotsch","given":["Uriel","Nicholas"]}
]}

Notice the first masked value is the same for both examples, this is due to the deterministic masking showing that the sequence of masked values is repeatable for the same value from the hash_columns/hash_source.
The same idea can be used for file masking but instead of specifying hash_columns, hash_sources should be used with a json_path for the value to be used for the hashing.


XML (xml)

This mask will use a query to locate and mask a value inside an XML document. The rest of the XML document is unchanged. An Xpath (path) is used to define the path to the node to mask. Once the node has been located, one or more node_transforms can be applied to alter its content or attributes.

Note: The xml mask should only be used with trusted XML data. The parser includes support for entity expansion and external references which can potentially be exploited with malicious XML payloads.

Note: XML declarations are to be preserved when the XML document is masked, except for the case where there is a declaration containing the standalone and no encoding parameter. In this case encoding='UTF-8' will be added to the declaration of the XML document.

Intro to transforms and node_transforms

XML documents are made up of one or more elements. When referring to an element, this includes the start tag, end tag, attributes and content. For example, this element representing a log:

<Log date="2022-08-09" username="user@example.com">Account created</Log>

The element to mask is located using an Xpath expression. Once found, there are a few different parts of the element that can be masked, namely:

  • its name (Log)
  • its attributes (date and username)
  • its text (Account created)

Each of these items are XML nodes.

When a masking run executes, each row from the database is fetched and passed to a masking function only once. To apply masks on different elements in an XML document, the ruleset should define a list of transforms, one for each element that requires masking. In turn, a list of node_transforms must be specified, one for each node of the element that needs to be masked.

Specifying masking in this manner allows the masking run to be more efficient by querying for each element to be masked only once.

As an example, consider how to mask the Log in the above example. The date and username attributes should be redacted, along with the text content. This would require one transform to locate the Log element, then three node transforms: one for the date attribute, another for the username attribute, and the final to mask the text of the element.

The relevant portion of the YAML describing this transform would look like:

transforms:
  - path: 'Log'
    node_transforms:
      - type: attribute
        attributes: 'date'
        masks:
          - <list of masks>
      - type: attribute
        attributes: 'username'
        masks:
          - <list of masks>
      - type: text
        masks:
           - <list of masks>

Note: This is assuming the Log element is not the root element in the XML document. To get the root element use . or an absolute Xpath (starting with //) as the path. All XML values are read as strings which will require a typecast mask if they are used in a mask that requires non-string values (e.g. numeric_bucket). XML also requires strings to be written so masks that return non-string values (e.g. from_random_number, from_random_boolean, numeric_bucket) need to go through a typecast mask before being written. For more information on typecast please refer to the Typecast documentation. Below is an example with from_random_number.

version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: xml_data
        masks:
          - type: xml
            transforms:
              - path: 'Log'
                node_transforms:
                - type: attribute
                  attributes: 'id'
                  masks:
                    - type: from_random_number
                      min: 1000
                      max: 9999
                    - type: typecast
                      typecast_as: 'string'
Consistency for multiple elements

Xpath expressions can match multiple elements. This XML document contains a UserLog with multiple Logs:

<Root>
    <UserLog>
        <Log date="2022-08-09" username="user@example.com">Account created</Log>
        <Log date="2022-08-09" username="user@example.com">Logged in</Log>
        <Log date="2022-08-09" username="user@example.com">Logged out</Log>
    </UserLog>
</Root>

The root is called Root in these examples – the root node does not need to be named Root.

The Xpath UserLog/Log would match all three Log elements. DataMasque can be configured to mask each of the specified nodes with the same value, or as different values. For example, the text of each element could be masked to the same value. Or, different masks can be applied to each located element. This is configured with the force_consistency option at the transform level. Setting this to true will apply each node transform in the same way to each element.

Xpath Relative Node

When evaluating an xpath expression, the root node is considered to be the current node when executing masking. Therefore, the root node should not be included when using relative xpaths.

Consider this example document:

<Root>
    <UserLog>
        <Log/>
    </UserLog>
</Root>

To select the Log node, the Xpath Root/UserLog/Log is not valid, as Root is the current node. Instead UserLog/Log should be used as the path is relative to Root.

If using an absolute Xpath (i.e. an Xpath starting with //) then the root node should be included. That is, the Xpath //Root/UserLog/Log and UserLog/Log select the same node(s) in this case.

XPath with XML namespaces

When an XML document uses namespaces, the namespace prefix is not used when specifying the Xpath, but instead the namespace URI is included in curly braces {} immediately before the element or attribute name. Note that you must include the namespace URI for each element or attribute in the path.

<Orders xmlns="http://example.com/api/"
        xmlns:o="http://example.com/api/orders/">
  <Order poNumber="55">
    <OrderId>20</OrderId>
    <o:Customer>
        <o:CustomerId>10</o:CustomerId>
        <o:State o:sentiment="good">Happy</o:State>
        <State>NSW</State>
    </o:Customer>
  </Order>
</Orders>

Here's an example ruleset to mask the above XML document:

version: "1.0"
tasks:
  - type: mask_file
    rules:
      - hash_sources:
        - xpath: "/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/}OrderId/text()"
        masks:
        - type: xml
          transforms:
          - path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/}OrderId'
            on_missing: error
            node_transforms:
            - type: text
              masks:
              - type: from_random_number
                min: 50
                max: 99
          - path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order'
            on_missing: error
            node_transforms:
            - type: attribute
              attributes: 'poNumber'
              masks:
              - type: from_random_number
                min: 50
                max: 99
          - path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/orders/}Customer/{http://example.com/api/orders/}State'
            on_missing: error
            node_transforms:
              - type: text
                masks:
                - type: from_choices
                  choices:
                  - Happy
                  - Sad
                  - Angry
                  - Anxious
                  - Excited
              - type: attribute
                attributes: '{http://example.com/api/orders/}sentiment'
                masks:
                - type: from_choices
                  choices:
                  - good
                  - bad
                  - excellent
          - path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/orders/}Customer/{http://example.com/api/}State'
            on_missing: error
            node_transforms:
              - type: text
                masks:
                - type: from_choices
                  choices:
                  - ABC
                  - DEF
                  - JKL
Masking of unknown/extra attributes

There may be cases where XML elements sometimes have extra attributes that are not always known prior to masking. To mask these, the extra_attribute_masks option can be specified. This should contain a list of masks to apply to each attribute that has not been masked using a defined node_transform.

By default, each "extra" attribute value will have the masks applied to it separately. To force each of these values to be the same, specify the force_extra_attribute_consistency: true at the transform level. The extra_attribute_masks will be applied to the first extra attribute on the first node found, and the resulting value will be applied to all extra attributes. Note that the order in which attributes are located is indeterminate and may not match the order they appear in the XML.

Parameters
  • transforms (required): A list of the transforms (replacements) to perform on the XML document.
    • path (required): The Xpath expression to locate the value to update.
    • node_transforms (required): A list of transforms to apply to the nodes on the element. The syntax of this object is shown in the node_transforms Parameters section below.
    • on_missing (optional): A string to specify the action to take if the element that the given path is not present (due to the document structure not matching the path).
      • skip: Skip to the next transform, the document is unchanged by this transform.
      • error (default): Raise an error and stop masking.
    • force_consistency (optional): Require each matching element to be masked to the same values. Defaults to false.
    • extra_attribute_masks: (optional): A list of masks to apply for attributes not covered by a specific node_transform.
    • force_extra_attribute_consistency (optional): Force all "extra" attributes to be masked to the same value. Only applicable when using extra_attribute_masks. Defaults to false.
  • fallback_masks: (optional): Mask to perform if the data retrieved from the database is not valid XML.

If the xml mask is provided a null value (e.g. from a SQL column), the value will remain null. fallback_masks will not be executed.

node_transforms Parameters

node_transforms is a list of transforms to apply to the nodes of the found element(s).

  • type (required): The type of node(s) of the current element to apply masking to. Must be one of:
    • text: The text value of the element (the content between the opening and closing tags).
    • attribute: Mask one or more attribute(s) on the element.
    • name: Mask the name of the element itself.
  • masks (required): A list of masks to be performed (Any of the valid Mask Types).
  • attributes (optional): This option is required when using the attribute type, and must not be present for other types. May either be a string, or an array of strings, which specify the attributes to apply masks to. To apply different masks to different attributes, use multiple node_transforms.
  • on_missing_attribute (optional): A string to specify the action to take if an attribute is missing. Please see the section below on Missing XML Nodes, to see what constitutes a missing attribute.
    • skip: Skip to the next attribute (if masking multiple attributes) or, if there are no attributes to be masked, to the next node_transform. The document is unchanged by this transform.
    • mask: Apply the masks, using a null value, then create the text content or attribute.
    • error (default): Raise an error and stop masking.
  • on_null_text (optional): A string to specify the action to take if the text of a node is null (missing). Please see the section below on Missing XML Nodes, to see what constitutes a missing node.
    • skip (default): Skip to the next node_transform. The document is unchanged by this transform.
    • mask: Apply the masks, using a null value, then create the text content or attribute.
    • error: Raise an error and stop masking.
  • hash_sources (optional): A list of relative paths to values to be used as hash_sources to ensure consistent masking for XML with the same structure.
    • xpath: A relative path from the current node to the node which will be used as a hash source for the mask.
  • advance_hash (optional): A boolean which when set to true will increment the seed value for hashing, which is generated when specifying hash_columns when using deterministic masking for databases or tabular files or hash_sources when using deterministic masking for files, allowing for a repeatable sequence of masked values when the value the hashing is performed on is the same. Defaults to false.
Missing XML Nodes

The on_missing_attribute or on_null_text options can be used to change how missing values are treated.

  • A text node is considered null if a tag is self-closing. For example, <Transaction amount="23.94"/>. It is also considered null if the element is empty; for example, <Message to="user1" from="user2"></Message>.
  • An attribute is considered missing if it does not exist on the element. For example, the attribute currency is missing from this element: <Transaction amount="23.94"/>. An empty string attribute is not considered missing, and instead is just masked as an empty string.
  • on_missing_attribute or on_null_text does not apply to name node type, as XML tags/elements must have a name.
Retaining known attributes and removing others

There may be some instances where you want to retain known attributes, but mask all others. In this case, you can combine the do_nothing mask with the extra_attribute_masks. Any attributes you want to retain will be "masked" to their original value with do_nothing; DataMasque considers these to be masked and then applies the extra_attributes_masks to any other attributes.

Examples

This example will contain 1 transforms and 3 node_transforms. The transforms item will specify the path UserLog/Log of the xml_data column, the optional parameters not specified will be set to the default values.

  • The first of the node_transforms replaces the text at the path with a fixed value REDACTED, the on_null_text: mask option is specified to mask the null value as normal.
  • The second will mask the username attribute to a similar replacement by concatenating 3 from_file masks and a transform_case mask to make sure the replacements are all still lower case.
  • The third will mask the date attribute with a suitable replacement date with a from_random_date mask.

Also note the use of fallback_masks. The last row did not have valid XML data in it, so the fallback mask was used to replace it with an empty <Root /> element which, may help clean the data for further use.

<Root>
    <UserLog>
        <Log date="2022-08-09" username="user@example.com">Account created</Log>
        <Log date="2022-08-09" username="user@example.com"></Log>
        <Log date="2022-08-09" username="user@example.com">Logged out</Log>
    </UserLog>
</Root>
version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: xml_data
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: '<Root />'
            transforms:
              - path: 'UserLog/Log'
                node_transforms:
                  - type: text
                    masks:
                      - type: from_fixed
                        value: REDACTED
                    on_null_text: mask
                  - type: attribute
                    attributes:
                        - username
                    masks:
                      - type: concat
                        masks:
                          - type: from_file
                            seed_file: DataMasque_firstNames_mixed.csv
                            seed_column: firstname-mixed
                          - type: from_file
                            seed_file: DataMasque_lastNames.csv
                            seed_column: lastnames
                          - type: from_file
                            seed_file: DataMasque_email_suffixes.csv
                            seed_column: email-suff
                      - type: transform_case
                        transform: lowercase
                  - type: attribute
                    attributes:
                        - date
                    masks:
                      - type: from_random_date
                        min: '2022-01-01'
                        max: '2022-12-31'

Show result

Before After
xml_data
<Root>
    <Message>Hello there!</Message>
</Root>
<Root>
    <UserLog>
        <Log date="2022-08-09" username="user@example.com">Account created</Log>
        <Log username="user@example.com"></Log>
        <Log date="2022-08-09" username="user@example.com">Logged out</Log>
    </UserLog>
</Root>
<Root>
    <UserLog>
        <Log date="2022-08-09" username="user@example.com">Account created</Log>
        <Log username="user@example.com"></Log>
        <Log date="2022-08-09" username="user@example.com">Logged out</Log>
    </UserLog>
</Root>
<Root>
    <UserLog>
        <Log date="2022-08-09" username="user@example.com">Account created</Log>
        <Log username="user@example.com"></Log>
        <Log date="2022-08-09" username="user@example.com">Logged out</Log>
    </UserLog>
</Root>
NOT_VALID_XML
xml_data
<Root>
    <Message>Hello there!</Message>
</Root>
<Root>
    <UserLog>
        <Log date="2022-02-14" username="levimendoza@t-online.de">REDACTED</Log>
        <Log date="2022-07-11" username="judikarlin@zonnet.nl">REDACTED</Log>
        <Log date="2022-08-10" username="albertguanoluisa@mac.com">REDACTED</Log>
    </UserLog>
</Root>
<Root>
    <UserLog>
        <Log date="2022-09-17" username="cliffordvooth@wanadoo.fr">REDACTED</Log>
        <Log date="2022-09-11" username="stevengrunhage@bigpond.net.au">REDACTED</Log>
        <Log date="2022-11-27" username="brianmolodykh@uol.com.br">REDACTED</Log>
    </UserLog>
</Root>
<Root>
    <UserLog>
        <Log date="2022-09-05" username="brasoneverz@telenet.be">REDACTED</Log>
        <Log date="2022-02-06" username="jasonrodelbronn@comcast.net">REDACTED</Log>
        <Log date="2022-01-28" username="michaelding@outlook.com">REDACTED</Log>
    </UserLog>
</Root>
<Root />

XML Example with force_consistency

This example will illustrate the benefit of using the force_consistency parameter on transforms. Suppose you have a table with XML data with the following structure:

<Root>
    <UserLog>
        <Log date="2022-08-09" username="user@example.com">Account created</Log>
        <Log date="2022-08-09" username="user@example.com"></Log>
        <Log date="2022-08-09" username="user@example.com">Logged out</Log>
    </UserLog>
</Root>

When masking the date and username attributes at the path UserLog/Log, it would be best to mask them with consistent values i.e. the same masked values would appear in each of the attributes after masking. To do this, set the force_consistency parameter of the relevant transform to true.

version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: xml_data
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: '<Root />'
            transforms:
              - path: 'UserLog/Log'
                force_consistency: true
                node_transforms:
                  - type: text
                    masks:
                      - type: from_fixed
                        value: REDACTED
                    on_null_text: mask
                  - type: attribute
                    attributes:
                        - username
                    masks:
                      - type: concat
                        masks:
                          - type: from_file
                            seed_file: DataMasque_firstNames_mixed.csv
                            seed_column: firstname-mixed
                          - type: from_file
                            seed_file: DataMasque_lastNames.csv
                            seed_column: lastnames
                          - type: from_file
                            seed_file: DataMasque_email_suffixes.csv
                            seed_column: email-suff
                      - type: transform_case
                        transform: lowercase
                  - type: attribute
                    attributes:
                        - date
                    masks:
                      - type: from_random_date
                        min: '2022-01-01'
                        max: '2022-12-31'

Show result

Before After
xml_data
<Root>
    <UserLog>
        <Log date="2022-07-07" username="edwardalcarazo@yahoo.co.id">Account created</Log>
        <Log date="2022-07-07" username="edwardalcarazo@yahoo.co.id"></Log>
        <Log date="2022-07-07" username="edwardalcarazo@yahoo.co.id">Logged out</Log>
    </UserLog>
</Root>
<Root>
    <UserLog>
        <Log date="2022-02-14" username="zamyramummertz@shaw.c">Account created</Log>
        <Log date="2022-02-14" username="zamyramummertz@shaw.c"></Log>
        <Log date="2022-02-14" username="zamyramummertz@shaw.c">Logged out</Log>
    </UserLog>
</Root>
<Root>
    <UserLog>
        <Log date="2022-01-12" username="colbydahmane@free.fr">Account created</Log>
        <Log date="2022-01-12" username="colbydahmane@free.fr"></Log>
        <Log date="2022-01-12" username="colbydahmane@free.fr">Logged out</Log>
    </UserLog>
</Root>
<Root>
    <UserLog>
        <Log date="2022-11-08" username="williamzieglmeister@laposte.net">Account created</Log>
        <Log date="2022-11-08" username="williamzieglmeister@laposte.net"></Log>
        <Log date="2022-11-08" username="williamzieglmeister@laposte.net">Logged out</Log>
    </UserLog>
</Root>
<Root>
    <UserLog>
        <Log date="2022-07-19" username="patricianurhamitov@alice.it">Account created</Log>
        <Log date="2022-07-19" username="patricianurhamitov@alice.it"></Log>
        <Log date="2022-07-19" username="patricianurhamitov@alice.it">Logged out</Log>
    </UserLog>
</Root>
xml_data
<Root>
    <UserLog>
        <Log date="2022-10-08" username="billyferwagner@msn.com">REDACTED</Log>
        <Log date="2022-10-08" username="billyferwagner@msn.com">REDACTED</Log>
        <Log date="2022-10-08" username="billyferwagner@msn.com">REDACTED</Log>
    </UserLog>
</Root>
<Root>
    <UserLog>
        <Log date="2022-11-19" username="williamflorista@juno.com">REDACTED</Log>
        <Log date="2022-11-19" username="williamflorista@juno.com">REDACTED</Log>
        <Log date="2022-11-19" username="williamflorista@juno.com">REDACTED</Log>
    </UserLog>
</Root>
<Root>
    <UserLog>
        <Log date="2022-09-03" username="christinecauqui@mail.com">REDACTED</Log>
        <Log date="2022-09-03" username="christinecauqui@mail.com">REDACTED</Log>
        <Log date="2022-09-03" username="christinecauqui@mail.com">REDACTED</Log>
    </UserLog>
</Root>
<Root>
    <UserLog>
        <Log date="2022-01-24" username="jamestroyas@hotmail.it">REDACTED</Log>
        <Log date="2022-01-24" username="jamestroyas@hotmail.it">REDACTED</Log>
        <Log date="2022-01-24" username="jamestroyas@hotmail.it">REDACTED</Log>
    </UserLog>
</Root>
<Root>
    <UserLog>
        <Log date="2022-04-23" username="williejones@telenet.be">REDACTED</Log>
        <Log date="2022-04-23" username="williejones@telenet.be">REDACTED</Log>
        <Log date="2022-04-23" username="williejones@telenet.be">REDACTED</Log>
    </UserLog>
</Root>

Without force_consistency the output XML would have all different names, an example is shown in the first example.

Example with force_extra_attribute_consistency

This example will illustrate the benefit of using the force_extra_attribute_consistency parameter on transforms. Suppose you have XML data with the following structure:

<Root>
    <Info>
        <Employee date="2022-10-08" given_name="billy_ferwagner" preferred_name="billy_ferwagner"></Employee>
        <Employee date="2022-10-08" given_name="william_florista" preferred_name="william_florista"></Employee>
    </Info>
</Root>

But this time you want to mask the given_name and preferred_name attributes to the same values, to achieve this you can specify any attributes you would want to mask, e.g. the date attribute, set force_extra_attribute_consistency: true, and specify extra_attribute_masks with the masks you want to be performed on the extra attributes. This will generate a masked value from the specified masks and replace values of all attributes to that masked value.

version: "1.0"
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: xml_data
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: fallback
            transforms:
              - path: 'Info/Employee'
                node_transforms:
                  - type: text
                    masks:
                      - type: do_nothing
                force_extra_attribute_consistency: true
                extra_attribute_masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_mixed.csv
                    seed_column: firstname-mixed
                    on_null_text: mask

Show result

Before After
xml_data
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Charlotte" preferred_name="Charlotte" />
        <Employee date="2022-10-08" given_name="Wilbur" preferred_name="Wilbur" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Gerald" preferred_name="Gerald" />
        <Employee date="2022-10-08" given_name="Gary" preferred_name="Gary" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Carrol" preferred_name="Carrol" />
        <Employee date="2022-10-08" given_name="Johnny" preferred_name="Johnny" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Alexandria" preferred_name="Alexandria" />
        <Employee date="2022-10-08" given_name="Eddie" preferred_name="Eddie" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Matthew" preferred_name="Matthew" />
        <Employee date="2022-10-08" given_name="Julio" preferred_name="Julio" />
    </Info>
</Root>
xml_data
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Nancy" preferred_name="Nancy" />
        <Employee date="2022-10-08" given_name="Arthur" preferred_name="Arthur" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Jack" preferred_name="Jack" />
        <Employee date="2022-10-08" given_name="Richard" preferred_name="Richard" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Joyce" preferred_name="Joyce" />
        <Employee date="2022-10-08" given_name="Ryan" preferred_name="Ryan" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Phillip" preferred_name="Phillip" />
        <Employee date="2022-10-08" given_name="Verdon" preferred_name="Verdon" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Richard" preferred_name="Richard" />
        <Employee date="2022-10-08" given_name="Christopher" preferred_name="Christopher" />
    </Info>
</Root>

Without force_extra_attribute_consistency the output XML would mask the given_name and preferred_name attributes differently as shown below.

Show result

Before After
xml_data
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Charlotte" preferred_name="Charlotte" />
        <Employee date="2022-10-08" given_name="Wilbur" preferred_name="Wilbur" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Gerald" preferred_name="Gerald" />
        <Employee date="2022-10-08" given_name="Gary" preferred_name="Gary" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Carrol" preferred_name="Carrol" />
        <Employee date="2022-10-08" given_name="Johnny" preferred_name="Johnny" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Alexandria" preferred_name="Alexandria" />
        <Employee date="2022-10-08" given_name="Eddie" preferred_name="Eddie" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Matthew" preferred_name="Matthew" />
        <Employee date="2022-10-08" given_name="Julio" preferred_name="Julio" />
    </Info>
</Root>
xml_data
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Braidon" preferred_name="Mark" />
        <Employee date="2022-10-08" given_name="Edmund" preferred_name="Rishi" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Maud" preferred_name="Kathy" />
        <Employee date="2022-10-08" given_name="Thomas" preferred_name="Damon" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Harshil" preferred_name="Kathy" />
        <Employee date="2022-10-08" given_name="Donald" preferred_name="Liliana" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Michael" preferred_name="Nicolas" />
        <Employee date="2022-10-08" given_name="Maanav" preferred_name="Rebeca" />
    </Info>
</Root>
<Root>
    <Info>
        <Employee date="2022-10-08" given_name="Thomas" preferred_name="Ashley" />
        <Employee date="2022-10-08" given_name="Angela" preferred_name="Alan" />
    </Info>
</Root>

XML Example with advance_hash

This example will demonstrate the benefit of using the advance_hash parameter on node transforms when using Deterministic Masking. Suppose you have a table which contains XML data with the following structure:

<Group>
    <Members>
        <Names>
            <Name>James Chalmers</Name>
            <Name>Libby Stevenson</Name>
            <Name>Matthew Radgen</Name>
        </Names>
    </Members>
</Group>

When masking a path which will retrieve multiple nodes, in this example the path is //Group/Members/Names/Name, it would be best to mask them with different values but also intend to ensure deterministic masking based on the id column in the table. In order to get the deterministic behaviour, either hash_columns or hash_sources will need to be specified, in this case hash_columns will be used as the XML data is stored in a table.

version: "1.0"
name: xml_advance_hash
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: xml_data
        hash_columns:
          - column_name: id
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: '<Group />'
            transforms:
              - path: "//Group/Members/Names/Name"
                node_transforms:
                  - type: text
                    masks:
                      - type: concat
                        glue: " "
                        masks:
                          - type: from_file
                            seed_file: DataMasque_firstNames_mixed.csv
                            seed_column: firstname-mixed
                          - type: from_file
                            seed_file: DataMasque_lastNames.csv
                            seed_column: lastnames

Show result

Before After
idxml_data
1
<Group>
    <Members>
        <Names>
            <Name>Neal Jollet</Name>
            <Name>Dale Shields</Name>
            <Name>Scott Feinman</Name>
        </Names>
    </Members>
</Group>
2
<Group>
    <Members>
        <Names>
            <Name>Bette Williams</Name>
            <Name>Weldon Shelting</Name>
            <Name>Richard Mayer</Name>
        </Names>
    </Members>
</Group>
3
<Group>
    <Members>
        <Names>
            <Name>Craig Cooke</Name>
            <Name>Jean Kempfer</Name>
            <Name>Victor Robu</Name>
        </Names>
    </Members>
</Group>
4
<Group>
    <Members>
        <Names>
            <Name>Davey Newstead</Name>
            <Name>Rusty Barnett</Name>
            <Name>Joshua Werrett</Name>
        </Names>
    </Members>
</Group>
5
<Group>
    <Members>
        <Names>
            <Name>Leslie Elvins</Name>
            <Name>Stewart Arnett</Name>
            <Name>Perseus Levitt</Name>
        </Names>
    </Members>
</Group>
idxml_data
1
<Group>
    <Members>
        <Names>
            <Name>James Seshu</Name>
            <Name>James Seshu</Name>
            <Name>James Seshu</Name>
        </Names>
    </Members>
</Group>
2
<Group>
    <Members>
        <Names>
            <Name>Libby Shaw</Name>
            <Name>Libby Shaw</Name>
            <Name>Libby Shaw</Name>
        </Names>
    </Members>
</Group>
3
<Group>
    <Members>
        <Names>
            <Name>Matthew Vesa</Name>
            <Name>Matthew Vesa</Name>
            <Name>Matthew Vesa</Name>
        </Names>
    </Members>
</Group>
4
<Group>
    <Members>
        <Names>
            <Name>Paul Sevriens</Name>
            <Name>Paul Sevriens</Name>
            <Name>Paul Sevriens</Name>
        </Names>
    </Members>
</Group>
5
<Group>
    <Members>
        <Names>
            <Name>Steven Robinson</Name>
            <Name>Steven Robinson</Name>
            <Name>Steven Robinson</Name>
        </Names>
    </Members>
</Group>

But when using deterministic masking (with hash_columns/hash_sources specified) all masked values for the path //Group/Members/Names/Name will be the same. If advance_hash is enabled, masked values will be a repeatable sequence rather than a single repeated value. This next example shows the difference with advance_hash: true specified.

version: "1.0"
name: xml_advance_hash
tasks:
  - type: mask_table
    table: xml_test
    key: id
    rules:
      - column: xml_data
        hash_columns:
          - column_name: id
        masks:
          - type: xml
            fallback_masks:
              - type: from_fixed
                value: '<Group />'
            transforms:
              - path: "//Group/Members/Names/Name"
                force_consistency: false
                node_transforms:
                  - type: text
                    advance_hash: true
                    masks:
                      - type: concat
                        glue: " "
                        masks:
                          - type: from_file
                            seed_file: DataMasque_firstNames_mixed.csv
                            seed_column: firstname-mixed
                          - type: from_file
                            seed_file: DataMasque_lastNames.csv
                            seed_column: lastnames

Show result

Before After
idxml_data
1
<Group>
    <Members>
        <Names>
            <Name>Neal Jollet</Name>
            <Name>Dale Shields</Name>
            <Name>Scott Feinman</Name>
        </Names>
    </Members>
</Group>
2
<Group>
    <Members>
        <Names>
            <Name>Bette Williams</Name>
            <Name>Weldon Shelting</Name>
            <Name>Richard Mayer</Name>
        </Names>
    </Members>
</Group>
3
<Group>
    <Members>
        <Names>
            <Name>Craig Cooke</Name>
            <Name>Jean Kempfer</Name>
            <Name>Victor Robu</Name>
        </Names>
    </Members>
</Group>
4
<Group>
    <Members>
        <Names>
            <Name>Davey Newstead</Name>
            <Name>Rusty Barnett</Name>
            <Name>Joshua Werrett</Name>
        </Names>
    </Members>
</Group>
5
<Group>
    <Members>
        <Names>
            <Name>Leslie Elvins</Name>
            <Name>Stewart Arnett</Name>
            <Name>Perseus Levitt</Name>
        </Names>
    </Members>
</Group>
idxml_data
1
<Group>
    <Members>
        <Names>
            <Name>James Seshu</Name>
            <Name>Madelyn Hirzhman</Name>
            <Name>Henry Valbonesi</Name>
        </Names>
    </Members>
</Group>
2
<Group>
    <Members>
        <Names>
            <Name>Libby Shaw</Name>
            <Name>Matthew Vesa</Name>
            <Name>Clerance Larcegui</Name>
        </Names>
    </Members>
</Group>
3
<Group>
    <Members>
        <Names>
            <Name>Matthew Vesa</Name>
            <Name>Darrell Hemi</Name>
            <Name>Joy Eberts</Name>
        </Names>
    </Members>
</Group>
4
<Group>
    <Members>
        <Names>
            <Name>Paul Sevriens</Name>
            <Name>Thelma Castellnou</Name>
            <Name>Jon Colomes</Name>
        </Names>
    </Members>
</Group>
5
<Group>
    <Members>
        <Names>
            <Name>Steven Robinson</Name>
            <Name>Jordan Klaass</Name>
            <Name>Paola Rull</Name>
        </Names>
    </Members>
</Group>

Notice the first masked value is the same for both examples, this is due to the deterministic masking showing that the sequence of masked values is repeatable for the same value from the hash_columns/hash_source.
The same idea can be used for file masking but instead of specifying hash_columns, hash_sources should be used with an xpath for the value to be used for the hashing.