Document Masks
Document masks can mask more complex relationships inside JSON or XML objects.
- JSON Masking
Masks data inside JSON documents - XML Masking
Masks data inside XML documents
JSON (json
)
This mask will use query to locate and mask a value inside a JSON document. The rest of the JSON document is unchanged.
The path
is specified using a list of strings or integers which will be used when traversing the data to the values
intended to be masked; some examples of path
are covered in the next section. A JSON mask with return the same type of
data that it received; for example, masking text formatted as JSON will return text, while masking a JSON encoded column
or file will return a JSON encoded value.
Parameters
transforms
(required): A list of the transforms (replacements) to perform on the JSON document.path
(required): The path to locate the value to update.masks
(required): A list of masks to be performed (Any of the valid Mask Types).on_null
(optional): A string to specify the action to take if the value isnull
. One of:skip
(default): Skip to the next transform, the document remains unchanged.error
: Raise an error and stop masking.mask
: Mask thenull
value as specified.
on_missing
(optional): A string to specify the action to take if the value is not present (due to the document structure not matching the path).skip
: Skip to the next transform, the document remains unchanged.error
(default): Raise an error and stop masking.
force_consistency
(optional): Keep consistency between replacements in the path. See the section JSON Example withforce_consistency
for details on behaviour. Defaults tofalse
.hash_sources
(optional): A list of relative paths to values to be used ashash_sources
to ensure consistent masking for JSON data with the same structure.advance_hash
(optional): A boolean which when set totrue
will increment the seed value for hashing, which is generated when specifyinghash_columns
when using deterministic masking for databases or tabular files orhash_sources
when using deterministic masking for files, allowing for a repeatable sequence of masked values when the value the hashing is performed on is the same. Defaults tofalse
.
fallback_masks
(optional): Mask to perform if the data retrieved from the database is not valid JSON.
If the
json
mask is provided anull
value (e.g. from a SQL column), the value will remainnull
.fallback_masks
will not be executed.When masking multiple values in the same JSON document, multiple
transforms
should be specified, instead of multiple table masks with a single transform each. This means that the JSON column will only need to serialized/deserialized once per row.
path
Intro
A JSON path is a list of path components (strings or integers) used to traverse a JSON document.
The path examples below make reference to the following JSON document; it describes an order with some customer details, a quantity, and list of products.
{
"customer_details": {
"first_name": "Richard",
"last_name": "Willis"
},
"quantity": 18,
"products": ["product1", "product2"]
}
The following paths can be used to refer to particular values:
["customer_details"]
refers to the customer details object,{"first_name": "Richard", "last_name": "Willis"}
["customer_details", "first_name"]
refers to the value"Richard"
["customer_details", "last_name"]
refers to the value"Willis"
["quantity"]
refers to the value18
["products"]
refers to theproducts
array["product1", "product2"]
["products", 0]
refers to the first value in theproducts
array,"product1"
["products", 1]
refers to the second value in theproducts
array,"product2"
Quoting numbers in paths
Numeric components of paths that refer to indexes of an array should not be quoted. Quoting is required if numbers refer to the key of an object that is a numeric string.
For example, in this JSON document users are stored in an object with string keys.
{
"users": {
"0": "Richard",
"1": "Willis"
}
}
The user "Richard"
can be accessed with the path ["users", "0"]
.
Compare this to the following example, which stores users in an array.
{
"users": ["Richard", "Willis"]
}
In this case, "Richard"
should be accessed with the path ["users", 0]
,
where 0
is unquoted as it refers to an array index.
Working with repeated elements of unknown length
The wildcard operator *
can be used to apply masks to multiple items matching the query. This is useful if you don't
know how many elements will be in an array or object. For example, a JSON object with multiple people, each with
multiple addresses:
{
"users": [
{
"name": "Richard",
"addresses": [
{"type": "postal", "city": "Fairview"},
{"type": "physical", "city": "Riverside"}
]
},
{
"name": "Willis",
"addresses": [
{"type": "postal", "city": "Beachland"},
{"type": "physical", "city": "Bronson"}
]
}
]
}
The path ["users", "*", "name"]
would mask the name
for every element in users
, regardless of how many there are.
Multiple wildcards can be used, too.
The path ["users", "*", "addresses", "*", "city"]
would mask city
in all addresses
elements of all users
.
Note that *
must always be quoted in YAML.
Individual wildcard and numeric indexes may be used together.
For example, to mask only the city of the first address, leaving all other address' cities unmasked,
use the path ["users", "*", "addresses", 0, "city"]
. Since the "*"
is used after "users"
,
this would still apply to all users.
Note: Values in
path
are case-sensitive. They should not follow quoting rules for database columns (double quotation marks in an outer set of single quotation marks). Instead, normal YAML string-quoting rules apply.
Example
This example replaces the data at the path [customer_details, first_name]
of the json_data
column with a
fixed value REDACTED
. The on_null: mask
option is specified to mask the null
value as normal. The skip
option is
specified to skip that transform and continue masking on missing values (i.e. the structure does not match the path).
Note that this means the first_name
in the wrong location in the first row is not masked. In cases like this, it can
be safer to specify error
instead, so the masking run fails if data is not in the expected format.
In the second row where {"first_name": null}
, this value will be masked since we specified on_null: mask
.
Also note the use of fallback_masks
. The last row did not have valid JSON data in it, so the fallback mask was used to
replace it with an empty JSON object which, may help clean the data for further use.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: uid
rules:
- column: json_data
masks:
- type: json
transforms:
- path: [customer_details, first_name]
masks:
- type: from_fixed
value: "REDACTED"
on_null: mask
on_missing: skip
fallback_masks:
- type: from_fixed
value: "{}"
Show result
Before | After |
|
|
---|
For arrays, all masks
will be applied to each value in the array.
For example:
{
"customer_details": {
"given_names": ["Richard", "Willis"]
}
}
The path [customer_details, given_names]
would return the value ["Richard", "Willis"]
and the masks would then
be performed on "Richard"
and "Willis"
separately. This means for most mask functions, each value in the array would be
transformed into a new, different value. However, if you are using a mask that always returns the same value
(e.g. from_fixed
) all values would be transformed to the same new value.
Note:
- In all databases, the
json
mask supports masking of JSON data stored in text type columns (VARCHAR
,NVARCHAR
orTEXT
).- JSON specific columns types are also supported, for example,
JSON
in PostgreSQL and MySQL, orJSONB
in PostgreSQL.- Arrays, maps, and sets inside Amazon DynamoDB columns can also be masked with the
json
mask. Sets are treated like arrays, with items indexed according to their sorted order.
JSON Example with force_consistency
This example will illustrate the benefit of using the force_consistency
parameter on transforms.
Suppose you have a table with JSON data with the following structure:
{
"name": [
{
"use": "official",
"family": "Chalmers",
"given": ["Peter", "James"]
},
{
"use": "usual",
"given": ["Jim"]
},
{
"use": "maiden",
"family": "Windsor",
"given": ["Peter", "James"]
}
]
}
When masking the items at the path name, '*', given
, it would be best to mask them with consistent values i.e. the same masked names would
appear in each of the given
items after masking. To do this, set the force_consistency
parameter of the relevant
transform to true
.
version: "1.0"
tasks:
- type: mask_table
table: dbo.json_test
key: id
rules:
- column: json_data
masks:
- type: json
transforms:
- path: ['name', '*', 'given']
masks:
- type: from_file
seed_file: DataMasque_firstNames_male.csv
seed_column: firstname-male
force_consistency: true
Show result
Before | After |
|
|
---|
Without force_consistency
the output JSON would have all different names, an example is shown below:
Show result
Before | After |
|
|
---|
JSON Example with advance_hash
This example will demonstrate the benefit of using the advance_hash
parameter when using deterministic masking.
Suppose you have JSON data with the same structure as the previous example, but the masked values should be deterministic, based on the id
column in the table.
In order to get the deterministic behaviour, either hash_columns
or hash_sources
will need to be specified, in this case hash_columns
will be specified as the JSON data is stored in a database.
A wildcard is now included in the path
as each name should be replaced with a different name, without it each item in the list will be the same due to deterministic masking.
version: "1.0"
tasks:
- type: mask_table
table: dbo.json_test
key: id
rules:
- hash_sources:
- column_name: id
column: json_data
masks:
- type: json
transforms:
- path: ['name', '*', 'given', '*']
masks:
- type: from_file
seed_file: DataMasque_firstNames_male.csv
seed_column: firstname-male
Show result
Before | After |
|
|
---|
But when using deterministic masking (with hash_columns
/hash_sources
specified) all masked values for the path ['name', '*', 'given', '*']
will be the same. If advance_hash
is enabled,
masked values will be a repeatable sequence rather than a single repeated value. This next example shows the difference with advance_hash: true
specified.
version: "1.0"
tasks:
- type: mask_table
table: dbo.json_test
key: id
rules:
- hash_sources:
- column_name: id
column: json_data
masks:
- type: json
transforms:
- path: ['name', '*', 'given', '*']
masks:
- type: from_file
seed_file: DataMasque_firstNames_male.csv
seed_column: firstname-male
advance_hash: true
Show result
Before | After |
|
|
---|
Notice the first masked value is the same for both examples, this is due to the deterministic masking showing that the sequence of masked values is repeatable for the same value from the hash_columns
/hash_source
.
The same idea can be used for file masking but instead of specifying hash_columns
, hash_sources
should be used with a json_path
for the value to be used for the hashing.
XML (xml
)
This mask will use a query to locate and mask a value inside an XML document. The rest of the XML document is unchanged.
An Xpath (path
) is used to define the path to the node to mask. Once the node has been located, one or more
node_transforms
can be applied to alter its content or attributes.
Note: The
xml
mask should only be used with trusted XML data. The parser includes support for entity expansion and external references which can potentially be exploited with malicious XML payloads.Note: XML declarations are to be preserved when the XML document is masked, except for the case where there is a declaration containing the
standalone
and noencoding
parameter. In this caseencoding='UTF-8'
will be added to the declaration of the XML document.
Intro to transforms
and node_transforms
XML documents are made up of one or more elements. When referring to an element, this includes the start tag, end tag, attributes and content. For example, this element representing a log:
<Log date="2022-08-09" username="user@example.com">Account created</Log>
The element to mask is located using an Xpath expression. Once found, there are a few different parts of the element that can be masked, namely:
- its name (
Log
) - its attributes (
date
andusername
) - its text (
Account created
)
Each of these items are XML nodes.
When a masking run executes, each row from the database is fetched and passed to a masking function only once.
To apply masks on different elements in an XML document, the ruleset should define a list of transforms
, one for each
element that requires masking. In turn, a list of node_transforms
must be specified, one for each node of the
element that needs to be masked.
Specifying masking in this manner allows the masking run to be more efficient by querying for each element to be masked only once.
As an example, consider how to mask the Log
in the above example. The date
and username
attributes should be
redacted, along with the text content. This would require one transform to locate the Log
element, then three
node transforms: one for the date
attribute, another for the username
attribute, and the final to mask the text
of
the element.
The relevant portion of the YAML describing this transform would look like:
transforms:
- path: 'Log'
node_transforms:
- type: attribute
attributes: 'date'
masks:
- <list of masks>
- type: attribute
attributes: 'username'
masks:
- <list of masks>
- type: text
masks:
- <list of masks>
Note: This is assuming the
Log
element is not the root element in the XML document. To get the root element use.
or an absolute Xpath (starting with//
) as the path. All XML values are read as strings which will require atypecast
mask if they are used in a mask that requires non-string values (e.g.numeric_bucket
). XML also requires strings to be written so masks that return non-string values (e.g.from_random_number
,from_random_boolean
,numeric_bucket
) need to go through atypecast
mask before being written. For more information ontypecast
please refer to the Typecast documentation. Below is an example withfrom_random_number
.
version: "1.0"
tasks:
- type: mask_table
table: xml_test
key: id
rules:
- column: xml_data
masks:
- type: xml
transforms:
- path: 'Log'
node_transforms:
- type: attribute
attributes: 'id'
masks:
- type: from_random_number
min: 1000
max: 9999
- type: typecast
typecast_as: 'string'
Consistency for multiple elements
Xpath expressions can match multiple elements. This XML document contains a UserLog
with multiple Log
s:
<Root>
<UserLog>
<Log date="2022-08-09" username="user@example.com">Account created</Log>
<Log date="2022-08-09" username="user@example.com">Logged in</Log>
<Log date="2022-08-09" username="user@example.com">Logged out</Log>
</UserLog>
</Root>
The root is called
Root
in these examples – the root node does not need to be namedRoot
.
The Xpath UserLog/Log
would match all three Log
elements. DataMasque can be configured to mask each of
the specified nodes with the same value, or as different values. For example, the text of each element could be masked
to the same value. Or, different masks can be applied to each located element. This is configured with the
force_consistency
option at the transform level. Setting this to true
will apply each node transform in the same way
to each element.
Xpath Relative Node
When evaluating an xpath expression, the root node is considered to be the current node when executing masking. Therefore, the root node should not be included when using relative xpaths.
Consider this example document:
<Root>
<UserLog>
<Log/>
</UserLog>
</Root>
To select the Log
node, the Xpath Root/UserLog/Log
is not valid, as Root
is the current node. Instead
UserLog/Log
should be used as the path is relative to Root
.
If using an absolute Xpath (i.e. an Xpath starting with //
) then the root node should be included. That is, the
Xpath //Root/UserLog/Log
and UserLog/Log
select the same node(s) in this case.
XPath with XML namespaces
When an XML document uses namespaces, the namespace prefix is not used when specifying the Xpath, but instead the namespace URI is included in curly braces {}
immediately before the element or attribute name. Note that you must include the namespace URI for each element or attribute in the path.
<Orders xmlns="http://example.com/api/"
xmlns:o="http://example.com/api/orders/">
<Order poNumber="55">
<OrderId>20</OrderId>
<o:Customer>
<o:CustomerId>10</o:CustomerId>
<o:State o:sentiment="good">Happy</o:State>
<State>NSW</State>
</o:Customer>
</Order>
</Orders>
Here's an example ruleset to mask the above XML document:
version: "1.0"
tasks:
- type: mask_file
rules:
- hash_sources:
- xpath: "/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/}OrderId/text()"
masks:
- type: xml
transforms:
- path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/}OrderId'
on_missing: error
node_transforms:
- type: text
masks:
- type: from_random_number
min: 50
max: 99
- path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order'
on_missing: error
node_transforms:
- type: attribute
attributes: 'poNumber'
masks:
- type: from_random_number
min: 50
max: 99
- path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/orders/}Customer/{http://example.com/api/orders/}State'
on_missing: error
node_transforms:
- type: text
masks:
- type: from_choices
choices:
- Happy
- Sad
- Angry
- Anxious
- Excited
- type: attribute
attributes: '{http://example.com/api/orders/}sentiment'
masks:
- type: from_choices
choices:
- good
- bad
- excellent
- path: '/{http://example.com/api/}Orders/{http://example.com/api/}Order/{http://example.com/api/orders/}Customer/{http://example.com/api/}State'
on_missing: error
node_transforms:
- type: text
masks:
- type: from_choices
choices:
- ABC
- DEF
- JKL
Masking of unknown/extra attributes
There may be cases where XML elements sometimes have extra attributes that are not always known prior to masking. To
mask these, the extra_attribute_masks
option can be specified. This should contain a list of masks to apply to each
attribute that has not been masked using a defined node_transform
.
By default, each "extra" attribute value will have the masks applied to it separately. To force each of these values to
be the same, specify the force_extra_attribute_consistency: true
at the transform level. The extra_attribute_masks
will be
applied to the first extra attribute on the first node found, and the resulting value will be applied to all extra
attributes. Note that the order in which attributes are located is indeterminate and may not match the order they appear
in the XML.
Parameters
transforms
(required): A list of the transforms (replacements) to perform on the XML document.path
(required): The Xpath expression to locate the value to update.node_transforms
(required): A list of transforms to apply to the nodes on the element. The syntax of this object is shown in thenode_transforms
Parameters section below.on_missing
(optional): A string to specify the action to take if the element that the givenpath
is not present (due to the document structure not matching the path).skip
: Skip to the next transform, the document is unchanged by this transform.error
(default): Raise an error and stop masking.
force_consistency
(optional): Require each matching element to be masked to the same values. Defaults tofalse
.extra_attribute_masks
: (optional): A list of masks to apply for attributes not covered by a specificnode_transform
.force_extra_attribute_consistency
(optional): Force all "extra" attributes to be masked to the same value. Only applicable when usingextra_attribute_masks
. Defaults tofalse
.
fallback_masks
: (optional): Mask to perform if the data retrieved from the database is not valid XML.
If the
xml
mask is provided anull
value (e.g. from a SQL column), the value will remainnull
.fallback_masks
will not be executed.
node_transforms
Parameters
node_transforms
is a list of transforms to apply to the nodes of the found element(s).
type
(required): The type of node(s) of the current element to apply masking to. Must be one of:text
: The text value of the element (the content between the opening and closing tags).attribute
: Mask one or more attribute(s) on the element.name
: Mask the name of the element itself.
masks
(required): A list of masks to be performed (Any of the valid Mask Types).attributes
(optional): This option is required when using theattribute
type, and must not be present for other types. May either be astring
, or anarray
ofstrings
, which specify the attributes to applymasks
to. To apply different masks to different attributes, use multiplenode_transforms
.on_missing_attribute
(optional): A string to specify the action to take if an attribute is missing. Please see the section below on Missing XML Nodes, to see what constitutes a missing attribute.skip
: Skip to the next attribute (if masking multipleattributes
) or, if there are no attributes to be masked, to the nextnode_transform
. The document is unchanged by this transform.mask
: Apply the masks, using anull
value, then create the text content or attribute.error
(default): Raise an error and stop masking.
on_null_text
(optional): A string to specify the action to take if the text of a node isnull
(missing). Please see the section below on Missing XML Nodes, to see what constitutes a missing node.skip
(default): Skip to the nextnode_transform
. The document is unchanged by this transform.mask
: Apply the masks, using anull
value, then create the text content or attribute.error
: Raise an error and stop masking.
hash_sources
(optional): A list of relative paths to values to be used ashash_sources
to ensure consistent masking for XML with the same structure.xpath
: A relative path from the current node to the node which will be used as a hash source for the mask.
advance_hash
(optional): A boolean which when set totrue
will increment the seed value for hashing, which is generated when specifyinghash_columns
when using deterministic masking for databases or tabular files orhash_sources
when using deterministic masking for files, allowing for a repeatable sequence of masked values when the value the hashing is performed on is the same. Defaults tofalse
.
Missing XML Nodes
The on_missing_attribute
or on_null_text
options can be used to change how missing values are treated.
- A text node is considered null if a tag is self-closing. For example,
<Transaction amount="23.94"/>
. It is also considered null if the element is empty; for example,<Message to="user1" from="user2"></Message>
. - An attribute is considered missing if it does not exist on the element. For example, the attribute
currency
is missing from this element:<Transaction amount="23.94"/>
. An empty string attribute is not considered missing, and instead is just masked as an empty string. on_missing_attribute
oron_null_text
does not apply toname
node type, as XML tags/elements must have a name.
Retaining known attributes and removing others
There may be some instances where you want to retain known attributes, but mask all others. In this case, you can
combine the do_nothing
mask with the extra_attribute_masks
. Any attributes you want to retain will be "masked" to their
original value with do_nothing
; DataMasque considers these to be masked and then applies the extra_attributes_masks
to any
other attributes.
Examples
This example will contain 1 transforms
and 3 node_transforms
. The transforms
item will specify the path UserLog/Log
of the xml_data
column, the optional parameters not specified will be set to the default values.
- The first of the
node_transforms
replaces the text at the path with a fixed valueREDACTED
, theon_null_text: mask
option is specified to mask thenull
value as normal. - The second will mask the
username
attribute to a similar replacement by concatenating 3from_file
masks and atransform_case
mask to make sure the replacements are all still lower case. - The third will mask the
date
attribute with a suitable replacement date with afrom_random_date
mask.
Also note the use of fallback_masks
. The last row did not have valid XML data in it, so the fallback mask was used to
replace it with an empty <Root />
element which, may help clean the data for further use.
<Root>
<UserLog>
<Log date="2022-08-09" username="user@example.com">Account created</Log>
<Log date="2022-08-09" username="user@example.com"></Log>
<Log date="2022-08-09" username="user@example.com">Logged out</Log>
</UserLog>
</Root>
version: "1.0"
tasks:
- type: mask_table
table: xml_test
key: id
rules:
- column: xml_data
masks:
- type: xml
fallback_masks:
- type: from_fixed
value: '<Root />'
transforms:
- path: 'UserLog/Log'
node_transforms:
- type: text
masks:
- type: from_fixed
value: REDACTED
on_null_text: mask
- type: attribute
attributes:
- username
masks:
- type: concat
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
- type: from_file
seed_file: DataMasque_lastNames.csv
seed_column: lastnames
- type: from_file
seed_file: DataMasque_email_suffixes.csv
seed_column: email-suff
- type: transform_case
transform: lowercase
- type: attribute
attributes:
- date
masks:
- type: from_random_date
min: '2022-01-01'
max: '2022-12-31'
Show result
Before | After |
|
|
---|
XML Example with force_consistency
This example will illustrate the benefit of using the force_consistency
parameter on transforms.
Suppose you have a table with XML data with the following structure:
<Root>
<UserLog>
<Log date="2022-08-09" username="user@example.com">Account created</Log>
<Log date="2022-08-09" username="user@example.com"></Log>
<Log date="2022-08-09" username="user@example.com">Logged out</Log>
</UserLog>
</Root>
When masking the date
and username
attributes at the path UserLog/Log
, it would be best to mask them with
consistent values i.e. the same masked values would appear in each of the attributes after masking.
To do this, set the force_consistency
parameter of the relevant transform to true
.
version: "1.0"
tasks:
- type: mask_table
table: xml_test
key: id
rules:
- column: xml_data
masks:
- type: xml
fallback_masks:
- type: from_fixed
value: '<Root />'
transforms:
- path: 'UserLog/Log'
force_consistency: true
node_transforms:
- type: text
masks:
- type: from_fixed
value: REDACTED
on_null_text: mask
- type: attribute
attributes:
- username
masks:
- type: concat
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
- type: from_file
seed_file: DataMasque_lastNames.csv
seed_column: lastnames
- type: from_file
seed_file: DataMasque_email_suffixes.csv
seed_column: email-suff
- type: transform_case
transform: lowercase
- type: attribute
attributes:
- date
masks:
- type: from_random_date
min: '2022-01-01'
max: '2022-12-31'
Show result
Before | After |
|
|
---|
Without force_consistency
the output XML would have all different names, an example is shown in the first example.
Example with force_extra_attribute_consistency
This example will illustrate the benefit of using the force_extra_attribute_consistency
parameter on transforms.
Suppose you have XML data with the following structure:
<Root>
<Info>
<Employee date="2022-10-08" given_name="billy_ferwagner" preferred_name="billy_ferwagner"></Employee>
<Employee date="2022-10-08" given_name="william_florista" preferred_name="william_florista"></Employee>
</Info>
</Root>
But this time you want to mask the given_name
and preferred_name
attributes to the same values, to achieve this you
can specify any attributes you would want to mask, e.g. the date
attribute, set force_extra_attribute_consistency: true
,
and specify extra_attribute_masks
with the masks you want to be performed on the extra attributes. This will generate a
masked value from the specified masks and replace values of all attributes to that masked value.
version: "1.0"
tasks:
- type: mask_table
table: xml_test
key: id
rules:
- column: xml_data
masks:
- type: xml
fallback_masks:
- type: from_fixed
value: fallback
transforms:
- path: 'Info/Employee'
node_transforms:
- type: text
masks:
- type: do_nothing
force_extra_attribute_consistency: true
extra_attribute_masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
on_null_text: mask
Show result
Before | After |
|
|
---|
Without force_extra_attribute_consistency
the output XML would mask the given_name
and preferred_name
attributes
differently as shown below.
Show result
Before | After |
|
|
---|
XML Example with advance_hash
This example will demonstrate the benefit of using the advance_hash
parameter on node transforms when using Deterministic Masking.
Suppose you have a table which contains XML data with the following structure:
<Group>
<Members>
<Names>
<Name>James Chalmers</Name>
<Name>Libby Stevenson</Name>
<Name>Matthew Radgen</Name>
</Names>
</Members>
</Group>
When masking a path
which will retrieve multiple nodes, in this example the path
is //Group/Members/Names/Name
,
it would be best to mask them with different values but also intend to ensure deterministic masking based on the id
column in the table.
In order to get the deterministic behaviour, either hash_columns
or hash_sources
will need to be specified, in this case hash_columns
will be used as the XML data is stored in a table.
version: "1.0"
name: xml_advance_hash
tasks:
- type: mask_table
table: xml_test
key: id
rules:
- column: xml_data
hash_columns:
- column_name: id
masks:
- type: xml
fallback_masks:
- type: from_fixed
value: '<Group />'
transforms:
- path: "//Group/Members/Names/Name"
node_transforms:
- type: text
masks:
- type: concat
glue: " "
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
- type: from_file
seed_file: DataMasque_lastNames.csv
seed_column: lastnames
Show result
Before | After |
|
|
---|
But when using deterministic masking (with hash_columns
/hash_sources
specified) all masked values for the path //Group/Members/Names/Name
will be the same.
If advance_hash
is enabled, masked values will be a repeatable sequence rather than a single repeated value. This next example shows the difference with advance_hash: true
specified.
version: "1.0"
name: xml_advance_hash
tasks:
- type: mask_table
table: xml_test
key: id
rules:
- column: xml_data
hash_columns:
- column_name: id
masks:
- type: xml
fallback_masks:
- type: from_fixed
value: '<Group />'
transforms:
- path: "//Group/Members/Names/Name"
force_consistency: false
node_transforms:
- type: text
advance_hash: true
masks:
- type: concat
glue: " "
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
- type: from_file
seed_file: DataMasque_lastNames.csv
seed_column: lastnames
Show result
Before | After |
|
|
---|
Notice the first masked value is the same for both examples, this is due to the deterministic masking showing that the sequence of masked values is repeatable for the same value from the hash_columns
/hash_source
.
The same idea can be used for file masking but instead of specifying hash_columns
, hash_sources
should be used with an xpath
for the value to be used for the hashing.