In-Flight Masking Rulesets

This document explains how to configure DataMasque's in-flight masking using rulesets. Rulesets define what data should be masked and how the masking should be performed. You'll learn how to structure rulesets, control masking behavior through various settings, and see examples of masking different types of data. The guide starts with simple examples and builds up to more complex scenarios like JSON document masking and hash-based consistent masking.

What are rulesets?
version
rules
hash_sources

What are rulesets?

Rulesets are YAML definitions that specify how input values should be masked. Each ruleset contains one or more rules that define masking operations to be applied.

This document introduces ruleset concepts gradually, building on previous examples. It assumes you have read the basic setup and use guide, and have seen how data is masked with a ruleset.

`version`

Required

A schema version is required in your ruleset. It must be quoted and a string, not a number.

This is not valid:

version: 1.0

This is valid:

version: "1.0"

Currently, the in-flight ruleset schema is version 1.0.

`rules`

Required

rules is required in a ruleset. It is an array of mask definitions. A full list of masks can be referenced on the masking functions documentation.

Unsupported or Partially Supported Masks

In-flight masking supports all DataMasque masks except for:

from_column
from_unique
secure_shuffle
from_blob

In addition, not all functions of these masks are supported:

from_file: The table_filter_column and seed_filter_column are not supported, as these only apply to filtering against database columns.

Simple Rule

This ruleset was shown in the basic setup guide. It generates a random first name.

version: "1.0"
rules:
- masks:
  - type: "from_file"
    seed_file: "DataMasque_firstNames_mixed.csv"

Example input and output

Note: Examples in this document omit any fields that do not affect the behaviour - for example, logs and metadata in responses. Refer to the API documentation on Masking Request Detail Object and Masking Response Detail Object for the full object schema.

Request:

{
  "data": ["Darcy", "Molly", "Evelyn"]
}

Response:

{
  "data": ["Salma", "Emmie", "Salma"]
}

Chaining Multiple Masks

Where multiple masks are defined in a ruleset, they are applied sequentially - the output of one mask becomes the input to the next mask.

This next example performs an uppercase transform on a randomly selected first name.

version: "1.0"
rules:
- masks:
  - type: "from_file"
    seed_file: "DataMasque_firstNames_mixed.csv"
  - type: transform_case
    transform: uppercase

Example input and output

Request:

{
  "data": ["Darcy", "Molly", "Evelyn"]
}

Response:

{
  "data": ["TINA", "LANA", "LEONA"]
}

Note: Since hashing has not been used, the output values are randomly generated on each request.

Mask Chaining Limitations

While masks can be chained together, some combinations may not work as expected:

Masks that generate values (like from_file) ignore their input values entirely. Chaining two from_file masks will only use the output from the last one:

# The first from_file is effectively ignored
rules:
- masks:
  - type: "from_file"
    seed_file: "DataMasque_firstNames_mixed.csv" # Output ignored
  - type: "from_file"
    seed_file: "DataMasque_lastNames.csv" # Only this output is used

At this point, we've covered:

Basic ruleset structure with versions and rules.
Simple value generation with from_file.
Sequential mask chaining and its limitations.

Next we will look at masking values within JSON documents, allowing you to selectively mask fields while preserving document structure.

Advanced JSON Rules

As well as masking single values, in-flight masking can mask multiple values at once, if contained in JSON documents.

This is achieved by using a json mask, which applies sub-rules to specified paths in a JSON document.

Before investigating the ruleset YAML, we'll look at the documents to be masked, and explain what masks will be applied.

Consider a JSON document containing user information, where we want to mask personal details while preserving structure:

{
  "id": 1, 
  "first_name": "Alice", 
  "last_name": "Apples", 
  "email": "alice.apples@gmail.com", 
  "identifier": "AAA111"
}

In this example, the first_name and last_name will be masked with from_file masks. These rules are sub-masks of a json mask.

This is the ruleset:

version: "1.0"
rules:
  - masks:
    - type: json
      transforms:
        - path: [first_name]
          hash_sources:
            - json_path: [.., id]
          masks:
            - type: from_file
              seed_file: DataMasque_firstNames_mixed.csv
              seed_column: firstname-mixed
        - path: [last_name]
          hash_sources:
            - json_path: [.., id]
          masks:
            - type: from_file
              seed_file: DataMasque_lastNames.csv
              seed_column: lastnames

Note that each transform includes hash_sources configured to use the document's id field. This ensures consistent masking across systems for the same user ID. It overrides any hash_sources set at the top level of the ruleset.

We will expand on this ruleset later to mask other elements.

Example input and output

In this example, multiple data values will be sent at once.

Request:

{
  "data": [
    {
      "id": 1,
      "first_name": "Alice",
      "last_name": "Apples",
      "email": "alice.apples@gmail.com",
      "identifier": "AAA111"
    },
    {
      "id": 2,
      "first_name": "Bob",
      "last_name": "Boris",
      "email": "bob.boris@gmail.com",
      "identifier": "BXZ888"
    }
  ]
}

Response:

{
  "data": [
    {
      "id": 1,
      "first_name": "Verena",
      "last_name": "Grazina",
      "email": "alice.apples@gmail.com",
      "identifier": "AAA111"
    },
    {
      "id": 2,
      "first_name": "Joanny",
      "last_name": "Gildenpfennig",
      "email": "bob.boris@gmail.com",
      "identifier": "BXZ888"
    }
  ]
}

Notice that the JSON mask is applied to each element in the input data array, it does not pass the entire JSON data to the mask.

Full JSON Example

The final JSON example extends the previous ruleset to mask the entire document. It uses:

Email masking that combines three random values using a concat mask:
1. A random first name
2. A random last name
3. A random email suffix (e.g. @example.com)
Identifier masking using from_unique_imitate
The same first name and last name masking as before

All fields use the document's id as a hash source, ensuring consistent masking across requests for the same user.

version: "1.0"
rules:
  - masks:
    - type: json
      transforms:
        - path: [email]
          hash_sources:
            - json_path: [.., id]
          masks:
            - type: chain
              masks:
                - type: concat
                  masks:
                    - type: from_file
                      seed_file: DataMasque_firstNames_mixed.csv
                      seed_column: firstname-mixed
                    - type: from_file
                      seed_file: DataMasque_lastNames.csv
                      seed_column: lastnames
                    - type: from_file
                      seed_file: DataMasque_fake_email_suffixes.csv
                      seed_column: email-suff
        - path: [identifier]
          hash_sources:
            - json_path: [.., id]
          masks:
            - type: from_unique_imitate
        - path: [first_name]
          hash_sources:
            - json_path: [.., id]
          masks:
            - type: from_file
              seed_file: DataMasque_firstNames_mixed.csv
              seed_column: firstname-mixed
        - path: [last_name]
          hash_sources:
            - json_path: [.., id]
          masks:
            - type: from_file
              seed_file: DataMasque_lastNames.csv
              seed_column: lastnames

Request:

{
  "data": [
    {
      "id": 1,
      "first_name": "Alice",
      "last_name": "Apples",
      "email": "alice.apples@gmail.com",
      "identifier": "AAA111"
    },
    {
      "id": 2,
      "first_name": "Bob",
      "last_name": "Boris",
      "email": "bob.boris@gmail.com",
      "identifier": "BXZ888"
    }
  ]
}

Response:

{
  "data": [
    {
      "id": 1,
      "first_name": "Verena",
      "last_name": "Grazina",
      "email": "VerenaGrazina@baptiste.org",
      "identifier": "DJZ228"
    },
    {
      "id": 2,
      "first_name": "Joanny",
      "last_name": "Gildenpfennig",
      "email": "JoannyGildenpfennig@matrix.net",
      "identifier": "CDC439"
    }
  ]
}

`hash_sources`

Optional

Hash sources determine what values are used to generate consistent masking results. They can be specified at two levels:

At the ruleset level (covered in this section)
Within individual transforms (as seen in the JSON examples)

Examples of both have been seen in the basic setup guide.

This section looks at the various methods of specifying hash sources at the ruleset level.

hash_sources are specified as a list at the top level of the ruleset. When multiple hash sources are listed, the hash values are fetched and concatenated on order to build a final hash seed. This means if the order of hash_sources changes the hash seed will change.

Each hash source is an object that must contain only one of these primary properties:

Path-based sources:

json_path: A JSON path query (specified as a list) used to fetch a hash value from an element in data.
xpath: An Xpath string used to fetch a hash value from an element in data.

Request-based sources:

source: Either from_request or self.
- from_request: Fetch the hash value from a hash_values array included in the request.
- self: Use the entire data element being masked as the hash value.

Note: from_request and self may only be specified at the top level of the ruleset, and not as part of a json or xml mask.

The following extra properties control how the hash value is transformed after it is fetched. All are optional, and with none specified no transform is performed on the hash value.

case_transform (optional, enum): Convert the hash value to lower- or upper-case. One of:
- lower: Convert the value to lower case.
- upper: Convert the value to upper case.
trim_whitespace (optional, boolean): If true, trim whitespace from the start and end of the hash value. Defaults to false (no trim is performed).
coerce_whole_numbers_to_int (optional, boolean): If true, whole number float or decimal values will be transformed to integers. For example, 1.0 would be converted to 1. This is useful if IDs are not stored as integers, but are whole numbers. Even if this value is true, non-whole-numbers remain as floats. For example, 1.5 stays as 1.5. Defaults to false (conversion is not performed).

Before explaining the hash source types in more detail, it is important to know when to specify hash_sources at a ruleset or mask level.

Why are `hash_sources` inside `json` or `xml` masks?

When masking values inside JSON or XML documents, sometimes the hash source can only be located relative to the node being masked.

For example, consider this JSON data where each element in data is itself an array of objects:

{
  "data": [
    [
      {
        "id": 1,
        "first_name": "Alice"
      },
      {
        "id": 2,
        "first_name": "Bob"
      }
    ]
  ]
}

When masking first_name, we need to use the corresponding id from the same object as the hash value. This relationship can only be expressed using relative paths within the json mask.

The same principle applies to XML data. Consider this example:

{
  "data": [
    "<data><person id=\"1\"><first_name>Alice</first_name></person><person id=\"2\"><first_name>Bob</first_name></person></data>"
    ]
}

Here, each first_name should be hashed using its parent node's id attribute. Again, this relationship requires relative path handling.

In contrast, specifying hash_sources at a ruleset level means these relative relationships cannot be expressed, as ruleset-level hash sources can only reference fixed paths.

We'll now look at the different hash source types in more detail, with examples.

`json_path` hash source type

json_path hash source type should be used when masking simple JSON documents that contain just a single value to mask, and a value to use for hashing. As discussed in the previous section, masking multiple values in a single JSON document may not give consistent results with ruleset-level hashing.

This ruleset masks the first_name in a JSON document by hashing on the id included in the document.

version: "1.0"
rules:
  - masks:
    - type: json
      transforms:
        - path: [first_name]
          masks:
            - type: from_file
              seed_file: DataMasque_firstNames_mixed.csv
              seed_column: firstname-mixed
hash_sources:
  - json_path: [id]

Example request:

{  
  "data": [
    {
      "id": 1,
      "first_name": "Alice"
    },
    {
      "id": 2,
      "first_name": "Bob"
    },
    {
      "id": 1,
      "first_name": "Charles"
    }
  ]
}

Example response:

{  
  "data": [
    {
      "id": 1,
      "first_name": "Verena"
    },
    {
      "id": 2,
      "first_name": "Joanny"
    },
    {
      "id": 1,
      "first_name": "Verena"
    }
  ]
}

Notice that since element 1 and element 3 both had the same id, their replacement first_names are the same.

`xpath` hash source type

xpath hash source type should be used when masking simple XML documents that contain just a single value to mask, and a value to use for hashing. As discussed in the previous section, masking multiple values in a single XML document may not give consistent results with ruleset-level hashing.

This ruleset masks the first_name element in an XML document by hashing on the id included in the document.

version: "1.0"
rules:
  - masks:
    - type: xml
      transforms:
        - path: '/user/first_name'
          node_transforms:
          - type: text
            masks:
            - type: from_file
              seed_file: DataMasque_firstNames_mixed.csv
              seed_column: firstname-mixed
hash_sources:
  - xpath: '/user/@id'

Example request:

{  
  "data": [
    "<user id=\"1\"><first_name>Alice</first_name></user>",
    "<user id=\"2\"><first_name>Bob</first_name></user>",
    "<user id=\"1\"><first_name>Charles</first_name></user>"
  ]
}

Example response:

{  
  "data": [
    "<user id=\"1\"><first_name>Verena</first_name></user>",
    "<user id=\"2\"><first_name>Joanny</first_name></user>",
    "<user id=\"1\"><first_name>Verena</first_name></user>"
  ]
}

Notice that since element 1 and element 3 both had the same id, their replacement first_names are the same.

`from_request` hash source type

This example of using the from_request hash source type is repeated from the basic setup guide.

Using source: from_request allows hash values to be specified in the request, separate from the data itself. It has two modes of operation:

hash_values is an array containing the same number of elements as data. The hash values map one-to-one to the data values, in the same order.
hash_values is a single value (string, number, or object). The same hash value is used for each data element.

These two modes are applied automatically based on the hash_values type in the request. The mode is not specified in the ruleset.

This ruleset selects a random first name using a from_file mask. It hashes on values in the request.

version: "1.0"
hash_sources:
  - source: from_request
rules:
- masks:
  - type: "from_file"
    seed_file: "DataMasque_firstNames_mixed.csv"

The first request example uses one hash value per data value:

{  
  "data": ["Alice", "Bob", "Charles"],
  "hash_values": [1, 2, 1]
}

Example response:

{
  "data": ["Verena", "Joanny", "Verena"]
}

Again, repeated hash values lead to repeated output values.

This second request example uses just a single hash value, so all the output are the same:

{  
  "data": ["Alice", "Bob", "Charles"],
  "hash_values": 2
}

Example response:

{
  "data": ["Joanny", "Joanny", "Joanny"]
}

`self` hash source type

The final type of hash source is self, which uses the entire data element being masked as the hash source. The self hash source is particularly useful when:

You want deterministic masking without managing external hash values.
You need consistency when the same data appears in different systems or databases.
You're masking values where the input itself can serve as a stable identifier.

For example, when masking test data across multiple environments, using self ensures that each value maps consistently to its masked equivalent without needing to maintain separate hash values or identifiers.

Note: If you need to maintain unique one-to-one relationships between input and masked values, the from_unique_imitate mask type should be used instead of self hashing.

This example ruleset masks first names using self hashing, meaning the same input name will always produce the same masked output.

version: "1.0"
hash_sources:
  - source: self
rules:
- masks:
  - type: "from_file"
    seed_file: "DataMasque_firstNames_mixed.csv"

The following example demonstrates how self hashing maintains consistency when values are repeated:

{  
  "data": ["Alice", "Bob", "Alice"]
}

Example response:

{  
  "data": ["Bonnie", "Nicholas", "Bonnie"]
}

In-Flight Masking Rulesets

What are rulesets?

version

rules

Unsupported or Partially Supported Masks

Simple Rule

Example input and output

Chaining Multiple Masks

Example input and output

Mask Chaining Limitations

Advanced JSON Rules

Example input and output

Full JSON Example

hash_sources

Why are hash_sources inside json or xml masks?

json_path hash source type

xpath hash source type

from_request hash source type

self hash source type

`version`

`rules`

`hash_sources`

Why are `hash_sources` inside `json` or `xml` masks?

`json_path` hash source type

`xpath` hash source type

`from_request` hash source type

`self` hash source type