DataMasque Portal

Performance Optimisation

This document provides performance optimisation guidelines by using multiple processes to enable parallel execution to improve performance, reducing the time it takes to perform masking runs.

Masking using parallelism and workers

When performing a masking run that utilises parallelism and multiple workers, it is recommended that the number of parallel processes multiplied by the number of workers assigned to each process does not exceed twice the number of CPUs available on your deployed instance. For instance, on 4 CPU deployment, if you wish to have 2 worker processes on each parallel task, it is recommended to run a maximum of 4 parallel tasks. DataMasque has a limit of 10 parallel tasks able to be performed simultaneously.

Masking a table with parallel processes

To improve masking performance on a single table, you can enable parallelism which allows multiple processes to work together simultaneously to mask a single table. This can be achieved simply by specifying the number of workers greater than 1.

It is also recommended to increase the batch size in addition to increasing the number of workers to achieve optimal performance. Increasing the batch size increases the number of rows that are fetched, masked, and updated in a single operation. This will reduce the database operation overhead at the cost of increasing memory usage by DataMasque. More details on batch size can be found under the Run Options guide.

When using multiple workers (workers > 1), each worker process operates on a separate batch of rows and these worker processes will run simultaneously. This can result in reduced masking run time hence improve performance as more rows of the table are masked at once at any given time.

Note: It is recommended to monitor CPU usage on the DataMasque instance and increase CPU system resource accordingly, as increasing the number of workers can lead to increased CPU usage on the DataMasque instance.

In the ruleset specification below workers: 4 is specified, therefore 4 worker processes will be used to mask the users table simultaneously.

version: "1.0"
tasks:
  - type: mask_table
    table: users
    workers: 4
    key: id
    rules:
      - column: last_name
        masks:
        - type: from_fixed
          value: 'redacted last name'

The following diagram describes how multiple worker processes work in the example ruleset above.

Multiple workers

Notes:

  • Batch of rows is set by the batch size parameter.
  • As each worker finishes masking a batch of rows, they will move on to the next unmasked batch of rows.

Performing tasks in parallel

When using the parallel task type, DataMasque performs masking using multiple processes which allows masking to run in parallel across multiple tables at once. Parallel tasks can reduce the time needed to mask a database when compared to performing masking on individual tables sequentially.

Below is an example ruleset of how mask_table table tasks can be set up in to run in parallel.

In the example ruleset below, 3 tables are masked simultaneously in each parallel task block. Once the first 3 tables are masked, the next parallel task block is executed, until finally all 3 parallel task blocks are complete and all tables in the ruleset are masked.

version: "1.0"
tasks:
  - type: parallel
    tasks:
      - type: mask_table
        table: table_1
        ...
      - type: mask_table
        table: table_2
        ...
      - type: mask_table
        table: table_3
        ...
  - type: parallel
    tasks:
      - type: mask_table
        table: table_4
        ...
      - type: mask_table
        table: table_5
        ...
      - type: mask_table
        table: table_6
        ...
  - type: parallel
    tasks:
      - type: mask_table
        table: table_7
        ...
      - type: mask_table
        table: table_8
        ...
      - type: mask_table
        table: table_9
        ...

Note: masque_unique_key tasks are not allowed to be run in parallel.

The following diagram describes how parallel execution works in the example ruleset shown above.

Parallel tasks