Performance Optimisation
In this document, you'll discover guidelines focused on performance optimisation. It provides guidance on utilising multiple processes, allowing for parallel execution and, consequently, enhanced performance.
- Masking using parallelism and workers
- Masking a table with multiple workers
- Performing tasks in parallel
Masking using parallelism and workers
The total number of worker processes should not exceed twice the number of CPUs available to the DataMasque instance. For example, if your virtual machine has four CPUs, the total number of worker processes should not exceed eight.
The total number of worker processes is equal to the number of parallel tasks times the number of workers per task.
For example:
1 task in parallel * 2 workers = 2 worker processes
or
2 tasks in parallel * 2 workers = 4 worker processes
DataMasque can run a maximum of 10 parallel tasks simultaneously, however each task may have multiple workers (thus allowing more than 10 worker processes).
Masking a table with multiple workers
To improve masking performance on a single table, you can enable parallelism which allows multiple processes to work
together simultaneously to mask a single table. This can be achieved simply by specifying a number of
workers
greater than 1
for a task.
It is also recommended to increase the batch size in addition to increasing the number of workers to achieve optimal performance. Increasing the batch size increases the number of rows that are fetched, masked, and updated in a single operation. This will reduce the database operation overhead at the cost of increasing memory usage by DataMasque. More details on batch size can be found under the Database Run Options guide.
When using multiple workers (workers
> 1
), each worker process operates on a separate batch of rows and these worker
processes will run simultaneously. This can result in reduced masking run time hence improve performance as more rows
of the table are masked at once.
Note: Increasing the number of workers will increase the amount of memory used (as well as CPU consumption). It is recommended to monitor resource usage when using parallelism.
In the ruleset specification below workers: 4
is specified, therefore four worker processes will be used to mask the
users
table simultaneously.
version: "1.0"
tasks:
- type: mask_table
table: users
workers: 4
key: id
rules:
- column: last_name
masks:
- type: from_fixed
value: 'redacted last name'
The following diagram describes how multiple worker processes work in the example ruleset above.
Notes:
- Number of rows in each buffer is set by the batch size parameter/run option.
- As each worker finishes masking a batch of rows, it will move on to the next unmasked batch of rows.
Performing tasks in parallel
When using the parallel
task type, DataMasque performs masking using multiple processes which allows masking
to run in parallel across multiple tables at once. Parallel tasks can reduce the time needed to mask a
database when compared to performing masking on individual tables sequentially.
Below is an example ruleset of how mask_table
table tasks can be set up in to run in parallel. Three tables are masked simultaneously in each parallel
task block. Once the first three tables are
masked, the next parallel
task block is executed, until finally all three parallel
task blocks are complete and all tables
in the ruleset are masked.
version: "1.0"
tasks:
- type: parallel
tasks:
- type: mask_table
table: table_1
...
- type: mask_table
table: table_2
...
- type: mask_table
table: table_3
...
- type: parallel
tasks:
- type: mask_table
table: table_4
...
- type: mask_table
table: table_5
...
- type: mask_table
table: table_6
...
- type: parallel
tasks:
- type: mask_table
table: table_7
...
- type: mask_table
table: table_8
...
- type: mask_table
table: table_9
...
Note:
mask_unique_key
tasks are not allowed to be run in parallel.
The following diagram describes how parallel execution works in the example ruleset shown above.