Masking runs

Overview
Create database masking run
- Database run options
Create file masking run
- File run options
Preview and confirm run
Run logs
Run History
Simultaneous runs

Overview

A masking run is the process of a applying a masking Ruleset to a database or file Connection. Runs are created from the Database Masking or File Masking dashboards. For automating the masking process, they can also be triggered via the API (see Best Practices).

Create database masking run

A new database masking run can be configured with the following steps:

Navigate to the Database Masking dashboard
Select a connection from the list of available connections.
Select a ruleset from the list of available rulesets.
Set additional run options (detailed below).
Click PREVIEW RUN or PREVIEW DRY RUN.
After configuring the run, you will be taken to the Preview and confirm run screen.

*Dry Run allows you to test your rulesets without modifying the database. When a dry run is executed, DataMasque performs every operation as usual except:

the final UPDATE operation of mask_table tasks, which would otherwise write the masked value to the database

the value generation and subsequent UPDATE operation for mask_unique_key tasks

the truncate_table operation

the run_sql operation

certain mask_table masking tasks (from_blob, from_unique_imitate, secure_shuffle), as these require temporary alterations to the database schema.

Database run options

Run options are displayed on the Run Options section of the Database Masking Dashboard.

Database run options

The following options are available:

Option	Default	Description
Batch Size¹	`50,000`	The maximum number of rows that will be fetched, masked, and updated in a single operation by DataMasque. Larger batch sizes will reduce database operation overhead, but using a batch size value that is too large may result in DataMasque memory exhaustion. This value does not affect the total number of rows masked. The maximum allowed batch size is 50,000. Note: The Batch Size parameter is not applicable for Amazon Redshift and Amazon DynamoDB masking runs. Only for database masking runs.
Max rows³⁴	`unset`	The maximum number of rows that will be masked by each `mask_table` task³. May be used for speeding up test iterations when developing rulesets. Warning: In the case that a table contains more rows than the value specified here, the remaining rows will contain unmasked data. Note: The Max rows parameter is not applicable for Amazon Redshift masking runs. Only for database masking runs.
Run secret	A secure random value	The run secret is used in the random generation of masked values. Please see the Run Secret Options below.
Continue on failure	`false`	If there is a task failure, and this option is false, DataMasque will skip all remaining non-started tasks. If this option is true, DataMasque will continue performing other tasks even if there is a task failure. It can be useful to set this option to true when testing/debugging your masking ruleset to identify as many failures as possible in each run.
Notify me when this run completes	`false`	Email the current user when the job completes. The DataMasque instance must have SMTP configured.
Disable instance secret	`false`	If this option is set to `true`, DataMasque will exclude its instance-specific secret and generate masked values based solely on the run secret. You may wish to disable the instance secret in order to achieve consistent masking across DataMasque instances. However, by disabling the instance secret, any DataMasque instance using the same `run_secret` could replicate your data masking.
Override connection's Server Side Encryption settings	`false`	Only valid for Amazon DynamoDB connections. If checked, table SSE key settings may be overridden on a per-table basis. Click the gear icon to view the settings. See the Specifying SSE Options documentation for DynamoDB for more information about these settings
Max file size²	See description	The max file size is only applicable to Amazon Redshift and Amazon DynamoDB databases. For Redshift it sets the MAXFILESIZE of the UNLOAD command. For DynamoDB it determines the maximum pre-compression size of batch files split from the original files exported by DynamoDB to S3. In both cases, this determines the maximum file size in MB of records that will be masked at once by a single DataMasque worker. The max file size can be set to any integer value between 5 MB and 1000 MB, and defaults to 10 MB for Redshift and 100MB for DynamoDB (Redshift files are stored in the compact Parquet format, which expands to a greater extent when loaded into DataMasque's memory). Only for database masking runs.
Enable Diagnostic Logging	`false`	With diagnostic logging enabled, extra information (Columns, constraints, foreign keys and indexes) is captured to assist with diagnosing problems with masking runs. Because of the extra information collected, masking runs may take longer to execute with this option enabled. Memory information of the main process will be output to the run log in the following format `Memory: T:31.19GB / F:0.26GB / A:10.60GB` where `T`, `F`, and `A` stand for Total, Free and Available memory respectively. Memory of workers will also be captured in the following format `Memory for PID(58): 355.17MB`.

Notes:

¹Batch Size applies to database types other than Amazon Redshift and Amazon DynamoDB.

²Max file size only applies to Amazon Redshift and Amazon DynamoDB databases.

³Max rows does not apply to mask_unique_key tasks.

⁴ Use of the Max rows run option for Amazon Redshift is not yet supported in DataMasque. This is on our roadmap and will be added in a future release.

Create file masking run

A new file masking run can be configured with the following steps:

Navigate to the File Masking dashboard
Select a source connection from the list of available connections (files will be read from here).
Select a ruleset from the list of available rulesets.
Select a destination connection from the list of available connections (files will be written to here).
Set additional run options (detailed below)
Click PREVIEW RUN or PREVIEW DRY RUN*
After configuring the run, you will be taken to the Preview and confirm run screen.

*Dry Run allows you to test your rulesets without modifying the files. When a dry run is executed, DataMasque downloads files from the source, applies include/skip rules and masks them, but does not upload the masked data to the destination. This allows you to see which files would get masked, and if there are any errors in your ruleset.

File run options

Run options are displayed on the Run Options section of the File Masking Dashboard.

File run options

The following options are available:

Option	Default	Description
Run secret	A secure random value	The run secret is used in the random generation of masked values. Please see the Run Secret Options below.
Continue on failure	`false`	If there is a task failure, and this option is false, DataMasque will skip all remaining non-started tasks. If this option is true, DataMasque will continue performing other tasks even if there is a task failure. It can be useful to set this option to true when testing/debugging your masking ruleset to identify as many failures as possible in each run.
Notify me when this run completes	`false`	Email the current user when the job completes. The DataMasque instance must have SMTP configured.
Disable instance secret	`false`	If this option is set to `true`, DataMasque will exclude its instance-specific secret and generate masked values based solely on the run secret. You may wish to disable the instance secret in order to achieve consistent masking across DataMasque instances. However, by disabling the instance secret, any DataMasque instance using the same `run_secret` could replicate your data masking.
Enable Diagnostic Logging	`false`	With diagnostic logging enabled, extra memory information is captured to assist with diagnosing problems with masking runs. Because of the extra information collected, masking runs may take longer to execute with this option enabled. Memory information of the main process will be output to the run log in the following format `Memory: T:31.19GB / F:0.26GB / A:10.60GB` where `T`, `F`, and `A` stand for Total, Free and Available memory respectively. Memory of workers will also be captured in the following format `Memory for PID(58): 355.17MB`.

Run Secret Options

A run secret is a critical component that controls the consistency of masked data generation. When the same run secret is used across multiple runs, DataMasque will generate consistent masked values, ensuring reproducibility when needed.

Note: To generate consistent masked values across multiple DataMasque instances, the Instance Secret should be disabled. Hashing may also be used. Please refer to the Consistent Masking guide for more information.

Configuration Options

There are three options for configuring the run secret.

1. Securely Generated Random Run Secret

Automatically generates a cryptographically secure random value for each run.
Best choice when you want unique masked data for each run.
No configuration needed - simply select this option.
Each run will produce different masked values.

2. Specify Your Own Run Secret

Allows you to input a custom run secret.
Requirements:
- Minimum length: 20 characters.
- Maximum length: 128 characters.
- Can include letters, numbers, and special characters.
Use this option when you need reproducible masked data across multiple runs.
On the same DataMasque instance, the same run secret will generate the same masked values for repeated runs.

3. Load from AWS Secrets Manager

Retrieves the run secret from AWS Secrets Manager. This is ideal for:

Team environments where secret management is centralised.
Maintaining consistent masking across different environments.
Meeting security compliance requirements, by limiting secret access to specific IAM roles.

The secret selection is a combo box, which allows you to select from a dropdown of available secrets, or entering a secret ARN.

The AWS IAM Permissions used:

secretsmanager:GetSecretValue is required to read the selected secret at masking time.
secretsmanager:ListSecrets is optional, and only required to populate the dropdown list.
- Without the ListSecrets permission, any ARN that your role has access to can still be manually specified, provided it can be accessed with the secretsmanager:GetSecretValue permission.
- The expected ARN format is arn:aws:secretsmanager:<region>:<account-id>:secret:secret-name

Click the refresh icon to the right of the ARN selection to reload the list of run secret ARNs from AWS.

Note: Random Seed Priority: If both a random_seed value is in the rule set, and a run secret has been provided, the random_seed takes precedence.

Preview and confirm run

This screen shows the preview of the configured run. Check the run parameters here before proceeding with execution via the START RUN button. After the run has been started, you will be redirected to the Run logs page where you can monitor the run output and progress.

You can view a curl command for starting an equivalently configured run using the DataMasque API by clicking the VIEW RUN COMMAND button. For more information, see the Best Practices guide and API Reference

Preview run

Run logs

The Run Logs screen displays a log of all historic runs, their statuses, and their individual log outputs. To access the Run Logs screen, choose the Run logs item from the main menu.

Run details

When a run is selected in the Run Logs panel, its details and log history are displayed in the Masking Run panel. While a run is still being executed, its log output will be streamed for continuous feedback on the run progress.

The following information is logged at the start of a run:

The target database type, e.g. Oracle
The version of the target database (where available), e.g.
Oracle Database 12c Standard Edition Release 12.2.0.1.0 - 64bit Production
The 64-character SHA256 hash of the ruleset content

The run options used by this run can be found on the first log line.

Run logs

Database masking job summary

On completion of a database masking run, the run log will display the following information:

Masking run status: The final status of the run on completion. This will indicate whether the run completed successfully or failed.
Started at: The time at which the masking run was started.
Finished at: The time at which the run completed, failed or was cancelled.
Total time: The total time taken for the run.
Total tables masked: The total number of tables that were successfully masked.
Total columns masked: The total number of columns that were successfully masked.
Total rows masked: The total number of rows that were successfully masked.

An example of a successful database masking run log can be seen below:

Successful database run log

When a run fails, if continue_on_failure was not enabled, the total tables, columns and rows masked will be displayed as 0. While some rows or column may be masked, it cannot be determined which particular rows or columns are masked on a failed run, so all tables should be considered unmasked.

Failed database run log

If continue_on_failure was enabled for the run, the tables, columns and rows will instead reflect the total numbers of tables, columns and rows masked in successful mask_table and mask_unique_key tasks.

File masking job summary

On completion of a file masking run, the run log will display the following information:

Masking run status: The final status of the run on completion. This will indicate whether the run was completed successfully or failed.
Started at: The time at which the masking run was started.
Finished at: The time at which the run completed, failed or was cancelled.
Total time: The total time taken for the run.
Total files masked: The total number of files that were successfully masked. On failed runs, this will reflect the number of files that were masked by successful file masking tasks.

An example of a successful file masking run log can be seen below:

Successful database run log

An example of a failed file masking run log can be seen below:

Successful database run log

Run Report

The run report is a CSV file that holds more detailed information about the run that was performed.

Currently, only runs with a run_file_data_discovery task will produce a run report. The report contains details specifically about the files that were discovered, any files that were skipped, and any files that caused an error.

The run report will be generated within the datamasque_admin-server_1 container, under the directory /files/user/file_task_reports/<YYYY>/<MM>/, with the file name <run-id>.csv.
The directory structure /<YYYY>/<MM> corresponds to the year and month digits representing when the run started.
For example, the report for the run with run ID 123, that started on April 1st, 2024, would be located at /files/user/file_task_reports/2024/04/123.csv.

The report will contain the following information about the run_file_data_discovery task performed:

path: The path of the file that was discovered.
file_size: The size of the file.
file_type: The file type discovered if the extension is a supported file type: avro, avsc, csv, json, ndjson, parquet, xml.
file_name: The name of the file.
file_extension: The file extension.
skip_reason: The reason the file was skipped, detailed below.

The following are the reasons and explanations for files to be skipped during a run_file_data_discovery task:

Reason	Explanation
`File type unsupported by data discovery`	The file is not a type supported by file discovery. Supported types are: CSV, JSON, NDJSON, Parquet.
`File type could not be determined`	DataMasque was unable to determine the file type. Most likely there is no extension on the file.
`File contains a column with unsupported data type`	For Parquet files, not all data types are supported by file discovery yet. See Supported Parquet Column data types for more information.
`Matched a skip filter`	The file was skipped due to matching a `skip` filter in the ruleset.
`Did not match any include filter`	The file did not match any `include` filter in the ruleset.
`An error occurred while running data discovery`	An unexpected error occurred during the discovery process.

There are several ways you can retrieve the run report for a file data discovery run:

On the Run Logs page, select the run and then click the DOWNLOAD RUN REPORT button at the bottom of the page.
Make a GET request to the endpoint /api/runs/{id}/run-report/.
SSH into the machine running DataMasque and copy the report using the following command:
```
docker cp datamasque_admin-server_1:/files/user/file_task_reports/<YYYY>/<MM>/<run-id>.csv <path-on-local-filesystem>
```
Replacing the <YYYY>, <MM>, <run-id>, and <path-on-local-filesystem> as appropriate.

Connection and ruleset snapshots

Snapshots of the connection and ruleset are kept for every masking run, maintaining an historical record of the exact configuration that was used for each run. Connection and ruleset snapshots can be viewed for the selected masking run by clicking on the connection or ruleset name displayed on the run detail panel.

Run detail snapshot

A modal window will open to display the snapshots, as captured at the time of masking run creation. The snapshot status indicates whether the current connection or ruleset configuration has been changed since this snapshot was taken. Clicking the edit link will allow you to edit the current connection or ruleset corresponding to the displayed snapshot.

Run log connection snapshot Run log ruleset snapshot

There are three different statuses that may be displayed for a snapshot:

current: The details have not been changed since the run was performed.
modified: The details have been changed since the run was performed. The details shown no longer reflect the current state of the connection or ruleset.
deleted: The connection or ruleset no longer exists. If this is the case, there will be no option to edit.

Downloading a run log

While hovering over a run in the Run Logs panel, a download button will be shown. Clicking on this button will start a download containing the logs for this masking run.

Downloading a sensitive data discovery report

When a run_data_discovery task is included in the masking ruleset, the resulting report for each run can be downloaded by clicking either the shield icon on the run row in the Run Logs list, or the Discovery Report chip on the Masking run detail panel. The report will be downloaded in CSV format and may be opened in a text editor or spreadsheet viewer such as Microsoft Excel. See Sensitive Data Discovery for more details.

Cancelling a run

If you wish to cancel a run, you may do so with the following steps:

Select the run you wish to cancel from the list in the 'Run Logs' panel.
Click the CANCEL RUN button at the bottom of the screen.
After clicking YES on the confirmation dialog, the run status will be updated to cancelling and DataMasque will proceed to stop the run's in-progress masking tasks.
Once all the run's tasks have been stopped, the run status will be updated to cancelled.

Note: During task cancellation, DataMasque will send an explicit request to the database to cancel any running queries. In most cases (if the query is in an interruptable phase), the database will catch this and stop the query immediately. If the query is in a non-interruptable phase, the database will still complete that phase before the query is terminated.

Run History

When a masking run completes, regardless of success or failure, DataMasque will attempt to record the result of a masking run in the database being masked, or in a file in the base directory of the destination connection for file masking runs.

History is only written for masking runs that actually mask data (assuming there is data to mask, no errors occur, and so on). For example, dry runs or schema/data discovery runs will not be recorded in the history.

Database masking runs: Run History table

The table DATAMASQUE_RUN_HISTORY is created if it does not exist on the default schema for the user performing the masking. This table acts as a record of masking runs, and to help identify that a masking run did actually occur. It does not guarantee that any amount of data has actually been masked.

The format of the table is as follows:

Column Name	Data Type	Description	Example Value
`completion_time`	`DATETIME` or `TIMESTAMP`	Time the run completed, in UTC	`2024-01-15 15:23:11`
`dm_version`	`VARCHAR(80)`	The DataMasque version	2.18.0
`status`	`VARCHAR(15)`	The result of the masking run	`finished`, `finished_with_warnings`, `failed`, or `cancelled`
`run_id`	`NUMBER` or `INT`	The ID of the run (as used in API or run logs)	156
`ruleset_uuid`	`VARCHAR(36)`	A unique identifier for each ruleset	`516aef12-98ce-5192-a9c8-154a8c9b8d12`
`ruleset_content_sha256`	`VARCHAR(64)`	A hash of the contents of the ruleset used for masking. This hash is logged at the start of a run log.	64-character SHA256 Hash
`connection_snapshot`	`VARCHAR(150)`	Summary of the database connection, including type, username and host. Does not include the password.	`MySQL - mysql_user@mysql-host:3306`
`run_hash`	`VARCHAR(64)`	A hash to compare with the hash of the masking run	64-character SHA256 Hash
`metadata`	See note below	Extra information specific to each masking run	Currently unused

Note: the metadata column is stored as a CLOB for Oracle and DB2 LUW, JSONB for Postgres, JSON for MySQL/MariaDB, and NVARCHAR(MAX) for MSSQL.

Note: Only MariaDB, MySQL, MSSQL, PostgreSQL, Oracle, and DB2 LUW databases support the run history table.

A new row is added for every masking run, assuming the ruleset is valid and a run finishes.

Insertion and creation of the DATAMASQUE_RUN_HISTORY table is automatic, and will not cause problems if any part of it fails. For example, if the user for the database connection of the masking run does not have the access required to create a table, this operation will silently fail, and masking will not be affected.

The DATAMASQUE_RUN_HISTORY table is automatically excluded from any masking runs, schema discovery, and so on.

File masking runs: Run History file

DataMasque writes the run history to a file .datamasque_run_history.ndjson in the base directory of the destination connection. The format of the file is NDJSON, with one line per history entry. Each history entry is a JSON object with the following fields:

Field Name	Data Type	Description	Example Value
`run_history_version`	string	Version of the run history format used for this entry. Currently always `"1.0"`	`"1.0"`
`completion_time`	string	Time the run completed, in UTC	`"2024-01-15 15:23:11"`
`dm_version`	string	The DataMasque version	`"2.20.0"`
`status`	string	The result of the masking run	`"finished"`, `finished_with_warnings`, `"failed"`, or `"cancelled"`
`run_id`	int	The ID of the run (as used in API or run logs)	`156`
`ruleset_uuid`	string	A unique identifier for each ruleset	`"516aef12-98ce-5192-a9c8-154a8c9b8d12"`
`ruleset_content_sha256`	string	A hash of the contents of the ruleset used for masking. This hash is logged at the start of a run log.	64-character SHA256 Hash
`connection_snapshot`	string	Summary of the file connection, including type and base directory. Does not include any credentials.	`"AWS S3 File Connection - path my/base/directory in bucket mybucket"`
`metadata`	object or `null`	Extra information specific to each masking run	Currently unused

As for the database history table, creation or modification of the file is done automatically and will not cause the masking run to fail should there be an error writing the file.

The .datamasque_run_history.ndjson file is automatically excluded from any masking runs or data discovery, even if matched by a glob statement such as glob: *.ndjson.

Verification of the ruleset hash in the run history

As the ruleset content is not captured in the history table or file, manual validation is required to compare it against the ruleset_content_sha256.

To ensure consistency, ensure that the ruleset content is encoded in UTF-8 format and includes any leading or trailing spaces. Then hash the ruleset content with the SHA256 algorithm. The result should match the hash stored in the DATAMASQUE_RUN_HISTORY table or .datamasque_run_history.ndjson file.

Simultaneous runs

To the same database

Simultaneous runs to the same database are not currently supported in DataMasque, as simultaneous masking can result in data being incorrectly masked. When there is a masking run in the status of queued, running or cancelling, subsequent masking runs to the same database connection cannot be scheduled.

If you wish to mask multiple tables in the same database simultaneously, it is recommended to use parallel tasks in a single ruleset.

To different databases

Simultaneous runs to different databases are supported in DataMasque.

For file connections

Simultaneous runs to the same source or destination file connection are not supported.