Ruleset Generation for Large Schemas

Introduction
Overview
Detailed steps
Notes

Introduction

As of v2.22, DataMasque can discover database schemas of almost any size and generate rulesets for them. However, to avoid overloading the user's Web browser, discovery results in the UI are now limited to 1,000 columns. Where the discovery result contains more than 1,000 total columns across all the databases/schemas and tables that were discovered, you must use the following API-based approach instead.

Overview

The ruleset generation process for large schemas is:

As for normal schema discovery, use the Ruleset Generator page to configure discovery options and start a discovery run.
Once the discovery run is complete, make an API request to download the discovery report, which is a CSV file.
Edit the CSV to specify which column(s) should be included in the rulesets.
Make an API request to upload the CSV with column selections to DataMasque.
Make an API request to poll the status of the ruleset generation.
When ruleset generation is complete, download a ZIP file of the rulesets.
Review and edit the rulesets, then upload them to DataMasque.

Each generated ruleset contains masking tasks for one or more tables. The ruleset's filename indicates the date it was generated and the range of table names covered by the ruleset, for example 2024-09-15 TABLE_00098 - TABLE_00193.yml. If a ruleset contains only a single table, it is named with just that table: 2024-09-15 TABLE_00204.yml.

For more information about the API calls involved, refer to the API Reference. Each API call below is linked to the relevant part of the API Reference page.

Detailed steps

Obtain the connection ID for the connection. You can find this by going to the Database Masking page, locating the connection in the list, and clicking the Edit button. The connection ID is displayed at the top of the page.

The connection ID has the UUID4 format, for example 7bd5ee8d-aa33-4fc3-b8b5-791e7c9759a5. Use the Copy to clipboard button to copy the connection ID to clipboard.
Start discovery as usual by using the UI. Navigate to the Ruleset Generator page, choose the connection, configure the discovery settings (target schemas/databases, keywords and In-Data Discovery settings), and click Run Discovery or Rerun Discovery.
Once discovery has started, navigate to the Run Logs page.
The discovery run is in progress. You can identify it by the connection you selected, and its ruleset name of $auto_schema_discovery$ . Note the run ID.
View the progress of the discovery by selecting this run and viewing the run logs.
Wait for the discovery run to complete successfully.
Authenticate with the DataMasque API as a mask builder or admin user.
Download the CSV report by making a GET request to https://<DataMasque URL>/api/runs/<run ID>/db-discovery-results/report/, where the run ID (an integer) is the one you noted in step 4. For example, the following curl command downloads the report for run 123 and saves it to report.csv:

curl -k -H "Authorization: Token <token>" -o report.csv \
    https://<DataMasque URL>/api/runs/123/db-discovery-results/report/

Note: The schema discovery report downloadable in the DataMasque Web UI does not contain the Selected column. Be sure to download the report using the API call above.

When you have the CSV, open it in a text editor, Excel, or some other suitable application. The CSV is sorted alphabetically by schema, table and column name. The leftmost column of the CSV is titled Selected and indicates which database columns you wish to generate a ruleset for.

To select a column, change the Selected value to any of 1, y, yes, or true (case-insensitive).
To exclude a column, leave the Selected value blank, or set it to 0, n, no, or false (case-insensitive).

Note: Any columns which are child columns of foreign keys cannot be selected. If you select any such columns in your CSV, DataMasque ignores these selections. To mask such columns, mask the corresponding parent column of the foreign key, and DataMasque will then automatically update the child column.

Note: Selecting thousands of columns manually can be rather tedious and error-prone. The filtering and auto-fill features in Excel or similar spreadsheet software can help speed the process up. You can also select columns using a script. For example, the Unix command sed 's/^,/y,/g' report.csv >selected_report.csv copies the report.csv file to selected_report.csv and marks all columns as selected.

Warning: When editing the CSV, retain the ordering of the rows. The order in which tasks are generated is based on the order of the CSV rows, so reordering can cause the tasks for one table to be spread across multiple rulesets. Reordering the rows can also negatively impact performance. Provided the order is retained, it is otherwise safe to delete unwanted rows (but not the header row) from the CSV.

Make a POST request to https://<DataMasque URL>/api/async-generate-ruleset/<connection ID>/from-csv, providing as form data a target_size_bytes field which specifies the approximate size of each generated ruleset (default is 500KB), and a csv_file field containing the CSV content.

The connection ID is the one you noted in step 1.

An example curl command is:

curl -k -H "Authorization: Token <token>" -X POST \
    -F 'connection=<connection ID>` \
    -F 'target_size_bytes=500000' \
    -F 'csv_file=@selected_report.csv'
    https://<DataMasque URL>/api/generate-ruleset-from-csv/

Note: The maximum upload size is 250MB. If the CSV exceeds this size, you can optionally split it into multiple CSVs (the header row must be present in each), then compress the CSV(s) into a ZIP file and upload that.

Make periodic GET requests to https://<DataMasque URL>/api/async-generate-ruleset/<connection ID>/. DataMasque returns JSON data indicating the status of the ruleset generation process and any error message. The status is running while ruleset generation is in progress, and failed if there is an error. When the status changes to finished, the ruleset generation is complete.

Below is a sample response showing that ruleset generation is still in progress:

{
  "connection": "7bd5ee8d-aa33-4fc3-b8b5-791e7c9759a5",
  "generated_ruleset": null,
  "status": "running",
  "error_message": null,
  "status_message": "Saved ruleset \"2024-09-15 TABLE_00098 - TABLE_00193.yml\"",
  "last_updated": "2024-09-15T10:03:22.956011Z"
}

Make a GET request to https://<DataMasque URL>/api/async-generate-ruleset/<connection ID>/download-rulesets/ to download the rulesets as a ZIP file. For example, the following command saves the output to rulesets.zip:

curl -k -H "Authorization: Token <token>" -o rulesets.zip \
    https://<DataMasque URL>/api/async-generate-ruleset/<connection ID>/download-rulesets/

Move the downloaded ZIP file to a suitable directory and unzip it. You can now review and edit the rulesets using a text editor, then upload them to DataMasque using either the API or the Upload Ruleset button in the YAML editor.

Upload ruleset

Notes

Incompatibility with discovery results from previous DataMasque versions

To support very large schemas, the internals of the database schema discovery process were changed significantly in v2.22. As such, it is not possible to use the above method to generate a ruleset unless schema discovery has been run on the connection using DataMasque v2.22 or later. If your discovery results are from an earlier version of DataMasque and are too large to display in the UI, you will need to rerun schema discovery.

Ruleset size configuration

As part of the API request to start generating rulesets, the target_size_bytes option specifies a target maximum size for each ruleset, for example 500kB. Each of the generated rulesets in the final ZIP file are then approximately this size. Limiting the size of each ruleset makes them easier to edit and avoids causing performance issues in text editor programs or in DataMasque's YAML editor.

The length of a ruleset scales very close to linearly with the file size. As a rough guide, a ruleset that masks 1,000 columns contains approximately 4,500 lines of YAML and has a file size of approximately 125KB.

Performance

Discovering a large schema can take several hours, particularly when using In-Data Discovery. The time taken is influenced by the DataMasque VM's CPU core speed, the target database's processing power, and the latency of DataMasque's connection to the target database.

Ruleset generation is much faster, on the order of 20-30 minutes per million columns on a fast VM. It is done entirely on the DataMasque VM, so the limiting factors are CPU core speed and disk I/O on the VM.

DataMasque only uses one CPU core for both discovery and ruleset generation, so increasing the number of cores will not improve performance.

Tools

The DataMasque team can provide you with scripts for bulk-upload and bulk-execution of rulesets. Alternatively, you can write your own using the POST /rulesets/ and POST /runs/ REST APIs.