Ruleset Generation for Large Schemas
Introduction
As of v2.22, DataMasque can discover database schemas of almost any size and generate rulesets for them. However, to avoid overloading the user's Web browser, discovery results in the UI are now limited to 1,000 columns. Where the discovery result contains more than 1,000 total columns across all the databases/schemas and tables that were discovered, you must use the following API-based approach instead.
Overview
The ruleset generation process for large schemas is:
- As for normal schema discovery, use the Ruleset Generator page to configure discovery options and start a discovery run.
- Once the discovery run is complete, make an API request to download the discovery report, which is a CSV file.
- Edit the CSV to specify which column(s) should be included in the rulesets.
- Make an API request to upload the CSV with column selections to DataMasque.
- Make an API request to poll the status of the ruleset generation.
- When ruleset generation is complete, download a ZIP file of the rulesets.
- Review and edit the rulesets, then upload them to DataMasque.
Each generated ruleset contains masking tasks for one or more tables.
The ruleset's filename indicates the date it was generated,
the number of the ruleset (starting at 1),
and the range of table names covered by the ruleset,
for example 20240915_012_TABLE_00098_to_TABLE_00193.yml
.
If a ruleset contains only a single table, it is named with just that table: 20240915_017_TABLE_00204.yml
.
Long table names are truncated in the ruleset filename, as ruleset names are limited to 64 characters.
For more information about the API calls involved, refer to the API Reference. Each API call below is linked to the relevant part of the API Reference page.
Detailed steps
Obtain the connection ID for the connection. You can find this by going to the Database Masking page, locating the connection in the list, and clicking the Edit button. The connection ID is displayed at the top of the page.
The connection ID has the UUID4 format, for example
7bd5ee8d-aa33-4fc3-b8b5-791e7c9759a5
. Use the Copy to clipboard button to copy the connection ID to clipboard.Start discovery as usual by using the UI. Navigate to the Ruleset Generator page, choose the connection, configure the discovery settings (target schemas/databases, keywords and In-Data Discovery settings), and click Run Discovery or Rerun Discovery.
Once discovery has started, navigate to the Run Logs page.
The discovery run is in progress. You can identify it by the connection you selected, and its ruleset name of
$auto_schema_discovery$
. Note the run ID.View the progress of the discovery by selecting this run and viewing the run logs.
Wait for the discovery run to complete successfully.
Authenticate with the DataMasque API as a mask builder or admin user to obtain a user token for use in API requests.
Note: A user token is required to use the discovery and ruleset generation API endpoints. API tokens do not have sufficient permissions.
Below is an example curl
command and response showing how to authenticate.
The key
in the response is the token
to pass in the Authorization
header in subsequent requests below.
It is a 40-character hexadecimal string.
curl -k -H "Content-Type: application/json" -X POST \
-d '{"username": "<user>", "password": "<password>"}' \
https://<DataMasque URL>/api/auth/token/login/
{
"id": 322,
"key": "7e390082eb4728abdf5dc04ef899b8ebc2b790a8",
"client_ip": "10.220.2.1",
...
"date_time_expires":"2024-09-26T14:19:31.571550Z"
}
- Download the CSV report by making a
GET
request tohttps://<DataMasque URL>/api/runs/<run ID>/db-discovery-results/report/
, where the run ID (an integer) is the one you noted in step 4. For example, the followingcurl
command downloads the report for run123
and saves it toreport.csv
:
curl -k -H "Authorization: Token <token>" -o report.csv \
https://<DataMasque URL>/api/runs/123/db-discovery-results/report/
Note: The schema discovery report downloadable in the DataMasque Web UI does not contain the
Selected
column. Be sure to download the report using the API call above.
- When you have the CSV, open it in a text editor, Excel, or some other suitable application.
The CSV is sorted alphabetically by schema, table and column name.
The leftmost column of the CSV is titled
Selected
and indicates which database columns you wish to generate a ruleset for.
- To select a column, change the
Selected
value toyes
(case-insensitive). - To exclude a column, leave the
Selected
value blank, or set it tono
(case-insensitive).
Note: Any columns which are child columns of foreign keys cannot be selected. If you select any such columns in your CSV, DataMasque ignores these selections. To mask such columns, mask the corresponding parent column of the foreign key, and DataMasque will then automatically update the child column.
Note: Selecting thousands of columns manually can be rather tedious and error-prone. The filtering and auto-fill features in Excel or similar spreadsheet software can help speed the process up. You can also select columns using a script. For example, the Unix command
sed 's/^,/y,/g' report.csv >selected_report.csv
copies thereport.csv
file toselected_report.csv
and marks all columns as selected.Warning: When editing the CSV, retain the ordering of the rows. The CSV is sorted by schema name, then table name. The order in which tasks are generated is based on the order of the CSV rows, so reordering can cause the tasks for one table to be spread across multiple rulesets. Reordering the rows can also negatively impact performance. Provided the order is retained, it is otherwise safe to delete unwanted rows (but not the header row) from the CSV.
Make a
POST
request tohttps://<DataMasque URL>/api/async-generate-ruleset/<connection ID>/from-csv/
, providing as form data acsv_or_zip_file
field containing the CSV content, and an optionaltarget_size_bytes
field which specifies the approximate size of each generated ruleset (default is 500KB),The connection ID is the one you noted in step 1.
An example
curl
command is:
curl -k -H "Authorization: Token <token>" -X POST \
-F 'csv_or_zip_file=@selected_report.csv' \
-F 'target_size_bytes=500000' \
https://<DataMasque URL>/api/async-generate-ruleset/<connection ID>/from-csv/
Note: The maximum upload size is 250MB. If the CSV exceeds this size, you can optionally split it into multiple CSVs (the header row must be present in each), then compress the CSV(s) into a ZIP file and upload that.
- Make periodic
GET
requests tohttps://<DataMasque URL>/api/async-generate-ruleset/<connection ID>/
. DataMasque returns JSON data indicating the status of the ruleset generation process and any error message. The status isrunning
while ruleset generation is in progress, andfailed
if there is an error. When the status changes tofinished
, the ruleset generation is complete.
Below is a sample curl
command and response showing that ruleset generation is still in progress:
curl -k -H "Authorization: Token <token>" https://<DataMasque URL>/api/async-generate-ruleset/<connection ID>/
{
"connection": "7bd5ee8d-aa33-4fc3-b8b5-791e7c9759a5",
"generated_ruleset": null,
"status": "running",
"error_message": null,
"status_message": "Saved ruleset \"2024-09-15 TABLE_00098 - TABLE_00193.yml\"",
"last_updated": "2024-09-15T10:03:22.956011Z"
}
- Make a
GET
request tohttps://<DataMasque URL>/api/async-generate-ruleset/<connection ID>/download-rulesets/
to download the rulesets as a ZIP file. For example, the following command saves the output torulesets.zip
:
curl -k -H "Authorization: Token <token>" -o rulesets.zip \
https://<DataMasque URL>/api/async-generate-ruleset/<connection ID>/download-rulesets/
- Move the downloaded ZIP file to a suitable directory and unzip it. You can now review and edit the rulesets using a text editor, then upload them to DataMasque using either the API or the Upload Ruleset button in the YAML editor.
Notes
Incompatibility with discovery results from previous DataMasque versions
To support very large schemas, the internals of the database schema discovery process were changed significantly in v2.22. As such, it is not possible to use the above method to generate a ruleset unless schema discovery has been run on the connection using DataMasque v2.22 or later. If your discovery results are from an earlier version of DataMasque and are too large to display in the UI, you will need to rerun schema discovery.
Ruleset size configuration
As part of the API request to start generating rulesets,
the target_size_bytes
option specifies a target maximum size for each ruleset, for example 500kB.
Each of the generated rulesets in the final ZIP file are then approximately this size.
Limiting the size of each ruleset makes them easier to edit
and avoids causing performance issues in text editor programs or in DataMasque's YAML editor.
The length of a ruleset scales very close to linearly with the file size. As a rough guide, a ruleset that masks 1,000 columns contains approximately 4,500 lines of YAML and has a file size of approximately 125KB.
Performance
Discovering a large schema can take several hours, particularly when using In-Data Discovery. The time taken is influenced by the DataMasque VM's CPU core speed, the target database's processing power, and the latency of DataMasque's connection to the target database.
Ruleset generation is much faster, on the order of 20-30 minutes per million columns on a fast VM. It is done entirely on the DataMasque VM, so the limiting factors are CPU core speed and disk I/O on the VM.
DataMasque only uses one CPU core for both discovery and ruleset generation, so increasing the number of cores will not improve performance.
Lifetime of results
Discovery results are stored per run, but only the most recent results for any given connection can be used to generate a ruleset. As such, if you perform two discovery runs on the same connection, you can only generate a ruleset from the results of the second run.
Ruleset generation results are stored per connection, and only the most recent generated ruleset ZIP is available from the API.
- If you delete a connection object, generated rulesets for that connection will be lost.
- If you clone a connection object, generated rulesets from the original connection are not copied to the new connection.
- If you perform two ruleset generations for a particular connection, you can only download the second generated ruleset ZIP.
You can use the DELETE /api/runs/{id}/db-discovery-results/
API
to delete discovery results for a run,
and the DELETE /api/async-generate-ruleset/{connection_id}/
API
to delete the most recent generated ruleset ZIP file for a connection.
This should only be a last resort in very low disk space situations.
Tools
The DataMasque team can provide you with scripts for bulk-upload and bulk-execution of rulesets.
Alternatively, you can write your own using the POST /rulesets/
and POST /runs/
REST APIs.