Ruleset Generation
Overview
DataMasque provides the Ruleset Generator functionality which can be used to generate a YAML ruleset to mask database tables on a connection.
The Ruleset Generator runs a run_schema_discovery
task in the background to discover database tables.
Navigate to the Ruleset Generator page to utilise this functionality.
Note: The schema discovery feature does not currently support AWS Redshift.
Methodology
For more information about the methodology behind the run_schema_discovery
task
See the Schema Discovery guide for more information on
this feature.
Ruleset Generator
An existing connection can be selected from the dropdown box. If a run_schema_discovery
task has been
run then the table will populate with the most recent run data, otherwise run click the "Run Discovery" button
to create a new run with the run_schema_discovery
task. Additional Custom Data Classification and Ignored keywords can be added
for this run, for more information please refer to the Additional Keywords section.
Once the run is completed the table will populate with the report data from that run. The report data can be downloaded by clicking the "Download Report" button. The report will be downloaded as a CSV similar to the Sensitive Data Discovery report.
The CSV report contains the following columns:*
Table schema | The schema of the table discovered. |
Table name | The name of the table discovered. |
Column name | The name of the column discovered and matched against built-in keywords, Global Custom Data Classification keywords or Custom keywords if keyword matches are selected. |
Data Type | The column data type specified in the database metadata. |
Constraint | Whether the column is a Primary or Unique key. In parentheses it will list the columns in which the constraint is present. |
Max Length | If the column is a text field, this contains the max length of the column, otherwise empty string. |
Numeric Precision | If the column is a numeric field, this contains the numeric precision of the column: the maximum number of digits allowed for the number. Otherwise, this value is an empty string. |
Numeric Scale | If the column is a numeric field, this contains the numeric scale of the column: the number of digits that are present after the decimal point. Otherwise, this value is an empty string. |
Max Length | If the column is a text field, this contains the max length of the column. |
Reason for flag | Description of pattern which caused the column to be flagged for sensitive data. |
Foreign Keys | A list of any foreign keys reference this column, described in the following pattern (fk_name, referenced_column). |
Data classifications | A comma-separated list of classifications for the flagged sensitive data. Possible classifications include PII (Personally Identifiable Information), PHI (Personal Health Information), PCI (Payment Card Information) and Custom (User specified custom keywords). |
The columns intended to be masked can be selected from the table. Once all the intended columns have been selected the ruleset can be generated by clicking the "Generate Ruleset" button.
After the ruleset has been generated it can be previewed, downloaded, or sent to the ruleset editor.
Note: When a
mask_table
task is generated thekey
for the task is selected through the following process:
- The
primary key
is used then there is aprimary key
constraint on the target table.- When there is no
primary key
, aunique key
will be used instead.- If there are multiple
unique keys
some of which are compositeunique keys
the key with the least amount of columns will be selected.
Keywords
Built-in Keywords
Built-in keywords can be enabled or disabled, this will only stop the classification of the columns relating to PII, PHI or PCI and the reasons for those flags.
Additional Keywords
Additional Keywords can be configured for a run_schema_discovery
task run on a connection.
A modal will be opened in which keywords can be added manually to the list, or a file with additional keywords can be uploaded.
The global keywords set on the Settings page can also be included if the "Include Global Custom Data Classification Keywords" or "Include Sensitive Data Discovery Ignored Keywords" toggles are toggled on.
Custom Data Classification Keywords
Additional Custom Data Classification Keywords can be configured through "Configure Custom Keywords".
- DataMasque will convert each keyword to a regular expression pattern that will match column names with
(space).
- Column names that match the Custom Data Classification keywords will be reported and tagged with the data classification "Custom" in Schema Discovery reports.
- Only spaces and alphanumeric characters are allowed for Custom Data Classification keywords.
Ignored Keywords
Additional Ignored Keywords can be configured through "Configure Ignored Keywords".
- Ignored keywords will only ignore exact matches of a column name, which allows you to exclude specific column names from Schema Discovery reports.
- Only spaces, underscores, hyphens and alphanumeric characters are allowed for Ignored keywords.
- Column names matched to ignored keywords will be excluded from the Schema Discovery report.
For more information about keywords please refer to:
Generated YAML Ruleset
After schema discovery has been run and the columns intended to be masked have been selected, the YAML ruleset can be generated by clicking the "Generate Ruleset" button.
This will automatically generate a ruleset containing mask_table
or mask_unique_key
tasks for those columns.
The generation of the ruleset is as follows:
The column names are first matched to the Built-in Keywords using the same method as the Sensitive data discovery,
if a match is not found then a ruleset based on the datatype of the column is the selected.
The type of the task will also be mapped according to the column name or the data type of the column. Further modifications of the ruleset may be required to achieve the intended mask on the database, which can be completed after passing the ruleset to the Ruleset Editor.
JSON Columns
Any json
or jsonb
type columns detected by the Ruleset Generator will be masked with a
from_fixed
mask with the value {}
(empty JSON
object/dictionary). This provides a safe default by effectively blanking out any JSON columns.
For proper masking of JSON columns, please use a json
mask instead. The json
mask can traverse a JSON document and update individual elements while retaining its structure.
Ruleset Editor
The Ruleset Editor is the same as the editor used when creating/editing a ruleset.
See the Ruleset Editor guide for more information on
this feature.