Ruleset Generation
- Overview
- Methodology
- Ruleset Generator
- Keywords
- Generated YAML Ruleset
- Troubleshooting Generated Rulesets
- Ruleset Editor
Overview
DataMasque provides the Ruleset Generator functionality which can be used to generate a YAML ruleset to mask database tables on a connection.
The Ruleset Generator runs a run_schema_discovery
task in the background to discover database tables.
Navigate to the Ruleset Generator page to utilise this functionality.
Note: The schema discovery feature does not currently support Amazon DynamoDB or Microsoft SQL Server (Linked Server).
Methodology
For more information about the methodology behind the run_schema_discovery
task
See the Schema Discovery guide for more information on
this feature.
Ruleset Generator
An existing connection can be selected from the dropdown box. If a run_schema_discovery
task has been
run then the table will populate with the most recent run data, otherwise run click the "Run Discovery" button
to create a new run with the run_schema_discovery
task. Additional Custom Data Classification and Ignored keywords can be added
for this run, for more information please refer to the Additional Keywords section.
Schemas
By default, schema discovery will run against the schema
configured on the database connection -
or if none is configured there, then the database user's default schema.
Alternatively, you can specify the schemas to discover
by clicking on the Configure schemas
button
and entering the schema names, or uploading them from a CSV file.
Notes:
- MySQL doesn't have the concept of a schema; instead, it uses databases to represent this concept (a grouping of tables). When a MySQL database connection is selected, the word "schema" in the UI will be replaced by "database" to reflect this.
- Schema (or database, for MySQL) names must be complete matches and are case-sensitive. Partial matches and wildcards are not supported. For example, entering
myschema
will match onlymyschema
, notmySCHEMA
normyschema_1
.
Schema discovery results
Once the run is completed the table will populate with the report data from that run. The report data can be downloaded by clicking the "Download Report" button. The report will be downloaded as a CSV similar to the Sensitive Data Discovery report.
The CSV report contains the following columns:
Table schema | The schema of the table discovered. |
Table name | The name of the table discovered. |
Column name | The name of the column discovered and matched against built-in keywords, Global Custom Data Classification keywords or Custom keywords if keyword matches are selected. |
Data Type | The column data type specified in the database metadata. |
Constraint | Whether the column is a Primary or Unique key. In parentheses it will list the columns in which the constraint is present. |
Max Length | If the column is a text field, this contains the max length of the column, otherwise empty string. |
Numeric Precision | If the column is a numeric field, this contains the numeric precision of the column: the maximum number of digits allowed for the number. Otherwise, this value is an empty string. |
Numeric Scale | If the column is a numeric field, this contains the numeric scale of the column: the number of digits that are present after the decimal point. Otherwise, this value is an empty string. |
Max Length | If the column is a text field, this contains the max length of the column. |
Reason for flag | Description of pattern which caused the column to be flagged for sensitive data. |
Foreign Keys | A list of any foreign keys reference this column, described in the following pattern (fk_name, referenced_column). |
Data classifications | A comma-separated list of classifications for the flagged sensitive data. Possible classifications include PII (Personally Identifiable Information), PHI (Personal Health Information), PCI (Payment Card Information) and Custom (User specified custom keywords). |
The columns intended to be masked can be selected from the table. Once all the intended columns have been selected the ruleset can be generated by clicking the "Generate Ruleset" button.
After the ruleset has been generated it can be previewed, downloaded, or sent to the ruleset editor.
Notes:
- Foreign key columns cannot be selected in the user interface, as they should only be updated as the result of masking the columns they reference.
Keywords
Built-in Keywords
Built-in keywords can be enabled or disabled, this will only stop the classification of the columns relating to PII, PHI or PCI and the reasons for those flags.
Additional Keywords
Additional keywords can be configured for a run_schema_discovery
task run on a connection.
A modal will be opened in which keywords can be added manually to the list, or a CSV file with additional keywords can be uploaded. The format and interpretation of additional custom data classification keywords and ignored keywords entered on the ruleset generator page is exactly the same as for the global keywords - see the links below.
The global keywords set on the Settings page will also be included if the "Include Global Custom Data Classification Keywords" or "Include Sensitive Data Discovery Ignored Keywords" toggles are toggled on.
For more information about keywords please refer to:
Generated YAML Ruleset
After schema discovery has been run and the columns intended to be masked have been selected, the YAML ruleset can be generated by clicking the "Generate Ruleset" button.
This will automatically generate a ruleset containing mask_table
or mask_unique_key
tasks for those columns.
The generation of the ruleset is as follows:
- DataMasque will generate
mask_unique_key
tasks for as many selected unique columns (unique keys, primary keys, and foreign key targets) as possible.- If one or more columns appear in multiple composite unique or
primary keys or foreign key targets, a
mask_unique_key
task will only be generated for one of those column sets in order to not break uniqueness or referential integrity. If one of those column is a subset of another, the subset will always be masked bymask_unique_key
in order to guarantee the uniqueness of both the subset and superset. mask_unique_key
tasks will only be generated for connections that support them.
- If one or more columns appear in multiple composite unique or
primary keys or foreign key targets, a
mask_table
tasks will be generated for the remaining selected columns.- Unique columns (unique keys, primary keys, and foreign key
targets) not masked by
mask_unique_key
will be masked byfrom_unique
masks with formats appropriate for column data types. Unique columns masked withmask_table
will be listed in a documentation block at the top of the generated YAML ruleset. - For other columns, the column names are first matched to the Built-in Keywords using the same method as sensitive data discovery in order to select an appropriate mask. If a match is not found, then an appropriate mask type is selected based on the column's data type.
- The
key
for themask_table
task is selected from the following options (in order of precedence): - The table's primary key, if it is not to be masked by
from_unique
. - The unique key with the fewest columns that is not to be masked
by
from_unique
. - The column or set of columns targeted by a foreign key with the
fewest columns that is not to be masked by
from_unique
. While the target of a foreign key is not guaranteed to be unique for all connections (e.g. MySQL), it is expected to be sufficiently unique to act as the key formask_table
. - If none of the options above can be selected, the
key
will be set toREPLACE_ME
.
- Unique columns (unique keys, primary keys, and foreign key
targets) not masked by
Further modifications to the ruleset may be required to achieve the intended mask on the database, which can be completed after passing the ruleset to the Ruleset Editor.
Notes:
- In certain circumstances, the generated ruleset may not mask all selected columns, such as:
- Columns where no masking approach can be determined that would not break referential integrity for one or more foreign keys
- In certain circumstances, additional columns that were not selected may also be masked, such as:
- Foreign keys referencing masked columns
- Unselected columns in groups of jointly unique columns where at least one column is selected, including: composite unique keys, primary keys, and the targets of foreign keys.
- In both of the above cases, the columns that could not be masked or were additionally masked will be listed in a documentation block at the top of the generated YAML ruleset.
JSON Columns
Any json
or jsonb
type columns detected by the Ruleset Generator will be masked with a
from_fixed
mask with the value {}
(empty JSON
object/dictionary). This provides a safe default by effectively blanking out any JSON columns.
For proper masking of JSON columns, please use a json
mask instead. The json
mask can traverse a JSON document and update individual elements while retaining its structure.
Troubleshooting Generated Rulesets
"Unique requirement for specified target_key could not be validated"
If a mask_unique_key
task fails because the target_key
could not
be guaranteed to be unique, it could be because the target_key
is
referenced by a foreign key, which is assumed to indicate a unique
key.
To mask a non-unique set of columns that is referenced by a foreign key, while maintaining referential integrity, you should manually construct a masking ruleset to:
- Use a
run_sql
task to disable any constraints of any foreign keys to be updated. - Use a
run_sql
task to create duplicates of all the referenced key columns (in the same table). - Use a
mask_table
task to mask the referenced key columns. Applying afrom_unique
mask type to at least one column can be done to guarantee the key is unique. - Use a
mask_table
task to update each foreign key of the referenced key:- Specify a
join
between the foreign key's table and the referenced key's table based on the foreign key columns and the duplicate columns create in step 2. - Use
from_column
mask types to copy values from the key columns masked in step 3 into the foreign key columns.
- Specify a
- Use a
run_sql
task to re-enable the foreign key constraints disabled in step 1.
Failures to satisfy primary or unique key constraints
If a mask_table
task fails to satisfy a primary or unique key column
that is masked with from_unique
, it could be due to one of the
following issues:
- The range of values generated by
from_unique
overlaps with the range of existing values in the column, resulting in duplicate values mid-masking. You should configurefrom_unique
to generate values that do not overlap with the existing contents of the column, or addrun_sql
tasks to disable unique constraints during masking and re-enable them after. - The
key
of themask_table
task is not a column or set of columns containing unique values (e.g. it is a non-unique set of columns referenced by a foreign key, which the ruleset generator assumes will typically be unique). You should change thekey
to a column or set of columns that is guaranteed to contain only unique values.
Ruleset Editor
The Ruleset Editor is the same as the editor used when creating/editing a ruleset. See the Ruleset Editor guide for more information on this feature.