Ruleset Generation

Overview
Methodology
Ruleset Generator
Keywords
- Additional Keywords
Generated YAML Ruleset
Troubleshooting Generated Rulesets
Ruleset Editor

Overview

DataMasque provides the Ruleset Generator functionality which can be used to generate a YAML ruleset to mask database tables on a connection. The Ruleset Generator runs a run_schema_discovery task in the background to discover database tables. Navigate to the Ruleset Generator page to utilise this functionality.

Note: The schema discovery feature does not currently support Amazon DynamoDB or Microsoft SQL Server (Linked Server).

Methodology

For more information about the methodology behind the run_schema_discovery task See the Schema Discovery guide for more information on this feature.

Ruleset Generator

An existing connection can be selected from the dropdown box. If a run_schema_discovery task has been run then the table will populate with the most recent run data, otherwise run click the "Run Discovery" button to create a new run with the run_schema_discovery task. Additional Custom Data Classification and Ignored keywords can be added for this run, for more information please refer to the Additional Keywords section.

Schemas

By default, schema discovery will run against the schema configured on the database connection - or if none is configured there, then the database user's default schema. Alternatively, you can specify the schemas to discover by clicking on the Configure schemas button and entering the schema names, or uploading them from a CSV file.

Notes:

MySQL doesn't have the concept of a schema; instead, it uses databases to represent this concept (a grouping of tables). When a MySQL database connection is selected, the word "schema" in the UI will be replaced by "database" to reflect this.

Schema (or database, for MySQL) names must be complete matches and are case-sensitive. Partial matches and wildcards are not supported. For example, entering myschema will match only myschema, not mySCHEMA nor myschema_1.

Schema discovery results

Once the run is completed the table will populate with the report data from that run. The report data can be downloaded by clicking the "Download Report" button. The report will be downloaded as a CSV similar to the Sensitive Data Discovery report.

The CSV report contains the following columns:


*Table schema*	The schema of the table discovered.
*Table name*	The name of the table discovered.
*Column name*	The name of the column discovered and matched against built-in keywords, Global Custom Data Classification keywords or Custom keywords if keyword matches are selected.
*Data Type*	The column data type specified in the database metadata.
*Constraint*	Whether the column is a Primary or Unique key. In parentheses it will list the columns in which the constraint is present.
*Max Length*	If the column is a text field, this contains the max length of the column, otherwise empty string.
*Numeric Precision*	If the column is a numeric field, this contains the numeric precision of the column: the maximum number of digits allowed for the number. Otherwise, this value is an empty string.
*Numeric Scale*	If the column is a numeric field, this contains the numeric scale of the column: the number of digits that are present after the decimal point. Otherwise, this value is an empty string.
*Max Length*	If the column is a text field, this contains the max length of the column.
*Reason for flag*	Description of pattern which caused the column to be flagged for sensitive data.
*Foreign Keys*	A list of any foreign keys reference this column, described in the following pattern (fk_name, referenced_column).
*Data classifications*	A comma-separated list of classifications for the flagged sensitive data. Possible classifications include PII (Personally Identifiable Information), PHI (Personal Health Information), PCI (Payment Card Information) and Custom (User specified custom keywords).

The columns intended to be masked can be selected from the table. Once all the intended columns have been selected the ruleset can be generated by clicking the "Generate Ruleset" button.

After the ruleset has been generated it can be previewed, downloaded, or sent to the ruleset editor.

Notes:

Foreign key columns cannot be selected in the user interface, as they should only be updated as the result of masking the columns they reference.

Keywords

Built-in Keywords

Built-in keywords can be enabled or disabled, this will only stop the classification of the columns relating to PII, PHI or PCI and the reasons for those flags.

Additional Keywords

Additional keywords can be configured for a run_schema_discovery task run on a connection.

A modal will be opened in which keywords can be added manually to the list, or a CSV file with additional keywords can be uploaded. The format and interpretation of additional custom data classification keywords and ignored keywords entered on the ruleset generator page is exactly the same as for the global keywords - see the links below.

The global keywords set on the Settings page will also be included if the "Include Global Custom Data Classification Keywords" or "Include Sensitive Data Discovery Ignored Keywords" toggles are toggled on.

For more information about keywords please refer to:

Generated YAML Ruleset

After schema discovery has been run and the columns intended to be masked have been selected, the YAML ruleset can be generated by clicking the "Generate Ruleset" button. This will automatically generate a ruleset containing mask_table or mask_unique_key tasks for those columns.

The generation of the ruleset is as follows:

DataMasque will generate mask_unique_key tasks for as many selected unique columns (unique keys, primary keys, and foreign key targets) as possible.
- If one or more columns appear in multiple composite unique or primary keys or foreign key targets, a mask_unique_key task will only be generated for one of those column sets in order to not break uniqueness or referential integrity. If one of those column is a subset of another, the subset will always be masked by mask_unique_key in order to guarantee the uniqueness of both the subset and superset.
- mask_unique_key tasks will only be generated for connections that support them.
mask_table tasks will be generated for the remaining selected columns.
- Unique columns (unique keys, primary keys, and foreign key targets) not masked by mask_unique_key will be masked by from_unique masks with formats appropriate for column data types. Unique columns masked with mask_table will be listed in a documentation block at the top of the generated YAML ruleset.
- For other columns, the column names are first matched to the Built-in Keywords using the same method as sensitive data discovery in order to select an appropriate mask. If a match is not found, then an appropriate mask type is selected based on the column's data type.
- The key for the mask_table task is selected from the following options (in order of precedence):
- The table's primary key, if it is not to be masked by from_unique.
- The unique key with the fewest columns that is not to be masked by from_unique.
- The column or set of columns targeted by a foreign key with the fewest columns that is not to be masked by from_unique. While the target of a foreign key is not guaranteed to be unique for all connections (e.g. MySQL), it is expected to be sufficiently unique to act as the key for mask_table.
- If none of the options above can be selected, the key will be set to REPLACE_ME.

Further modifications to the ruleset may be required to achieve the intended mask on the database, which can be completed after passing the ruleset to the Ruleset Editor.

Notes:

In certain circumstances, the generated ruleset may not mask all selected columns, such as:

Columns where no masking approach can be determined that would not break referential integrity for one or more foreign keys

In certain circumstances, additional columns that were not selected may also be masked, such as:

Foreign keys referencing masked columns

Unselected columns in groups of jointly unique columns where at least one column is selected, including: composite unique keys, primary keys, and the targets of foreign keys.

In both of the above cases, the columns that could not be masked or were additionally masked will be listed in a documentation block at the top of the generated YAML ruleset.

JSON Columns

Any json or jsonb type columns detected by the Ruleset Generator will be masked with a from_fixed mask with the value {} (empty JSON object/dictionary). This provides a safe default by effectively blanking out any JSON columns.

For proper masking of JSON columns, please use a json mask instead. The json mask can traverse a JSON document and update individual elements while retaining its structure.

Troubleshooting Generated Rulesets

"Unique requirement for specified target_key could not be validated"

If a mask_unique_key task fails because the target_key could not be guaranteed to be unique, it could be because the target_key is referenced by a foreign key, which is assumed to indicate a unique key.

To mask a non-unique set of columns that is referenced by a foreign key, while maintaining referential integrity, you should manually construct a masking ruleset to:

Use a run_sql task to disable any constraints of any foreign keys to be updated.
Use a run_sql task to create duplicates of all the referenced key columns (in the same table).
Use a mask_table task to mask the referenced key columns. Applying a from_unique mask type to at least one column can be done to guarantee the key is unique.
Use a mask_table task to update each foreign key of the referenced key:
- Specify a join between the foreign key's table and the referenced key's table based on the foreign key columns and the duplicate columns create in step 2.
- Use from_column mask types to copy values from the key columns masked in step 3 into the foreign key columns.
Use a run_sql task to re-enable the foreign key constraints disabled in step 1.

Failures to satisfy primary or unique key constraints

If a mask_table task fails to satisfy a primary or unique key column that is masked with from_unique, it could be due to one of the following issues:

The range of values generated by from_unique overlaps with the range of existing values in the column, resulting in duplicate values mid-masking. You should configure from_unique to generate values that do not overlap with the existing contents of the column, or add run_sql tasks to disable unique constraints during masking and re-enable them after.
The key of the mask_table task is not a column or set of columns containing unique values (e.g. it is a non-unique set of columns referenced by a foreign key, which the ruleset generator assumes will typically be unique). You should change the key to a column or set of columns that is guaranteed to contain only unique values.

Ruleset Editor

The Ruleset Editor is the same as the editor used when creating/editing a ruleset. See the Ruleset Editor guide for more information on this feature.