Ruleset Generation

Overview
Ruleset Generator
Keywords
- Additional Keywords
Generated YAML Ruleset
Troubleshooting Generated Rulesets
Ruleset Editor

Overview

Use DataMasque's Ruleset Generator to generate a prototype YAML ruleset to mask database tables on a connection.

The Ruleset Generator runs a run_schema_discovery task in the background to discover database tables. Navigate to the Ruleset Generator page to utilise this functionality.

For more information about the methodology behind schema discovery, see the Schema Discovery guide. More information on the run_schema_discovery task can be found on the Ruleset Specification page.

Note: The schema discovery feature does not currently support Amazon DynamoDB or Microsoft SQL Server (Linked Server).

Caution: Be aware of the memory usage limitations.

Ruleset Generator

Select an existing connection from the dropdown box. If no run_schema_discovery task has previously been run against this connection (either from the ruleset generator page or by including a run_schema_discovery task in a ruleset), click the Run Discovery button to create a new run with the run_schema_discovery task. Otherwise, the table will populate with the most recent run data. You can click the Rerun Discovery button to run schema discovery again, which will overwrite the previous run's results.

Additional Custom Data Classification and Ignored keywords can be added for schema discovery. For more information please refer to the Additional Keywords section.

Schemas

By default, schema discovery will run against the schema configured on the database connection - or if none is configured there, then the database user's default schema. Alternatively, you can specify the schemas to discover by clicking on the Configure schemas button and entering the schema names, or uploading them from a CSV file.

Notes:

MySQL and MariaDB don't have the concept of a schema; instead, they use databases to represent this concept (a grouping of tables). When a MySQL or MariaDB database connection is selected, the word "schema" in the UI will be replaced by "database" to reflect this.

Schema (or database, for MySQL/MariaDB) names must be complete matches and are case-sensitive. Partial matches and wildcards are not supported. For example, entering myschema will match only myschema, not mySCHEMA nor myschema_1.

In-data discovery

The toggle switch marked In-data discovery allows you to enable or disable in-data discovery for this schema discovery run. When in-data discovery finds a column that may contain sensitive information, the column will be marked as such in the results (under Flagged by). When generating a ruleset, DataMasque will suggest suitable mask types for the columns based on the type of data that in-data discovery thinks the column contains.

Schema discovery results

Once the run is completed the table will populate with the report data from that run. The report data can be downloaded by clicking the "Download Report" button. The report will be downloaded as a CSV similar to the Sensitive Data Discovery report.

The CSV report contains the following columns:


*Table schema*	The schema of the table discovered.
*Table name*	The name of the table discovered.
*Column name*	The name of the column discovered and matched against built-in keywords, Global Custom Data Classification keywords or Custom keywords if keyword matches are selected.
*Data Type*	The column data type specified in the database metadata.
*Constraint*	Whether the column is a Primary or Unique key. In parentheses it will list the columns in which the constraint is present.
*Max Length*	If the column is a text field, this contains the max length of the column, otherwise empty string.
*Numeric Precision*	If the column is a numeric field, this contains the numeric precision of the column: the maximum number of digits allowed for the number. Otherwise, this value is an empty string.
*Numeric Scale*	If the column is a numeric field, this contains the numeric scale of the column: the number of digits that are present after the decimal point. Otherwise, this value is an empty string.
*Max Length*	If the column is a text field, this contains the max length of the column.
*Reason for flag*	Description of pattern which caused the column to be flagged for sensitive data.
*Flagged by*	Whether the column was flagged for sensitive data through in-data discovery or through the standard sensitive data discovery / keyword matching process.
*Foreign Keys*	A list of any foreign keys reference this column, described in the following pattern (fk_name, referenced_column).
*Data classifications*	A comma-separated list of classifications for the flagged sensitive data. Possible classifications include PII (Personally Identifiable Information), PHI (Personal Health Information), PCI (Payment Card Information) and Custom (User specified custom keywords).

The columns intended to be masked can be selected from the table. Once all the intended columns have been selected the ruleset can be generated by clicking the "Generate Ruleset" button.

After the ruleset has been generated it can be previewed, downloaded, or sent to the ruleset editor.

Notes:

Foreign key columns cannot be selected in the user interface, as they should only be updated as the result of masking the columns they reference.

Keywords

Built-in Keywords

Built-in keywords can be enabled or disabled, this will only stop the classification of the columns relating to PII, PHI or PCI and the reasons for those flags.

Additional Keywords

Additional keywords can be configured for a run_schema_discovery task run on a connection.

A modal will be opened in which keywords can be added manually to the list, or a CSV file with additional keywords can be uploaded. The format and interpretation of additional custom data classification keywords and ignored keywords entered on the ruleset generator page is exactly the same as for the global keywords - see the links below.

The global keywords set on the Settings page will also be included if the "Include Global Custom Data Classification Keywords" or "Include Sensitive Data Discovery Ignored Keywords" toggles are toggled on.

For more information about keywords please refer to:

Generated YAML Ruleset

After schema discovery has been run and the columns intended to be masked have been selected, the YAML ruleset can be generated by clicking the "Generate Ruleset" button. This will automatically generate a ruleset containing mask_table or mask_unique_key tasks for those columns.

The generation of the ruleset is as follows:

DataMasque will generate mask_unique_key tasks for as many selected unique columns (unique keys, primary keys, and foreign key targets) as possible.
- If one or more columns appear in multiple composite unique or primary keys or foreign key targets, a mask_unique_key task will only be generated for one of those column sets in order to not break uniqueness or referential integrity. If one of those column is a subset of another, the subset will always be masked by mask_unique_key in order to guarantee the uniqueness of both the subset and superset.
- mask_unique_key tasks will only be generated for connections that support them.
mask_table tasks will be generated for the remaining selected columns.
- Unique columns (unique keys, primary keys, and foreign key targets) not masked by mask_unique_key will be masked by from_unique masks with formats appropriate for column data types. Unique columns masked with mask_table will be listed in a documentation block at the top of the generated YAML ruleset.
- For other columns, the column names are first matched to the Built-in Keywords using the same method as sensitive data discovery in order to select an appropriate mask. If a match is not found, then an appropriate mask type is selected based on the column's data type.
The key of a mask_table task is a column name, or list of column names. It determines which rows will be masked with the same value. To maximise the security of the masked data, ideally no two rows would be deliberately masked to the same value and hence every row needs a different value in the key column (or a different combination of values across all key columns, where the ruleset specifies more than one). The following describes how the ruleset generator chooses the key column(s) to try to achieve this.

For Oracle databases, the key for the mask_table task is always generated as ROWID (a unique number that Oracle gives to each row). For other databases, the key for the mask_table task is selected from the following options (in order of precedence):
- The table's primary key, if it is not to be masked by from_unique.
- The unique key with the fewest columns, where none of those columns are to be masked by from_unique.
- The columns participating in the smallest (in terms of number of columns) unique index on the table, where none of those columns are to be masked by from_unique.
- The column or set of columns targeted by a foreign key with the fewest columns where none of those columns are to be masked by from_unique. While the target of a foreign key is not guaranteed to be unique for all connections (e.g. MySQL) it is expected to be sufficiently unique to act as the key for mask_table.
- Any column(s) that are flagged as Identification / Identifiers by sensitive data discovery, and are not selected for masking.
- The first three column(s) alphabetically by name amongst those not selected for masking. The columns are selected as a set; the ruleset generator will always select exactly three columns in this case.
The ruleset generator adds a comment above every key in a mask_table task explaining which key column(s) was/were chosen and why. Where the key may not be unique, a warning note is added to this effect. Be sure to review the ruleset generator's choice of key column(s) to ensure the key is (sufficiently) unique for the table in question.

When the ruleset generator cannot select any key columns from the above list of options, it adds a warning note to the ruleset with a commented-out key of REPLACE_ME. You will need to uncomment this line and replace the REPLACE_ME placeholder text with the name of a column, or a list of columns, suitable for use as a key.

Further modifications to the ruleset may be required to achieve the intended mask on the database, which can be completed after passing the ruleset to the Ruleset Editor.

Notes:

In certain circumstances, the generated ruleset may not mask all selected columns, such as:

Columns where no masking approach can be determined that would not break referential integrity for one or more foreign keys

In certain circumstances, additional columns that were not selected may also be masked, such as:

Foreign keys referencing masked columns

Unselected columns in groups of jointly unique columns where at least one column is selected, including: composite unique keys, primary keys, and the targets of foreign keys.

In both of the above cases, the columns that could not be masked or were additionally masked will be listed in a documentation block at the top of the generated YAML ruleset.

JSON Columns

Any json or jsonb type columns detected by the Ruleset Generator will be masked with a from_fixed mask with the value {} (empty JSON object/dictionary). This provides a safe default by effectively blanking out any JSON columns.

For proper masking of JSON columns, please use a json mask instead. The json mask can traverse a JSON document and update individual elements while retaining its structure.

Troubleshooting Generated Rulesets

"Unique requirement for specified target_key could not be validated"

If a mask_unique_key task fails because the target_key could not be guaranteed to be unique, it could be because the target_key is referenced by a foreign key, which is assumed to indicate a unique key.

To mask a non-unique set of columns that is referenced by a foreign key, while maintaining referential integrity, you should manually construct a masking ruleset to:

Use a run_sql task to disable any constraints of any foreign keys to be updated.
Use a run_sql task to create duplicates of all the referenced key columns (in the same table).
Use a mask_table task to mask the referenced key columns. Applying a from_unique mask type to at least one column can be done to guarantee the key is unique.
Use a mask_table task to update each foreign key of the referenced key:
- Specify a join between the foreign key's table and the referenced key's table based on the foreign key columns and the duplicate columns create in step 2.
- Use from_column mask types to copy values from the key columns masked in step 3 into the foreign key columns.
Use a run_sql task to re-enable the foreign key constraints disabled in step 1.

Failures to satisfy primary or unique key constraints

If a mask_table task fails to satisfy a primary or unique key column that is masked with from_unique, it could be due to one of the following issues:

The range of values generated by from_unique overlaps with the range of existing values in the column, resulting in duplicate values mid-masking. You should configure from_unique to generate values that do not overlap with the existing contents of the column, or add run_sql tasks to disable unique constraints during masking and re-enable them after.
The key of the mask_table task is not a column or set of columns containing unique values (e.g. it is a non-unique set of columns referenced by a foreign key, which the ruleset generator assumes will typically be unique). You should change the key to a column or set of columns that is guaranteed to contain only unique values.

Ruleset Editor

The Ruleset Editor is the same as the editor used when creating/editing a ruleset. See the Ruleset Editor guide for more information on this feature.