Ruleset Generator
- Overview
- Ruleset Generator
- Keywords
- Generated YAML Ruleset
- Troubleshooting Generated Rulesets
- Ruleset Editor
Overview
Use DataMasque's Ruleset Generator to generate a prototype YAML ruleset to mask database tables on a connection.
The Ruleset Generator runs a run_schema_discovery
task in the background to discover database tables.
Navigate to the Ruleset Generator page to utilise this functionality.
For more information about the methodology behind schema discovery, see the Schema Discovery guide.
More information on the run_schema_discovery
task can be found
on the Ruleset Specification page.
Note: The schema discovery feature does not currently support Amazon DynamoDB or Microsoft SQL Server (Linked Server).
Caution: Be aware of the memory usage limitations.
Ruleset Generator
Select an existing connection from the dropdown box.
If no run_schema_discovery
task has previously been run against this connection
(either from the ruleset generator page or by including a run_schema_discovery
task in a ruleset),
click the Run Discovery button to create a new run with the run_schema_discovery
task.
Otherwise, the table will populate with the most recent run data.
You can click the Rerun Discovery button to run schema discovery again,
which will overwrite the previous run's results.
Additional Custom Data Classification and Ignored keywords can be added for schema discovery. For more information please refer to the Additional Keywords section.
Schemas
By default, schema discovery will run against the schema
configured on the database connection -
or if none is configured there, then the database user's default schema.
Alternatively, you can specify the schemas to discover
by clicking on the Configure schemas
button
and entering the schema names, or uploading them from a CSV file.
Notes:
- MySQL and MariaDB don't have the concept of a schema; instead, they use databases to represent this concept (a grouping of tables). When a MySQL or MariaDB database connection is selected, the word "schema" in the UI will be replaced by "database" to reflect this.
- Schema (or database, for MySQL/MariaDB) names must be complete matches and are case-sensitive. Partial matches and wildcards are not supported. For example, entering
myschema
will match onlymyschema
, notmySCHEMA
normyschema_1
.
In-data discovery
The toggle switch marked In-data discovery
allows you to enable or disable in-data discovery
for this schema discovery run.
When in-data discovery finds a column that may contain sensitive information,
the column will be marked as such in the results (under Flagged by
).
When generating a ruleset, DataMasque will suggest suitable mask types for the columns
based on the type of data that in-data discovery thinks the column contains.
Schema discovery results
Once the run is completed the table will populate with the report data from that run. The report data can be downloaded by clicking the "Download Report" button. The report will be downloaded as a CSV similar to the Sensitive Data Discovery report.
The CSV report contains the following columns:
Table schema | The schema of the table discovered. |
Table name | The name of the table discovered. |
Column name | The name of the column discovered and matched against built-in keywords, Global Custom Data Classification keywords or Custom keywords if keyword matches are selected. |
Data Type | The column data type specified in the database metadata. |
Constraint | Whether the column is a Primary or Unique key. In parentheses it will list the columns in which the constraint is present. |
Foreign Keys | A list of any foreign keys reference this column, described in the following pattern (fk_name, referenced_column). |
Max Length | If the column is a text field, this contains the max length of the column, otherwise empty string. |
Numeric Precision | If the column is a numeric field, this contains the numeric precision of the column: the maximum number of digits allowed for the number. Otherwise, this value is an empty string. |
Numeric Scale | If the column is a numeric field, this contains the numeric scale of the column: the number of digits that are present after the decimal point. Otherwise, this value is an empty string. |
Max Length | If the column is a text field, this contains the max length of the column. |
Reason for flag | Description of pattern which caused the column to be flagged for sensitive data. |
Flagged by | Whether the column was flagged for sensitive data through in-data discovery or through the standard sensitive data discovery / keyword matching process. |
Data classifications | A comma-separated list of classifications for the flagged sensitive data. Possible classifications include PII (Personally Identifiable Information), PHI (Personal Health Information), PCI (Payment Card Information) and Custom (User specified custom keywords). |
The columns intended to be masked can be selected from the table. Once all the intended columns have been selected the ruleset can be generated by clicking the "Generate Ruleset" button.
After the ruleset has been generated it can be previewed, downloaded, or sent to the ruleset editor.
Notes:
- Foreign key columns cannot be selected in the user interface, as they should only be updated as the result of masking the columns they reference.
Keywords
Built-in Keywords
Built-in keywords can be enabled or disabled, this will only stop the classification of the columns relating to PII, PHI or PCI and the reasons for those flags.
Additional Keywords
Additional keywords can be configured for a run_schema_discovery
task run on a connection.
A modal will be opened in which keywords can be added manually to the list, or a CSV file with additional keywords can be uploaded. The format and interpretation of additional custom data classification keywords and ignored keywords entered on the ruleset generator page is exactly the same as for the global keywords - see the links below.
The global keywords set on the Settings page will also be included if the "Include Global Custom Data Classification Keywords" or "Include Sensitive Data Discovery Ignored Keywords" toggles are toggled on.
For more information about keywords please refer to:
Generated YAML Ruleset
After schema discovery has been run and the columns intended to be masked have been selected, the YAML ruleset can be generated by clicking the "Generate Ruleset" button.
This will automatically generate a ruleset containing mask_table
or mask_unique_key
tasks for those columns.
The generation of the ruleset is as follows:
DataMasque will generate
mask_unique_key
tasks for as many selected unique columns (unique keys, primary keys, and foreign key targets) as possible.- If one or more columns appear in multiple composite unique or
primary keys or foreign key targets, a
mask_unique_key
task will only be generated for one of those column sets in order to not break uniqueness or referential integrity. If one of those column is a subset of another, the subset will always be masked bymask_unique_key
in order to guarantee the uniqueness of both the subset and superset. mask_unique_key
tasks will only be generated for connections that support them.
- If one or more columns appear in multiple composite unique or
primary keys or foreign key targets, a
mask_table
tasks will be generated for the remaining selected columns.- Unique columns (unique keys, primary keys, and foreign key
targets) not masked by
mask_unique_key
will be masked byfrom_unique
masks with formats appropriate for column data types. Unique columns masked withmask_table
will be listed in a documentation block at the top of the generated YAML ruleset. - If a ruleset contains any
mask_table
tasks, askip_defaults
option to exclude masking ofnull
values and empty strings will be added to the ruleset. This can be removed, or overridden in individual tasks to mask these blank values in those tasks. - For other columns, the column names are first matched to the Built-in Keywords using the same method as sensitive data discovery in order to select an appropriate mask. If a match is not found, then an appropriate mask type is selected based on the column's data type.
The
key
of amask_table
task is a column name, or list of column names. It determines which rows will be masked with the same value. To maximise the security of the masked data, ideally no two rows would be deliberately masked to the same value and hence every row needs a different value in thekey
column (or a different combination of values across allkey
columns, where the ruleset specifies more than one). The following describes how the ruleset generator chooses thekey
column(s) to try to achieve this.For Oracle databases, the
key
for themask_table
task is always generated asROWID
(a unique number that Oracle gives to each row). For other databases, thekey
for themask_table
task is selected from the following options (in order of precedence):- The table's primary key, if it is not to be masked by
from_unique
. - The unique key with the fewest columns, where none of those columns are to be masked by
from_unique
. - The columns participating in the smallest (in terms of number of columns) unique index on the table,
where none of those columns are to be masked by
from_unique
. - The column or set of columns targeted by a foreign key with the fewest columns
where none of those columns are to be masked by
from_unique
. While the target of a foreign key is not guaranteed to be unique for all connections (e.g. MySQL) it is expected to be sufficiently unique to act as the key formask_table
. - Any column(s) that are flagged as
Identification
/Identifiers
by sensitive data discovery, and are not selected for masking. - The first three column(s) alphabetically by name amongst those not selected for masking. The columns are selected as a set; the ruleset generator will always select exactly three columns in this case.
The ruleset generator adds a comment above every
key
in amask_table
task explaining which key column(s) was/were chosen and why. Where the key may not be unique, a warning note is added to this effect. Be sure to review the ruleset generator's choice ofkey
column(s) to ensure the key is (sufficiently) unique for the table in question.When the ruleset generator cannot select any
key
columns from the above list of options, it adds a warning note to the ruleset with a commented-outkey
ofREPLACE_ME
. You will need to uncomment this line and replace theREPLACE_ME
placeholder text with the name of a column, or a list of columns, suitable for use as a key.- Unique columns (unique keys, primary keys, and foreign key
targets) not masked by
Further modifications to the ruleset may be required to achieve the intended mask on the database, which can be completed after passing the ruleset to the Ruleset Editor.
Notes:
- In certain circumstances, the generated ruleset may not mask all selected columns, such as:
- Columns where no masking approach can be determined that would not break referential integrity for one or more foreign keys
- In certain circumstances, additional columns that were not selected may also be masked, such as:
- Foreign keys referencing masked columns
- Unselected columns in groups of jointly unique columns where at least one column is selected, including: composite unique keys, primary keys, and the targets of foreign keys.
- In both of the above cases, the columns that could not be masked or were additionally masked will be listed in a documentation block at the top of the generated YAML ruleset.
JSON Columns
Any json
or jsonb
type columns detected by the Ruleset Generator will be masked with a
from_fixed
mask with the value {}
(empty JSON
object/dictionary). This provides a safe default by effectively blanking out any JSON columns.
For proper masking of JSON columns, please use a json
mask instead. The json
mask can traverse a JSON document and update individual elements while retaining its structure.
Troubleshooting Schema Discovery
If, after running discovery in the Ruleset Generator, an error or warning appears and results are not displayed, refer to the following troubleshooting guidance. You can also click on the link in the message to view the run log for the discovery run, which might contain more details about why discovery failed.
"Schema discovery failed"
- Check that DataMasque is licensed.
- Test the connection from the Database Masking page, and fix any connection issues such as the password being incorrect.
- Check there is no masking run against the target database currently in progress.
- Ensure the database user configured on the connection has sufficient database privileges.
"Schema discovery on connection successful, but no tables were discovered"
- Check the schema(s) or database(s) specified for discovery. These are case-sensitive.
- If there are no schemas or databases explicitly specified in the options, then check the default schema or database specified on the connection. This is also case-sensitive.
- Check the database user has access to the target schema(s) or database(s).
"Data discovery results for are too large to display"
This occurs when the discovery results contain more than 1,000 columns. The following workarounds are available:
- Download the discovery report to view the results.
- Re-run discovery using ignore keywords to exclude some columns from the results.
- Use the API to generate a ruleset for a large database schema.
Troubleshooting Generated Rulesets
"Unique requirement for specified target_key could not be validated"
If a mask_unique_key
task fails because the target_key
could not
be guaranteed to be unique, it could be because the target_key
is
referenced by a foreign key, which is assumed to indicate a unique
key.
To mask a non-unique set of columns that is referenced by a foreign key, while maintaining referential integrity, you should manually construct a masking ruleset to:
- Use a
run_sql
task to disable any constraints of any foreign keys to be updated. - Use a
run_sql
task to create duplicates of all the referenced key columns (in the same table). - Use a
mask_table
task to mask the referenced key columns. Applying afrom_unique
mask type to at least one column can be done to guarantee the key is unique. - Use a
mask_table
task to update each foreign key of the referenced key:- Specify a
join
between the foreign key's table and the referenced key's table based on the foreign key columns and the duplicate columns create in step 2. - Use
from_column
mask types to copy values from the key columns masked in step 3 into the foreign key columns.
- Specify a
- Use a
run_sql
task to re-enable the foreign key constraints disabled in step 1.
Failures to satisfy primary or unique key constraints
If a mask_table
task fails to satisfy a primary or unique key column
that is masked with from_unique
, it could be due to one of the
following issues:
- The range of values generated by
from_unique
overlaps with the range of existing values in the column, resulting in duplicate values mid-masking. You should configurefrom_unique
to generate values that do not overlap with the existing contents of the column, or addrun_sql
tasks to disable unique constraints during masking and re-enable them after. - The
key
of themask_table
task is not a column or set of columns containing unique values (e.g. it is a non-unique set of columns referenced by a foreign key, which the ruleset generator assumes will typically be unique). You should change thekey
to a column or set of columns that is guaranteed to contain only unique values.
Ruleset Editor
The Ruleset Editor is the same as the editor used when creating/editing a ruleset. See the Ruleset Editor guide for more information on this feature.