File Ruleset Generator
Overview
Use DataMasque's File Ruleset Generator to produce a YAML ruleset for masking files within a connection.
The File Ruleset Generator executes a run_file_data_discovery
task in the background to identify files to mask.
Navigate to the File Ruleset Generator page to use this feature.
For details on DataMasque's sensitive data discovery methodology, see the Sensitive Data Discovery guide.
More on the run_file_data_discovery
task can be found here.
File Ruleset Generator
Select a connection from the File Connection dropdown.
If you don't have a connection configured yet, you can create one on the File Masking page.
Ensure the connection is configured to be used as Source
or Source and Destination
.
If this is the first time a run_file_data_discovery
task has been run against this connection click the Run Discovery button to start a new discovery run.
Otherwise, the results will show the latest run data.
Click the Re-run Discovery button to redo schema discovery. This will overwrite the results of the previous run.
In-data discovery
Configure In-data discovery via the button.
There are options to set the Rows to sample and add Custom Patterns for profiling,
comprising a Name and a Regex Pattern for matching purposes.
You can also set Non-Sensitive Patterns (also regexes).
Any column or field containing data that matches a Non-Sensitive Pattern will be reported in the discovery results as Custom Non-Sensitive
.
A minimum of 10% of the data in a particular column or field must match a Custom Pattern for it to take effect.
For Non-Sensitive Patterns, every non-null row or field value must match the pattern for the column or field to be flagged as Custom Non-Sensitive
.
When In-data discovery identifies content that may contain sensitive information,
it will be marked accordingly in the results in the Flagged By
column.
During ruleset generation, DataMasque will recommend the appropriate masks for files based on the discovered data types.
Custom Keywords
Toggle Built-in keywords or configure custom keywords, and ignored keywords for a specific run_file_data_discovery
task.
Global keywords and ignored keywords configured on the Settings page are also included if the Include global custom data classification keywords or Include global data discovery ignored keywords are checked. The format and interpretation of additional custom keywords and ignored keywords mirror that of global keywords; please refer to Global Custom Data Classification Keywords or Global Ignored Keywords for more details.
Number of workers
You can customize the number of workers to speed up the discovery task, at the expense of using more RAM and CPU resources on the machine running DataMasque. Each worker discovers one file at a time, so for example, with four workers you can discover four files in parallel.
By default, discovery runs with 1 worker. Leave the Number of workers input blank to use this default, or specify a value between 1 and 32.
Caution: Use of too many workers, particularly when discovering large JSON files, may exhaust all available RAM and cause the machine running DataMasque to become unresponsive.
File Discovery Results
File discovery results are categorized by file type and by files sharing the same data-structures or columns. These are displayed in groups enabling selection of files and columns / JSON paths for masking.
By default, only sensitive groups are displayed. This can be altered using the dropdown box to Show all groups.
Note: If there are no sensitive results then all results are shown and the select options disabled.
Each group should be reviewed by clicking to expand the contents.
The checkbox in the group header selects all files, while expanding it allows you to select individual files. Columns / JSON Paths can also all be selected using the checkbox box in the table header row or selected individually from the table. Results can be further filtered using the search box. By default, only sensitive results are displayed to see all the results change the dropdown box to Show all column names or Show all JSON paths.
When collapsing a group, a green tick will be displayed for valid selections, or a red warning if an invalid selection was made. Groups must have both files and columns selected to be valid.
Once each group has been reviewed and the selections are valid, proceed to generate a ruleset by clicking the Generate Ruleset button.
Generated Ruleset
After schema discovery has been run and the files and columns intended to be masked have been selected, the YAML ruleset can be generated by clicking the Generate Ruleset button.
This will automatically generate a ruleset containing relevant mask_file
or mask_tabular_file
tasks for the selected files and rules for each selected column. Preview, download, or further edit the ruleset as needed.
Once the ruleset is saved, you can use it in a file masking task. See Create file masking run.
Ruleset Editor
The Ruleset Editor is the same as the editor used when creating or editing a ruleset manually. See the Ruleset Editor guide for details.
Download Report
Download a JSON report of discovery results by clicking Download Report button after a run. The report contains a list of group objects with the following fields:
id
: Unique ID for this Group.connection
: object containing thename
andid
of the connection.files
: a list of file objects containingpath
,file_type
,delimiter
andencoding
.file_type
: the file type for this group.results
: a list of result objects.
Result Object:
locator
: The column name or JSON path.matches
: The list of match objects describing the match categories and how the match was flagged. Ifmatches
is empty then no sensitive data was discovered for this result. See Sensitive Data Discovery for more information.data_types
: The list of data types for the column name or JSON path.
Example:
[
{
"id": 1,
"connection": {
"id": "3146e366-9067-4a22-a0c1-d620a2754c0e",
"name": "aws_s3_src"
},
"files": [
{
"path": "example.parquet",
"delimiter": "",
"encoding": "",
"file_type": "parquet"
}
],
"file_type": "parquet",
"results": [
{
"locator": "Date",
"matches": [], <-- no sensitive matches
"data_types": [
"date"
]
},
{
"locator": "Name",
"matches": [
{
"label": "name",
"categories": [
"PCI",
"PHI",
"PII"
],
"flagged_by": "In-Data Discovery",
"description": "Full Names"
}
],
"data_types": [
"str"
]
},
...
Troubleshooting
"My file did not appear in any of the groups"
Try changing the dropdown to Show All Groups.
By default, only sensitive results are displayed and files will be hidden if no sensitive data was discovered. If you don't have In-Data discovery enabled, you can try enabling this to get more accurate results.
Is your file a supported file type?
File data discovery currently supports only
.json
,.ndjson
,.parquet
, and.csv
formats. DataMasque determines a file's format by its file extension, not its content.Does your Parquet file contain one or more columns of unsupported type, such as maps with keys that aren't strings?
File data discovery currently only supports certain data types for Parquet columns and any nested columns in lists, structs or maps. The run log for a file data discovery run will display a warning about being unable to discover any files that contain columns with an unsupported data type.
If none of the above steps help, check the Run Logs and the Run Report for more detailed information about the run that was performed.
- Run Log: navigate to the Run Logs page and locate a run with a matching Source Connection and with Ruleset set to
$auto_file_data_discovery$
. - Run Report: see the documentation here for steps to download the run report.
"My column / field did not appear in the results"
Check you have Show all columns selected.
It could be a list or object type in JSON/NDJSON, only scalar fields are discovered.
It could be misnamed in the CSV or parquet file.
"My column or field is identified wrongly / not identified"
If the name matches an ignored keyword then the column won't be reported at all, even if it's sensitive.
If the data matches an In-Data Discovery Non-Sensitive rule then the data will be reported as
Custom Non-Sensitive
.If the name doesn't match any metadata keywords, you'll need to use In-data discovery (IDD); check that IDD is enabled and the settings are right.
The IDD sample size might be too small e.g. if your column/field has multiple categories of data and the sample only catches one, it will not find the other type; and we only report the first discovered type.
Custom keywords might be affecting the results.
Discovery is always a best guess effort; the user should always review the generated ruleset.