Sensitive Data Discovery
- Overview
- Methodology
- JSON Sensitive Data Discovery
- In-Data Discovery
- Reporting and notifications
- Keywords configuration
- Built-in data discovery keywords
- Appendix: In-data discovery rules
Overview
DataMasque can be configured to automatically discover sensitive data in your databases during
masking. When a masking ruleset contains the special purpose
run_data_discovery
task type, DataMasque
will inspect the database metadata and generate a discovery report
for each masking run. Additionally, each user will receive
email notifications of any new, unmasked sensitive data
that has been discovered by DataMasque, providing ongoing protection against new sensitive data
being added to your schemas over time.
Note: The schema discovery feature does not currently support Amazon DynamoDB or Microsoft SQL Server (Linked Server) databases.
Methodology
To perform sensitive data discovery, DataMasque uses regular expressions (regex) to scan the metadata of the target database. When a database column is identified as sensitive, DataMasque will compare it with the masking rules specified in the ruleset to determine the masking coverage for the column.
DataMasque comes with over 90 built-in keywords to help discover various types of sensitive data (account numbers, addresses, etc.) in your database. These built-in keywords are global keywords, used for sensitive data discovery by default and for schema discovery when enabled. Each pattern is classified into one or more categories. The included data classification categories are:
- Personally Identifiable Information (PII)
- Personal Health Information (PHI)
- Payment Card Information (PCI)
JSON Sensitive Data Discovery
Sensitive information contained within JSON data can be discovered through the JSON Sensitive Data Discovery feature available through the YAML Ruleset Editor
Here you can enter JSON data into the text field and run the discovery feature to identify the lowest level sensitive keys which could contain sensitive information. The sensitive items are evaluated using the built-in data discovery keywords.
Once finished it will return an automatically generated rule to mask the sensitive key's values contained within the specified JSON data, this rule can then be copied or inserted into the ruleset on the YAML Editor.
Example
This example will show the benefit of generating the rule through the JSON Mask Generator. Suppose a column contains the following JSON data:
{
"customers": [
{
"primary": {
"name": "Foo",
"credit card": 123456789
},
"secondary":
{
"name": "Bar",
"credit card":987654321
}
}
]
}
This data can be entered into the JSON Data field of the JSON Mask Generator and a rule will be created to mask the sensitive keys (name and credit card) as shown below. This can then be copied into the ruleset under the relevant task.
type: json
transforms:
- path:
- customers
- '*'
- primary
- name
masks:
- type: from_fixed
value: redacted
on_null: skip
on_missing: skip
force_consistency: false
- path:
- customers
- '*'
- primary
- credit card
masks:
- type: chain
masks:
- type: credit_card
validate_luhn: true
pan_format: false
preserve_prefix: false
- type: take_substring
start_index: 0
end_index: 9
on_null: skip
on_missing: skip
force_consistency: false
- path:
- customers
- '*'
- secondary
- name
masks:
- type: from_fixed
value: redacted
on_null: skip
on_missing: skip
force_consistency: false
- path:
- customers
- '*'
- secondary
- credit card
masks:
- type: chain
masks:
- type: credit_card
validate_luhn: true
pan_format: false
preserve_prefix: false
- type: take_substring
start_index: 0
end_index: 9
on_null: skip
on_missing: skip
force_consistency: false
In-Data Discovery
Note: This feature is in preview.
In-data discovery is an optional extra part of sensitive data discovery.
It examines the contents of database tables or files and matches the data
against a list of built-in rules and custom rules
to identify content that likely contain sensitive data,
even if there is no metadata match for any of the built-in or configured sensitive data keywords.
For example, it can find credit card numbers stored in a column named pcd
(for payment card details
) or column19
.
For databases, in-data discovery is currently only available as part of schema discovery;
it is not supported for sensitive data discovery.
You can enable it on the ruleset generator and file ruleset generator pages.
It is supported on all database platforms that support schema discovery
(Postgres, Oracle, MySQL, MariaDB, Microsoft SQL Server and Amazon Redshift). Supported file types are .json
, .ndjson
, .parquet
and .csv
.
In-data discovery samples 1,000 random rows (or the entire table, if fewer than 1,000 rows) and attempts to match the content of those rows against its matching rules. To avoid false positives, a certain minimum percentage of the data within a column has to match a rule for the column to be flagged as sensitive data. If a column's data matches multiple rules, only the first match will be returned in the sensitive data discovery or schema discovery report. The list below gives the rules in priority order, highest priority first.
Configuring Options for In-Data Discovery
Enabling In-Data Discovery can be done from the ruleset generator page or on the file ruleset generator page by clicking the Configure In-Data Discovery button.
The options that can be configured are:
- Rows to sample: This option determines the number of rows used for profiling.
- Custom Patterns: These are user-provided rules used for profiling.
- Name: The name of the rule that, if data potentially matches, will be displayed in the
Reason for Flag
column. - Rule: A regex pattern for the data you want to match against.
- Name: The name of the rule that, if data potentially matches, will be displayed in the
- Non-Sensitive Patterns: These are user-provided rules to flag particular data as non-sensitive. Each rule is a regex pattern.
Custom and Non-Sensitive Patterns
If at least 10% of the sampled rows from the column match a Custom Pattern,
the column will be flagged with the name of the rule and Custom
in the Data Classification
column.
An imitate
mask will be generated from columns flagged by custom rules;
this can be changed later in the Ruleset YAML editor as needed.
If every sampled row from the column (excluding those with null
values) matches a Non-Sensitive Pattern,
the column will appear in discovery results marked as Custom Non-Sensitive
, as shown below.
Non-Sensitive Patterns have priority over built-in rules and Custom Patterns.
When a starting a discovery run, if a given regex is invalid, the modal will re-open with an error message underneath the invalid pattern.
Reporting and notifications
Per-run data discovery report
After a run_data_discovery
task has
completed, the corresponding data discovery report can be downloaded alongside the
run logs of the masking run:
The sensitive data discovery report will be downloaded in CSV format and may be opened in a text editor or spreadsheet viewer such as Microsoft Excel. The report contains information to assist you in discovering and masking the sensitive data in your database. Every column that has been identified as potentially containing sensitive data is included in the report, along with a classification of the data and an indication of whether the masking ruleset contains a rule targeting the matched column.
The CSV report includes the following columns:
Table schema | The schema of the table containing a sensitive data match. |
Table name | The name of the table containing a sensitive data match. |
Column name | The name of the column which has matched against a commonly used sensitive data identifier. |
Reason for flag | Description of pattern which caused the column to be flagged for sensitive data. |
Data classifications | A comma-separated list of classifications for the flagged sensitive data. Possible classifications include PII (Personally Identifiable Information), PHI (Personal Health Information), and PCI (Payment Card Information). |
Covered by ruleset | A boolean value (True/False) indicating whether the masking ruleset contains a rule to target the identified column. |
Note for Oracle: Sensitive data discovery reports will only cover the tables owned by the user or schema as defined in Connection. Schema will take precedence over user.
Note for Microsoft SQL Server: Sensitive data discovery reports will only cover the tables owned by the user's default schema.
Note for PostgreSQL: Sensitive data discovery reports will only cover the visible tables in the current user search path.
A table is said to be visible if its containing schema is in the search path and no table of the same name appears earlier in the search path.
Notification of new sensitive data
DataMasque users will receive email notifications1 of newly detected, unmasked sensitive data in your databases. This feature provides ongoing protection against new sensitive data making its way into your databases over time.
Disabling notifications
Each user can opt out of receiving email updates by navigating to the My Account page and disabling the Notify me when sensitive data is found option of the Edit Account form:
When notifications are sent
Each user that has enables this option will receive daily notification emails when new, unmasked sensitive data is detected on any Connection that has been masked in the previous 24 hours. If there is nothing to report, no notifications will be sent.
Notifications will be sent, given that in the previous 24 hours a data discovery task has been run:
- On a new Connection that contains unmasked sensitive data.
- As part of a ruleset in which a masking rule that was previously protecting sensitive data has been removed.
- On an existing Connection which has had one or more columns containing sensitive data added to the database.
Notes:
- New sensitive data is only detected during masking runs that include a
run_data_discovery
task. Include this task in all masking runs to receive ongoing protection.- This feature requires that SMTP has been configured, allowing DataMasque to send outbound email.
Keywords configuration
DataMasque can be configured with additional custom keywords and ignored keywords to facilitate sensitive data discovery. Both custom keywords and ignored keywords are case-insensitive. These can be configured from the Settings page.
Global Custom Data Classification Keywords
Schema, table and column names are matched to Global Custom Data Classification keywords in addition to the built-in data discovery keywords. Two formats are supported for custom keywords:
- Keyword format:
- The matching behaviour of the keyword will depend on the number of dot-separated segments:
- If no dot is present, the keyword will be compared to the column name.
- If a single dot is present, the keyword will be compared to
the schema and table name:
schema.table
. - If two dots are present, the keyword will be compared to the
schema, table, and column name:
schema.table.column
. - The keyword will be matched case-insensitively against the data-dictionary representations of schema/table/column names.
- The keyword will still match if the name contains additional characters preceding/following a substring that matches the corresponding segment of the keyword.
- Spaces in a keyword will match space, underscore, and hyphen
delimiter characters, and will also match in the absence of such a
delimiter character. For example, the space-separated keyword
credit card number
would match columns such ascredit card number
,creditcardnumber
,creditcard_number
, andcredit-card number
. - Within each dot-separated segment, you can use the
*
wildcard to match any number of characters. For example, you can match:- all columns in the
audit
table in theaccounts
schema by specifyingaccounts.audit.*
- columns in any table in the
accounts
schema, where the column name contains the stringsaudit
andentry
separated by an unknown number of characters, by specifyingaccounts.*.audit*entry
.
- all columns in the
- Only alphanumeric characters, spaces, underscores, hyphens, dots, asterisk wildcards, and escaping backslashes are allowed.
- Regular expression format:
- A keyword prefixed with
regex:
will be treated as a regular expression over the stringschema_name.table_name.column_name
. For example,regex:.*\.[a-z]+log\..*user$
matches:- columns with names ending in
user
(.*user$
) - in tables whose names consist of a series of letters ending in
log
, such asauditlog
([a-z]+log
) - in any schema (
.*
).
- columns with names ending in
- The regular expression should start immediately after the
regex:
prefix. Do not include a space after the colon. - The match is a partial match.
For example,
regex:login
matcheslogin
anywhere in the full string (so any part of the schema, table, or column name). Use the anchors^
at the start and$
at the end if you want to restrict the match to a full match, such asregex:^myschema\.auth\.login_date$
to match precisely the schemamyschema
, tableauth
, and columnlogin_date
. - To match the separator dots between the schema, table and column names,
be sure to escape the dot with a backslash (
\.
), as an unescaped dot in a regular expression matches any character. - The regex will be matched case-insensitively against the data-dictionary representations of schema/table/column names.
- For more details on regular expressions, see: Common regular expression patterns.
- A keyword prefixed with
Column names containing dots and/or backslashes:
- When not using the
regex:
format, literal dots (.
) and backslashes (\
) in names should be escaped with a preceding backslash. For example, to match a column namedcreditcard.1
whose name contains a dot, specifycreditcard\.1
.- When using the
regex:
format, any backslashes or dots in schema/table/column data-dictionary names will be prefixed by a backslash in the string to be matched by the regex. For example, to match a column namedcreditcard.1
, you need to specifyregex:creditcard\\\.1
, with an escaped backslash\\
to match the added backslash and an escaped dot\.
to match the dot.
Column names that match the Global Custom Data Classification keywords will be reported and tagged with the data classification "Custom" in Sensitive Data Discovery reports.
Global Ignored keywords
Global Ignored keywords will only ignore exact matches of a schema, table and/or column name,
which allows you to exclude specific columns from Sensitive Data Discovery reports.
For example, with the ignored keyword p_id
, columns named p_id
will be ignored,
and not identified as sensitive data nor included in the discovery report.
The format and behaviour of ignored keywords is entirely the same as for custom keywords, except for the following differences:
- When not using the regular expression format, a full exact match is required.
If the schema, column or table name contains additional text before and/or after the keyword,
the keyword will not match.
For example, the ignored keyword
credit_card
will not match a column namedcredit_card_number
. - Spaces are considered part of the keyword and do not match characters other than a space.
For example, the ignored keyword
credit card number
will only match a column with that exact name (case-insensitive).
The *
wildcard is still supported, so in the first example above,
credit_card*
can be used to match credit_card_number
.
The requirement for full matches for ignored keywords makes the *
wildcard more useful.
For example, you can ignore:
- all columns whose names start with
credit
by specifyingcredit*
- all columns whose names end with
login
in tables whose names start withuser
by specifying*.user*.*login
.
Built-in data discovery keywords
DataMasque uses regular expressions to search for columns which may contain the following sensitive information:
Category PII
- Name / first name / middle name / last name / surname
- Mail / email
- Date of birth / DoB
- SSN / Social Security Number
- Address
- Post code
- Country / state / city / suburb / zip code
- Phone number
- Fax number
- Insurance number
- Passport number
- Driver license number
- Gender
- Age
- Vehicle identification number / VIN
- Login
- Job position / role / title
- Workspace / company
- NRIC / Identity Card Number
- IC number
- ID number
- IRD number / Inland Revenue Department number
- NINO
- Unique taxpayer reference / UTR
- Identity / identification / ID
- Tax number
- Internet protocol (IP) address
- Media access control (MAC) address
- Licence plate
- Licence number
- Certificate number
- Identifiers / serial number
Category PCI
- Credit / payment / debit card
- Credit / payment / debit number
- Account number
- Security code
- Expiry date
- Name / first name / middle name / last name / surname
- PIN / Personal identification numbers
- CVV / Card Verification Value
- Address
- Post code
- Country / state / city/ suburb / zip code
- Phone number
- Fax number
Category PHI
- PHI number
- NHI number
- Medical record number
- Insurance number
- Internet protocol (IP) address
- Name / first name / middle name / last name / surname
- Health plan beneficiary number
- Identifiers / ID
- Identifying number / code
- Serial number
- Licence plate
- Licence number
- Certificate number
- Address
- Post code
- Country / state / city/ suburb / zip code
- Phone number
- Fax number
Appendix: In-data discovery rules
The list of rules is built-in and currently cannot be changed or customized.
DataMasque can detect the following data in columns.
Credit Card Numbers
- How it matches: Checks the card prefix against the built-in list of valid card prefixes and card lengths, and uses the Luhn checksum algorithm. A list of valid prefixes can be found in the credit_card_prefixes.csv file.
- Threshold: 12% of sampled data should match this check.
- Minimum Column Length: 12 characters.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: credit_card
pan_format: false
pan_character: '#'
generate_luhn: true
apply_weighting: false
generate_card_number: true
on_null: skip
on_invalid: mask
output_format_choice: retained
Luhn Checksum
- How it matches: Uses the Luhn checksum algorithm.
- Threshold: 12% of sampled data should match this check.
- Minimum Column Length: 12 characters.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: from_unique_imitate
skip_letters: false
skip_digits: false
checksum: luhn
on_invalid: error
Brazilian CPF Numbers
- How it matches: Uses the Brazilian CPF checksum algorithm.
- Mask for a match: Full Brazilian CPF numbers are typically preserved.
- Threshold: 10% of sampled data should match this check.
- Minimum Column Length: 11 characters.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: brazilian_cpf
on_null: skip
on_invalid: mask
output_format_choice: retained
Social Security Numbers
- How it matches: Uses the social security number pattern rules,
as documented for the
social_security_number
mask. - Threshold: 45% of sampled data should match this check.
- Minimum Column Length: 9 characters.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: social_security_number
on_null: skip
on_invalid: mask
output_format_choice: retained
Partial Credit Card Numbers
- How it matches: Uses a regular expression.
- Pattern to match:
^(\d{6}(?:X{6}|X{5}\d|X{4}\d{2}|X{3}\d{3}|X{2}\d{4}|X{1}\d{5}|#{6}|#{5}\d|#{4}\d{2}|#{3}\d{3}|#{2}\d{4}|#{1}\d{5}))$
- Threshold: 10% of sampled data should match this check.
- Minimum Column Length: 15 characters.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: imitate
IP Addresses
- How it matches: Uses a regular expression.
- Pattern to match:
\b(?:\d{1,3}\.){3}\d{1,3}\b
- Threshold: 10% of sampled data should match this check.
- Minimum Column Length: 15 characters.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: concat
masks:
- type: from_random_number
max: 255
min: 0
- type: from_random_number
max: 255
min: 0
- type: from_random_number
max: 255
min: 0
- type: from_random_number
max: 255
min: 0
glue: .
MAC Addresses
- How it matches: Uses a regular expression.
- Pattern to match:
^(([0-9A-Fa-f]{2}[-]?){5}([0-9A-Fa-f]{2})|([0-9A-Fa-f]{2}[:]?){5}([0-9A-Fa-f]{2}))$
- Pattern to exclude:
\d{12}
- Threshold: 10% of sampled data should match this check.
- Minimum Column Length: 12 characters.
- Maximum Column Length: No maximum length.
Examples
Value | Match |
---|---|
aa:aa:aa:aa:aa:aa | Yes |
A1-B2-C3-D4-E5-F6 | Yes |
1a2345665432 | Yes |
123456765432 | No - all digits |
00-1A-2B:3C-4D-5E | No - inconsistent separators |
12:34:56:78:90:GF | No - contains non-hex character 'G' |
Generated Mask Definition
- type: chain
masks:
- type: imitate
force_change: false
uppercase: false
lowercase: false
digits: true
- type: replace_regex
regex: '[a-f]'
masks:
- type: from_format_string
value_format: '{[a-f],1}'
preserve_length: true
- type: replace_regex
regex: '[A-F]'
masks:
- type: from_format_string
value_format: '{[A-F],1}'
preserve_length: true
US States
- How it matches: Matches column data in
state_long
against the built-inDataMasque_US_States.csv
seed file. - Threshold: 60% of sampled data should match this check.
- Minimum Column Length: No minimum length.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: from_file
seed_file: DataMasque_US_States.csv
seed_column: state_long
US States Short
- How it matches: Matches column data in
state_short
against the built-inDataMasque_US_States.csv
seed file. - Threshold: 80% of sampled data should match this check.
- Minimum Column Length: No minimum length.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: from_file
seed_file: DataMasque_US_States.csv
seed_column: state_short
Email Addresses
- How it matches: Uses a regular expression.
- Pattern to match:
^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,24}$
- Threshold: 10% of sampled data should match this check.
- Minimum Column Length: No minimum length.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: concat
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
- type: from_file
seed_file: DataMasque_lastNames_v2.csv
seed_column: lastnames
- type: from_file
seed_file: DataMasque_fake_email_suffixes.csv
seed_column: email-suff
Gender
- How it matches: Uses a regular expression.
- Pattern to match:
^(f|m|male|female)$
. Match words case-insensitive. - Threshold: 70% of sampled data should match this check.
- Minimum Column Length: No minimum length.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: from_fixed
value: REDACTED
First Names
- How it matches: Matches column data against the built-in
DataMasque_firstNames_mixed.csv
seed file. The match is case-insensitive. - Threshold: 40% of sampled data should match this check.
- Minimum Column Length: No minimum length.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
Last Names
- How it matches: Matches column data against the built-in
DataMasque_lastNames_v2.csv
seed file. The match is case-insensitive. - Threshold: 25% of sampled data should match this check.
- Minimum Column Length: No minimum length.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: from_file
seed_file: DataMasque_lastNames_v2.csv
seed_column: lastnames
Street Addresses
- How it matches: Matches column data against the built-in
DataMasque_street_names_full.csv
amdDataMasque_street_types_all.csv
seed files. The match is case-insensitive. Also requires that the data starts with a digit. - Threshold: 60% of sampled data should match this check.
- Minimum Column Length: No minimum length.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: concat
masks:
- type: from_file
seed_file: DataMasque_NZ_addresses_real.csv
seed_column: street_number
- type: from_file
seed_file: DataMasque_NZ_addresses_real.csv
seed_column: street_name
glue: ' '
Full Names
- How it matches: Matches column data against the built-in
DataMasque_firstNames_mixed.csv
andDataMasque_lastNames_v2.csv
seed files. The match is case-insensitive. Also requires that the data starts with a letter (A-Z
ora-z
). - Threshold: 60% of sampled data should match this check.
- Minimum Column Length: No minimum length.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: concat
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
- type: from_file
seed_file: DataMasque_lastNames_v2.csv
seed_column: lastnames
glue: ' '
Phone Number
- How it matches: Uses a regular expression.
- Pattern to match:
^(?=(?:\D*\d){7,15}$)[+]?[(]?[0-9]{1,3}[)]?[-\s\.]?[0-9]{1,4}([- \.\s]?[0-9]{2,4}){1,3}$
. - Threshold: 20% of sampled data should match this check.
- Minimum Column Length: 7 characters.
- Maximum Column Length: No maximum length.
Generated Mask Definition
- type: imitate