Sensitive Data Discovery

Overview
Methodology
JSON Sensitive Data Discovery
In-Data Discovery
- Configuring Options for In-Data Discovery
Reporting and notifications
- Per-run data discovery report
- Notification of new sensitive data
Keywords configuration
- Global Custom Data Classification Keywords
- Global Ignored keywords
Built-in data discovery keywords
Appendix: In-data discovery rules

Overview

DataMasque can be configured to automatically discover sensitive data in your databases during masking. When a masking ruleset contains the special purpose run_data_discovery task type, DataMasque will inspect the database metadata and generate a discovery report for each masking run. Additionally, each user will receive email notifications of any new, unmasked sensitive data that has been discovered by DataMasque, providing ongoing protection against new sensitive data being added to your schemas over time.

Note: The schema discovery feature does not currently support Amazon DynamoDB or Microsoft SQL Server (Linked Server) databases.

Methodology

To perform sensitive data discovery, DataMasque uses regular expressions (regex) to scan the metadata of the target database. When a database column is identified as sensitive, DataMasque will compare it with the masking rules specified in the ruleset to determine the masking coverage for the column.

DataMasque comes with over 90 built-in keywords to help discover various types of sensitive data (account numbers, addresses, etc.) in your database. These built-in keywords are global keywords, used for sensitive data discovery by default and for schema discovery when enabled. Each pattern is classified into one or more categories. The included data classification categories are:

Personally Identifiable Information (PII)
Personal Health Information (PHI)
Payment Card Information (PCI)

JSON Sensitive Data Discovery

Sensitive information contained within JSON data can be discovered through the JSON Sensitive Data Discovery feature available through the YAML Ruleset Editor

Here you can enter JSON data into the text field and run the discovery feature to identify the lowest level sensitive keys which could contain sensitive information. The sensitive items are evaluated using the built-in data discovery keywords.

Once finished it will return an automatically generated rule to mask the sensitive key's values contained within the specified JSON data, this rule can then be copied or inserted into the ruleset on the YAML Editor.

Example

This example will show the benefit of generating the rule through the JSON Mask Generator. Suppose a column contains the following JSON data:

{
  "customers": [
    {
      "primary": {
        "name": "Foo",
        "credit card": 123456789
      },
      "secondary":
        {
          "name": "Bar",
          "credit card":987654321
        }
    }
  ]
 }

This data can be entered into the JSON Data field of the JSON Mask Generator and a rule will be created to mask the sensitive keys (name and credit card) as shown below. This can then be copied into the ruleset under the relevant task.

type: json
transforms:
  - path:
      - customers
      - '*'
      - primary
      - name
    masks:
      - type: from_fixed
        value: redacted
    on_null: skip
    on_missing: skip
    force_consistency: false
  - path:
      - customers
      - '*'
      - primary
      - credit card
    masks:
      - type: chain
        masks:
          - type: credit_card
            validate_luhn: true
            pan_format: false
            preserve_prefix: false
          - type: take_substring
            start_index: 0
            end_index: 9
    on_null: skip
    on_missing: skip
    force_consistency: false
  - path:
      - customers
      - '*'
      - secondary
      - name
    masks:
      - type: from_fixed
        value: redacted
    on_null: skip
    on_missing: skip
    force_consistency: false
  - path:
      - customers
      - '*'
      - secondary
      - credit card
    masks:
      - type: chain
        masks:
          - type: credit_card
            validate_luhn: true
            pan_format: false
            preserve_prefix: false
          - type: take_substring
            start_index: 0
            end_index: 9
    on_null: skip
    on_missing: skip
    force_consistency: false

In-Data Discovery

Note: This feature is in preview.

In-data discovery is an optional extra part of sensitive data discovery. It examines the contents of database tables or files and matches the data against a list of built-in rules and custom rules to identify content that likely contain sensitive data, even if there is no metadata match for any of the built-in or configured sensitive data keywords. For example, it can find credit card numbers stored in a column named pcd (for payment card details) or column19.

For databases, in-data discovery is currently only available as part of schema discovery; it is not supported for sensitive data discovery. You can enable it on the ruleset generator and file ruleset generator pages. It is supported on all database platforms that support schema discovery (Postgres, Oracle, MySQL, MariaDB, Microsoft SQL Server and Amazon Redshift). Supported file types are .json, .ndjson, .parquet and .csv.

In-data discovery samples 1,000 random rows (or the entire table, if fewer than 1,000 rows) and attempts to match the content of those rows against its matching rules. To avoid false positives, a certain minimum percentage of the data within a column has to match a rule for the column to be flagged as sensitive data. If a column's data matches multiple rules, only the first match will be returned in the sensitive data discovery or schema discovery report. The list below gives the rules in priority order, highest priority first.

Configuring Options for In-Data Discovery

Enabling In-Data Discovery can be done from the ruleset generator page or on the file ruleset generator page by clicking the Configure In-Data Discovery button.

InDataDiscoveryModal

The options that can be configured are:

Rows to sample: This option determines the number of rows used for profiling.
Custom Patterns: These are user-provided rules used for profiling.
- Name: The name of the rule that, if data potentially matches, will be displayed in the Reason for Flag column.
- Rule: A regex pattern for the data you want to match against.
Non-Sensitive Patterns: These are user-provided rules to flag particular data as non-sensitive. Each rule is a regex pattern.

Custom and Non-Sensitive Patterns

If at least 10% of the sampled rows from the column match a Custom Pattern, the column will be flagged with the name of the rule and Custom in the Data Classification column. An imitate mask will be generated from columns flagged by custom rules; this can be changed later in the Ruleset YAML editor as needed.

If every sampled row from the column (excluding those with null values) matches a Non-Sensitive Pattern, the column will appear in discovery results marked as Custom Non-Sensitive, as shown below. Non-Sensitive Patterns have priority over built-in rules and Custom Patterns.

Non-sensitive result example

When a starting a discovery run, if a given regex is invalid, the modal will re-open with an error message underneath the invalid pattern.

Reporting and notifications

Per-run data discovery report

After a run_data_discovery task has completed, the corresponding data discovery report can be downloaded alongside the run logs of the masking run:

The sensitive data discovery report will be downloaded in CSV format and may be opened in a text editor or spreadsheet viewer such as Microsoft Excel. The report contains information to assist you in discovering and masking the sensitive data in your database. Every column that has been identified as potentially containing sensitive data is included in the report, along with a classification of the data and an indication of whether the masking ruleset contains a rule targeting the matched column.

The CSV report includes the following columns:


*Table schema*	The schema of the table containing a sensitive data match.
*Table name*	The name of the table containing a sensitive data match.
*Column name*	The name of the column which has matched against a commonly used sensitive data identifier.
*Reason for flag*	Description of pattern which caused the column to be flagged for sensitive data.
*Data classifications*	A comma-separated list of classifications for the flagged sensitive data. Possible classifications include PII (Personally Identifiable Information), PHI (Personal Health Information), and PCI (Payment Card Information).
*Covered by ruleset*	A boolean value (True/False) indicating whether the masking ruleset contains a rule to target the identified column.

Note for Oracle: Sensitive data discovery reports will only cover the tables owned by the user or schema as defined in Connection. Schema will take precedence over user.

Note for Microsoft SQL Server: Sensitive data discovery reports will only cover the tables owned by the user's default schema.

Note for PostgreSQL: Sensitive data discovery reports will only cover the visible tables in the current user search path.

A table is said to be visible if its containing schema is in the search path and no table of the same name appears earlier in the search path.

Notification of new sensitive data

DataMasque users will receive email notifications¹ of newly detected, unmasked sensitive data in your databases. This feature provides ongoing protection against new sensitive data making its way into your databases over time.

Disabling notifications

Each user can opt out of receiving email updates by navigating to the My Account page and disabling the Notify me when sensitive data is found option of the Edit Account form:

Data discovery notification opt-in

When notifications are sent

Each user that has enables this option will receive daily notification emails when new, unmasked sensitive data is detected on any Connection that has been masked in the previous 24 hours. If there is nothing to report, no notifications will be sent.

Notifications will be sent, given that in the previous 24 hours a data discovery task has been run:

On a new Connection that contains unmasked sensitive data.
As part of a ruleset in which a masking rule that was previously protecting sensitive data has been removed.
On an existing Connection which has had one or more columns containing sensitive data added to the database.

Notes:

New sensitive data is only detected during masking runs that include a run_data_discovery task. Include this task in all masking runs to receive ongoing protection.

This feature requires that SMTP has been configured, allowing DataMasque to send outbound email.

Keywords configuration

DataMasque can be configured with additional custom keywords and ignored keywords to facilitate sensitive data discovery. Both custom keywords and ignored keywords are case-insensitive. These can be configured from the Settings page.

Global Custom Data Classification Keywords

Schema, table and column names are matched to Global Custom Data Classification keywords in addition to the built-in data discovery keywords. Two formats are supported for custom keywords:

Keyword format:
- The matching behaviour of the keyword will depend on the number of dot-separated segments:
- If no dot is present, the keyword will be compared to the column name.
- If a single dot is present, the keyword will be compared to the schema and table name: schema.table.
- If two dots are present, the keyword will be compared to the schema, table, and column name: schema.table.column.
- The keyword will be matched case-insensitively against the data-dictionary representations of schema/table/column names.
- The keyword will still match if the name contains additional characters preceding/following a substring that matches the corresponding segment of the keyword.
- Spaces in a keyword will match space, underscore, and hyphen delimiter characters, and will also match in the absence of such a delimiter character. For example, the space-separated keyword credit card number would match columns such as credit card number, creditcardnumber, creditcard_number, and credit-card number.
- Within each dot-separated segment, you can use the * wildcard to match any number of characters. For example, you can match:
  - all columns in the audit table in the accounts schema by specifying accounts.audit.*
  - columns in any table in the accounts schema, where the column name contains the strings audit and entry separated by an unknown number of characters, by specifying accounts.*.audit*entry.
- Only alphanumeric characters, spaces, underscores, hyphens, dots, asterisk wildcards, and escaping backslashes are allowed.
Regular expression format:
- A keyword prefixed with regex: will be treated as a regular expression over the string schema_name.table_name.column_name. For example, regex:.*\.[a-z]+log\..*user$ matches:
  - columns with names ending in user (.*user$)
  - in tables whose names consist of a series of letters ending in log, such as auditlog ([a-z]+log)
  - in any schema (.*).
- The regular expression should start immediately after the regex: prefix. Do not include a space after the colon.
- The match is a partial match. For example, regex:login matches login anywhere in the full string (so any part of the schema, table, or column name). Use the anchors ^ at the start and $ at the end if you want to restrict the match to a full match, such as regex:^myschema\.auth\.login_date$ to match precisely the schema myschema, table auth, and column login_date.
- To match the separator dots between the schema, table and column names, be sure to escape the dot with a backslash (\.), as an unescaped dot in a regular expression matches any character.
- The regex will be matched case-insensitively against the data-dictionary representations of schema/table/column names.
- For more details on regular expressions, see: Common regular expression patterns.

Column names containing dots and/or backslashes:

When not using the regex: format, literal dots (.) and backslashes (\) in names should be escaped with a preceding backslash. For example, to match a column named creditcard.1 whose name contains a dot, specify creditcard\.1.

When using the regex: format, any backslashes or dots in schema/table/column data-dictionary names will be prefixed by a backslash in the string to be matched by the regex. For example, to match a column named creditcard.1, you need to specify regex:creditcard\\\.1, with an escaped backslash \\ to match the added backslash and an escaped dot \. to match the dot.

Column names that match the Global Custom Data Classification keywords will be reported and tagged with the data classification "Custom" in Sensitive Data Discovery reports.

Global Ignored keywords

Global Ignored keywords will only ignore exact matches of a schema, table and/or column name, which allows you to exclude specific columns from Sensitive Data Discovery reports. For example, with the ignored keyword p_id, columns named p_id will be ignored, and not identified as sensitive data nor included in the discovery report.

The format and behaviour of ignored keywords is entirely the same as for custom keywords, except for the following differences:

When not using the regular expression format, a full exact match is required. If the schema, column or table name contains additional text before and/or after the keyword, the keyword will not match. For example, the ignored keyword credit_card will not match a column named credit_card_number.
Spaces are considered part of the keyword and do not match characters other than a space. For example, the ignored keyword credit card number will only match a column with that exact name (case-insensitive).

The * wildcard is still supported, so in the first example above, credit_card* can be used to match credit_card_number. The requirement for full matches for ignored keywords makes the * wildcard more useful. For example, you can ignore:

all columns whose names start with credit by specifying credit*
all columns whose names end with login in tables whose names start with user by specifying *.user*.*login.

Built-in data discovery keywords

DataMasque uses regular expressions to search for columns which may contain the following sensitive information:

Category PII
- Name / first name / middle name / last name / surname
- Mail / email
- Date of birth / DoB
- SSN / Social Security Number
- Address
- Post code
- Country / state / city / zip code
- Phone number
- Fax number
- Insurance number
- Passport number
- Driver license number
- Gender
- Age
- Vehicle identification number / VIN
- Login
- Job position / role / title
- Workspace / company
- NRIC / Identity Card Number
- IC number
- ID number
- IRD number / Inland Revenue Department number
- NINO
- Unique taxpayer reference / UTR
- Identity / identification / ID
- Tax number
- Internet protocol (IP) address
- Media access control (MAC) address
- Licence plate
- Licence number
- Certificate number
- Identifiers / serial number
Category PCI
- Credit / payment / debit card
- Credit / payment / debit number
- Account number
- Security code
- Expiry date
- Name / first name / middle name / last name / surname
- PIN / Personal identification numbers
- CVV / Card Verification Value
- Address
- Post code
- Country / state / city / zip code
- Phone number
- Fax number
Category PHI
- PHI number
- NHI number
- Medical record number
- Insurance number
- Internet protocol (IP) address
- Name / first name / middle name / last name / surname
- Health plan beneficiary number
- Identifiers / ID
- Identifying number / code
- Serial number
- Licence plate
- Licence number
- Certificate number
- Address
- Post code
- Country / state / city / zip code
- Phone number
- Fax number

Appendix: In-data discovery rules

The list of rules is built-in and currently cannot be changed or customized.

DataMasque can detect the following data in columns.

Credit Card Numbers

How it matches: Checks the card prefix against the built-in list of valid card prefixes and card lengths, and uses the Luhn checksum algorithm. A list of valid prefixes can be found in the credit_card_prefixes.csv file.
Threshold: 12% of sampled data should match this check.
Minimum Column Length: 12 characters.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: credit_card
  pan_format: false
  pan_character: '#'
  generate_luhn: true
  apply_weighting: false
  generate_card_number: true
  on_null: skip
  on_invalid: mask
  output_format_choice: retained

Luhn Checksum

How it matches: Uses the Luhn checksum algorithm.
Threshold: 12% of sampled data should match this check.
Minimum Column Length: 12 characters.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: from_unique_imitate
  skip_letters: false
  skip_digits: false
  checksum: luhn
  on_invalid: error

Brazilian CPF Numbers

How it matches: Uses the Brazilian CPF checksum algorithm.
Mask for a match: Full Brazilian CPF numbers are typically preserved.
Threshold: 10% of sampled data should match this check.
Minimum Column Length: 11 characters.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: brazilian_cpf
  on_null: skip
  on_invalid: mask
  output_format_choice: retained

How it matches: Uses the social security number pattern rules, as documented for the social_security_number mask.
Threshold: 45% of sampled data should match this check.
Minimum Column Length: 9 characters.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: social_security_number
  on_null: skip
  on_invalid: mask
  output_format_choice: retained

Partial Credit Card Numbers

How it matches: Uses a regular expression.
Pattern to match: ^(\d{6}(?:X{6}|X{5}\d|X{4}\d{2}|X{3}\d{3}|X{2}\d{4}|X{1}\d{5}|#{6}|#{5}\d|#{4}\d{2}|#{3}\d{3}|#{2}\d{4}|#{1}\d{5}))$
Threshold: 10% of sampled data should match this check.
Minimum Column Length: 15 characters.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: imitate

IP Addresses

How it matches: Uses a regular expression.
Pattern to match: \b(?:\d{1,3}\.){3}\d{1,3}\b
Threshold: 10% of sampled data should match this check.
Minimum Column Length: 15 characters.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: concat
  masks:
    - type: from_random_number
      max: 255
      min: 0
    - type: from_random_number
      max: 255
      min: 0
    - type: from_random_number
      max: 255
      min: 0
    - type: from_random_number
      max: 255
      min: 0
  glue: .

MAC Addresses

How it matches: Uses a regular expression.
Pattern to match: ^(([0-9A-Fa-f]{2}[-]?){5}([0-9A-Fa-f]{2})|([0-9A-Fa-f]{2}[:]?){5}([0-9A-Fa-f]{2}))$
Pattern to exclude: \d{12}
Threshold: 10% of sampled data should match this check.
Minimum Column Length: 12 characters.
Maximum Column Length: No maximum length.

Examples

Value	Match
aa:aa:aa:aa:aa:aa	Yes
A1-B2-C3-D4-E5-F6	Yes
1a2345665432	Yes
123456765432	No - all digits
00-1A-2B:3C-4D-5E	No - inconsistent separators
12:34:56:78:90:GF	No - contains non-hex character 'G'

Generated Mask Definition

- type: chain
  masks:
    - type: imitate
      force_change: false
      uppercase: false
      lowercase: false
      digits: true
    - type: replace_regex
      regex: '[a-f]'
      masks:
        - type: from_format_string
          value_format: '{[a-f],1}'
      preserve_length: true
    - type: replace_regex
      regex: '[A-F]'
      masks:
        - type: from_format_string
          value_format: '{[A-F],1}'
      preserve_length: true

US States

How it matches: Matches column data in state_long against the built-in DataMasque_US_States.csv seed file.
Threshold: 60% of sampled data should match this check.
Minimum Column Length: No minimum length.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: from_file
  seed_file: DataMasque_US_States.csv
  seed_column: state_long

US States Short

How it matches: Matches column data in state_short against the built-in DataMasque_US_States.csv seed file.
Threshold: 80% of sampled data should match this check.
Minimum Column Length: No minimum length.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: from_file
  seed_file: DataMasque_US_States.csv
  seed_column: state_short

Email Addresses

How it matches: Uses a regular expression.
Pattern to match: ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,24}$
Threshold: 10% of sampled data should match this check.
Minimum Column Length: No minimum length.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: concat
  masks:
    - type: from_file
      seed_file: DataMasque_firstNames_mixed.csv
      seed_column: firstname-mixed
    - type: from_file
      seed_file: DataMasque_lastNames_v2.csv
      seed_column: lastnames
    - type: from_file
      seed_file: DataMasque_fake_email_suffixes.csv
      seed_column: email-suff

Gender

How it matches: Uses a regular expression.
Pattern to match: ^(f|m|male|female)$. Match words case-insensitive.
Threshold: 70% of sampled data should match this check.
Minimum Column Length: No minimum length.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: from_fixed
  value: REDACTED

First Names

How it matches: Matches column data against the built-in DataMasque_firstNames_mixed.csv seed file. The match is case-insensitive.
Threshold: 40% of sampled data should match this check.
Minimum Column Length: No minimum length.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: from_file
  seed_file: DataMasque_firstNames_mixed.csv
  seed_column: firstname-mixed

Last Names

How it matches: Matches column data against the built-in DataMasque_lastNames_v2.csv seed file. The match is case-insensitive.
Threshold: 25% of sampled data should match this check.
Minimum Column Length: No minimum length.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: from_file
  seed_file: DataMasque_lastNames_v2.csv
  seed_column: lastnames

Street Addresses

How it matches: Matches column data against the built-in DataMasque_street_names_full.csv amd DataMasque_street_types_all.csv seed files. The match is case-insensitive. Also requires that the data starts with a digit.
Threshold: 60% of sampled data should match this check.
Minimum Column Length: No minimum length.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: concat
  masks:
    - type: from_file
      seed_file: DataMasque_NZ_addresses_real.csv
      seed_column: street_number
    - type: from_file
      seed_file: DataMasque_NZ_addresses_real.csv
      seed_column: street_name
  glue: ' '

Full Names

How it matches: Matches column data against the built-in DataMasque_firstNames_mixed.csv and DataMasque_lastNames_v2.csv seed files. The match is case-insensitive. Also requires that the data starts with a letter (A-Z or a-z).
Threshold: 60% of sampled data should match this check.
Minimum Column Length: No minimum length.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: concat
  masks:
    - type: from_file
      seed_file: DataMasque_firstNames_mixed.csv
      seed_column: firstname-mixed
    - type: from_file
      seed_file: DataMasque_lastNames_v2.csv
      seed_column: lastnames
  glue: ' '

Phone Number

How it matches: Uses a regular expression.
Pattern to match: ^(?=(?:\D*\d){7,15}$)[+]?[(]?[0-9]{1,3}[)]?[-\s\.]?[0-9]{1,4}([- \.\s]?[0-9]{2,4}){1,3}$.
Threshold: 20% of sampled data should match this check.
Minimum Column Length: 7 characters.
Maximum Column Length: No maximum length.

Generated Mask Definition

- type: imitate

Sensitive Data Discovery

Overview

Methodology

JSON Sensitive Data Discovery

Example

In-Data Discovery

Configuring Options for In-Data Discovery

Custom and Non-Sensitive Patterns

Reporting and notifications

Per-run data discovery report

Notification of new sensitive data

Disabling notifications

When notifications are sent

Keywords configuration

Global Custom Data Classification Keywords

Global Ignored keywords

Built-in data discovery keywords

Appendix: In-data discovery rules

Credit Card Numbers

Generated Mask Definition

Luhn Checksum

Generated Mask Definition

Brazilian CPF Numbers

Generated Mask Definition

Social Security Numbers

Generated Mask Definition

Partial Credit Card Numbers

Generated Mask Definition

IP Addresses

Generated Mask Definition

MAC Addresses

Examples

Generated Mask Definition

US States

Generated Mask Definition

US States Short

Generated Mask Definition

Email Addresses

Generated Mask Definition

Gender

Generated Mask Definition

First Names

Generated Mask Definition

Last Names

Generated Mask Definition

Street Addresses

Generated Mask Definition

Full Names

Generated Mask Definition

Phone Number

Generated Mask Definition