Sensitive Data Discovery

Overview
Methodology
JSON Sensitive Data Discovery
Reporting and notifications
- Per-run data discovery report
- Notification of new sensitive data
Keywords configuration
- Global Custom Data Classification Keywords
- Global Ignored keywords
Appendix: Built-in data discovery keywords

Overview

DataMasque can be configured to automatically discover sensitive data in your databases during masking. When a masking ruleset contains the special purpose run_data_discovery task type, DataMasque will inspect the database metadata and generate a discovery report for each masking run. Additionally, each user will receive email notifications of any new, unmasked sensitive data that has been discovered by DataMasque, providing ongoing protection against new sensitive data being added to your schemas over time.

Note: The schema discovery feature does not currently support Microsoft SQL Server (Linked Server) databases.

Methodology

To perform sensitive data discovery, DataMasque uses regular expressions (regex) to scan the metadata of the target database. When a database column is identified as sensitive, DataMasque will compare it with the masking rules specified in the ruleset to determine the masking coverage for the column.

DataMasque comes with over 90 built-in keywords to help discover various types of sensitive data (account numbers, addresses, etc.) in your database. These built-in keywords are global keywords, used for sensitive data discovery by default and for schema discovery when enabled. Each pattern is classified into one or more categories. The included data classification categories are:

Personally Identifiable Information (PII)
Personal Health Information (PHI)
Payment Card Information (PCI)

JSON Sensitive Data Discovery

Sensitive information contained within JSON data can be discovered through the JSON Sensitive Data Discovery feature available through the YAML Ruleset Editor

Here you can enter JSON data into the text field and run the discovery feature to identify the lowest level sensitive keys which could contain sensitive information. The sensitive items are evaluated using the built-in data discovery keywords.

Once finished it will return an automatically generated rule to mask the sensitive key's values contained within the specified JSON data, this rule can then be copied or inserted into the ruleset on the YAML Editor.

Example

This example will show the benefit of generating the rule through the JSON Mask Generator. Suppose a column contains the following JSON data:

{
  "customers": [
    {
      "primary": {
        "name": "Foo",
        "credit card": 123456789
      },
      "secondary":
        {
          "name": "Bar",
          "credit card":987654321
        }
    }
  ]
 }

This data can be entered into the JSON Data field of the JSON Mask Generator and a rule will be created to mask the sensitive keys (name and credit card) as shown below. This can then be copied into the ruleset under the relevant task.

type: json
transforms:
  - path:
      - customers
      - '*'
      - primary
      - name
    masks:
      - type: from_fixed
        value: redacted
    on_null: skip
    on_missing: skip
    force_consistency: false
  - path:
      - customers
      - '*'
      - primary
      - credit card
    masks:
      - type: chain
        masks:
          - type: credit_card
            validate_luhn: true
            pan_format: false
            preserve_prefix: false
          - type: take_substring
            start_index: 0
            end_index: 9
    on_null: skip
    on_missing: skip
    force_consistency: false
  - path:
      - customers
      - '*'
      - secondary
      - name
    masks:
      - type: from_fixed
        value: redacted
    on_null: skip
    on_missing: skip
    force_consistency: false
  - path:
      - customers
      - '*'
      - secondary
      - credit card
    masks:
      - type: chain
        masks:
          - type: credit_card
            validate_luhn: true
            pan_format: false
            preserve_prefix: false
          - type: take_substring
            start_index: 0
            end_index: 9
    on_null: skip
    on_missing: skip
    force_consistency: false

Reporting and notifications

Per-run data discovery report

After a run_data_discovery task has completed, the corresponding data discovery report can be downloaded alongside the run logs of the masking run:

The sensitive data discovery report will be downloaded in CSV format and may be opened in a text editor or spreadsheet viewer such as Microsoft Excel. The report contains information to assist you in discovering and masking the sensitive data in your database. Every column that has been identified as potentially containing sensitive data is included in the report, along with a classification of the data and an indication of whether the masking ruleset contains a rule targeting the matched column.

The CSV report contains the following columns:


*Table schema*	The schema of the table containing a sensitive data match.
*Table name*	The name of the table containing a sensitive data match.
*Column name*	The name of the column which has matched against a commonly used sensitive data identifier.
*Reason for flag*	Description of pattern which caused the column to be flagged for sensitive data.
*Data classifications*	A comma-separated list of classifications for the flagged sensitive data. Possible classifications include PII (Personally Identifiable Information), PHI (Personal Health Information), and PCI (Payment Card Information).
*Covered by ruleset*	A boolean value (True/False) indicating whether the masking ruleset contains a rule to target the identified column.

Note for Oracle: Sensitive data discovery reports will only cover the tables owned by the user or schema as defined in Connection. Schema will take precedence over user.

Note for Microsoft SQL Server: Sensitive data discovery reports will only cover the tables owned by the user's default schema.

Note for PostgreSQL: Sensitive data discovery reports will only cover the visible tables in the current user search path.

A table is said to be visible if its containing schema is in the search path and no table of the same name appears earlier in the search path.

Notification of new sensitive data

DataMasque users will receive email notifications¹ of newly detected, unmasked sensitive data in your databases. This feature provides ongoing protection against new sensitive data making its way into your databases over time.

Disabling notifications

Each user can opt out of receiving email updates by navigating to the My Account page and disabling the Notify me when sensitive data is found option of the Edit Account form:

Data discovery notification opt-in

When notifications are sent

Each user that has enables this option will receive daily notification emails when new, unmasked sensitive data is detected on any Connection that has been masked in the previous 24 hours. If there is nothing to report, no notifications will be sent.

Notifications will be sent, given that in the previous 24 hours a data discovery task has been run:

On a new Connection that contains unmasked sensitive data.
As part of a ruleset in which a masking rule that was previously protecting sensitive data has been removed.
On an existing Connection which has had one or more columns containing sensitive data added to the database.

Notes:

New sensitive data is only detected during masking runs that include a run_data_discovery task. Include this task in all masking runs to receive ongoing protection.

This feature requires that SMTP has been configured, allowing DataMasque to send outbound email.

Keywords configuration

DataMasque can be configured with additional custom keywords and ignored keywords to facilitate sensitive data discovery. Both custom keywords and ignored keywords are case-insensitive. These can be configured from the Settings page.

Global Custom Data Classification Keywords

Column names are matched to Global Custom Data Classification keywords in addition to the built-in data discovery keywords.

- (dash) or _(underscore) are also supported in keywords.
DataMasque will convert each keyword to a regular expression pattern that will match column names with (space). For example, columns named credit card number, creditcardnumber and creditcard number will all be matched by the space-separated keyword phrase credit card number.
Column names that match the Global Custom Data Classification keywords will be reported and tagged with the data classification "Custom" in Sensitive Data Discovery reports.
Wildcards are also supported by using the * character, for example you can discover all columns in any table by specifying schema_name.table_name.*

Global Ignored keywords

Global Ignored keywords will only ignore exact matches of a column name, which allows you to exclude specific column names from Sensitive Data Discovery reports. Only spaces, underscores, hyphens and alphanumeric characters are allowed for Global Ignored keywords.

For example, with the ignored keyword p_id, columns named p_id will be ignored, and will be no longer identified as sensitive data.

Wildcards are also supported by using the * character, for example you can ignore all columns in any table by specifying schema_name.table_name.*

Appendix: Built-in data discovery keywords

DataMasque uses regular expressions to search for columns which may contain the following sensitive information:

Category PII
- Name / first name / middle name / last name / surname / fName / mName / lName
- Fax number
- Mail / email
- Date of birth / DoB
- SSN / Social Security Number
- Address
- Post code
- Phone
- Insurance number
- Passport number
- Driver license number
- Country / state / city / zip code
- Gender
- Age
- Vehicle identification number / VIN
- Login
- Media access control / MAC
- Job position / role / title
- Workspace / company
- NRIC / Identity Card Number
- IC number
- ID number
- IRD number / Inland Revenue Department number
- NINO
- Unique taxpayer reference / UTR
- Identity / identification / tax number / ID
- Internet protocol address / IP address
- Licence plate
- Licence number
- Certificate number
- Identifiers / serial number
Category PCI
- Credit / payment/ debit card
- Credit / payment / debit number
- Account number
- Security code
- Expiry date
- Name / first name / middle name / last name / surname / fName / mName / lName
- PIN / Personal identification numbers
- CVV / Card Verification Value
- Address
- Country / state / city / zip code
Category PHI
- PHI number
- NHI number
- Medical record number
- Insurance number
- Internet protocol address / IP address
- Name / first name / middle name / last name / surname / fName / mName / lName
- Health plan beneficiary number
- Identifiers / serial number
- Identifying number / code
- Licence plate
- Licence number
- Certificate number
- Address
- Country / state / city / zip code