Schema Discovery

Overview
Methodology
Reporting
- Per-run data discovery report

Overview

DataMasque can be configured to automatically discover data in your databases during masking. When a masking ruleset contains the special purpose run_schema_discovery task type, DataMasque will inspect the database metadata and generate a discovery report for each masking run. Discovery reports produced are accessible via the Ruleset Generator page.

Note: The schema discovery feature does not currently support Amazon DynamoDB or Microsoft SQL Server (Linked Server) databases.

Methodology

To perform schema discovery, DataMasque discovers all tables and columns under the default schema of a connection. Once all columns have been identified they are then checked for potential sensitive information then flagged appropriately.

Instead of the default schema, you can also specify a list of schemas to discover. See the ruleset generator documentation for more information.

DataMasque comes with over 90 built-in keywords to help discover various types of sensitive data (account numbers, addresses, etc.) in your database. Each pattern is classified into one or more categories. The included data classification categories are:

Personally Identifiable Information (PII)
Personal Health Information (PHI)
Payment Card Information (PCI)

Reporting

Per-run data discovery report

After a run_schema_discovery task has completed, the results can then be viewed in the ruleset generator.

The ruleset generator will display the following columns:


*Table schema*	The schema of the table containing a sensitive data match.
*Table name*	The name of the table containing a sensitive data match.
*Column name*	The name of the column which has matched against a commonly used sensitive data identifier.
*Constraint*	Whether the column is a Primary or Unique key. In parentheses it will list the columns in which the constraint is present.
*Data Type*	The data type of the column in the database metadata
*Max Length*	If the column is a text field, this contains the max length of the column, otherwise empty string .
*Numeric Precision*	If the column is a numeric field, this contains the numeric precision of the column, otherwise empty string.
*Reason for flag*	Description of pattern which caused the column to be flagged for sensitive data.
*Data classifications*	A comma-separated list of classifications for the flagged sensitive data. Possible classifications include PII (Personally Identifiable Information), PHI (Personal Health Information), PCI (Payment Card Information) and Custom (User specified custom keywords).
*Foreign Keys*	List of foreign keys relating to the column, with information of the foreign key (FK_Name, Referenced Column).

Note:

Schema discovery reports will discover primary and unique keys including composite primary and unique keys.

This is done by querying the relevant system tables with the constraint name to get all columns under the constraint in order.

This is needed for the task key when generating mask_table tasks.

Note for Oracle: Schema discovery reports will only cover the tables owned by the user or schema as defined in Connection. Schema will take precedence over user.

Note for Microsoft SQL Server: Schema discovery reports will only cover the tables owned by the user's default schema.

Note for PostgreSQL:

Schema discovery reports will only cover the visible tables in the current user search path.

A table is said to be visible if it's containing schema is in the search path and no table of the same name appears earlier in the search path.

Note for Redshift:

Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints. See Defining table constraints for additional information about how Amazon Redshift uses constraints.