Schema Discovery

Overview
Methodology
Limits
- Total columns
- In-Data Discovery samples
Reporting
- Per-run data discovery report

Overview

DataMasque can be configured to automatically discover data in your databases during masking. When a masking ruleset contains the special purpose run_schema_discovery task type, DataMasque will inspect the database metadata and generate a discovery report for each masking run. Discovery reports produced are accessible via the Ruleset Generator page.

Note: The schema discovery feature does not currently support Amazon DynamoDB or Microsoft SQL Server (Linked Server) databases.

Methodology

To perform schema discovery, DataMasque discovers all tables and columns under the default schema of a connection. Once all columns have been identified they are then checked for potential sensitive information then flagged appropriately.

Instead of the default schema, you can also specify a list of schemas to discover. See the ruleset generator documentation for more information.

DataMasque comes with over 90 built-in keywords to help discover various types of sensitive data (account numbers, addresses, etc.) in your database. Each pattern is classified into one or more categories. The included data classification categories are:

Personally Identifiable Information (PII)
Personal Health Information (PHI)
Payment Card Information (PCI)

Limits

Running schema discovery against very large databases can exhaust the RAM of the computer or virtual machine running DataMasque. This can cause the operating system to terminate DataMasque, meaning schema discovery will fail to record any results.

The following provides an approximate guide to database size limits for schema discovery, based on an Ubuntu VM where DataMasque is the only application running and DataMasque is not running any other tasks at the same time as schema discovery.

Total columns

As of v2.22, the number of total columns that can be discovered is very high (several millions). Discovery has been successfully tested on a database with over 2 million total columns. Aside from any practical limitations of executing discovery and ruleset generation on large database schemas, the primary limiting factor is the disk space required to store the results, the discovery report CSV, and any generated rulesets. The total required disk space is approximately 1KB per column.

Discovery results accumulate in DataMasque's internal database; rerunning discovery on a connection does not delete the previous set of results. If you need to delete previous discovery results to free up disk space, you can use the DELETE /runs//db-discovery-results/ API endpoint.

Ruleset generation on databases with over 1,000 total columns can only be done using the API. Refer to Discovery and Ruleset Generation for Large Schemas for more details.

In-Data Discovery samples

If you enable In-Data Discovery (IDD), then for each table in the schema(s) (databases for MySQL/MariaDB) targeted for discovery, DataMasque will load a sample of data from that table into memory. Samples of data to be analysed by IDD are loaded into memory one table at a time, so only the table that has the largest (in bytes) data sample matters.

The amount of data to be analysed depends on the number of sample rows you specify in the IDD settings. The default is a sample of 1,000 rows.

To calculate the memory used by the sample data for IDD, follow these steps.

Consider all the tables in the target schema(s) (databases for MySQL/MariaDB), looking for columns that potentially contain large amounts of data, such as blob, text or large varchar columns. Smaller data types such as int or bool use minimal amounts of memory and need not be considered here.
For each table, determine an estimate for the maximum possible sample size for that table, based on your chosen number of sample rows.

Hint: For a rough estimate, you can multiply the maximum data size of the large column(s) by the number of sample rows. For example, a column of type text has a maximum size of 64KB, so a table with two text columns and one varchar(32768) (i.e. 32KB) column would have a maximum size of 160KB per row, or approximately 160MB across 1,000 rows.

A query like SELECT sum(l) FROM (SELECT length(my_big_column) AS l FROM my_table ORDER BY l DESC LIMIT 1000) AS temp can get you a more precise estimate for the sample size. Replace the 1000 in the LIMIT clause with your chosen number of sample rows for IDD. The result of this query will be in bytes.

The above query is for a single column; if there is more than one column that could contain large amounts of data, repeat the query for each such column in the table and add the results to get an estimate of the sample size for that table.

If you have a lot of rows in the table, this query could be very slow. In this case you might want to try a random sample query (for example using TABLESAMPLE system on Postgres, omitting the ORDER BY clause).

Take the largest of these per-table estimates.

If the largest table estimate is less than the Maximum IDD sample size value in the table below then you should be able to use IDD without any out-of-memory issues.

Machine RAM	Maximum IDD sample size
8GB	2GB
16GB	8GB
32GB	20GB

Note that use of IDD does not impact the maximum total number of columns DataMasque can discover on a given machine.

Working around memory limits

If you encounter out-of-memory issues when running the Ruleset Generator, try the following:

If you are running schema discovery against multiple schemas (databases for MySQL/MariaDB), try running schema discovery against each schema separately, then optionally combine the rulesets later.
Use ignored keywords to filter out some columns you don't need to mask.

Reporting

Per-run data discovery report

After a run_schema_discovery task has completed, the results can then be viewed in the ruleset generator.

The ruleset generator will display the following columns:


*Table schema*	The schema of the table discovered in the connection.
*Table name*	The name of the table discovered.
*Column name*	The name of the column which has matched against a commonly used sensitive data identifier.
*Constraint*	Whether the column is a Primary or Unique key. In parentheses it will list the columns in which the constraint is present.
*Data Type*	The data type of the column in the database metadata.
*Max Length*	If the column is a text field, this contains the max length of the column, otherwise empty string.
*Numeric Precision*	If the column is a numeric field, this contains the numeric precision of the column, otherwise empty string.
*Reason for flag*	Description of pattern which caused the column to be flagged for sensitive data.
*Flagged by*	Whether the column was flagged for sensitive data through in-data discovery or through the standard sensitive data discovery / keyword matching process.
*Data classifications*	A comma-separated list of classifications for the flagged sensitive data. Possible classifications include PII (Personally Identifiable Information), PHI (Personal Health Information), PCI (Payment Card Information) and Custom (User specified custom keywords).
*Foreign Keys*	List of foreign keys relating to the column, with information of the foreign key (FK_Name, Referenced Column).

Note:

Schema discovery reports will discover primary and unique keys including composite primary and unique keys.

This is done by querying the relevant system tables with the constraint name to get all columns under the constraint in order.

This is needed for the task key when generating mask_table tasks.

Note for Oracle: Schema discovery reports will only cover the tables owned by the user or schema as defined in Connection. Schema will take precedence over user.

Note for Microsoft SQL Server: Schema discovery reports will only cover the tables owned by the user's default schema.

Note for PostgreSQL:

Schema discovery reports will only cover the visible tables in the current user search path.

A table is said to be visible if it's containing schema is in the search path and no table of the same name appears earlier in the search path.

Note for Redshift:

Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints. See Defining table constraints for additional information about how Amazon Redshift uses constraints.