DataMasque can be configured to automatically discover data in your databases during
masking. When a masking ruleset contains the special purpose
run_schema_discovery task type, DataMasque
will inspect the database metadata and generate a discovery report
for each masking run. Discovery reports produced are accessible via the Ruleset Generator page.
Note: The schema discovery feature does not currently support Amazon DynamoDB or Microsoft SQL Server (Linked Server) databases.
To perform schema discovery, DataMasque discovers all tables and columns under the default schema of a connection. Once all columns have been identified they are then checked for potential sensitive information then flagged appropriately.
Instead of the default schema, you can also specify a list of schemas to discover. See the ruleset generator documentation for more information.
DataMasque comes with over 90 built-in keywords to help discover various types of sensitive data (account numbers, addresses, etc.) in your database. Each pattern is classified into one or more categories. The included data classification categories are:
- Personally Identifiable Information (PII)
- Personal Health Information (PHI)
- Payment Card Information (PCI)
Running schema discovery against very large databases can exhaust the RAM of the computer or virtual machine running DataMasque. This can cause the operating system to terminate DataMasque, meaning schema discovery will fail to record any results.
The following provides an approximate guide to database size limits for schema discovery, based on an Ubuntu VM where DataMasque is the only application running and DataMasque is not running any other tasks at the same time as schema discovery.
The table below states the approximate limit for the number of columns across all tables in the schema(s) (or databases for MySQL/MariaDB) targeted for discovery. For example, if your schema has 1,000 tables each with 30 columns, that is 30,000 total columns.
|Maximum total columns
In-Data Discovery samples
If you enable In-Data Discovery (IDD), then for each table in the schema(s) (databases for MySQL/MariaDB) targeted for discovery, DataMasque will load a sample of data from that table into memory. Samples of data to be analysed by IDD are loaded into memory one table at a time, so only the table that has the largest (in bytes) data sample matters.
The amount of data to be analysed depends on the number of sample rows you specify in the IDD settings. The default is a sample of 1,000 rows.
To calculate the memory used by the sample data for IDD, follow these steps.
- Consider all the tables in the target schema(s) (databases for MySQL/MariaDB),
looking for columns that potentially contain large amounts of data, such as
varcharcolumns. Smaller data types such as
booluse minimal amounts of memory and need not be considered here.
- For each table, determine an estimate for the maximum possible sample size for that table, based on your chosen number of sample rows.
Hint: For a rough estimate, you can multiply the maximum data size of the large column(s) by the number of sample rows. For example, a column of type
texthas a maximum size of 64KB, so a table with two
textcolumns and one
varchar(32768)(i.e. 32KB) column would have a maximum size of 160KB per row, or approximately 160MB across 1,000 rows.
A query like
SELECT sum(l) FROM (SELECT length(my_big_column) AS l FROM my_table ORDER BY l DESC LIMIT 1000) AS tempcan get you a more precise estimate for the sample size. Replace the
LIMITclause with your chosen number of sample rows for IDD. The result of this query will be in bytes.
The above query is for a single column; if there is more than one column that could contain large amounts of data, repeat the query for each such column in the table and add the results to get an estimate of the sample size for that table.
If you have a lot of rows in the table, this query could be very slow. In this case you might want to try a random sample query (for example using
TABLESAMPLE systemon Postgres, omitting the
- Take the largest of these per-table estimates.
If the largest table estimate is less than the Maximum IDD sample size value in the table below then you should be able to use IDD without any out-of-memory issues.
|Maximum IDD sample size
Note that use of IDD does not impact the maximum total number of columns DataMasque can discover on a given machine.
Working around memory limits
If you encounter out-of-memory issues when running the Ruleset Generator, try the following:
- If you are running schema discovery against multiple schemas (databases for MySQL/MariaDB), try running schema discovery against each schema separately, then optionally combine the rulesets later.
- Use ignored keywords to filter out some columns you don't need to mask.
Per-run data discovery report
The ruleset generator will display the following columns:
|The schema of the table discovered in the connection.
|The name of the table discovered.
|The name of the column which has matched against a commonly used sensitive data identifier.
|Whether the column is a Primary or Unique key. In parentheses it will list the columns in which the constraint is present.
|The data type of the column in the database metadata.
|If the column is a text field, this contains the max length of the column, otherwise empty string.
|If the column is a numeric field, this contains the numeric precision of the column, otherwise empty string.
|Reason for flag
|Description of pattern which caused the column to be flagged for sensitive data.
|Whether the column was flagged for sensitive data through in-data discovery or through the standard sensitive data discovery / keyword matching process.
|A comma-separated list of classifications for the flagged sensitive data. Possible classifications include PII (Personally Identifiable Information), PHI (Personal Health Information), PCI (Payment Card Information) and Custom (User specified custom keywords).
|List of foreign keys relating to the column, with information of the foreign key (FK_Name, Referenced Column).
- Schema discovery reports will discover primary and unique keys including composite primary and unique keys.
- This is done by querying the relevant system tables with the constraint name to get all columns under the constraint in order.
- This is needed for the task
Note for Oracle: Schema discovery reports will only cover the tables owned by the user or schema as defined in Connection. Schema will take precedence over user.
Note for Microsoft SQL Server: Schema discovery reports will only cover the tables owned by the user's default schema.
Note for PostgreSQL:
- Schema discovery reports will only cover the visible tables in the current user search path.
- A table is said to be visible if it's containing schema is in the search path and no table of the same name appears earlier in the search path.
Note for Redshift:
- Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints. See Defining table constraints for additional information about how Amazon Redshift uses constraints.