Masking Tutorial
- 1. Introduction
- 2. Creating a connection
- 3. Uploading a seed file
- 4. Creating a ruleset
- 5. Starting a masking run
- 6. Next steps
1. Introduction
This tutorial will guide you through the process of:
- Configuring a Connection to a target database
- Building a simple Ruleset to mask user details
- Executing a masking Run against the target database
Prerequisites
To complete this tutorial, a target database server (either Oracle or Microsoft SQL Server) that has been
created with an empty schema is required. Download and run one of the following scripts to
initialise the schema with a users
table and some sample user data.
This tutorial uses a seed file tutorial_names.csv as the data source for replacement name values. Download this file now for use later on.
Masking strategy
After running the database init script from the previous section, you will have a users
table
with 4 columns. The table below describes the strategy that will be used to mask each of these columns:
Column | Strategy |
---|---|
user_id |
N/A (non-sensitive data) |
date_of_birth |
Replace with randomly generated date. |
first_name |
Replace with a random first name chosen from seed file. |
last_name |
Replace with a random last name chosen from seed file. |
2. Creating a connection
After logging in to DataMasque, you will be taken to the Dashboard:
Click the button on the Connections panel. You will be taken to the 'Add Connection' form:
Complete the form with the parameters and credentials to connect to the database that you have prepared
following the Prerequisites section. Choosing a meaningful connection name based on your own
requirements will help you to easily identify the target database at a glance. In this example, the connection is named
datamasque_tutorial
. If you are unsure about what value to use for any of the parameters, refer to the
Connections reference.
Click the TEST CONNECTION button to validate that DataMasque can connect to the target database. Once you have confirmed that your connection works, click SAVE AND EXIT to complete the connection setup. Your new connection will now be available in the Connections list on the Dashboard.
3. Uploading a seed file
Seed files provide datasets of replacement values for DataMasque to use when masking. Seed files must be CSV formatted and include a
header row. In this tutorial, the file tutorial_names.csv will be used to
provide DataMasque with a dataset of replacement names for the users
table. The file contains two columns:
first_name
and last_name
.
Open the sidebar navigation menu by clicking the menu icon at the top left of the screen and navigate to the Files page. A list of all available files will be displayed:
Click the button to open the file upload dialog. Now click Browse and locate the tutorial_names.csv file that you have downloaded previously. You may also provide a short description for your file. Click SUBMIT to complete the file upload:
The tutorial_names.csv
file will appear in the files list:
Return to the Dashboard using the sidebar navigation menu.
4. Creating a Ruleset
A ruleset is the configuration that defines the tasks and masking logic that will be applied by DataMasque to a target database during a masking run. Rulesets are created and edited using the ruleset editor, and are written in DataMasque's YAML-based ruleset configuration language. A complete reference is available in the Ruleset Specification user guide.
To create a new ruleset, click the button on the Rulesets panel of the Dashboard. You will be taken to the Ruleset Generator. The Ruleset Generator can automatically generate a ruleset from your database's schema, and is the recommended way to get started with DataMasque.
To create a new empty ruleset, without using the generator, click Skip to YAML Editor, which will take you to the Ruleset Editor (shown below).
General setup
Replace the ruleset name with a descriptive name for the ruleset. This name will be used to identify the ruleset
from the Dashboard. In this example, the name user_table_mask
is used.
The target database has a single table that requires masking, so the ruleset will contain a single task of type
mask_table
.
Update the value of table
to match the users
table created by the database init script in the Prerequisites
section.
The mask_table
task type also requires the name of a key
column which uniquely identifies each row in the database table.
Multiple column names may be provided in an array to form a composite key. On the users
table, each row is uniquely
identified by the user_id
column:
version: "1.0"
tasks:
- type: mask_table
table: users
key: user_id
rules:
- column: REPLACE_ME
masks:
- type: from_fixed
value: REPLACE_ME
Masking rules
date_of_birth
The first masking rule of this ruleset will be applied to the date_of_birth
column. Replace the placeholder value for
column
with date_of_birth
.
The desired strategy for masking the date_of_birth
column is to replace all values with a new
randomly generated date. This can be achieved with the from_random_date
mask. Replace the placeholder from_fixed
mask type with from_random_date
. min
and max
parameters must be provided to
the from_random_date
mask. In this example, each user's date_of_birth
will be a randomly chosen date between 1st
January 1950 and 31st December 2000.
version: "1.0"
tasks:
- type: mask_table
table: users
key: user_id
rules:
- column: date_of_birth
masks:
- type: from_random_date
min: '1950-01-01'
max: '2000-12-31'
first_name
The desired strategy for masking the first_name
column is to replace all values with a new name
chosen randomly from the tutorial_names.csv file uploaded
previously.
To achieve this, we will add a second masking rule to the ruleset targeting the first_name
column.
The from_file
mask type is used to randomly choose replacement
values from a seed file. The seed_file
parameter specifies the name of the seed file to use as the data source, and
the seed_column
parameter specifies the name of the column within that seed file (as determined by the CSV header row) from which
values will be sourced:
version: "1.0"
tasks:
- type: mask_table
table: users
key: user_id
rules:
- column: date_of_birth
masks:
- type: from_random_date
min: '1950-01-01'
max: '2000-12-31'
- column: first_name
masks:
- type: from_file
seed_file: tutorial_names.csv
seed_column: first_name
last_name
The strategy for masking the last_name
column is nearly identical to the first_name
column. Add a third masking rule
to the ruleset to use the last_name
column from the same tutorial_names.csv seed file as a data source for randomly chosen last names:
version: "1.0"
tasks:
- type: mask_table
table: users
key: user_id
rules:
- column: date_of_birth
masks:
- type: from_random_date
min: '1950-01-01'
max: '2000-12-31'
- column: first_name
masks:
- type: from_file
seed_file: tutorial_names.csv
seed_column: first_name
- column: last_name
masks:
- type: from_file
seed_file: tutorial_names.csv
seed_column: last_name
You have just finished building your first ruleset. Click SAVE AND EXIT to save the ruleset and return to the Dashboard:
5. Starting a masking run
We will now apply data masking to a target database. From the Dashboard, select the Connection and Ruleset created previously. Run options may be left as their default values. Click the PREVIEW RUN button, which will take you to a confirmation screen for the masking run:
After verifying that the run configuration is correct, click the START RUN button to start your masking run:
You will be taken to the Run Logs page, where you can monitor the progress of the masking run from a stream of log messages from the masking worker. The run status will update to 'finished' on completion of masking.
Congratulations! You have successfully masked your first database with DataMasque. Try querying the users
table to
verify that the values have been masked as you expected.
6. Next steps
- Familiarise yourself with the Ruleset Specification guide to learn how to implement more complex data masking strategies.
- Review the DataMasque Best Practices guide for some tips on getting the most from DataMasque.