DataMasque Portal

Best Practices

Database configuration

The performance of DataMasque masking is largely dependent on the performance of the target database. A poorly tuned target database will have a direct negative impact on the execution time of your masking runs. To achieve optimal masking performance, it is important to tune the target database and tables - as you would for application transaction workloads.

The following sections include recommendations for improving the performance of your target databases during masking.

Transaction logging

Data masking is a destructive operation and is performed on copies of production data. Consequently, database recovery is not a critical operational requirement on masking targets. It is therefore recommended to disable transaction logging at the start of data masking runs to improve SQL execution by minimising database I/O.

  • For Oracle databases, enable NOARCHIVELOG mode.
  • For Microsoft SQL databases, enable SIMPLE recovery mode.

Warning: After masking is complete, any transaction logs on the target database should be manually purged. Lingering transaction logs may contain unmasked data in plain text.

Tip: Use a run_sql task before your mask_table tasks to manage the logging configuration of your target database.

Query optimisation

For Oracle databases, use ROWID as the key for mask_table tasks. ROWID values are the fastest way to lookup rows. However, you should not delete or reinsert rows to the same table during masking runs when using ROWID as key, as this may cause Oracle to reassign ROWID values.

Sensitive data discovery

To ensure complete, ongoing protection of sensitive data in your databases, it is recommended to make use of the Sensitive Data Discovery features provided by DataMasque. Include a run_data_discovery task in every masking ruleset, and ensure that appropriate users have opted-in to be notified when new sensitive data is discovered by DataMasque.

Encrypted connections

It is recommended to use encrypted database connections with DataMasque whenever possible. For more information on configuring encryption for your database connections, see the Connections reference guide.

Network and I/O

DataMasque uses remote database connections to support databases running in on-prem and public cloud environments. This means that masking requires data to be transmitted across networks between the database server and DataMasque.

Slow networks with numerous hops between the database server and the DataMasque server can degrade performance. For best performance it is recommended to co-locate your DataMasque instance on the same network as your target databases.

Ways to improve network throughput include:

  • Use a minimum 1 Gigabit Ethernet (1GbE) connection between DataMasque and target database servers.
  • Ensure latency and hops from DataMasque to target database servers are as low as possible.

Health monitoring

DataMasque provides a health check API endpoint to facilitate simple integration into your existing application monitoring tooling. To learn more about using this endpoint, see the API documentation for authentication and the health check API endpoint.

Data protection

For dedicated DataMasque instances deployed on virtual machines, it is recommended to take regular virtual machine snapshots/backups.

It is also recommended to periodically save copies of your Run Logs, as well as Ruleset and Connection configurations.

Adequate security measures should also be in place to protect the DataMasque server and backed-up files. Knowledge of masking rulesets may aid an attacker's attempts to derive information from masked data, and certain secrets may enable an attacker to reproduce deterministic masking operations (See: Freezing random values).

Consistent masking

While the most secure approach to data masking (and the default behaviour within DataMasque) is to generate masked values independently in each masking run, some use cases will require consistent masking across multiple runs (facilitated by deterministic masking features). If you wish to make the masked values deterministic from the same input values, and for the same deterministic values to be generated across multiple masking runs, you can specify a run_secret in the run options while using a hash_columns list in a rule.

Important! It is critical to store run_secret securely and rotate it as frequently as possible as it is used to influence the internal hashing algorithm to produce repeatable and consistent masked values.

By default, DataMasque ties this deterministic behaviour to a specific DataMasque application instance, so the exact same ruleset and run_secret will yield different deterministic values if transferred to a different DataMasque instance, making it more difficult to identify masked values for a given hash input. The admin user may rotate the instance secret that determines the masking behaviour for the DataMasque instance from the Settings page. After rotating the instance secret, subsequent deterministic masking will be based on a new instance secret, and existing rulesets will produce different masking results. For added security, the instance secret cannot be viewed or reverted to a previous value, so please ensure you no longer require consistency with earlier masking runs before rotating the instance secret.

Example

This example ruleset will mask the date_of_birth column with a date value that has been deterministically generated based on the hash of date_of_birth and first_name column values combined with the run's run_secret.

For example, in every row where date_of_birth = '2000-01-01' and first_name = 'Carl', the date_of_birth will be replaced with a deterministically generated value such as 1986-05-11. This same replacement value will be generated during each repeated masking run, including masking across different data sources (such as Oracle and Microsoft SQL Server databases).

version: '1.0'
tasks:
  - type: mask_table
    table: users
    key: user_id
    rules:
      - column: date_of_birth
        hash_columns:
          - date_of_birth
          - first_name
        masks:
          - type: from_random_datetime
            min: '1980-01-01'
            max: '2000-01-01'

Show result

Before After
date_of_birth first_name
2000-01-01 Carl
1965-11-10 Ria
2000-01-01 Carl
2000-01-01 Jose
1990-05-31 Thomas
1999-07-31 Nicole
date_of_birth first_name
1986-05-11 Carl
1970-10-09 Ria
1986-05-11 Carl
1973-04-18 Jose
1961-02-26 Thomas
1992-04-30 Nicole

Different use cases will require consistency of masking across different scopes (e.g. across databases, across time periods, across DataMasque instances). We recommend the following approaches to address various consistent masking requirements:

We recommend the following approaches to achieving masking consistency across databases, time, and DataMasque instances:

  • For consistent masking across multiple masking runs (which may target different databases), specify the same run_secret in the options of each run. If you are starting runs via the API, you can leave the run_secret unset to allow it to be automatically generated, and then use the value returned in the API response when starting subsequent masking runs. You should discard the run_secret as soon as you have started all of the runs that you require to be consistent, as prolonged storage of the run_secret risks it being acquired by an attacker.
  • For consistent masking over a longer period of time (e.g. all masking runs over the course of a week), you may need to store the run_secret for re-use when starting each of these runs. For these cases, the run_secret should be stored securely and rotated as frequently as your use case allows.
  • For consistent masking across DataMasque instances, you can disable the instance secret in the options for a specific run. Runs performed on different DataMasque instances (of the same version) with the instance secret disabled and with the same run_secret will have consistent masking. However, disabling the instance secret will mean that an attacker with access to the masking ruleset and run_secret could potentially reproduce your masking results on a different DataMasque instance. Therefore, you should ensure the run_secret is shared securely between the DataMasque instances and ensure that the run_secret is regularly rotated.

Host security

It is recommended to deploy DataMasque on a dedicated VM/server and apply the standard security practices of your organisation to the host VM/server. Such best practices include, but are not limited to:

  • Use of a network firewall
  • Strict access control to the host OS
  • Regular OS security patching
  • Host filesystem encryption
  • Intrusion detection
  • Virus scanning

Automation

Masking runs can be triggered programmatically via the DataMasque API.

curl commands

The Preview and Confirm Run screen provides an example curl command that can be used to execute a masking run with the same configuration. To view the command, click the View Run Command button.

Note: curl will not accept DataMasque's default self-signed SSL certificate unless it is manually installed on your client machine. In the case where you do not have your own SSL certificate installed, you will need to check the 'Disable SSL validation' checkbox to include the --insecure flag in the generated curl command.

Preview run

Authentication

The DataMasque API uses token-based authentication. You can get your token from the My Account screen:

Account Information

You must include the token in the Authorization header of all API requests:

GET /runs/123
Authorization: Token <your-api-token>
Accept: application/json

Endpoints

The following API endpoints are available:

Method URL Description
GET /runs/ Get a list of all masking runs
POST /runs/ Create and start a masking run
GET /runs/{id}/ Get the current status of a masking run
GET /runs/{id}/logs/ Get the logs for a masking run
POST /runs/{id}/cancel/ Cancel a masking run

For more detailed information on these endpoints please refer to the API Reference.

UUIDs

Runs can be triggered with the API using any existing connection and ruleset specified by UUID. You can find the UUID for any connection or ruleset by opening it for editing from the Dashboard. The UUID is located at the top of the page:

Connection UUID

Ruleset UUID

Example script

Assuming the corresponding connection and ruleset have been configured in DataMasque, the following script will schedule a masking run and wait until it reaches a finalised status.

Environment Variables
DATAMASQUE_HOST The URL of your DataMasque instance (including port)
DATAMASQUE_API_TOKEN Your DataMasque API token (see API Reference)
CONNECTION_UUID The UUID of the database connection that masking will be applied to
RULESET_UUID The UUID of the masking ruleset that will be used in the masking run
RUN_ID=$(curl $DATAMASQUE_HOST'/api/runs/' \
    -H 'Accept: application/json, text/plain, */*' \
    -H 'Authorization: Token '$DATAMASQUE_API_TOKEN \
    -H 'Content-Type: application/json;charset=UTF-8' \
    --data '{"name":"api_masking_run","connection":"'$CONNECTION_UUID'","ruleset":"'$RULESET_UUID'","options":{"dry_run":false,"batch_size":"50000"}}' \
    --compressed -k | jq -r '.id')
echo $(date -Iseconds): RUN_ID $RUN_ID created!
STATUS=''
spin='-\|/'
index=0
while [[ $STATUS != 'finished' &&  $STATUS != 'failed' &&  $STATUS != 'cancelled']]; do
  STATUS=$(curl -s $DATAMASQUE_HOST'/api/runs/'$RUN_ID'/'\
        -H 'Accept: application/json, text/plain, */*'\
        -H 'Authorization: Token '$DATAMASQUE_API_TOKEN\
        -k\
        | jq -r '.status');
  index=$(( ($index+1) %4 ))
  printf "\r${spin:$index:1}"
  sleep 1;
done
printf "\r$(date -Iseconds): RUN_ID $RUN_ID $STATUS!"

Or alternatively use following Python script (with requests library) to schedule a masking run and wait until it reaches a finalised status:

import requests
import json
import datetime
import sys
import time

DATAMASQUE_HOST = 'https://datamasque-app-host-name'
DATAMASQUE_API_TOKEN = 'datamasque-api-token'
CONNECTION_UUID = 'connection-uuid'
RULESET_UUID = 'ruleset-uuid'


def spinning_cursor():
    while True:
        for cursor in '|/-\\':
            yield cursor


run_payload = json.dumps({
  "name": "api_masking_run",
  "connection": CONNECTION_UUID,
  "ruleset": RULESET_UUID,
  "options": {
    "batch_size": "50000",
    "dry_run": False
  }
})
headers = {
  'Authorization': 'Token ' + DATAMASQUE_API_TOKEN,
  'Content-Type': 'application/json'
}
response = requests.request("POST", DATAMASQUE_HOST + '/api/runs/', headers=headers, data=run_payload)
response.raise_for_status()
run_id = response.json()['id']
print(datetime.datetime.now(), 'RUN ID {} created'.format(run_id))

status = ''
spinner = spinning_cursor()
while status not in ['finished', 'failed', 'cancelled']:
    sys.stdout.write(next(spinner))
    sys.stdout.flush()
    time.sleep(1)
    sys.stdout.write('\b')    
    response = requests.request("GET", '{}/api/runs/{}/'.format(DATAMASQUE_HOST, run_id), headers=headers)
    response.raise_for_status()
    status = response.json()['status']

print(datetime.datetime.now(), 'RUN ID {} {}!'.format(run_id, status))