Unique Masks
Unique masks are masks that will ensure every value is replaced with a unique value for that column. For other masks see all mask functions.
- From Unique (
from_unique
)
Generates random but unique strings or numbers, from a format string (databases only) - From Unique Imitate (
from_unique_imitate
)
Transforms strings or numbers to be random, retaining format and uniqueness
From unique (from_unique
)
A mask that generates string values that are guaranteed to be unique within the target column.
Note: To automatically cascade primary and unique key values to
foreign keys or mask composite keys, consider using
mask_unique_key
instead.
Parameters
format
(required): The format that will be used to generate values. See format string syntax for details on how to construct a format string.
version: '1.0'
tasks:
- type: mask_table
table: drivers
key: id
rules:
- column: licence_plate
masks:
- type: from_unique
format: "{[A-Z],3}{[0-9],3}"
Show result
Before | After |
|
|
---|
Note
- All values produced by
from_unique
will be strings. To convert values to other data types, you canchain
yourfrom_unique
mask with atypecast
mask. - Unlike
from_format_string
,from_unique
ensures generated values are unique. This is achieved using the same underlying unique value generation procedure asmask_unique_key
. - The specified
format
must allow for a sufficient number of unique values to cover the full number of rows in the target table or file (rounded up to the nearest multiple of thebatch_size
formask_table
tasks), otherwise an error will be reported when executing the run. - Using the
from_unique
mask type differs from themask_unique_key
task type in the following ways:from_unique
cannot perform any cascading of values to related columns, such as foreign keys.from_unique
cannot be used to guarantee joint uniqueness across the columns in a composite key, unless guaranteeing uniqueness within a single column within the composite key is sufficient for the target use case.from_unique
can be used to update unique columns at the same time as other columns in amask_table
task, without the need for a separate database update performed by amask_unique_key
task.from_unique
can be used to generate unique values inmask_tabular_file
tasks or inmask_table
tasks for databases that do not supportmask_unique_key
(such as Amazon DynamoDB).
- Using
from_unique
in amask_table
task whereworker_count
> 1 is only supported for Oracle and Amazon DynamoDB connections. from_unique
cannot be used in the rules of amask_file
task.- Like other mask functions,
from_unique
cannot be used to update thekey
columns of amask_table
task (except for databases that allowmask_table
to update key columns, such as Amazon DynamoDB). - All
from_unique
masks within a given task will be produced from the same sequence of unique values, such that twofrom_unique
masks with identicalformat
will produce the same value for the same row. This can be useful to ensure values produced by identicalfrom_unique
masks in differentif
branches are jointly unique.
From Unique Imitate (from_unique_imitate
)
The from_unique_imitate
mask is very similar to the imitate mask:
- alphabetical characters
a-z
andA-Z
are replaced by other alphabetical characters of the same case - digits are replaced by other digits
- all other characters (whitespace, symbols such as
%
, and so on) are left as-is.
However, there are the following differences:
- Most notably,
from_unique_imitate
can also be used on primary keys and columns that have aUNIQUE
constraint. Indeed, it is designed specifically for data that must take unique values per row. - The masked values are guaranteed to be consistent for the same given input, and different for different inputs.
(The specific values generated will vary across runs unless you control the seeding.)
- For example, if a column has a
UNIQUE
constraint and the existing data satisfies that constraint, the resulting masked values will all be unique. - If a column contains the same value in every row, the content of the column after masking will also have the same value in every row.
- For example, if a column has a
The from_unique_imitate
mask will always create a different result to its input,
as if it has an implicit force_change: true
.
The mask may occasionally preserve individual letters and digits in the same position,
but it is always guaranteed that at least one (and very likely almost all) of the alphanumeric characters will be different.
For example, ABC-123
may be masked to BYC-457
.
This mask can only be used on columns of integer
or string (char
/ varchar
) type.
Null values will always be left as null.
Important!
from_unique_imitate
does not supportIDENTITY
columns in Oracle or Microsoft SQL Server.Leading zeroes
- All zeroes appearing before any other digit in a value will be preserved. For example, when masking the value
ABC-00123-0044
, the mask will not change the two zeroes before123
but can change the two zeroes before44
. A sample result for this value might beZQD-00157-3498
.- Similarly, no masked value will be created that includes a zero as the first digit character. For example, when masking
A%123-456
the character immediately after the%
will never be a zero, though the other digits may be zero. A sample output value might beA439-007
. Likewise, when masking the integer value12345
, the result will not be5892
as if there was a leading zero; it will always be of five digits in length, i.e. be between10000
and99999
inclusive.- Outside of the rules above, the mask can produce a sequence of digits that is all zero. For example,
M109D732
may be masked toA543G000
.
You can apply a from_unique_imitate
mask to a primary key column
or a column that is used as a foreign key in another table.
References will be updated automatically.
Composite primary keys are supported.
Parameters
skip_letters
(optional): A boolean to enable or disable the skipping alphabetical characters from being masked. Defaults tofalse
(alphabetical characters will not be replaced).skip_digits
(optional): A boolean to enable or disable the skipping of digits from being masked. Defaults tofalse
(digits will not be replaced).checksum
(optional): A string to specify an algorithm to use to generate unique valid replacements. Options:brazilian_cpf
.
Invalid Parameter Combination:
Due to no masking occurring if bothskip_digits
&skip_letters
istrue
, this combination is invalid and a run will be prevented if this ruleset is specified.
Values Requiring Checksums
When the output values must satisfy a checksum, specify the name of the checksum as the checksum
parameter.
Unique values will be generated that satisfy that checksum algorithm.
The available options for checksum
are listed below.
For each checksum, the input value must contain a certain number of digits 0-9
and no letters A-Z
or a-z
.
Other non-letter characters that are used for formatting are retained in the output.
The replacement value will conform to the checksum algorithm, even if the input did not.
brazilian_cpf
Use this checksum
type to generate values that satisfy the Brazilian CPF (Cadastro de Pessoas Físicas) number
checksum.
The input must contain 11 digits. The validity of the input CPF number is not checked.
Input Example | Description | Output Example | Output Description |
---|---|---|---|
298.056.372-20 | Valid, formatted CPF | 886.972.870-65 | Valid CPF with formatting retained |
2980,5637,220 | Valid CPF, with other formatting | 8869,7287,065 | Valid CPF with formatting retained |
29805637220 | Valid CPF, digits only | 88697287065 | Valid CPF, digits only |
298.056.372-29 | 11-digit, formatted number, that is not a CPF | 886.972.870-65 | Valid CPF with formatting retained |
29805637229 | 11-digit number that is not a CPF | 88697287065 | Valid CPF, digits only |
298056372 | 9-digit number | – | No output, error is raised and masking stops due to invalid length |
298A056B372C20 | String with letters | – | No output, error is raised and masking stops due to invalid characters |
Warning: for performance, this mask only performs basic validation on the input value. That is, it checks that the input contains 11 digits and no letters. Multiple 11-digit numbers that share the same first nine digits will mask to the same output, even if the last two digits differ in the input. This will only happen if one or more of the values are not valid CPF numbers. For example, the number
372.116.366-45
(valid) and the number372.116.366-99
(invalid) will give the same output.This will cause a collision, and a database column that only contained unique values will subsequently contain duplicates.
For this reason, the
checksum: brazilian_cpf
parameter should only be used when the input data is known to only contain valid CPF numbers.
luhn
Use this checksum
type to generate values that satisfy the Luhn checksum algorithm.
The most common use of this is for credit card or other payment card numbers;
however, this checksum
can be used for any numbering system that utilizes the Luhn checksum algorithm,
for example mobile phone IMEI numbers.
The input must contain between 12 and 19 digits (inclusive). The validity of the input number is not checked.
Input Example | Description | Output Example | Output Description |
---|---|---|---|
4111 1111 1111 1111 | Valid, formatted card number | 2260 5651 2623 0906 | Number that satisfies the Luhn checksum, with formatting retained |
2980,5637/2204 | Number with other formatting | 8869,7287/0655 | Number that satisfies the Luhn checksum, with formatting retained |
4111111111111111 | Valid card number, digits only | 2260565126230906 | Number that satisfies the Luhn checksum, digits only |
1234 1234 5678 5678 | Formatted card number that does not satisfy the Luhn algorithm | 2260 5651 2623 0906 | Number that satisfies the Luhn checksum, with formatting retained |
298056372 | 9-digit number | – | No output, error is raised and masking stops due to invalid length |
298A056B372C20 | String with letters | – | No output, error is raised and masking stops due to invalid characters |
Warning: for performance, this mask only performs basic validation on the input value. That is, it checks that the input contains 12-19 digits and no letters. Numbers of the same length that share the same non-checksum digits (all but the last one) will mask to the same output, even if the last digit differs in the input. This will only happen if one or more of the values do not satisfy the Luhn algorithm. For example, the number
4111 1111 1111 1111
(valid) and the number4111 1111 1111 1118
(invalid) will give the same output.This will cause a collision, and a database column that only contained unique values will subsequently contain duplicates.
For this reason, the
checksum: luhn
parameter should only be used when the input data is known to only contain numbers that satisfy the Luhn algorithm, such as credit card numbers.
Example
This example will apply from_unique_imitate
masks to the vehicle_id
, license_plate
and validation_code
columns.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: vehicle_id
masks:
- type: from_unique_imitate
- column: license_plate
masks:
- type: from_unique_imitate
- column: validation_code
masks:
- type: from_unique_imitate
Show result
Before | After |
|
|
---|
RedShift Limitations
Due to the current method in which we mask with RedShift, please note the current limitations:
PRIMARY KEY
constrained columns will be transformed into columns which areUNIQUE
constrained instead ofPRIMARY KEY
constrained.NOT NULL
constrained columns will lose theirNOT NULL
constraint. However, this is not just limited tofrom_unique_imitate
masking.