Data Pattern Masks
Data pattern masks are used to mask very specific patterns of data.
- Credit Card (
credit_card
)
Replaces credit card values with random ones - Brazilian CPF (
brazilian_cpf
)
Replaces Brazilian CPF numbers with random ones - Social security number (
social_security_number
)
Replaces social security numbers with random ones
Credit card (credit_card
)
This mask provides a number of methods for masking and generating credit card numbers. Parameters can be set to control the prefix, PAN formatting and Luhn checksum validity of the generated numbers.
There are three modes of operation of this mask.
- Card numbers can be replaced with generated numbers
(
generate_card_number
set totrue
). - Card numbers can have the middle digits obscured (using a
#
character by default), leaving just the first 6 and last 4 digits readable (pan_format
set totrue
). - Both the above modes can be combined (by setting both parameters to
true
), which will generate a card number and obscure the middle digits.
Please note that at least one of generate_card_number
or pan_format
must be true
.
If they are both false
then the masking run will fail as no masking would occur.
Parameters
generate_card_number
(optional): Iftrue
, new credit card numbers will be generated. Set tofalse
to not generate card numbers (which improves performance), if thepan_format
argument is to be used.generate_card_number
defaults totrue
.pan_format
(optional): Iftrue
, mask the card number by replacing the digits between the first six and last four withpan_character
.pan_format
defaults tofalse
.pan_character
(optional): The character to use to conceal credit card digits, ifpan_format
istrue
. Must be a single character string. Defaults to#
.generate_luhn
(optional): Iftrue
the generated card number will pass the Luhn checksum. Set tofalse
to generate random credit cards instead, which slightly improves performance by skipping Luhn digit generation.generate_luhn
defaults to the opposite ofpan_format
, ortrue
ifpan_format
is not set.retain_prefix_length
(optional): The number of digits of the input card's prefix to retain, orautomatic
to automatically determine the length of the prefix from the issuer. See Retaining Prefixes below. By default, no prefix is retained (i.e. the entire credit card number is generated randomly).issuer_names
(optional): The generated card will have the specified issuer's prefix(es) and card lengths. If left empty all card issuers can be used to generate the card number. Not valid to use ifretain_prefix_length
is specified. Please refer to the list of issuers.apply_weighting
(optional): Iftrue
, randomly select prefixes based on the actual popularity of prefixes. This increases the accuracy of generated data but slightly decreases performance. See Random Weighting below.apply_weighting
defaults tofalse
.on_null
(optional): A string to specify the action to take if the value isnull
. One of:skip
(default): Skip to the next value, the value remains unchanged (i.e. the value staysnull
).mask
: Overwrite thenull
value with a generated credit card number.error
: Raise an error and stop masking.
on_invalid
(optional): A string to specify the action to take if the value is an invalid credit card number. One of:mask
(default): Always overwrite without validating the credit card number. If an input value is not a valid credit card number theimitate
mask will be used to replace the digits.skip
: Skip to the next value, the value remains unchanged.error
: Raise an error and stop masking.
output_format_choice
(optional): A string to specify the desired output format. One of:retained
(default): Detect the input format (also based on the input value type i.e. numeric types have no format) and output in the same format. This is done by replacing each digit in the original value with the generated digits, in order.numeric
: Always return just the digits, as long as there are only digits in the masked value.
segment_separators
(optional): An array of characters to allow as separators when validating credit card numbers. See Validating Card Numbers below.segment_separators
defaults to[" ", "/", "-"]
.
Invalid Parameter Combinations
Some combinations of parameters are invalid as they would be redundant or cause no masking to occur. These combinations will cause an error and the masking run will fail.
generate_card_number
and pan_format
can not both be false
, since no masking would occur. Both may be true
,
however, which will mean card numbers will be generated and then have PAN formatting applied.
Using retain_prefix_length
with pan_format
only (i.e. generate_card_number
is false
) is invalid as there is
no reason to try to retain a prefix when not generating the card number.
generate_luhn
and pan_format
can not both be true
. It is redundant to try to generate the Luhn digit when the
middle characters will be unknown in the output.
A list of issuer_names
can not be provided when retain_prefix
is true
, as this may create an unresolvable
scenario if trying to retain the prefix of a credit card number that is not in the list of specified issuers.
Retaining Prefixes
When generating card numbers there are three options for retaining the prefix of the input credit card number. The
first is to not retain the prefix at all, which means the entire credit card number will be randomly generated. This
is the default behaviour, if retain_prefix_length
is omitted from the ruleset.
The second option is to specify a number of digits to retain. For example, to retain the first 4 digits of each input credit card, use the following ruleset.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: credit_card
retain_prefix_length: 4
generate_card_number: true
If retain_prefix_length
is more than half of the length of a credit card that is encountered when masking, an error
will be raised. For example, if retain_prefix_length
is 7 and a credit card number of 14 or less digits is found,
this will cause an error and the masking run will stop.
Finally, the credit_card
mask can be configured to automatically retain the prefix of the issuer, by specifying
automatic
for retain_prefix_length
. The length of the prefix will depend on the issuer and card length. The
longest matching prefix will be retained; for example, the prefixes 62 and 622126 both exist. The card
number 623… would retain just the 62 prefix, whereas a card number 6221264… would retain the 622126
prefix: even though it matches both the longest will be selected.
If no prefixes match a card number, then the mask will fall back to just retaining the first digit.
This next ruleset shows how to use automatic prefix retaining.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: credit_card
retain_prefix_length: automatic
generate_card_number: true
The retain_prefix_length
parameter is not valid if a list of issuer_names
are provided.
DataMasque contains a list of over 105,000 prefixes which are used when the retain_prefix_length: automatic
parameter
is set. If a prefix is not found, then DataMasque falls back to preserving just the first digit.
A full list of prefixes can be found in the credit_card_prefixes.csv file (1.9MB CSV).
Card issuer names
These card issuer names can be used as arguments to the issuer_names
parameter. They are not case-sensitive.
Visa | Mastercard | American Express |
China T-Union | China Unionpay | Dankort |
Diners Club International | Diners Club United States & Canada | Discover Card |
Instapayment | Interpayment | JCB |
Lankapay | Maestro | Maestro UK |
MIR | NPS Pridnestrovie | Rupay |
Troy | Ukrcard | Verce |
Random Selection and Weightings
When not choosing to retain the prefix, one will be randomly selected so that generated card numbers have a valid issuer prefix.
Note that the following rules apply after issuers have been filtered down to those specified in the
issuer_names
parameter (if provided). If this parameter is not provided then all issuers will be selected from.
If apply_weighting
is enabled, credit_card
mask uses weighting to select a prefix to use - first based on the
popularity of the issuer and then based on the count of credit card numbers inside that prefix.
The approximate weighting of issuers is as shown (note that weightings are approximate and apply relative to each other, so they do not add to 100% exactly).
Issuer | Approximate Weighting |
---|---|
Visa | 53% |
Mastercard | 33% |
Discover Card | 8% |
American Express | 8% |
Other Issuers | .1% each |
Once an issuer has been chosen, a prefix is selected weighted on the length of cards. For example, a prefix of length one for a 16 character card has 1015 combinations, whereas a prefix of length one for a 14 digit card has 1013 combinations. Therefore, the 16 digit card number is ten times more likely to be chosen than the 15 digit card.
If apply_weighting
is false
then each issuer, and all prefixes for the chosen issuer, are equally likely to be
chosen.
Once a prefix and length have been chosen, a random card number is generated with this prefix and length. If
generate_luhn
is true, the generated card number will pass the Luhn checksum.
Applying weightings will produce more realistic distribution of generated card numbers, at the cost of a slight performance penalty.
Validating Card Numbers
The credit_card
mask can be configured to perform different actions when a value that is not a credit card number is
encountered. A value is considered a valid credit card number if:
- It is of integer or string type.
- It is between 12 and 19 characters long (inclusive).
- The digits satisfy the Luhn algorithm.
Note that a null
value is not covered by these rules, as it is evaluated based on the on_null
parameter (see
Handling Nulls below).
For string types, validation takes into account the segment_separators
argument. Take for example, the
card number 4111111111111111 which is valid (correct length and passes Luhn checksum). Since the
segment_separators
includes - (dash) by default, then the value 4111-1111-1111-1111 would also be
considered valid. However, the value 4111_1111_1111_1111 would not be considered a valid credit card number,
since _ (underscore) is not in the list of segment_separators
.
Specifying the segment_separators
parameter replaces the existing list of segment separators. This means that _
can't be added by itself as a separator, the existing separators must also be specified. For example:
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: credit_card
segment_separators:
- "_"
- " "
- "/"
- "-"
When on_invalid
is skip
, invalid values will be returned as is, i.e. no masking will occur on these values.
When on_invalid
is error
, the masking run will stop and an error displayed in the run log.
When on_invalid
is mask
, the value will be masked with the imitate
mask, which is configured
to replace digits in the value. Note that the imitate
mask can only be applied to strings, therefore a value may not
be a valid credit card number which causes a fallback to the imitate
mask, but it may then cause an error and
halting of the masking run by being the wrong type for imitate
to be applied. Therefore, setting on_invalid
to
mask
is not a foolproof way of masking any type of data.
Handling Nulls
When on_null
is skip
, null values will be returned as is, i.e. the null
value will be retained.
When on_null
is error
, the masking run will stop and an error displayed in the run log.
When on_null
is mask
, a credit card number will be generated based on the behaviour described in Random Selection
and Weightings above. However, if retain_prefix
is true
then an error will be raised and the masking run will
fail, as a null
value has no prefix.
Example
This example generates credit card numbers that pass the Luhn checksum, with card issuer set to either MasterCard, Visa, or American Express.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: credit_card
issuer_names:
- VISA
- MASTERCARD
- AMERICAN EXPRESS
generate_luhn: true
pan_format: false
Show result
Before | After |
|
|
---|
This example generates credit card numbers that retain the original card prefix.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: credit_card
retain_prefix_length: automatic
generate_card_number: true
Show result
Before | After |
|
|
---|
This example does not generate card numbers, it just applied PAN formatting. The output_format_choice
parameter is set
to numeric, so the input format is not retained, instead the output is normalized to just the (concealed) numbers.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: credit_card
pan_format: true
output_format_choice: numeric
Show result
Before | After |
|
|
---|
Brazilian CPF (brazilian_cpf
)
This mask provides a method for masking Brazilian CPF numbers using random (but valid) CPF numbers. Parameters can be adjusted to determine which actions to take based on the input value and the desired output.
The conventional format for CPF numbers is XXX.XXX.XXX-XX
. This format can be retained, or if the input doesn't adhere
to this format, it can be standardised to this format. Alternatively, the input can be forced to display only the digits.
Parameters
on_null
(optional): A string to specify the action to take if the value isnull
. One of:skip
(default): Skip to the next value, the value remains unchanged (i.e. the value staysnull
).mask
: Overwrite thenull
value with a generated CPF number.error
: Raise an error and stop masking.
on_invalid
(optional): A string to specify the action to take if the value is an invalid CPF number. One of:mask
(default): Always overwrite without validating the generated CPF number. If an input value is not a valid CPF number theimitate
mask will be used to replace the digits.skip
: Skip to the next value, the value remains unchanged.error
: Raise an error and stop masking.
output_format_choice
(optional): A string to specify the desired output format. One of:retained
(default): Detect the input format (also based on the input value type i.e. numeric types have no format) and output in the same format. This is done by replacing each digit in the original value with the generated digits, in order.formatted
: Always return with the conventional formatting (XXX.XXX.XXX-XX) if the masked value is of correct length (11).numeric
: Always return just the digits, as long as there are only digits in the masked value.
Notes: In the cases where the digits are replaced by the
imitate
mask:
- If
output_format_choice: formatted
, the value must be of correct length (11) to be formatted. If not, an error will be raised.- If
output_format_choice: numeric
and there are non-numeric characters an error will be raised.
Example
In this example, CPF numbers are generated to replace the values in the tax_number
column of the employees
table.
The numbers are forced into the standardised format XXX.XXX.XXX-XX
, ensuring that both null
and invalid values are replaced for consistency.
version: "1.0"
tasks:
- type: mask_table
table: employees
key: employee_id
rules:
- column: tax_number
masks:
- type: brazilian_cpf
on_null: mask
on_invalid: mask
output_format_choice: formatted
Show result
Before | After |
|
|
---|
Social security number (social_security_number
)
This mask provides a method for masking social security numbers using random (but valid) social security numbers. Parameters can be adjusted to determine which actions to take based on the input value and the desired output.
The conventional format for security numbers is XXX-XX-XXXX
. This format can be retained, or if the input doesn't adhere
to this format, it can be standardised to this format. Alternatively, the input can be forced to display only the digits.
A value is considered a valid social security number if:
- It is of integer or string type.
- It consists of three groups of digits.
- Group 1 (area number) comprises 3 digits, the value must be in the range of 001 to 899 and should not be 666.
- Group 2 (group number) comprises 2 digits, the value must be in the range of 01 to 99.
- Group 3 (serial number) comprises 4 digits, the value must be in the range of 0001 to 9999.
- These three groups can be written together or separated by either a hyphen
-
or a space.
Parameters
on_null
(optional): A string to specify the action to take if the value isnull
. One of:skip
(default): Skip to the next value, the value remains unchanged (i.e. the value staysnull
).mask
: Overwrite thenull
value with a generated social security number.error
: Raise an error and stop masking.
on_invalid
(optional): A string to specify the action to take if the value is an invalid social security number. One of:mask
(default): Always overwrite without validating the generated social security number. If an input value is not a valid social security number theimitate
mask will be used to replace the digits.skip
: Skip to the next value, the value remains unchanged.error
: Raise an error and stop masking.
output_format_choice
(optional): A string to specify the desired output format. One of:retained
(default): Detect the input format (also based on the input value type i.e. numeric types have no format) and output in the same format. This is done by replacing each digit in the original value with the generated digits, in order.formatted
: Always return with the conventional formatting (XXX-XX-XXXX) if the masked value is of correct length (9).numeric
: Always return just the digits, as long as there are only digits in the masked value.
Example
In this example, social security numbers are generated to replace the values in the social_number
column of the employees
table.
The numbers are forced into the standardised format XXX-XX-XXXX
, ensuring that both null
and invalid values are replaced for consistency.
version: "1.0"
tasks:
- type: mask_table
table: employees
key: employee_id
rules:
- column: social_number
masks:
- type: social_security_number
on_null: mask
on_invalid: mask
output_format_choice: formatted
Show result
Before | After |
|
|
---|