DataMasque Portal

Data Pattern Masks

Data pattern masks are used to mask very specific patterns of data.

Credit card (credit_card)

This mask provides a number of methods for masking and generating credit card numbers. Parameters can be set to control the prefix, PAN formatting and Luhn checksum validity of the generated numbers.

There are three modes of operation of this mask.

  1. Card numbers can be replaced with generated numbers (generate_card_number set to true).
  2. Card numbers can have the middle digits obscured (using a # character by default), leaving just the first 6 and last 4 digits readable (pan_format set to true).
  3. Both the above modes can be combined (by setting both parameters to true), which will generate a card number and obscure the middle digits.

Please note that at least one of generate_card_number or pan_format must be true. If they are both false then the masking run will fail as no masking would occur.

Parameters

  • generate_card_number (optional): If true, new credit card numbers will be generated. Set to false to not generate card numbers (which improves performance), if the pan_format argument is to be used. generate_card_number defaults to true.
  • pan_format (optional): If true, mask the card number by replacing the digits between the first six and last four with pan_character. pan_format defaults to false.
  • pan_character (optional): The character to use to conceal credit card digits, if pan_format is true. Must be a single character string. Defaults to #.
  • generate_luhn (optional): If true the generated card number will pass the Luhn checksum. Set to false to generate random credit cards instead, which slightly improves performance by skipping Luhn digit generation. generate_luhn defaults to the opposite of pan_format, or true if pan_format is not set.
  • retain_prefix_length (optional): The number of digits of the input card's prefix to retain, or automatic to automatically determine the length of the prefix from the issuer. See Retaining Prefixes below. By default, no prefix is retained (i.e. the entire credit card number is generated randomly).
  • issuer_names (optional): The generated card will have the specified issuer's prefix(es) and card lengths. If left empty all card issuers can be used to generate the card number. Not valid to use if retain_prefix_length is specified. Please refer to the list of issuers.
  • apply_weighting (optional): If true, randomly select prefixes based on the actual popularity of prefixes. This increases the accuracy of generated data but slightly decreases performance. See Random Weighting below. apply_weighting defaults to false.
  • on_null (optional): A string to specify the action to take if the value is null. One of:
    • skip (default): Skip to the next value, the value remains unchanged (i.e. the value stays null).
    • mask: Overwrite the null value with a generated credit card number.
    • error: Raise an error and stop masking.
  • on_invalid (optional): A string to specify the action to take if the value is an invalid credit card number. One of:
    • mask (default): Always overwrite without validating the credit card number. If an input value is not a valid credit card number the imitate mask will be used to replace the digits.
    • skip: Skip to the next value, the value remains unchanged.
    • error: Raise an error and stop masking.
  • output_format_choice (optional): A string to specify the desired output format. One of:
    • retained (default): Detect the input format (also based on the input value type i.e. numeric types have no format) and output in the same format. This is done by replacing each digit in the original value with the generated digits, in order.
    • numeric: Always return just the digits, as long as there are only digits in the masked value.
  • segment_separators (optional): An array of characters to allow as separators when validating credit card numbers. See Validating Card Numbers below. segment_separators defaults to [" ", "/", "-"].

Invalid Parameter Combinations

Some combinations of parameters are invalid as they would be redundant or cause no masking to occur. These combinations will cause an error and the masking run will fail.

generate_card_number and pan_format can not both be false, since no masking would occur. Both may be true, however, which will mean card numbers will be generated and then have PAN formatting applied.

Using retain_prefix_length with pan_format only (i.e. generate_card_number is false) is invalid as there is no reason to try to retain a prefix when not generating the card number.

generate_luhn and pan_format can not both be true. It is redundant to try to generate the Luhn digit when the middle characters will be unknown in the output.

A list of issuer_names can not be provided when retain_prefix is true, as this may create an unresolvable scenario if trying to retain the prefix of a credit card number that is not in the list of specified issuers.

Retaining Prefixes

When generating card numbers there are three options for retaining the prefix of the input credit card number. The first is to not retain the prefix at all, which means the entire credit card number will be randomly generated. This is the default behaviour, if retain_prefix_length is omitted from the ruleset.

The second option is to specify a number of digits to retain. For example, to retain the first 4 digits of each input credit card, use the following ruleset.

version: "1.0"
tasks:
  - type: mask_table
    table: customers
    key: customer_id
    rules:
      - column: credit_card_number
        masks:
          - type: credit_card
            retain_prefix_length: 4
            generate_card_number: true

If retain_prefix_length is more than half of the length of a credit card that is encountered when masking, an error will be raised. For example, if retain_prefix_length is 7 and a credit card number of 14 or less digits is found, this will cause an error and the masking run will stop.

Finally, the credit_card mask can be configured to automatically retain the prefix of the issuer, by specifying automatic for retain_prefix_length. The length of the prefix will depend on the issuer and card length. The longest matching prefix will be retained; for example, the prefixes 62 and 622126 both exist. The card number 623… would retain just the 62 prefix, whereas a card number 6221264… would retain the 622126 prefix: even though it matches both the longest will be selected.

If no prefixes match a card number, then the mask will fall back to just retaining the first digit.

This next ruleset shows how to use automatic prefix retaining.

version: "1.0"
tasks:
  - type: mask_table
    table: customers
    key: customer_id
    rules:
      - column: credit_card_number
        masks:
          - type: credit_card
            retain_prefix_length: automatic
            generate_card_number: true

The retain_prefix_length parameter is not valid if a list of issuer_names are provided.

DataMasque contains a list of over 105,000 prefixes which are used when the retain_prefix_length: automatic parameter is set. If a prefix is not found, then DataMasque falls back to preserving just the first digit.

A full list of prefixes can be found in the credit_card_prefixes.csv file (1.9MB CSV).

Card issuer names

These card issuer names can be used as arguments to the issuer_names parameter. They are not case-sensitive.

VisaMastercardAmerican Express
China T-UnionChina UnionpayDankort
Diners Club InternationalDiners Club United States & CanadaDiscover Card
InstapaymentInterpaymentJCB
LankapayMaestroMaestro UK
MIRNPS PridnestrovieRupay
TroyUkrcardVerce

Random Selection and Weightings

When not choosing to retain the prefix, one will be randomly selected so that generated card numbers have a valid issuer prefix.

Note that the following rules apply after issuers have been filtered down to those specified in the issuer_names parameter (if provided). If this parameter is not provided then all issuers will be selected from.

If apply_weighting is enabled, credit_card mask uses weighting to select a prefix to use - first based on the popularity of the issuer and then based on the count of credit card numbers inside that prefix.

The approximate weighting of issuers is as shown (note that weightings are approximate and apply relative to each other, so they do not add to 100% exactly).

Issuer Approximate Weighting
Visa 53%
Mastercard 33%
Discover Card 8%
American Express 8%
Other Issuers .1% each

Once an issuer has been chosen, a prefix is selected weighted on the length of cards. For example, a prefix of length one for a 16 character card has 1015 combinations, whereas a prefix of length one for a 14 digit card has 1013 combinations. Therefore, the 16 digit card number is ten times more likely to be chosen than the 15 digit card.

If apply_weighting is false then each issuer, and all prefixes for the chosen issuer, are equally likely to be chosen.

Once a prefix and length have been chosen, a random card number is generated with this prefix and length. If generate_luhn is true, the generated card number will pass the Luhn checksum.

Applying weightings will produce more realistic distribution of generated card numbers, at the cost of a slight performance penalty.

Validating Card Numbers

The credit_card mask can be configured to perform different actions when a value that is not a credit card number is encountered. A value is considered a valid credit card number if:

  • It is of integer or string type.
  • It is between 12 and 19 characters long (inclusive).
  • The digits satisfy the Luhn algorithm.

Note that a null value is not covered by these rules, as it is evaluated based on the on_null parameter (see Handling Nulls below).

For string types, validation takes into account the segment_separators argument. Take for example, the card number 4111111111111111 which is valid (correct length and passes Luhn checksum). Since the segment_separators includes - (dash) by default, then the value 4111-1111-1111-1111 would also be considered valid. However, the value 4111_1111_1111_1111 would not be considered a valid credit card number, since _ (underscore) is not in the list of segment_separators.

Specifying the segment_separators parameter replaces the existing list of segment separators. This means that _ can't be added by itself as a separator, the existing separators must also be specified. For example:

version: "1.0"
tasks:
  - type: mask_table
    table: customers
    key: customer_id
    rules:
      - column: credit_card_number
        masks:
          - type: credit_card
            segment_separators:
              - "_"
              - " "
              - "/"
              - "-"

When on_invalid is skip, invalid values will be returned as is, i.e. no masking will occur on these values.

When on_invalid is error, the masking run will stop and an error displayed in the run log.

When on_invalid is mask, the value will be masked with the imitate mask, which is configured to replace digits in the value. Note that the imitate mask can only be applied to strings, therefore a value may not be a valid credit card number which causes a fallback to the imitate mask, but it may then cause an error and halting of the masking run by being the wrong type for imitate to be applied. Therefore, setting on_invalid to mask is not a foolproof way of masking any type of data.

Handling Nulls

When on_null is skip, null values will be returned as is, i.e. the null value will be retained.

When on_null is error, the masking run will stop and an error displayed in the run log.

When on_null is mask, a credit card number will be generated based on the behaviour described in Random Selection and Weightings above. However, if retain_prefix is true then an error will be raised and the masking run will fail, as a null value has no prefix.

Example

This example generates credit card numbers that pass the Luhn checksum, with card issuer set to either MasterCard, Visa, or American Express.

version: "1.0"
tasks:
  - type: mask_table
    table: customers
    key: customer_id
    rules:
      - column: credit_card_number
        masks:
          - type: credit_card
            issuer_names:
              - VISA
              - MASTERCARD
              - AMERICAN EXPRESS
            generate_luhn: true
            pan_format: false

Show result

Before After
credit_card_number
4988418614189936
4429545392235346
5208475828392947
credit_card_number
371006478248634
5220082637809691
4284336225480232

This example generates credit card numbers that retain the original card prefix.

version: "1.0"
tasks:
  - type: mask_table
    table: customers
    key: customer_id
    rules:
      - column: credit_card_number
        masks:
          - type: credit_card
            retain_prefix_length: automatic
            generate_card_number: true

Show result

Before After
credit_card_number
371006478248634
4429545392235346
5208475828392947
credit_card_number
3781#####248626
4259######809342
52784######480232

This example does not generate card numbers, it just applied PAN formatting. The output_format_choice parameter is set to numeric, so the input format is not retained, instead the output is normalized to just the (concealed) numbers.

version: "1.0"
tasks:
  - type: mask_table
    table: customers
    key: customer_id
    rules:
      - column: credit_card_number
        masks:
          - type: credit_card
            pan_format: true
            output_format_choice: numeric

Show result

Before After
credit_card_number
3710-0647-8248-634
4429 5453 9223 5346
5208475828392947
credit_card_number
371006#####8634
442954######5346
520847######2947


Brazilian CPF (brazilian_cpf)

This mask provides a method for masking Brazilian CPF numbers using random (but valid) CPF numbers. Parameters can be adjusted to determine which actions to take based on the input value and the desired output.

The conventional format for CPF numbers is XXX.XXX.XXX-XX. This format can be retained, or if the input doesn't adhere to this format, it can be standardised to this format. Alternatively, the input can be forced to display only the digits.

Parameters

  • on_null (optional): A string to specify the action to take if the value is null. One of:
    • skip (default): Skip to the next value, the value remains unchanged (i.e. the value stays null).
    • mask: Overwrite the null value with a generated CPF number.
    • error: Raise an error and stop masking.
  • on_invalid (optional): A string to specify the action to take if the value is an invalid CPF number. One of:
    • mask (default): Always overwrite without validating the generated CPF number. If an input value is not a valid CPF number the imitate mask will be used to replace the digits.
    • skip: Skip to the next value, the value remains unchanged.
    • error: Raise an error and stop masking.
  • output_format_choice (optional): A string to specify the desired output format. One of:
    • retained (default): Detect the input format (also based on the input value type i.e. numeric types have no format) and output in the same format. This is done by replacing each digit in the original value with the generated digits, in order.
    • formatted: Always return with the conventional formatting (XXX.XXX.XXX-XX) if the masked value is of correct length (11).
    • numeric: Always return just the digits, as long as there are only digits in the masked value.

Notes: In the cases where the digits are replaced by the imitate mask:

  • If output_format_choice: formatted, the value must be of correct length (11) to be formatted. If not, an error will be raised.
  • If output_format_choice: numeric and there are non-numeric characters an error will be raised.

Example

In this example, CPF numbers are generated to replace the values in the tax_number column of the employees table. The numbers are forced into the standardised format XXX.XXX.XXX-XX, ensuring that both null and invalid values are replaced for consistency.

version: "1.0"
tasks:
  - type: mask_table
    table: employees
    key: employee_id
    rules:
      - column: tax_number
        masks:
          - type: brazilian_cpf
            on_null: mask
            on_invalid: mask
            output_format_choice: formatted

Show result

Before After
tax_number
280.012.389-38
null
74000886967
11111111111
608.763.852-00
tax_number
674.644.623-94
894.101.120-52
411.526.176-56
088.603.186-96
722.798.298-00


Social security number (social_security_number)

This mask provides a method for masking social security numbers using random (but valid) social security numbers. Parameters can be adjusted to determine which actions to take based on the input value and the desired output.

The conventional format for security numbers is XXX-XX-XXXX. This format can be retained, or if the input doesn't adhere to this format, it can be standardised to this format. Alternatively, the input can be forced to display only the digits.

A value is considered a valid social security number if:

  • It is of integer or string type.
  • It consists of three groups of digits.
  • Group 1 (area number) comprises 3 digits, the value must be in the range of 001 to 899 and should not be 666.
  • Group 2 (group number) comprises 2 digits, the value must be in the range of 01 to 99.
  • Group 3 (serial number) comprises 4 digits, the value must be in the range of 0001 to 9999.
  • These three groups can be written together or separated by either a hyphen - or a space .

Parameters

  • on_null (optional): A string to specify the action to take if the value is null. One of:
    • skip (default): Skip to the next value, the value remains unchanged (i.e. the value stays null).
    • mask: Overwrite the null value with a generated social security number.
    • error: Raise an error and stop masking.
  • on_invalid (optional): A string to specify the action to take if the value is an invalid social security number. One of:
    • mask (default): Always overwrite without validating the generated social security number. If an input value is not a valid social security number the imitate mask will be used to replace the digits.
    • skip: Skip to the next value, the value remains unchanged.
    • error: Raise an error and stop masking.
  • output_format_choice (optional): A string to specify the desired output format. One of:
    • retained (default): Detect the input format (also based on the input value type i.e. numeric types have no format) and output in the same format. This is done by replacing each digit in the original value with the generated digits, in order.
    • formatted: Always return with the conventional formatting (XXX-XX-XXXX) if the masked value is of correct length (9).
    • numeric: Always return just the digits, as long as there are only digits in the masked value.

Example

In this example, social security numbers are generated to replace the values in the social_number column of the employees table. The numbers are forced into the standardised format XXX-XX-XXXX, ensuring that both null and invalid values are replaced for consistency.

version: "1.0"
tasks:
  - type: mask_table
    table: employees
    key: employee_id
    rules:
      - column: social_number
        masks:
          - type: social_security_number
            on_null: mask
            on_invalid: mask
            output_format_choice: formatted

Show result

Before After
social_number
345-66-5463
null
111111111
123 45 6789
999-99-9999
social_number
874-34-4623
654-97-4756
036-93-3245
567-12-0041
457-17-6816