Numeric Masks

Numeric masks work specifically with number data types.

Random Number
Generates a random integer/decimal between two numbers - supports triangular or uniform distribution
Random Boolean
Generates a random true/false or 1/0 value
Numeric Bucket
Generates replacement numbers whilst retaining specified ranges

Random Number (`from_random_number`)

This mask replaces the column value with a random number. The default number type is integer. If decimal_places is set greater than 0, a decimal number will be generated.

Note: For compatibility with some databases, the generated value is actually of string type. Most database engines automatically convert the string to the appropriate numeric type for the column being masked, so no typecast is required. For file masking however, if you need a numeric value, use a typecast mask with typecast_as set to integer, float, or decimal as appropriate.

Parameters

max (required): The generated number is guaranteed to be less than or equal to this maximum value.
min (optional): This is the minimum value for the generated number. Defaults to 0.
mode (optional): If this parameter is supplied, the number will be chosen from a triangular distribution where 'mode' is the peak. Useful for street numbers etc. (e.g. with min and mode both set to 1 to generate lower street numbers more frequently).
decimal_places (optional): Number of decimal places to generate. Defaults to 0.

Example (integer)

This example replaces the quantity column in the products table with a random integer between 1 and 200.

version: '1.0'
tasks:
  - type: mask_table
    table: products
    key: id
    rules:
      - column: quantity
        masks:
          - type: from_random_number
            min: 1
            max: 200

Show result

Before

After

quantity
30
40
60
100
150
120

quantity
152
108
159
84
88
176

Example (decimal)

This example replaces the price column in the products table with a random number to 2 decimal places between 1 and 200.

version: '1.0'
tasks:
  - type: mask_table
    table: products
    key: id
    rules:
      - column: price
        masks:
          - type: from_random_number
            min: 1
            max: 200
            decimal_places: 2

Show result

Before

After

price
10.00
20.10
50.01
80.99
150.80
199.50

price
101.52
30.89
44.23
93.01
6.11
190.90

Random boolean (`from_random_boolean`)

This mask replaces the column value with a random boolean value (1/0 or true/false). from_random_boolean is effectively equivalent to a from_random_number mask with options max: 1 and min: 0.

Note The return type is an integer which can be automatically cast to the appropriate boolean type in most databases. For file masking and Amazon DynamoDB, literal 1 or 0 will be written to the file, which may not be the intended behaviour.

If literal true/false boolean values are required, please chain from_random_boolean with a typecast:
masks:
  - type: chain
    masks:
      - type: from_random_boolean
      - type: typecast
        typecast_as: boolean
If "true"/"false" strings are required please use the from_choices mask with "true" or "false" values (note the quoting around the values). For more information please refer to from_choices

Example

This example replaces the active column in the products table with random boolean values.

version: '1.0'
tasks:
  - type: mask_table
    table: products
    key: id
    rules:
      - column: active
        masks:
          - type: from_random_boolean

Show result

Before

After

active
true
true
false
false
true
false

active
true
false
false
true
false
true

Numeric Bucket (`numeric_bucket`)

This mask generates random integers from within the same "bucket" (numeric range) as the value to be masked. The mask is set up using an array of integers that define the lower bound of each bucket.

`buckets` Intro

Buckets are a series of defined lower bounds from which replacement values are randomly generated. For example, an age column that should mask values into certain age buckets: 1-17, 18-24, 25-64, 65+. buckets: [18,25,65] are specified, so if a value from the column is 16, the replacement value is randomly generated within the 1-17 range, as opposed to any number.

Capping minimum and maximum values

The numeric bucket mask will generate numbers for the lowest bucket from 0/1 to <bucket> - 1. Similarly, for the highest bucket, numbers will be generated in the range bucket to 2³¹ - 1. This default behaviour may cause undesired outputs which are outside the normal range for the data.

For example, consider a column containing ages. All ages currently in the column are 18 or older. Ages are to be retained in buckets 35 or younger, 36-65, 65+.

A ruleset could be defined with buckets: [36, 65], but this would not function correctly. Ages 35 or lower could be masked to any value from 0-35, therefore the masked age could be below 18, which does not match the application rules. A similar problem exists with ages over 65, the masked value could be any value from 65 to 2³¹ - 1.

To solve this issue, lower and upper bounds should be specified. This requires an understanding of the existing data. Choosing the upper bound would mean selecting a reasonable value that your ages could have. A better ruleset definition could be: buckets: [18, 36, 65, 100]. This would limit the lower age to 18 and upper age to 100.

Parameters

buckets (required): A series of numbers representing the lower bounds of the buckets, including the number e.g. [5, 10, 15 20] would result in ranges 1-4, 5-9, 10-14, 15-19, 20+. Note: buckets should not contain duplicates and should be in ascending order.
force_change (optional): Since the replacement values are chosen randomly, it is possible that the number might be randomly replaced with the same one (for example, 45 is chosen as a replacement for 45). Set force_change to true to ensure the replacement value is not the same as the original value. Note that this makes the output slightly less random as the number of possible replacements is reduced by one.
include_zero (optional): When set to true, will lower the first lower bound to include 0, otherwise the lowest bound will be 1.
scale_to (optional): Provide a number that better represents the scale of your data in the target column to which replacement values will be scaled. e.g., If the scale of your data is 1000 (values are multiples of 1000), then 4583 is not an appropriate replacement for 2000. Set scale_to to 1000 so the replacement values will be multiples of 1000 (4583 -> 5000). scale_to must be a multiple of 10, and the values in buckets must also be multiples of scale_to.

Note: If buckets were defined with a size of one (e.g. buckets: [1,3,5]) with the force_change parameter set to true this would cause an infinite loop. Instead, an error will be raised. If the target column is of a type with a smaller range than 0 - 2³¹-1 (smallint, tinyint, etc.), add a maximum value to the buckets as replacement values could be larger than the column can store. numeric_bucket requires a numeric value as the input. A typecast mask can be used to convert any string values before masking with numeric_bucket. For more information please refer to Typecast.

Example

This example replaces the age column in the users table with random values from the specified buckets.

version: '1.0'
tasks:
  - type: mask_table
    table: users
    key: id
    rules:
      - column: age
        masks:
          - type: numeric_bucket
            buckets: [16,18,25,65,110]
            force_change: true

Show result

Before

After

age
18
24
36
16
43
65

age
22
23
55
17
56
100

Numeric Masks

Random Number (from_random_number)

Parameters

Example (integer)

Example (decimal)

Random boolean (from_random_boolean)

Example

Numeric Bucket (numeric_bucket)

buckets Intro

Capping minimum and maximum values

Parameters

Example

Random Number (`from_random_number`)

Random boolean (`from_random_boolean`)

Numeric Bucket (`numeric_bucket`)

`buckets` Intro