Numeric Masks
Numeric masks work specifically with number data types.
- Random Number
Generates a random integer/decimal between two numbers - supports triangular or uniform distribution - Random Boolean
Generates a random true/false or 1/0 value - Numeric Bucket
Generates replacement numbers whilst retaining specified ranges
Random Number (from_random_number
)
This mask replaces the column value with a random number. The default number type is
integer. If decimal_places
is set greater than 0, a decimal number will be
generated.
Note: For compatibility with some databases, the generated value is actually of string type. Most database engines automatically convert the string to the appropriate numeric type for the column being masked, so no typecast is required. For file masking however, if you need a numeric value, use a typecast mask with
typecast_as
set tointeger
,float
, ordecimal
as appropriate.
Parameters
max
(required): The generated number is guaranteed to be less than or equal to this maximum value.min
(optional): This is the minimum value for the generated number. Defaults to0
.mode
(optional): If this parameter is supplied, the number will be chosen from a triangular distribution where 'mode' is the peak. Useful for street numbers etc. (e.g. withmin
andmode
both set to1
to generate lower street numbers more frequently).decimal_places
(optional): Number of decimal places to generate. Defaults to 0.
Example (integer)
This example replaces the quantity
column in the products table with a random
integer between 1 and 200.
version: '1.0'
tasks:
- type: mask_table
table: products
key: id
rules:
- column: quantity
masks:
- type: from_random_number
min: 1
max: 200
Show result
Before | After |
|
|
---|
Example (decimal)
This example replaces the price
column in the products table with a random
number to 2 decimal places between 1 and 200.
version: '1.0'
tasks:
- type: mask_table
table: products
key: id
rules:
- column: price
masks:
- type: from_random_number
min: 1
max: 200
decimal_places: 2
Show result
Before | After |
|
|
---|
Random boolean (from_random_boolean
)
This mask replaces the column value with a random boolean value (1
/0
or
true
/false
). from_random_boolean
is effectively equivalent to a
from_random_number
mask with options max: 1
and min: 0
.
Note The return type is an integer which can be automatically cast to the appropriate boolean type in most databases. For file masking and Amazon DynamoDB, literal
1
or0
will be written to the file, which may not be the intended behaviour.
- If literal
true
/false
boolean values are required, please chainfrom_random_boolean
with atypecast
:
masks: - type: chain masks: - type: from_random_boolean - type: typecast typecast_as: boolean
- If
"true"
/"false"
strings are required please use thefrom_choices
mask with"true"
or"false"
values (note the quoting around the values). For more information please refer tofrom_choices
Example
This example replaces the active
column in the products table with random
boolean values.
version: '1.0'
tasks:
- type: mask_table
table: products
key: id
rules:
- column: active
masks:
- type: from_random_boolean
Show result
Before | After |
|
|
---|
Numeric Bucket (numeric_bucket
)
This mask generates random integers from within the same "bucket" (numeric range) as the value to be masked. The mask is set up using an array of integers that define the lower bound of each bucket.
buckets
Intro
Buckets
are a series of defined lower bounds from which replacement values are randomly generated.
For example, an age
column that should mask values into certain age buckets
: 1-17, 18-24, 25-64, 65+. buckets: [18,25,65]
are specified, so if a value from the column is 16
, the replacement value is randomly generated within the 1-17
range,
as opposed to any number.
Capping minimum and maximum values
The numeric bucket mask will generate numbers for the lowest bucket from 0/1 to <bucket> - 1
. Similarly, for the highest
bucket, numbers will be generated in the range bucket
to 231 - 1. This default behaviour may cause undesired
outputs which are outside the normal range for the data.
For example, consider a column containing ages. All ages currently in the column are 18 or older. Ages are to be retained in buckets 35 or younger, 36-65, 65+.
A ruleset could be defined with buckets: [36, 65]
, but this would not function correctly. Ages 35 or lower could be masked
to any value from 0-35, therefore the masked age could be below 18, which does not match the application rules. A similar
problem exists with ages over 65, the masked value could be any value from 65 to 231 - 1.
To solve this issue, lower and upper bounds should be specified. This requires an understanding of the existing data. Choosing
the upper bound would mean selecting a reasonable value that your ages could have. A better ruleset definition could be: buckets: [18, 36, 65, 100]
.
This would limit the lower age to 18 and upper age to 100.
Parameters
buckets
(required): A series of numbers representing the lower bounds of the buckets, including the number e.g.[5, 10, 15 20]
would result in ranges 1-4, 5-9, 10-14, 15-19, 20+. Note:buckets
should not contain duplicates and should be in ascending order.force_change
(optional): Since the replacement values are chosen randomly, it is possible that the number might be randomly replaced with the same one (for example,45
is chosen as a replacement for45
). Setforce_change
totrue
to ensure the replacement value is not the same as the original value. Note that this makes the output slightly less random as the number of possible replacements is reduced by one.include_zero
(optional): When set totrue
, will lower the first lower bound to include 0, otherwise the lowest bound will be 1.scale_to
(optional): Provide a number that better represents the scale of your data in the target column to which replacement values will be scaled. e.g., If the scale of your data is 1000 (values are multiples of 1000), then 4583 is not an appropriate replacement for 2000. Setscale_to
to1000
so the replacement values will be multiples of1000
(4583 -> 5000).scale_to
must be a multiple of 10, and the values inbuckets
must also be multiples ofscale_to
.
Note: If
buckets
were defined with a size of one (e.g.buckets: [1,3,5]
) with theforce_change
parameter set totrue
this would cause an infinite loop. Instead, an error will be raised. If the target column is of a type with a smaller range than 0 - 231-1 (smallint, tinyint, etc.), add a maximum value to thebuckets
as replacement values could be larger than the column can store.numeric_bucket
requires a numeric value as the input. Atypecast
mask can be used to convert any string values before masking withnumeric_bucket
. For more information please refer to Typecast.
Example
This example replaces the age
column in the users
table with random values from the specified buckets
.
version: '1.0'
tasks:
- type: mask_table
table: users
key: id
rules:
- column: age
masks:
- type: numeric_bucket
buckets: [16,18,25,65,110]
force_change: true
Show result
Before | After |
|
|
---|