vignettes/introduction.Rmd
introduction.Rmd
This package implements methods to provide perturbation for
statistical tables. The implementation if greatly inspired by the the
paper Methodology for the Automatic Confidentialisation of
Statistical Outputs from Remote Servers at the Australian Bureau of
Statistics (Thompson, Broadfoot, Elazar). This approach however was
generalized and a new specification on how record keys are specified and
the lookup-tables are defined is used. This package makes usage of
perturbation tables that can be defined in the ptable
package. This document describes the usage of version 1.0.2
of the package.
In the cellKey package it is possible to pertub
multidimensional count and magnitude tables. Functionality to generate
suitable record keys is provided in function
ck_generate_rkeys()
. Using sha1-checksums on the input
datasets, we can make sure that the same record keys are generated
whenever the same input data set is used for which record keys should be
generated. However, it is also possible of course to use already
pre-generated record keys available as variable in the microdata
set.
The package allows to make use of sampling weights. Thus, both
weighted and unweighted count and magnitude tables can be perturbed. The
hierarchical structure of the variables spanning the table can be
arbitrarily complex. To generate these hierarchies, the
cellKey packages depends on functionality available in
package sdcHierarchies.
Finally, auxiliary methods are provided that allow to extract valuable
information from objects created by the main function of this package
perturbTable()
. For example mod_counts()
returns a data object showing for each cell the perturbation value and
where it was found in the lookup table.
What is left is of course identifying bugs and issues and also to optimize performance of the package as well as to include some real world examples, eg. census tables.
We now show the capabilities of the cellKey package by running an example that
library(cellKey)
packageVersion("cellKey")
## [1] '1.0.2'
The first step in this approach is to generate a statistical table,
which can be achieved using ck_setup
. This function
generates an object that contains all the information required to
perturb count- (and optionally continuously scaled) variables. There are
however a few inputs that ck_setup()
requires and that need
to be generated beforehand. These inputs are:
x
: a data.frame
or data.table
containing microdatarkey
: defines where to find or how to compute record
keys. It is possible to either specify a scalar character name or a
number. If a number is specified, record keys are internally created by
sampling from a uniform distribution between 0 and 1 with the random
numbers rounded to the specified value of digits. If a name is
specified, this name is interpreted as a variable name within
x
already containing record keys.dims
: a named list where each list element contains a
hierarchy created with functionality from package sdcHierarchies
and each name refers to a variable in x
w
: either NULL
(no weights) or the name of
a variable in x
that contains weightscountvars
: an optional character vector of variables in
x
holding counts. In any case a special count variable
total
is internally generated that is 1
for
each row of x
numvars
: an optional character vector of variables in
x
holding numerical variables that can later be
perturbed.We continue to show how to generate the required inputs. The first step is to prepare the inputdata.
The testdata set we are using in this example contains information on person-level along with sampling weights as well as some categorical and continuously scaled variables. We also compute some binary variables for which we also want to create perturbed tables.
dat <- ck_create_testdata()
dat <- dat[, c("sex", "age", "savings", "income", "sampling_weight")]
dat[, cnt_highincome := ifelse(income >= 9000, 1, 0)]
We are now adding record keys (which will be referenced later) with 7
digits to the data set using function ck_generate_rkeys
as
shown below:
dat$rkeys <- ck_generate_rkeys(dat = dat, nr_digits = 7)
print(head(dat))
## sex age savings income sampling_weight cnt_highincome rkeys
## 1: male age_group3 12 5780 57 0 0.6705526
## 2: female age_group3 28 2530 63 0 0.0357946
## 3: male age_group1 550 6920 35 0 0.1339016
## 4: male age_group1 870 7960 81 0 0.9849310
## 5: male age_group4 20 9030 93 1 0.2941448
## 6: female age_group3 102 3290 75 0 0.7070011
To ensure that the same record keys are computed each and every time
for the same data set, a seed based on the sha1-hash of the input
dataset is computed by default in ck_generate_rkeys
. This
seed is used before sampling the record keys and may be overwritten
using the seed
argument.
The goal of this introduction is to create a perturbed table of
counts of variables sex
by age
for all
observations as well as for the subgroups given by
cnt_highincome
that are non-zero. We also want to create
perturbed tables of continously scaled variables savings
and income
also giving the hierarchical structure defined
by sex
by age
.
It is required to define hierarchies for each of the classifying
variables of the desired table. There are two ways how the hierarchical
structure of these variables (including sub-totals) can be specified.
One way is to use "@; value
format as (also) used in sdcTable.
We would suggest however, to use the second alternative which is using
functionality from package sdcHierarchies.
In this vignette, we only show the preferred way to generate hierarchies
for categorical variables age
and sex
below:
dim_sex <- hier_create(root = "Total", nodes = c("male", "female"))
hier_display(dim_sex)
## Total
## ├─male
## └─female
For variable age
the process is very much the same:
dim_age <- hier_create(root = "Total", nodes = paste0("age_group", 1:6))
hier_display(dim_age)
## Total
## ├─age_group1
## ├─age_group2
## ├─age_group3
## ├─age_group4
## ├─age_group5
## └─age_group6
The idea of sdcHierarchies
is to create a tree object using hier_create()
and then add
(using hier_add()
), delete (with
hier_delete()
) or rename (using hier_rename()
)
elements from this hierarchical structure. For more examples have a look
at the package vignette of the package which can be accessed with
sdcHierarchies::hier_vignette()
. sdcHierarchies
also contains a shiny-based app that can be called with
hier_app()
which allows to interactively change and modify
a hierarchy and allows to convert tree- and data.frame based inputs into
various formats. For more information, have a look at
?hier_app
and the other help-files of the package.
After all dimensions have been specified, these inputs must then be
combined into a named list. In this object, the list-names refer to
variable names of the input data set and the elements the data objects
that hold the hierarchy specification itself. This is why the list
elements of dims
in this example have to be named
sex
and age
as the specification refers to
variables age
and sex
in input set
dat
.
dims <- list(sex = dim_sex, age = dim_age)
We now have prepared the inputs and can define a generic statistical
table using ck_setup()
as shown below:
tab <- ck_setup(
x = dat,
rkey = "rkeys",
dims = dims,
w = "sampling_weight",
countvars = "cnt_highincome",
numvars = c("income", "savings"))
## computing contributing indices | rawdata <--> table; this might take a while
ck_setup()
returns a R6
class object that
contains not only the relevant data but also all available methods.
Thus, it is not required to assign the results of such methods to new
objects, instead, the object itself is automatically updated.
These objects also have a custom print method, showing some general information about the object:
print(tab)
## ── Table Information ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## ✔ 21 cells in 2 dimensions ('sex', 'age')
## ✔ weights: yes
## ── Tabulated / Perturbed countvars ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## ☐ 'total'
## ☐ 'cnt_highincome'
## ── Tabulated / Perturbed numvars ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## ☐ 'income'
## ☐ 'savings'
The next task is to define parameters that are used to perturb count
variables which can be achieved with ck_params_cnts
. This
function requires as input the result of either
pt_create_pParams
, pt_create_pTable
or
create_cnt_ptable
from the ptable
package. Please refer also to the documentation of this package for
information on the required parameters. In this example we are going to
use - amongst others - exemplary ptables that can are provided by the
ptable
-pkg for demonstration purposes:
# two different perturbation parameter sets from the ptable-pkg
# an example ptable provided directly
ptab1 <- ptable::pt_ex_cnts()
# creating a ptable by specifying parameters
para2 <- ptable::create_cnt_ptable(
D = 8, V = 3, js = 2, pstay = 0.5,
optim = 1, mono = TRUE)
We then need to create the required inputs for the cellKey package.
p_cnts1 <- ck_params_cnts(ptab = ptab1)
p_cnts2 <- ck_params_cnts(ptab = para2)
ck_params_cnts()
returns objects that can be used as
inputs in method params_cnts_set()
. In argument
v
one may specify count variables for which the supplied
perturbation parameters should be used. If v
is not
specified, the perturbation parameters are used for all count
variables.
# use `p_cnts1` for variable "total" (which always exists)
tab$params_cnts_set(val = p_cnts1, v = "total")
## --> setting perturbation parameters for variable 'total'
# use `p_cnts2` for "cnt_highincome"
tab$params_cnts_set(val = p_cnts2, v = "cnt_highincome")
## --> setting perturbation parameters for variable 'cnt_highincome'
It is therefore entirely possible to use different parameter sets for
different variables. Modifying perturbation parameters for some
variables is easy, too. It is only required to apply the
params_cnts_set()
-method again which will replace any
previously defined parameters.
Setting and defining perturbation parameters for continuous variables
works similarily. The required functions are
ck_params_num()
to create input objects that can be set
with the params_nums_set
method. Please note that it is
possibly by specifying the path
argument in both
ck_params_nums()
and ck_params_cnts()
to save
the parameters additionally as yaml-file. Using
ck_read_yaml()
, these files can later be imported again.
This feature is useful for re-using parameter settings.
The underlying framework on how to perturb continuous tables differs
from the proposed method from ABS
. One possible approach is
based on a “flex function”. This approach (which is described
in deliverable D4.2
in the project perturbative
confidentiality methods allows to apply different magnitude of noise
to larger and smaller cells. Users can define the required parameters
for the flex-approach with function ck_flexparams()
. The
required inputs are:
fp
: the flexpoint defining at which point should the
underlying noise coefficient function reach its desired maximum (which
is defined by the first element of p
)p
: numeric vector of length 2
with
p[1] > p[2]
where both elements specify a percentage.
The first value refers to the desired maximum perturbation percentage
for small cells (depending on fp) while the second element refers to the
desired maximum perturbation percentage for large cells.epsilon
: a numeric vector in descending order with all
values in [0; 1]
and with the first element forced to equal
1
. The length of this parameter must correspond with the
number of top_k
specified in ck_params_nums()
(which will be discussed later).
# parameters for the flex-function
p_flex <- ck_flexparams(
fp = 1000,
p = c(0.3, 0.03),
epsilon = c(1, 0.5, 0.2))
In the cellKey
package it is possible to select the underlying data that form the base
for the perturbation differently. In ck_params_nums()
the
specific approach can be selected in argument type
. The
valid choices for this argument are:
"top_contr"
: the k
largest contributions
to each cell are used in the perturbation procedure with the number
k
required to be specified in argument
top_k
"mean"
: weighted cellmeans are used as starting
points"range"
: the difference between largest and smallest
unweighted contributions for each cell are used as base for the
perturbation procedure"sum"
: weighted cellvalues are used as starting points
for the perturbationAnother, more basic approach, is to use a constant perturbation
magnitude for all cells, independent on their (weighted) values. The
required parameters can be defined with ck_simpleparams()
as shown below:
# parameters for the simple approach
p_simple <- ck_simpleparams(
p = 0.05,
epsilon = 1)
In this appraoch it is only required to specify a single percentage
value p
and - as in the case for the flex function - a
vector of epsilons that are used in the case when
top_k > 1
.
Further important parameters for ck_params_nums()
are:
mu_c
: an extra amount of perturbation applied to
sensitive cells (restricted to the first of top_k
noise
components). In the following example we demonstrate how to identify
sensitive cells for numeric variables.same_key
: a logical value specifying if the original
cell key (TRUE
) should be used for the lookup of the
largest contributor of a cell or if a perturbation of the cellkey itself
(FALSE
) should take place.use_zero_rkeys
: a logical value defining if record keys
of units not contributing to a specific numeric variables should be used
(TRUE
) or ignored (FALSE
) when cell keys are
computed.A very important parameter is ptab
which actually holds
the perturbation tables in which perturbation values are looked up. This
input can be specified differently in the case when numeric variables
should be perturbed. It can be either an object derived from
ptable::pt_create_pTable(..., table = "nums")
in the most
simple case. More advanced is to supply a named list, where the allowed
names are shown below and each element must be the output of
ptable::pt_create_pTable(..., table = "nums")
.
"all"
: this ptable will be used for all cells; if
specified, list-elements named "even"
or "odd"
are ignored"even"
: this perturbation table will be used to look up
perturbation values for cells with an even number of contributors"odd"
: will be used to look up perturbation values for
cells with an odd number of contributors"small_cells"
: if specified, this ptable will be used
to extract perturbation values for very small cellsPlease note, that if the goal is to have different perturbation
tables for cells with an even/odd number of contributors, both
"even"
or "odd"
must be available in the input
list. In the chunk below we create four different perturbation tables.
For details on the parameters, please look at the documentation of the
ptable
package, especially in ptable::create_num_ptable()
.
# same ptable for all cells except for very small ones
ex_ptab1 <- ptable::pt_ex_nums(parity = TRUE, separation = TRUE)
We can now use these tables to finally create objects containing all
the required information to create perturbed magnitude tables using
ck_params_nums
. In the first case we want the same
perturbation table (ptab_all
) for cells with an even/odd
number of contributors but want to use ptab_sc
for very
small cells.
p_nums1 <- ck_params_nums(
type = "top_contr",
top_k = 3,
ptab = ex_ptab1,
mult_params = p_flex,
mu_c = 2,
same_key = FALSE,
use_zero_rkeys = TRUE)
The second input we generate should use different ptables for cells
with an even/odd number of contributing units (ptab_even
and ptab_odd
) but should not use a specific perturbation
table for very small cells.
ex_ptab2 <- ptable::pt_ex_nums(parity = FALSE, separation = FALSE)
As above, we need to use ck_params_nums()
to compute
suitable inputs.
p_nums2 <- ck_params_nums(
type = "mean",
ptab = ex_ptab2,
mult_params = p_simple,
mu_c = 1.5,
same_key = FALSE,
use_zero_rkeys = TRUE)
The package internally computes the separation point that is used for
very small cells in case this is required. Details on this can also be
found in deliverable D4.2
.
Now we can attach the results from ck_params_nums()
to
numeric variables using the params_nums_set()
-method as
shown below:
tab$params_nums_set(v = "income", val = p_nums1)
## --> setting perturbation parameters for variable 'income'
tab$params_nums_set(v = "savings", val = p_nums1)
## --> setting perturbation parameters for variable 'savings'
In order to make use of parameter mu_c
that allows ab
add extra amount of protection to sensitive cells, one may identify
sensitive cells according to some rules. The following methods to
identify sensitive cells are implemented:
supp_p()
: identify sensitive cells based on
p%-rulesupp_pp()
: identify sensitive cells based on
pq%-rulesupp_nk()
: identify sensitive cells based on
nk-dominance rulesupp_freq()
: identify sensitive cells based on minimal
frequencies for (weighted) number of contributorssupp_val()
: identify sensitive cells based on
(weighted) cell valuessupp_cells()
: identify sensitive cells based on their
“names”We now want to set all cells for variable income
as
sensitive to which less than 15
units contribute.
tab$supp_freq(v = "income", n = 15, weighted = FALSE)
## freq-rule: 3 new sensitive cells (incl. duplicates) found (total: 3)
To set specific cells independent on values but their names, one may
use the $supp_cells()
-method. This cell requires a
data.frame
as input that contains a column for each
dimensional variable specified. Each row of this input is considered as
a cell where NAs
are used as placeholders that match any
characteristic of the relevant variable. Using the
data.frame
inp
show below, the programm would
suppress the following cells:
female
x age_group1
male
x age_group3
male
x any age group available in the data
inp <- data.frame(
"sex" = c("female", "male", "male"),
"age" = c("age_group1", "age_group3", NA)
)
It is now possible to finally perturbed variables using the
perturb()
-method. As tab
- the object created
with ck_setup()
- already contains all possible data, the
only required input is the name of a variable that should be
perturbed.
v
: a character vector specifying one or more variables
that should be perturbed.
tab$perturb(v = "total")
## Count variable 'total' was perturbed.
After this call, object tab
is updated and contains now
also perturbed values for variable total
. We note that no
explicit assignment is required. The following code shows that we also
can perturb cnt- and numerical variables in one single call.
tab$perturb(v = c("cnt_highincome", "savings", "income"))
## Count variable 'cnt_highincome' was perturbed.
## Numeric variable 'savings' was perturbed.
## Numeric variable 'income' was perturbed.
A data.table
containing original and perturbed values
can now be extracted using the freqtab()
- and
numtab()
methods as discussed next.
Applying the freqtab()
-method to one ore more already
perturbed variables returns a data.table
that contains for
each table cell the unpertubed and perturbed (weighted and/or
unweighted) counts. This function has the following arguments:
v
: one or more variable names of already perturbed
count variablespath
: if not NULL
, a (relative or
absolute) path to which the resulting output table should be written. A
csv
file will be generated and .csv
will be
appended to the value provided.The method returns a data.table
with all combinations of
the dimensional variables in the first \(n\) columns and after those the following
columns:
vname
: name of the perturbed variableuwc
: unweighted countswc
: weighted countspuwc
: perturbed unweighted countspwc
: perturbed weighted counts
tab$freqtab(v = c("total", "cnt_highincome"))
## sex age vname uwc wc puwc pwc
## 1: Total Total total 4580 275982 4578 275861.4838
## 2: Total age_group1 total 1969 119446 1968 119385.3367
## 3: Total age_group2 total 1143 68235 1144 68294.6982
## 4: Total age_group3 total 864 52074 863 52013.7292
## 5: Total age_group4 total 423 25034 422 24974.8180
## 6: Total age_group5 total 168 10226 168 10226.0000
## 7: Total age_group6 total 13 967 12 892.6154
## 8: male Total total 2296 138348 2296 138348.0000
## 9: male age_group1 total 1015 61597 1017 61718.3734
## 10: male age_group2 total 571 33355 572 33413.4151
## 11: male age_group3 total 424 25776 424 25776.0000
## 12: male age_group4 total 195 11749 194 11688.7487
## 13: male age_group5 total 84 5349 84 5349.0000
## 14: male age_group6 total 7 522 7 522.0000
## 15: female Total total 2284 137634 2284 137634.0000
## 16: female age_group1 total 954 57849 953 57788.3616
## 17: female age_group2 total 572 34880 570 34758.0420
## 18: female age_group3 total 440 26298 440 26298.0000
## 19: female age_group4 total 228 13285 226 13168.4649
## 20: female age_group5 total 84 4877 86 4993.1190
## 21: female age_group6 total 6 445 6 445.0000
## 22: Total Total cnt_highincome 445 27091 443 26969.2427
## 23: Total age_group1 cnt_highincome 192 11811 192 11811.0000
## 24: Total age_group2 cnt_highincome 123 7279 120 7101.4634
## 25: Total age_group3 cnt_highincome 82 5136 82 5136.0000
## 26: Total age_group4 cnt_highincome 34 2133 36 2258.4706
## 27: Total age_group5 cnt_highincome 14 732 13 679.7143
## 28: Total age_group6 cnt_highincome 0 0 0 0.0000
## 29: male Total cnt_highincome 219 13013 219 13013.0000
## 30: male age_group1 cnt_highincome 90 5414 94 5654.6222
## 31: male age_group2 cnt_highincome 66 3743 66 3743.0000
## 32: male age_group3 cnt_highincome 41 2486 41 2486.0000
## 33: male age_group4 cnt_highincome 15 961 17 1089.1333
## 34: male age_group5 cnt_highincome 7 409 7 409.0000
## 35: male age_group6 cnt_highincome 0 0 0 0.0000
## 36: female Total cnt_highincome 226 14078 227 14140.2920
## 37: female age_group1 cnt_highincome 102 6397 102 6397.0000
## 38: female age_group2 cnt_highincome 57 3536 58 3598.0351
## 39: female age_group3 cnt_highincome 41 2650 42 2714.6341
## 40: female age_group4 cnt_highincome 19 1172 15 925.2632
## 41: female age_group5 cnt_highincome 7 323 7 323.0000
## 42: female age_group6 cnt_highincome 0 0 0 0.0000
## sex age vname uwc wc puwc pwc
Using the numtab()
-method allows to extract results for
continuous variables. The required inputs are the same as for the
freqtab()
-method and the output returns a
data.table
with all combinations of the dimensional
variables in the first \(n\) columns
and the following additional columns:
vname
: name of the perturbed variableuws
: unweighted sumws
: weighted cellsumpws
: perturbed weighted sum of the given cellWe now have a look at the results of the variables
savings
and income
that we already have
perturbed.
tab$numtab(v = c("savings", "income"))
## sex age vname uws ws pws
## 1: Total Total savings 2273532 138166346 138166924.9
## 2: Total age_group1 savings 982386 60171300 60172051.2
## 3: Total age_group2 savings 552336 33312736 33311653.4
## 4: Total age_group3 savings 437101 26261183 26264415.9
## 5: Total age_group4 savings 214661 12937274 12935793.6
## 6: Total age_group5 savings 80451 4988523 4984467.5
## 7: Total age_group6 savings 6597 495330 496738.1
## 8: male Total savings 1159816 70528606 70525105.2
## 9: male age_group1 savings 517660 31738679 31741936.3
## 10: male age_group2 savings 280923 16655873 16668562.8
## 11: male age_group3 savings 214970 13142789 13148709.2
## 12: male age_group4 savings 99420 6029194 6040481.6
## 13: male age_group5 savings 43233 2703643 2699243.4
## 14: male age_group6 savings 3610 258428 256360.6
## 15: female Total savings 1113716 67637740 67633186.4
## 16: female age_group1 savings 464726 28432621 28431887.0
## 17: female age_group2 savings 271413 16656863 16656779.8
## 18: female age_group3 savings 222131 13118394 13117084.5
## 19: female age_group4 savings 115241 6908080 6908527.0
## 20: female age_group5 savings 37218 2284880 2285759.5
## 21: female age_group6 savings 2987 236902 231994.3
## 22: Total Total income 22952978 1383445632 1383452773.6
## 23: Total age_group1 income 9810547 595005901 595007699.0
## 24: Total age_group2 income 5692119 340608263 340599838.0
## 25: Total age_group3 income 4406946 266041766 266036662.8
## 26: Total age_group4 income 2133543 125501283 125497838.2
## 27: Total age_group5 income 848151 51585356 51565854.5
## 28: Total age_group6 income 61672 4703063 4765960.5
## 29: male Total income 11262049 673726952 673689706.5
## 30: male age_group1 income 4877164 293752206 293784579.0
## 31: male age_group2 income 2811379 163198627 163261260.8
## 32: male age_group3 income 2168169 130620890 130663591.6
## 33: male age_group4 income 978510 58254840 58295283.6
## 34: male age_group5 income 393134 25456993 25425184.6
## 35: male age_group6 income 33693 2443396 2440015.0
## 36: female Total income 11690929 709718680 709672641.3
## 37: female age_group1 income 4933383 301253695 301249538.7
## 38: female age_group2 income 2880740 177409636 177407183.1
## 39: female age_group3 income 2238777 135420876 135380210.5
## 40: female age_group4 income 1155033 67246443 67254774.8
## 41: female age_group5 income 455017 26128363 26142013.3
## 42: female age_group6 income 27979 2259667 2189399.8
## sex age vname uws ws pws
Method measures_cnts()
allows to compute information
loss measures for perturbed count variables. Its application is as
simple as:
tab$measures_cnts(v = "total", exclude_zeros = TRUE)
## $overview
## noise cnt pct
## 1: -2 2 0.0952381
## 2: -1 2 0.0952381
## 3: 0 8 0.3809524
## 4: 1 6 0.2857143
## 5: 2 3 0.1428571
##
## $measures
## what d1 d2 d3
## 1: Min 0.000 0.000 0.000
## 2: Q10 0.000 0.000 0.000
## 3: Q20 0.000 0.000 0.000
## 4: Q30 0.000 0.000 0.000
## 5: Q40 1.000 0.000 0.011
## 6: Mean 0.857 0.006 0.026
## 7: Median 1.000 0.001 0.015
## 8: Q60 1.000 0.001 0.017
## 9: Q70 1.000 0.002 0.024
## 10: Q80 2.000 0.003 0.036
## 11: Q90 2.000 0.009 0.066
## 12: Q95 2.000 0.024 0.108
## 13: Q99 2.000 0.066 0.135
## 14: Max 2.000 0.077 0.141
##
## $cumdistr_d1
## cat cnt pct
## 1: 0 8 0.3809524
## 2: 1 16 0.7619048
## 3: 2 21 1.0000000
##
## $cumdistr_d2
## cat cnt pct
## 1: [0,0.02] 19 0.9047619
## 2: (0.02,0.05] 20 0.9523810
## 3: (0.05,0.1] 21 1.0000000
## 4: (0.1,0.2] 21 1.0000000
## 5: (0.2,0.3] 21 1.0000000
## 6: (0.3,0.4] 21 1.0000000
## 7: (0.4,0.5] 21 1.0000000
## 8: (0.5,Inf] 21 1.0000000
##
## $cumdistr_d3
## cat cnt pct
## 1: [0,0.02] 13 0.6190476
## 2: (0.02,0.05] 18 0.8571429
## 3: (0.05,0.1] 19 0.9047619
## 4: (0.1,0.2] 21 1.0000000
## 5: (0.2,0.3] 21 1.0000000
## 6: (0.3,0.4] 21 1.0000000
## 7: (0.4,0.5] 21 1.0000000
## 8: (0.5,Inf] 21 1.0000000
##
## $false_zero
## [1] 0
##
## $false_nonzero
## [1] 0
##
## $exclude_zeros
## [1] TRUE
This function returns a list with several utility measures that are
now discussed. For further information have a look at
?ck_cnt_measures
as the same set of measures can also be
computed for two vectors containing original and perturbed values.
overview
: a data.table
with the following
three columns:
measures
: a data.table
containing measures
of the distribution of three different distances between original and
perturbed values of the unweighted counts. Column what
specifies the computed measure. The three distances considered are:
d1
: absolute distance between original and masked
valuesd2
: relative absolute distance between original and
masked valuesd3
: absolute distance between square-roots of original
and perturbed valuescumdistr_d1
, cumdistr_d2
and
cumdistr_d3
: for each distance d1, d2 and d3, a data.table
with the following three columns:
cat
: a specific value (for d1) or interval (for
distances d2 and d3)cnt
: number of records smaller or equal the value in
column cat for the given distancepct
: proportion of records smaller or equal the value
in column cat for the selected distancefalse_zero
: number of cells that were perturbed to
zerofalse_nonzero
: number of cells that were initially zero
but have been perturbed to a number different from zeroIf argument exclude_zeros
is TRUE
(the
default setting), empty cells are excluded when computing distance-based
measures d1
, d2
and d3
.
Finally we note that there are print
and
summary
methods implemented for objects created with from
ck_setup()
which can be used as shown below:
The print()
-method shows the dimension of the table as
well as the variables that are (or possibly can be) perturbed. It is
also displayed if the table was constructed using weights.
tab$print() # same as (print(tab))
## ── Table Information ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## ✔ 21 cells in 2 dimensions ('sex', 'age')
## ✔ weights: yes
## ── Tabulated / Perturbed countvars ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## ☒ 'total' (perturbed)
## ☒ 'cnt_highincome' (perturbed)
## ── Tabulated / Perturbed numvars ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## ☒ 'income' (perturbed)
## ☒ 'savings' (perturbed)
The summary
-method shows some utility statistics for
already perturbed variables as shown below:
tab$summary()
## ┌──────────────────────────────────────────────┐
## │Utility measures for perturbed count variables│
## └──────────────────────────────────────────────┘
## ── Distribution statistics of perturbations ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## countvar Min Q10 Q20 Q30 Q40 Mean Median Q60 Q70 Q80 Q90 Q95 Q99 Max
## 1: total -2 -2 -1 -1 -1 -0.286 0 0 0 0 1 2 2.0 2
## 2: cnt_highincome -4 -2 0 0 0 0.048 0 0 0 1 2 2 3.6 4
##
## ── Distance-based measures ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## ✔ Variable: 'total'
##
## what d1 d2 d3
## 1: Min 0.000 0.000 0.000
## 2: Q10 0.000 0.000 0.000
## 3: Q20 0.000 0.000 0.000
## 4: Q30 0.000 0.000 0.000
## 5: Q40 1.000 0.000 0.011
## 6: Mean 0.857 0.006 0.026
## 7: Median 1.000 0.001 0.015
## 8: Q60 1.000 0.001 0.017
## 9: Q70 1.000 0.002 0.024
## 10: Q80 2.000 0.003 0.036
## 11: Q90 2.000 0.009 0.066
## 12: Q95 2.000 0.024 0.108
## 13: Q99 2.000 0.066 0.135
## 14: Max 2.000 0.077 0.141
##
## ✔ Variable: 'cnt_highincome'
##
## what d1 d2 d3
## 1: Min 0.000 0.000 0.000
## 2: Q10 0.000 0.000 0.000
## 3: Q20 0.000 0.000 0.000
## 4: Q30 0.000 0.000 0.000
## 5: Q40 0.000 0.000 0.000
## 6: Mean 1.167 0.033 0.089
## 7: Median 1.000 0.004 0.040
## 8: Q60 1.000 0.019 0.068
## 9: Q70 1.900 0.024 0.130
## 10: Q80 2.000 0.053 0.156
## 11: Q90 3.300 0.090 0.221
## 12: Q95 4.000 0.145 0.285
## 13: Q99 4.000 0.197 0.446
## 14: Max 4.000 0.211 0.486
##
## ┌──────────────────────────────────────────────────┐
## │Utility measures for perturbed numerical variables│
## └──────────────────────────────────────────────────┘
## ── Distribution statistics of perturbations ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## vname Min Q10 Q20 Q30 Q40 Mean Median Q60 Q70 Q80 Q90 Q95 Q99 Max
## 1: income -70267.150 -40665.472 -31808.440 -8425.046 -4156.270 -24.711 -3381.04 1798.040 8331.760 32372.963 42701.575 62633.80 62844.78 62897.53
## 2: savings -4907.719 -4399.562 -3500.797 -1480.407 -1082.593 584.672 -83.23 578.866 879.508 3232.861 5920.215 11287.64 12409.40 12689.84
The package is “work in progress” and therefore, suggestions and/or bugreports are welcome. Please feel free to file an issue at our issue tracker or contribute to the package by filing a pull request against the master branch.