Calculate information loss after targeted record swapping

Calculate information loss after targeted record swapping using both the original and the swapped micro data. Information loss will be calculated on table counts defined by parameter `table_vars` using either implemented information loss measures like absolute deviaton, relative absolute deviation and absolute deviation of square roots or custom metric, See details below.

infoLoss(
  data,
  data_swapped,
  table_vars,
  metric = c("absD", "relabsD", "abssqrtD"),
  custom_metric = NULL,
  hid = NULL,
  probs = sort(c(seq(0, 1, by = 0.1), 0.95, 0.99)),
  quantvals = c(0, 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, Inf),
  apply_quantvals = c("relabsD", "abssqrtD"),
  exclude_zeros = FALSE,
  only_inner_cells = FALSE
)

Arguments

data: original micro data set, must be either a `data.table` or `data.frame`.
data_swapped: micro data set after targeted record swapping was applied. Must be either a `data.table` or `data.frame`.
table_vars: column names in both `data` and `data_swapped`. Defines the variables over which a (multidimensional) frequency table is constructed. Information loss is then calculated by applying the metric in `metric` and `custom_merics` over the cell-counts and margin counts of the table from `data` and `data_swapped`.
metric: character vector containing one or more of the already implemented metrices: "absD","relabsD" and/or "abssqrtD".
custom_metric: function or (named) list of functions. Functions defined here must be of the form `fun(x,y,...)` where `x` and `y` expect numeric values of the same length. The output of these functions must be a numeric vector of the same length as `x` and `y`.
hid: `NULL` or character containing household id in `data` and `data_swapped`. If not `NULL` frequencies will reflect number of households, otherwise frequencies will reflect number of persons.
probs: numeric vector containing values in the inervall [0,1].
quantvals: optional numeric vector which defines the groups used for the cumulative outputs. Is applied on the results `m` from each information loss metric as `cut(m,breaks=quantvals,include.lowest=TRUE)`, see also return values.
apply_quantvals: character vector defining for the output of which metrices `quantvals` should be applied to.
exclude_zeros: `TRUE` or `FALSE`, if `TRUE` 0 cells in the frequency table using `data_swapped` will be ignored.
only_inner_cells: `TRUE` or `FALSE`, if `TRUE` only inner cells of the frequency table defined by `table_vars` will be compared. Otherwise also all tables margins will bei calculated.

Value

Returns a list containing:

* `cellvalues`: `data.table` showing in a long format for each table cell the frequency counts for `data` ~ `count_o` and `data_swapped` ~ `count_s`. * `overview`: `data.table` containing the disribution of the `noise` in number of cells and percentage. The `noise` ist calculated as the difference between the cell values of the frequency table generated from the original and swapped data * `measures`: `data.table` containing the quantiles and mean (column `waht`) of the distribution of the information loss metrices applied on each table cell. The quantiles are defined by parameter `probs`. * `cumdistr\*`: `data.table` containing the cumulative distribution of the information loss metrices. Distribution is shown in number of cells (`cnt`) and percentage (`pct`). Column `cat` shows all unique values of the information loss metric or the grouping defined by `quantvals`. * `false_zero`: number of table cells which are non-zero when using `data` and zero when using `data_swapped`. * `false_nonzero`: number of table cells which are zero when using `data` and non-zero when using `data_swapped`. * `exclude_zeros`: value passed to `exclude_zero` when calling the function.

Details

First frequency tables are build from both `data` and `data_swapped` using the variables defined in `table_vars`. By default also all table margins will be calculated, see parameter `only_inner_cells = FALSE`. After that the information loss metrices defined in either `metric` or `custom_metric` are applied on each of the table cells from both frequency tables. This is done in the sense of `metric(x,y)` where `metric` is the information loss, `x` a cell from the table created from `data` and `y` the same cell from the table created from `data_swapped`. One or more custom metrices can be applied using the parameter `custom_metric`, see also examples.

Examples

# generate dummy data 
seed <- 2021
set.seed(seed)
nhid <- 10000
dat <- createDat( nhid )

# define paramters for swapping
k_anonymity <- 1
swaprate <- .05
similar <- list(c("hsize"))
hier <- c("nuts1","nuts2")
carry_along <- c("nuts3","lau2")
risk_variables <- c("ageGroup","national")
hid <- "hid"

# # apply record swapping
# dat_s <- recordSwap(data = dat, hid = hid, hierarchy = hier,
#                     similar = similar, swaprate = swaprate,
#                     k_anonymity = k_anonymity,
#                     risk_variables = risk_variables,
#                     carry_along = carry_along,
#                     return_swapped_id = TRUE,
#                     seed=seed)
# 
# 
# # calculate informationn loss
# # for the table nuts2 x national
# iloss <- infoLoss(data=dat, data_swapped = dat_s,
#                   table_vars = c("nuts2","national"))
# iloss$measures # distribution of information loss measures
# iloss$false_zero # no false zeros
# iloss$false_nonzero # no false non-zeros
# 
# # frequency tables of households accross
# # nuts2 x hincome
# 
# iloss <- infoLoss(data=dat, data_swapped = dat_s,
 #                  table_vars = c("nuts2","hincome"),
#                   hid = "hid")
# iloss$measures  
# 
# # define custom metric
# squareD <- function(x,y){
#   (x-y)^2
# }
# 
# iloss <- infoLoss(data=dat, data_swapped = dat_s,
#                  table_vars = c("nuts2","national"),
#                  custom_metric = list(squareD=squareD))
# iloss$measures # includes custom loss as well
#