Calculate information loss after targeted record swapping using both the original and the swapped micro data. Information loss will be calculated on table counts defined by parameter `table_vars` using either implemented information loss measures like absolute deviaton, relative absolute deviation and absolute deviation of square roots or custom metric, See details below.

infoLoss(
  data,
  data_swapped,
  table_vars,
  metric = c("absD", "relabsD", "abssqrtD"),
  custom_metric = NULL,
  hid = NULL,
  probs = sort(c(seq(0, 1, by = 0.1), 0.95, 0.99)),
  quantvals = c(0, 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, Inf),
  apply_quantvals = c("relabsD", "abssqrtD"),
  exclude_zeros = FALSE,
  only_inner_cells = FALSE
)

Arguments

data

original micro data set, must be either a `data.table` or `data.frame`.

data_swapped

micro data set after targeted record swapping was applied. Must be either a `data.table` or `data.frame`.

table_vars

column names in both `data` and `data_swapped`. Defines the variables over which a (multidimensional) frequency table is constructed. Information loss is then calculated by applying the metric in `metric` and `custom_merics` over the cell-counts and margin counts of the table from `data` and `data_swapped`.

metric

character vector containing one or more of the already implemented metrices: "absD","relabsD" and/or "abssqrtD".

custom_metric

function or (named) list of functions. Functions defined here must be of the form `fun(x,y,...)` where `x` and `y` expect numeric values of the same length. The output of these functions must be a numeric vector of the same length as `x` and `y`.

hid

`NULL` or character containing household id in `data` and `data_swapped`. If not `NULL` frequencies will reflect number of households, otherwise frequencies will reflect number of persons.

probs

numeric vector containing values in the inervall [0,1].

quantvals

optional numeric vector which defines the groups used for the cumulative outputs. Is applied on the results `m` from each information loss metric as `cut(m,breaks=quantvals,include.lowest=TRUE)`, see also return values.

apply_quantvals

character vector defining for the output of which metrices `quantvals` should be applied to.

exclude_zeros

`TRUE` or `FALSE`, if `TRUE` 0 cells in the frequency table using `data_swapped` will be ignored.

only_inner_cells

`TRUE` or `FALSE`, if `TRUE` only inner cells of the frequency table defined by `table_vars` will be compared. Otherwise also all tables margins will bei calculated.

Value

Returns a list containing:

* `cellvalues`: `data.table` showing in a long format for each table cell the frequency counts for `data` ~ `count_o` and `data_swapped` ~ `count_s`. * `overview`: `data.table` containing the disribution of the `noise` in number of cells and percentage. The `noise` ist calculated as the difference between the cell values of the frequency table generated from the original and swapped data * `measures`: `data.table` containing the quantiles and mean (column `waht`) of the distribution of the information loss metrices applied on each table cell. The quantiles are defined by parameter `probs`. * `cumdistr\*`: `data.table` containing the cumulative distribution of the information loss metrices. Distribution is shown in number of cells (`cnt`) and percentage (`pct`). Column `cat` shows all unique values of the information loss metric or the grouping defined by `quantvals`. * `false_zero`: number of table cells which are non-zero when using `data` and zero when using `data_swapped`. * `false_nonzero`: number of table cells which are zero when using `data` and non-zero when using `data_swapped`. * `exclude_zeros`: value passed to `exclude_zero` when calling the function.

Details

First frequency tables are build from both `data` and `data_swapped` using the variables defined in `table_vars`. By default also all table margins will be calculated, see parameter `only_inner_cells = FALSE`. After that the information loss metrices defined in either `metric` or `custom_metric` are applied on each of the table cells from both frequency tables. This is done in the sense of `metric(x,y)` where `metric` is the information loss, `x` a cell from the table created from `data` and `y` the same cell from the table created from `data_swapped`. One or more custom metrices can be applied using the parameter `custom_metric`, see also examples.

Examples

# generate dummy data 
seed <- 2021
set.seed(seed)
nhid <- 10000
dat <- createDat( nhid )

# define paramters for swapping
k_anonymity <- 1
swaprate <- .05
similar <- list(c("hsize"))
hier <- c("nuts1","nuts2")
carry_along <- c("nuts3","lau2")
risk_variables <- c("ageGroup","national")
hid <- "hid"

# # apply record swapping
# dat_s <- recordSwap(data = dat, hid = hid, hierarchy = hier,
#                     similar = similar, swaprate = swaprate,
#                     k_anonymity = k_anonymity,
#                     risk_variables = risk_variables,
#                     carry_along = carry_along,
#                     return_swapped_id = TRUE,
#                     seed=seed)
# 
# 
# # calculate informationn loss
# # for the table nuts2 x national
# iloss <- infoLoss(data=dat, data_swapped = dat_s,
#                   table_vars = c("nuts2","national"))
# iloss$measures # distribution of information loss measures
# iloss$false_zero # no false zeros
# iloss$false_nonzero # no false non-zeros
# 
# # frequency tables of households accross
# # nuts2 x hincome
# 
# iloss <- infoLoss(data=dat, data_swapped = dat_s,
 #                  table_vars = c("nuts2","hincome"),
#                   hid = "hid")
# iloss$measures  
# 
# # define custom metric
# squareD <- function(x,y){
#   (x-y)^2
# }
# 
# iloss <- infoLoss(data=dat, data_swapped = dat_s,
#                  table_vars = c("nuts2","national"),
#                  custom_metric = list(squareD=squareD))
# iloss$measures # includes custom loss as well
#