Calculate information loss after targeted record swapping using both the original and the swapped micro data. Information loss will be calculated on table counts defined by parameter table_vars using either implemented information loss measures like absolute deviaton, relative absolute deviation and absolute deviation of square roots or custom metric, See details below.

infoLoss(
data,
data_swapped,
table_vars,
metric = c("absD", "relabsD", "abssqrtD"),
custom_metric = NULL,
hid = NULL,
probs = sort(c(seq(0, 1, by = 0.1), 0.95, 0.99)),
quantvals = c(0, 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, Inf),
apply_quantvals = c("relabsD", "abssqrtD"),
exclude_zeros = FALSE,
only_inner_cells = FALSE
)

## Arguments

data

original micro data set, must be either a data.table or data.frame.

data_swapped

micro data set after targeted record swapping was applied. Must be either a data.table or data.frame.

table_vars

column names in both data and data_swapped. Defines the variables over which a (multidimensional) frequency table is constructed. Information loss is then calculated by applying the metric in metric and custom_merics over the cell-counts and margin counts of the table from data and data_swapped.

metric

character vector containing one or more of the already implemented metrices: "absD","relabsD" and/or "abssqrtD".

custom_metric

function or (named) list of functions. Functions defined here must be of the form fun(x,y,...) where x and y expect numeric values of the same length. The output of these functions must be a numeric vector of the same length as x and y.

hid

NULL or character containing household id in data and data_swapped. If not NULL frequencies will reflect number of households, otherwise frequencies will reflect number of persons.

probs

numeric vector containing values in the inervall [0,1].

quantvals

optional numeric vector which defines the groups used for the cumulative outputs. Is applied on the results m from each information loss metric as cut(m,breaks=quantvals,include.lowest=TRUE), see also return values.

apply_quantvals

character vector defining for the output of which metrices quantvals should be applied to.

exclude_zeros

TRUE or FALSE, if TRUE 0 cells in the frequency table using data_swapped will be ignored.

only_inner_cells

TRUE or FALSE, if TRUE only inner cells of the frequency table defined by table_vars will be compared. Otherwise also all tables margins will bei calculated.

## Value

Returns a list containing:

* cellvalues: data.table showing in a long format for each table cell the frequency counts for data ~ count_o and data_swapped ~ count_s. * overview: data.table containing the disribution of the noise in number of cells and percentage. The noise ist calculated as the difference between the cell values of the frequency table generated from the original and swapped data * measures: data.table containing the quantiles and mean (column waht) of the distribution of the information loss metrices applied on each table cell. The quantiles are defined by parameter probs. * cumdistr\*: data.table containing the cumulative distribution of the information loss metrices. Distribution is shown in number of cells (cnt) and percentage (pct). Column cat shows all unique values of the information loss metric or the grouping defined by quantvals. * false_zero: number of table cells which are non-zero when using data and zero when using data_swapped. * false_nonzero: number of table cells which are zero when using data and non-zero when using data_swapped. * exclude_zeros: value passed to exclude_zero when calling the function.

## Details

First frequency tables are build from both data and data_swapped using the variables defined in table_vars. By default also all table margins will be calculated, see parameter only_inner_cells = FALSE. After that the information loss metrices defined in either metric or custom_metric are applied on each of the table cells from both frequency tables. This is done in the sense of metric(x,y) where metric is the information loss, x a cell from the table created from data and y the same cell from the table created from data_swapped. One or more custom metrices can be applied using the parameter custom_metric, see also examples.

## Examples

# generate dummy data
seed <- 2021
set.seed(seed)
nhid <- 10000
dat <- createDat( nhid )

# define paramters for swapping
k_anonymity <- 1
swaprate <- .05
similar <- list(c("hsize"))
hier <- c("nuts1","nuts2")
carry_along <- c("nuts3","lau2")
risk_variables <- c("ageGroup","national")
hid <- "hid"

# # apply record swapping
# dat_s <- recordSwap(data = dat, hid = hid, hierarchy = hier,
#                     similar = similar, swaprate = swaprate,
#                     k_anonymity = k_anonymity,
#                     risk_variables = risk_variables,
#                     carry_along = carry_along,
#                     return_swapped_id = TRUE,
#                     seed=seed)
#
#
# # calculate informationn loss
# # for the table nuts2 x national
# iloss <- infoLoss(data=dat, data_swapped = dat_s,
#                   table_vars = c("nuts2","national"))
# iloss$measures # distribution of information loss measures # iloss$false_zero # no false zeros
# iloss$false_nonzero # no false non-zeros # # # frequency tables of households accross # # nuts2 x hincome # # iloss <- infoLoss(data=dat, data_swapped = dat_s, # table_vars = c("nuts2","hincome"), # hid = "hid") # iloss$measures
#
# # define custom metric
# squareD <- function(x,y){
#   (x-y)^2
# }
#
# iloss <- infoLoss(data=dat, data_swapped = dat_s,
#                  table_vars = c("nuts2","national"),
#                  custom_metric = list(squareD=squareD))
# iloss\$measures # includes custom loss as well
#