Measures IL_correl() and IL_variables() were proposed by Andrzej Mlodak and are (theoretically) bounded between 0 and 1.

IL_correl(x, xm)

# S3 method for il_correl
print(x, digits = 3, ...)

IL_variables(x, xm)

# S3 method for il_variables
print(x, digits = 3, ...)

## Arguments

x

an object coercible to a data.frame representing the original dataset

xm

an object coercible to a data.frame representing the perturbed, modified dataset

digits

number digits used for rounding when displaying results

...

additional parameter for print-methods; currently ignored

## Value

the corresponding information-loss measure

## Details

• IL_correl(): is a information-loss measure that can be applied to common numerically scaled variables in x and xm. It is based on diagonal entries of inverse correlation matrices in the original and perturbed data.

• IL_variables(): for common-variables in x and xm the individual distance-functions depend on the class of the variable; specifically these functions are different for numeric variables, ordered-factors and character/factor variables. The individual distances are summed up and scaled by n * m with n being the number of records and m being the number of (common) variables.

Details can be found in the references below

The implementation of IL_correl() differs slightly with the original proposition from Mlodak, A. (2020) as the constant multiplier was changed to 1 / sqrt(2) instead of 1/2 for better efficiency and interpretability of the measure.

## Author

Bernhard Meindl bernhard.meindl@statistik.gv.at

## Examples

data("Tarragona", package = "sdcMicro")
res1 <- addNoise(obj = Tarragona, variables = colnames(Tarragona), noise = 100)
IL_correl(x = as.data.frame(res1$x), xm = as.data.frame(res1$xm))
#> Number of records (x):  834  | Number of records (xm):  834
#> Number of common numeric variables:  13
#> Overall information loss:  0.473

res2 <- addNoise(obj = Tarragona, variables = colnames(Tarragona), noise = 25)
IL_correl(x = as.data.frame(res2$x), xm = as.data.frame(res2$xm))
#> Number of records (x):  834  | Number of records (xm):  834
#> Number of common numeric variables:  13
#> Overall information loss:  0.23

# creating test-inputs
n <- 150
x <- xm <- data.frame(
v1 = factor(sample(letters[1:5], n, replace = TRUE), levels = letters[1:5]),
v2 = rnorm(n),
v3 = runif(3),
v4 = ordered(sample(LETTERS[1:3], n, replace = TRUE), levels = c("A", "B", "C"))
)
xm$v1[1:5] <- "a" xm$v2 <- rnorm(n, mean = 5)
xm\$v4[1:5] <- "A"
IL_variables(x, xm)
#> Number of records:  150
#> Number of variables:  4
#> Overall information loss:  0.223
#> Individual information losses for variables:
#>  variable  loss
#>        v1 0.020
#>        v2 0.859
#>        v3 0.000
#>        v4 0.013