Utility measures for perturbed counts — ck_cnt

This function computes utility/information loss measures based on two numeric vectors (original and perturbed)

ck_cnt_measures(orig, pert, exclude_zeros = TRUE)

Arguments

orig: a numeric vector holding original values
pert: a numeric vector holding perturbed values
exclude_zeros: a scalar logical value; if TRUE (the default), all only cells with counts > 0 are used when computing distances d1, d2 and d3. If this argument is FALSE, the complete vector is used.

Value

a list containing the following elements:

overview: a data.table with the following three columns:
- noise: amount of noise computed as orig - pert
- cnt: number of cells perturbed with the value given in column noise
- pct: percentage of cells perturbed with the value given in column noise
measures: a data.table containing measures of the distribution of three different distances between original and perturbed values of the unweighted counts. Column what specifies the computed measure. The three distances considered are:
- d1: absolute distance between original and masked values
- d2: relative absolute distance between original and masked values
- d3: absolute distance between square-roots of original and perturbed values
cumdistr_d1, cumdistr_d2 and cumdistr_d3: for each distance d1, d2 and d3, a data.table with the following three columns:
- cat: a specific value (for d1) or interval (for distances d2 and d3)
- cnt: number of records smaller or equal the value in column cat for the given distance
- pct: proportion of records smaller or equal the value in column cat for the selected distance
false_zero: number of cells that were perturbed to zero
false_nonzero: number of cells that were initially zero but have been perturbed to a number different from zero
exclude_zeros: were empty cells exluded from computation or not

Examples

orig <- c(1:10, 0, 0)
pert <- orig; pert[c(1, 5, 7)] <- c(0, 6, 9)

# ignore empty cells when computing measures `d1`, `d2`, `d3`
ck_cnt_measures(orig = orig, pert = pert, exclude_zeros = TRUE)
#> $overview
#>     noise   cnt        pct
#>    <fctr> <int>      <num>
#> 1:     -2     1 0.08333333
#> 2:     -1     1 0.08333333
#> 3:      0     9 0.75000000
#> 4:      1     1 0.08333333
#> 
#> $measures
#>       what    d1    d2    d3
#>     <char> <num> <num> <num>
#>  1:    Min 0.000 0.000 0.000
#>  2:    Q10 0.000 0.000 0.000
#>  3:    Q20 0.000 0.000 0.000
#>  4:    Q30 0.000 0.000 0.000
#>  5:    Q40 0.000 0.000 0.000
#>  6:   Mean 0.333 0.054 0.063
#>  7: Median 0.000 0.000 0.000
#>  8:    Q60 0.000 0.000 0.000
#>  9:    Q70 0.000 0.000 0.000
#> 10:    Q80 0.400 0.080 0.085
#> 11:    Q90 1.200 0.217 0.242
#> 12:    Q95 1.600 0.251 0.298
#> 13:    Q99 1.920 0.279 0.343
#> 14:    Max 2.000 0.286 0.354
#> 
#> $cumdistr_d1
#>       cat   cnt       pct
#>    <char> <int>     <num>
#> 1:      0     7 0.7777778
#> 2:      1     8 0.8888889
#> 3:      2     9 1.0000000
#> 
#> $cumdistr_d2
#>            cat   cnt       pct
#>         <char> <int>     <num>
#> 1:    [0,0.02]     7 0.7777778
#> 2: (0.02,0.05]     7 0.7777778
#> 3:  (0.05,0.1]     7 0.7777778
#> 4:   (0.1,0.2]     8 0.8888889
#> 5:   (0.2,0.3]     9 1.0000000
#> 6:   (0.3,0.4]     9 1.0000000
#> 7:   (0.4,0.5]     9 1.0000000
#> 8:   (0.5,Inf]     9 1.0000000
#> 
#> $cumdistr_d3
#>            cat   cnt       pct
#>         <char> <int>     <num>
#> 1:    [0,0.02]     7 0.7777778
#> 2: (0.02,0.05]     7 0.7777778
#> 3:  (0.05,0.1]     7 0.7777778
#> 4:   (0.1,0.2]     7 0.7777778
#> 5:   (0.2,0.3]     8 0.8888889
#> 6:   (0.3,0.4]     9 1.0000000
#> 7:   (0.4,0.5]     9 1.0000000
#> 8:   (0.5,Inf]     9 1.0000000
#> 
#> $false_zero
#> [1] 1
#> 
#> $false_nonzero
#> [1] 0
#> 
#> $exclude_zeros
#> [1] TRUE
#> 

# use all cells
ck_cnt_measures(orig = orig, pert = pert, exclude_zeros = FALSE)
#> $overview
#>     noise   cnt        pct
#>    <fctr> <int>      <num>
#> 1:     -2     1 0.08333333
#> 2:     -1     1 0.08333333
#> 3:      0     9 0.75000000
#> 4:      1     1 0.08333333
#> 
#> $measures
#>       what    d1    d2    d3
#>     <char> <num> <num> <num>
#>  1:    Min 0.000 0.000 0.000
#>  2:    Q10 0.000 0.000 0.000
#>  3:    Q20 0.000 0.000 0.000
#>  4:    Q30 0.000 0.000 0.000
#>  5:    Q40 0.000 0.000 0.000
#>  6:   Mean 0.333 0.124 0.131
#>  7: Median 0.000 0.000 0.000
#>  8:    Q60 0.000 0.000 0.000
#>  9:    Q70 0.000 0.000 0.000
#> 10:    Q80 0.800 0.160 0.171
#> 11:    Q90 1.000 0.277 0.340
#> 12:    Q95 1.450 0.607 0.645
#> 13:    Q99 1.890 0.921 0.929
#> 14:    Max 2.000 1.000 1.000
#> 
#> $cumdistr_d1
#>       cat   cnt       pct
#>    <char> <int>     <num>
#> 1:      0     9 0.7500000
#> 2:      1    11 0.9166667
#> 3:      2    12 1.0000000
#> 
#> $cumdistr_d2
#>            cat   cnt       pct
#>         <char> <int>     <num>
#> 1:    [0,0.02]     9 0.7500000
#> 2: (0.02,0.05]     9 0.7500000
#> 3:  (0.05,0.1]     9 0.7500000
#> 4:   (0.1,0.2]    10 0.8333333
#> 5:   (0.2,0.3]    11 0.9166667
#> 6:   (0.3,0.4]    11 0.9166667
#> 7:   (0.4,0.5]    11 0.9166667
#> 8:   (0.5,Inf]    12 1.0000000
#> 
#> $cumdistr_d3
#>            cat   cnt       pct
#>         <char> <int>     <num>
#> 1:    [0,0.02]     9 0.7500000
#> 2: (0.02,0.05]     9 0.7500000
#> 3:  (0.05,0.1]     9 0.7500000
#> 4:   (0.1,0.2]     9 0.7500000
#> 5:   (0.2,0.3]    10 0.8333333
#> 6:   (0.3,0.4]    11 0.9166667
#> 7:   (0.4,0.5]    11 0.9166667
#> 8:   (0.5,Inf]    12 1.0000000
#> 
#> $false_zero
#> [1] 1
#> 
#> $false_nonzero
#> [1] 0
#> 
#> $exclude_zeros
#> [1] FALSE
#> 

# for an application on a perturbed object, see ?cellkey_pkg