The function measures the disclosure risk for weighted or unweighted data. It computes the individual risk (and household risk if reasonable) and the global risk. It also computes a risk threshold based on a global risk value.

Prints a 'measure_risk'-object

Prints a 'ldiversity'-object

measure_risk(obj, ...)

ldiversity(obj, ldiv_index = NULL, l_recurs_c = 2, missing = -999, ...)

# S3 method for measure_risk
print(x, ...)

# S3 method for ldiversity
print(x, ...)

Arguments

obj

Object of class sdcMicroObj-class

...

see arguments below

data:

Input data, a data.frame.

keyVars:

names (or indices) of categorical key variables (for data-frame method)

w:

name of variable containing sample weights

hid:

name of the clustering variable, e.g. the household ID

max_global_risk:

Maximal global risk for threshold computation

fast_hier:

If TRUE a fast approximation is computed if household data are provided.

ldiv_index

indices (or names) of the variables used for l-diversity

l_recurs_c

l-Diversity Constant

missing

a integer value to be used as missing value in the C++ routine

x

Output of measure_risk() or ldiversity()

Value

A modified sdcMicroObj-class object or a list with the following elements:

global_risk_ER:

expected number of re-identification.

global_risk:

global risk (sum of indivdual risks).

global_risk_pct:

global risk in percent.

Res:

matrix with the risk, frequency in the sample and grossed-up frequency in the population (and the hierachical risk) for each observation.

global_threshold:

for a given max_global_risk the threshold for the risk of observations.

max_global_risk:

the input max_global_risk of the function.

hier_risk_ER:

expected number of re-identification with household structure.

hier_risk:

global risk with household structure (sum of indivdual risks).

hier_risk_pct:

global risk with household structure in percent.

ldiverstiy:

Matrix with Distinct_Ldiversity, Entropy_Ldiversity and Recursive_Ldiversity for each sensitivity variable.

Prints risk-information into the console

Information on L-Diversity Measures in the console

Details

To be used when risk of disclosure for individuals within a family is considered to be statistical independent.

Internally, function freqCalc() and indivRisk are used for estimation.

Measuring individual risk: The individual risk approach based on so-called super-population models. In such models population frequency counts are modeled given a certain distribution. The estimation procedure of sample frequency counts given the population frequency counts is modeled by assuming a negative binomial distribution. This is used for the estimation of the individual risk. The extensive theory can be found in Skinner (1998), the approximation formulas for the individual risk used is described in Franconi and Polettini (2004).

Measuring hierarchical risk: If “hid” - the index of variable holding information on the hierarchical cluster structures (e.g., individuals that are clustered in households) - is provided, the hierarchical risk is additional estimated. Note that the risk of re-identifying an individual within a household may also affect the probability of disclosure of other members in the same household. Thus, the household or cluster-structure of the data must be taken into account when estimating disclosure risks. It is commonly assumed that the risk of re-identification of a household is the risk that at least one member of the household can be disclosed. Thus this probability can be simply estimated from individual risks as 1 minus the probability that no member of the household can be identified.

Global risk: The sum of the individual risks in the dataset gives the expected number of re-identifications that serves as measure of the global risk.

l-Diversity: If “ldiv_index” is unequal to NULL, i.e. if the indices of sensible variables are specified, various measures for l-diversity are calculated. l-diverstiy is an extension of the well-known k-anonymity approach where also the uniqueness in sensible variables for each pattern spanned by the key variables are evaluated.

References

Franconi, L. and Polettini, S. (2004) Individual risk estimation in mu-Argus: a review. Privacy in Statistical Databases, Lecture Notes in Computer Science, 262--272. Springer

Machanavajjhala, A. and Kifer, D. and Gehrke, J. and Venkitasubramaniam, M. (2007) l-Diversity: Privacy Beyond k-Anonymity. ACM Trans. Knowl. Discov. Data, 1(1)

Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R. Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4. doi:10.1007/978-3-319-50272-4 .

#' Templ, M. and Kowarik, A. and Meindl, B. Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro. Journal of Statistical Software, 67 (4), 1--36, 2015. doi:10.18637/jss.v067.i04

See also

freqCalc, indivRisk

measure_risk

Author

Alexander Kowarik, Bernhard Meindl, Matthias Templ, Bernd Prantner, minor parts of IHSN C++ source

Examples

## measure_risk with sdcMicro objects:
data(testdata)
# \donttest{
sdc <- createSdcObj(testdata,
  keyVars=c('urbrur','roof','walls','water','electcon'),
numVars=c('expend','income','savings'), w='sampling_weight')

## risk is already estimated and available in...
names(sdc@risk)
#> [1] "global"     "individual" "numeric"   

## measure risk on data frames or matrices:
res <- measure_risk(testdata,
  keyVars=c("urbrur","roof","walls","water","sex"))
print(res)
#> 
#> --------------------------
#> 165 obs. with higher risk as the main part
#> Expected no. of re-identifications:
#> 93
#> (2.03%)
#> Threshold:0.03
#>  (for maximal global risk0.01)
#> --------------------------
head(res$Res)
#>             risk  fk  Fk
#> [1,] 0.002785515 359 359
#> [2,] 0.002849003 351 351
#> [3,] 0.002785515 359 359
#> [4,] 0.002785515 359 359
#> [5,] 0.006211180 161 161
#> [6,] 0.006666667 150 150
resw <- measure_risk(testdata,
  keyVars=c("urbrur","roof","walls","water","sex"),w="sampling_weight")
print(resw)
#> 
#> --------------------------
#> 0 obs. with higher risk as the main part
#> Expected no. of re-identifications:
#> 1.53
#> (0.03%)
#> Threshold:Inf
#>  (for maximal global risk0.01)
#> --------------------------
head(resw$Res)
#>              risk  fk    Fk
#> [1,] 2.793218e-05 359 35900
#> [2,] 2.857061e-05 351 35100
#> [3,] 2.793218e-05 359 35900
#> [4,] 2.793218e-05 359 35900
#> [5,] 6.249609e-05 161 16100
#> [6,] 6.710959e-05 150 15000
res1 <- ldiversity(testdata,
  keyVars=c("urbrur","roof","walls","water","sex"),ldiv_index="electcon")
print(res1)
#> --------------------------
#> L-Diversity Measures 
#> --------------------------
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   2.000   3.000   2.374   3.000   3.000 
head(res1)
#>      electcon_Distinct_Ldiversity electcon_Entropy_Ldiversity
#> [1,]                            3                    1.765424
#> [2,]                            3                    1.855345
#> [3,]                            3                    1.765424
#> [4,]                            3                    1.765424
#> [5,]                            2                    1.172525
#> [6,]                            2                    1.130836
#>      electcon_Recursive_Ldiversity MultiEntropy_Ldiversity
#> [1,]                             1                       0
#> [2,]                             1                       0
#> [3,]                             1                       0
#> [4,]                             1                       0
#> [5,]                             1                       0
#> [6,]                             1                       0
#>      MultiRecursive_Ldiversity
#> [1,]                         0
#> [2,]                         0
#> [3,]                         0
#> [4,]                         0
#> [5,]                         0
#> [6,]                         0
res2 <- ldiversity(testdata,
  keyVars=c("urbrur","roof","walls","water","sex"),ldiv_index=c("electcon","relat"))
print(res2)
#> --------------------------
#> L-Diversity Measures 
#> --------------------------
#>  electcon_Distinct_Ldiversity relat_Distinct_Ldiversity
#>  Min.   :1.000                Min.   :1.000            
#>  1st Qu.:2.000                1st Qu.:5.000            
#>  Median :3.000                Median :5.000            
#>  Mean   :2.374                Mean   :5.524            
#>  3rd Qu.:3.000                3rd Qu.:7.000            
#>  Max.   :3.000                Max.   :8.000            
head(res2)
#>      electcon_Distinct_Ldiversity electcon_Entropy_Ldiversity
#> [1,]                            3                    1.765424
#> [2,]                            3                    1.855345
#> [3,]                            3                    1.765424
#> [4,]                            3                    1.765424
#> [5,]                            2                    1.172525
#> [6,]                            2                    1.130836
#>      electcon_Recursive_Ldiversity relat_Distinct_Ldiversity
#> [1,]                             1                         5
#> [2,]                             1                         8
#> [3,]                             1                         5
#> [4,]                             1                         5
#> [5,]                             1                         5
#> [6,]                             1                         4
#>      relat_Entropy_Ldiversity relat_Recursive_Ldiversity
#> [1,]                 2.276001                          2
#> [2,]                 2.997907                          2
#> [3,]                 2.276001                          2
#> [4,]                 2.276001                          2
#> [5,]                 2.209202                          2
#> [6,]                 2.240133                          2
#>      MultiEntropy_Ldiversity MultiRecursive_Ldiversity
#> [1,]                       0                         0
#> [2,]                       0                         0
#> [3,]                       0                         0
#> [4,]                       0                         0
#> [5,]                       0                         0
#> [6,]                       0                         0

# measure risk with household risk
resh <- measure_risk(testdata,
  keyVars=c("urbrur","roof","walls","water","sex"),w="sampling_weight",hid="ori_hid")
print(resh)
#> 
#> --------------------------
#> 0 obs. with higher risk as the main part
#> Expected no. of re-identifications:
#> 1.53
#> (0.03%)
#> Threshold:Inf
#>  (for maximal global risk0.01)
#> --------------------------
#> --------------------------
#> Hierarchical risk 
#> --------------------------
#> Expected no. of re-identifications:
#> 7.18
#> (0.16% )

# change max_global_risk
rest <- measure_risk(testdata,
  keyVars=c("urbrur","roof","walls","water","sex"),
  w="sampling_weight",max_global_risk=0.0001)
print(rest)
#> 
#> --------------------------
#> 0 obs. with higher risk as the main part
#> Expected no. of re-identifications:
#> 1.53
#> (0.03%)
#> Threshold:0
#>  (for maximal global risk0)
#> --------------------------

## for objects of class sdcMicro:
data(testdata2)
sdc <- createSdcObj(testdata2,
  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),
  numVars=c('expend','income','savings'), w='sampling_weight')
## -> when using `createSdcObj()`, the risks are already internally computed
## and it is not required to explicitely run `sdc <- measure_risk(sdc)`
# }