Microaggregation

Function to perform various methods of microaggregation.

microaggregation(
  obj,
  variables = NULL,
  aggr = 3,
  strata_variables = NULL,
  method = "mdav",
  weights = NULL,
  nc = 8,
  clustermethod = "clara",
  measure = "mean",
  trim = 0,
  varsort = 1,
  transf = "log"
)

Arguments

obj: either an object of class sdcMicroObj-class or a data.frame
variables: variables to microaggregate. For NULL: If obj is of class sdcMicroObj, all numerical key variables are chosen per default. For data.frames, all columns are chosen per default.
aggr: aggregation level (default=3)
strata_variables: for data.frames, by-variables for applying microaggregation only within strata defined by the variables. For sdcMicroObj-class-objects, the stratification-variable defined in slot @strataVar is used. This slot can be changed any time using strataVar<-.
method: pca, rmd, onedims, single, simple, clustpca, pppca, clustpppca, mdav, clustmcdpca, influence, mcdpca
weights: sampling weights. If obj is of class sdcMicroObj the vector of sampling weights is chosen automatically. If determined, a weighted version of the aggregation measure is chosen automatically, e.g. weighted median or weighted mean.
nc: number of cluster, if the chosen method performs cluster analysis
clustermethod: clustermethod, if necessary
measure: aggregation statistic, mean, median, trim, onestep (default=mean)
trim: trimming percentage, if measure=trim
varsort: variable for sorting, if method=single
transf: transformation for data x

Value

If ‘obj’ was of class sdcMicroObj-class the corresponding slots are filled, like manipNumVars, risk and utility. If ‘obj’ was of class “data.frame”, an object of class “micro” with following entities is returned:

x:: original data
mx:: the microaggregated dataset
method:: method
aggr:: aggregation level
measure:: proximity measure for aggregation

Details

On https://research.cbs.nl/casc/glossary.htm one can found the “official” definition of microaggregation:

Records are grouped based on a proximity measure of variables of interest, and the same small groups of records are used in calculating aggregates for those variables. The aggregates are released instead of the individual record values.

The recommended method is “rmd” which forms the proximity using multivariate distances based on robust methods. It is an extension of the well-known method “mdav”. However, when computational speed is important, method “mdav” is the preferable choice.

While for the proximity measure very different concepts can be used, the aggregation itself is naturally done with the arithmetic mean. Nevertheless, other measures of location can be used for aggregation, especially when the group size for aggregation has been taken higher than 3. Since the median seems to be unsuitable for microaggregation because of being highly robust, other mesures which are included can be chosen. If a complex sample survey is microaggregated, the corresponding sampling weights should be determined to either aggregate the values by the weighted arithmetic mean or the weighted median.

This function contains also a method with which the data can be clustered with a variety of different clustering algorithms. Clustering observations before applying microaggregation might be useful. Note, that the data are automatically standardised before clustering.

The usage of clustering method ‘Mclust’ requires package mclust02, which must be loaded first. The package is not loaded automatically, since the package is not under GPL but comes with a different licence.

The are also some projection methods for microaggregation included. The robust version ‘pppca’ or ‘clustpppca’ (clustering at first) are fast implementations and provide almost everytime the best results.

Univariate statistics are preserved best with the individual ranking method (we called them ‘onedims’, however, often this method is named ‘individual ranking’), but multivariate statistics are strong affected.

With method ‘simple’ one can apply microaggregation directly on the (unsorted) data. It is useful for the comparison with other methods as a benchmark, i.e. replies the question how much better is a sorting of the data before aggregation.

Note

if only one variable is specified, mafast is applied and argument method is ignored. Parameters measure are ignored for methods mdav and rmd.

References

Templ, M. and Meindl, B., Robust Statistics Meets SDC: New Disclosure Risk Measures for Continuous Microdata Masking, Lecture Notes in Computer Science, Privacy in Statistical Databases, vol. 5262, pp. 113-126, 2008.

Templ, M. Statistical Disclosure Control for Microdata Using the R-Package sdcMicro, Transactions on Data Privacy, vol. 1, number 2, pp. 67-85, 2008. http://www.tdp.cat/issues/abs.a004a08.php

Templ, M. New Developments in Statistical Disclosure Control and Imputation: Robust Statistics Applied to Official Statistics, Suedwestdeutscher Verlag fuer Hochschulschriften, 2009, ISBN: 3838108280, 264 pages.

Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R. Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4. doi:10.1007/978-3-319-50272-4 doi:10.1007/978-3-319-50272-4

Templ, M. and Meindl, B. and Kowarik, A.: Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro, Journal of Statistical Software, 67 (4), 1–36, 2015.

Author

Matthias Templ, Bernhard Meindl

For method “mdav”: This work is being supported by the International Household Survey Network (IHSN) and funded by a DGF Grant provided by the World Bank to the PARIS21 Secretariat at the Organisation for Economic Co-operation and Development (OECD). This work builds on previous work which is elsewhere acknowledged.

Author for the integration of the code for mdav in R: Alexander Kowarik.

Examples

data(testdata)
# donttest since Examples with CPU time larger 2.5 times elapsed time, because
# of using data.table and multicore computation.
# \donttest{
m <- microaggregation(
  obj = testdata[1:100, c("expend", "income", "savings")],
  method = "mdav",
  aggr = 4
)
summary(m)
#> $meansx
#>      expend             income            savings       
#>  Min.   : 1106874   Min.   :    2897   Min.   :  11751  
#>  1st Qu.:25977689   1st Qu.:27750000   1st Qu.:2620342  
#>  Median :45716872   Median :44850000   Median :4771488  
#>  Mean   :48440371   Mean   :49180278   Mean   :4798498  
#>  3rd Qu.:69426340   3rd Qu.:70650000   3rd Qu.:6940269  
#>  Max.   :98685205   Max.   :99600000   Max.   :9984098  
#> 
#> $meansxm
#>      expend             income            savings       
#>  Min.   :14471827   Min.   : 4482460   Min.   : 872137  
#>  1st Qu.:22752145   1st Qu.:24675000   1st Qu.:2353262  
#>  Median :42916487   Median :46850000   Median :5116959  
#>  Mean   :48440371   Mean   :49180278   Mean   :4798498  
#>  3rd Qu.:71888065   3rd Qu.:65000000   3rd Qu.:7020042  
#>  Max.   :91918606   Max.   :93725000   Max.   :9407783  
#> 
#> $amean
#> [1] 0
#> 
#> $amedian
#> [1] 0.1782512
#> 
#> $aonestep
#> [1] 0
#> 
#> $devvar
#> [1] 0.3747106
#> 
#> $amad
#> [1] 0.5343033
#> 
#> $acov
#> [1] 0.1873553
#> 
#> $arcov
#> [1] NA
#> 
#> $acor
#> [1] 0.25935
#> 
#> $arcor
#> [1] NA
#> 
#> $acors
#> [1] 0.6424174
#> 
#> $adlm
#> [1] 0.1699611
#> 
#> $adlts
#> [1] NA
#> 
#> $apcaload
#> [1] 0.6853318
#> 
#> $apppcaload
#> [1] 2.27684
#> 
#> $totalsOrig
#>     expend     income    savings 
#> 4844037117 4918027777  479849813 
#> 
#> $totalsMicro
#> numeric(0)
#> 
#> $atotals
#> [1] 0
#> 
#> $pmtotals
#> [1] 0
#> 
#> $util1
#> [1] 53.39995
#> 
#> $deigenvalues
#> [1] 0.0596172
#> 
#> $risk0
#> [1] 0
#> 
#> $risk1
#> [1] 0.32
#> 
#> $risk2
#> [1] 0
#> 
#> $wrisk1
#> [1] 0.9830069
#> 
#> $wrisk2
#> [1] 0
#> 

## for objects of class sdcMicro:
## no stratification because `@strataVar` is `NULL`
data(testdata2)
sdc <- createSdcObj(
  dat = testdata2,
  keyVars = c("urbrur", "roof", "walls", "water", "electcon", "sex"),
  numVars = c("expend", "income", "savings"),
  w = "sampling_weight"
)
sdc <- microaggregation(
  obj = sdc,
  variables = c("expend", "income")
)

## with stratification using variable `"relat"`
strataVar(sdc) <- "relat"
sdc <- microaggregation(
  obj = sdc,
  variables = "savings"
)
# }