Microaggregation for numerical and categorical key variables based on a distance similar to the Gower Distance

The microaggregation is based on the distances computed similar to the Gower distance. The distance function makes distinction between the variable types factor,ordered,numerical and mixed (semi-continuous variables with a fixed probability mass at a constant value e.g. 0)

microaggrGower(
  obj,
  variables = NULL,
  aggr = 3,
  dist_var = NULL,
  by = NULL,
  mixed = NULL,
  mixed.constant = NULL,
  trace = FALSE,
  weights = NULL,
  numFun = mean,
  catFun = VIM::sampleCat,
  addRandom = FALSE
)

Arguments

obj: sdcMicroObj-class-object or a data.frame
variables: character vector with names of variables to be aggregated (Default for sdcMicroObj is all keyVariables and all numeric key variables)
aggr: aggregation level (default=3)
dist_var: character vector with variable names for distance computation
by: character vector with variable names to split the dataset before performing microaggregation (Default for sdcMicroObj is strataVar)
mixed: character vector with names of mixed variables
mixed.constant: numeric vector with length equal to mixed, where the mixed variables have the probability mass
trace: TRUE/FALSE for some console output
weights: numerical vector with length equal the number of variables for distance computation
numFun: function: to be used to aggregated numerical variables
catFun: function: to be used to aggregated categorical variables
addRandom: TRUE/FALSE if a random value should be added for the distance computation.

Value

The function returns the updated sdcMicroObj or simply an altered data frame.

Details

The function sampleCat samples with probabilities corresponding to the occurrence of the level in the NNs. The function maxCat chooses the level with the most occurrences and random if the maximum is not unique.

Note

In each by group all distance are computed, therefore introducing more by-groups significantly decreases the computation time and memory consumption.

Author

Alexander Kowarik

Examples


data(testdata,package="sdcMicro")
testdata <- testdata[1:200,]
# \donttest{
for(i in c(1:7,9)) testdata[,i] <- as.factor(testdata[,i])
test <- microaggrGower(testdata,variables=c("relat","age","expend"),
  dist_var=c("age","sex","income","savings"),by=c("urbrur","roof"))
#>         age      income     savings         age      income     savings 
#>        4.00  3071751.00    21073.89       61.00 93600000.00  9391037.00 
#>          age       income      savings          age       income      savings 
#>        1.000     2897.484    11751.200       70.000 99600000.000  9984098.000 

for(i in c(1:7,9)) testdata[,i] <- as.ordered(testdata[,i])
sdc <- createSdcObj(testdata,
  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),
  numVars=c('expend','income','savings'), w='sampling_weight')

sdc <- microaggrGower(sdc)
#> Warning: The number of unique values in the ordinal variables in data.x
#>               does not correspond to the values given in levOrders
#> Warning: The number of unique values in the ordinal variables in data.y
#>               does not correspond to the values given in levOrders
#>       expend       income      savings       expend       income      savings 
#>  1106874.000     2897.484    11751.200 98766142.000 99600000.000  9984098.000 
# }