Targeted Record Swapping — recordSwap • sdcMicro

Applies targeted record swapping on micro data considering the identification risk of each record as well the geographic topology.

recordSwap(data, ...)

# S3 method for class 'sdcMicroObj'
recordSwap(data, ...)

# Default S3 method
recordSwap(
  data,
  hid,
  hierarchy,
  similar,
  swaprate = 0.05,
  risk = NULL,
  risk_threshold = 0,
  k_anonymity = 3,
  risk_variables = NULL,
  carry_along = NULL,
  return_swapped_id = FALSE,
  log_file_name = "TRS_logfile.txt",
  seed = NULL,
  ...
)

Arguments

data: must be either a micro data set in the form of a `data.table` or `data.frame`, or an `sdcObject`, see createSdcObj.
...: parameters passed to `recordSwap.default()`
hid: column index or column name in `data` which refers to the household identifier.
hierarchy: column indices or column names of variables in `data` which refer to the geographic hierarchy in the micro data set. For instance county > municipality > district.
similar: vector or list of integer vectors or column names containing similarity profiles, see details for more explanations.
swaprate: double between 0 and 1 defining the proportion of households which should be swapped, see details for more explanations
risk: either column indices or column names in `data` or `data.table`, `data.frame` or `matrix` indicating risk of each record at each hierarchy level. If `risk`-matrix is supplied to swapping procedure will not use the k-anonymity rule but the values found in this matrix for swapping. When using the risk parameter is expected to have assigned a maximum value in a household for each member of the household. If this condition is not satisfied, the risk parameter is automatically adjusted to comply with this condition. If risk parameter is provided then k-anonymity rule is suppressed.
risk_threshold: single numeric value indicating when a household is considered "high risk", e.g. when this household must be swapped. Is only used when `risk` is not `NULL`. Risk threshold indicates households that have to be swapped, but be aware that households with risk lower than threshold, but with still high enough risk may be swapped as well. Only households with risk set to 0 are not swapped. Risk and risk threshold must be equal or bigger then 0.
k_anonymity: integer defining the threshold of high risk households (counts<k) for using k-anonymity rule
risk_variables: column indices or column names of variables in `data` which will be considered for estimating the risk. Only used when k-anonymity rule is applied.
carry_along: integer vector indicating additional variables to swap besides to hierarchy variables. These variables do not interfere with the procedure of finding a record to swap with or calculating risk. This parameter is only used at the end of the procedure when swapping the hierarchies. We note that the variables to be used as `carry_along` should be at household level. In case it is detected that they are at individual level (different values within `hid`), a warning is given.
return_swapped_id,: boolean if `TRUE` the output includes an additional column showing the `hid` with which a record was swapped with. The new column will have the name `paste0(hid,"_swapped")`.
log_file_name: character, path for writing a log file. The log file contains a list of household IDs (`hid`) which could not have been swapped and is only created if any such households exist.
seed: integer defining the seed for the random number generator, for reproducibility. if `NULL` a random seed will be set using `sample(1e5,1)`.

Value

`data.table` with swapped records.

Details

The procedure accepts a `data.frame` or `data.table` containing all necessary information for the record swapping, e.g parameter `hid`, `similar`, `hierarchy`, etc ... First, the micro data in `data` is ordered by `hid` and the identification risk is calculated for each record in each hierarchy level. As of right now only counts is used as identification risk and the inverse of counts is used as sampling probability. NOTE: It will be possible to supply an identification risk for each record and hierarchy level which will be passed down to the C++-function. This is however not fully implemented.

With the parameter `k_anonymity` a k-anonymity rule is applied to define risky households in each hierarchy level. A household is set to risky if counts < k_anonymity in any hierarchy level and the household needs to be swapped across this hierarchy level. For instance, having a geographic hierarchy of NUTS1 > NUTS2 > NUTS3 the counts are calculated for each geographic variable and defined `risk_variables`. If the counts for a record falls below `k_anonymity` for hierarchy county (NUTS1, NUTS2, ...) then this record needs to be swapped across counties. Setting `k_anonymity = 0` disables this feature and no risky households are defined.

After that the targeted record swapping is applied, starting from the highest to the lowest hierarchy level and cycling through all possible geographic areas at each hierarchy level, e.g every county, every municipality in every county, etc, ...

At each geographic area, a set of values is created for records to be swapped. In all but the lowest hierarchy level, this is ONLY made out of all records which do not fulfil the k-anonymity and have not already been swapped. Those records are swapped with records not belonging to the same geographic area, which have not already been swapped beforehand. Swapping refers to the interchange of geographic variables defined in `hierarchy`. When a record is swapped all other records containing the same `hid` are swapped as well.

At the lowest hierarchy level in every geographic area, the set of records to be swapped is made up of all records which do not fulfil the k-anonymity as well as the remaining number of records such that the proportion of swapped records of the geographic area is in coherence with the `swaprate`. If due to the k-anonymity condition, more records have already been swapped in this geographic area then only the records which do not fulfil the k-anonymity are swapped.

Using the parameter `similar` one can define similarity profiles. `similar` needs to be a list of vectors with each list entry containing column indices of `data`. These entries are used when searching for donor households, meaning that for a specific record the set of all donor records is made out of records which have the same values in `similar[[1]]`. It is however important to note, that these variables can only be variables related to households (not persons!). If no suitable donor can be found the next similarity profile is used, `similar[[2]]` and the set of all donors is then made up out of all records which have the same values in the column indices in `similar[[2]]`. This procedure continues until a donor record was found or all the similarity profiles have been used.

`swaprate` sets the swaprate of households to be swapped, where a single swap counts for swapping 2 households, the sampled household and the corresponding donor. Prior to the procedure, the swaprate is applied on the lowest hierarchy level, to determine the target number of swapped households in each of the lowest hierarchies. If the target numbers of a decimal point they will randomly be rounded up or down such that the number of households swapped in total is in coherence to the swaprate.

Author

Johannes Gussenbauer

Examples

# generate 10000 dummy households
library(data.table)
seed <- 2021
set.seed(seed)
nhid <- 10000
# \donttest{
dat <- sdcMicro::createDat(nhid)

# define paramters for swapping
k_anonymity <- 1
swaprate <- .05 # 5%
similar <- list(c("hsize"))
hier <- c("nuts1", "nuts2")
risk_variables <- c("ageGroup", "national")
hid <- "hid"

## apply record swapping
#dat_s <- recordSwap(
#  data = dat,
#  hid = hid,
#  hierarchy = hier,
#  similar = similar,
#  swaprate = swaprate,
#  k_anonymity = k_anonymity,
#  risk_variables = risk_variables,
#  carry_along = NULL,
#  return_swapped_id = TRUE,
#  seed = seed
#)
#
## number of swapped households
#dat_s[hid != hid_swapped, uniqueN(hid)]
#
## hierarchies are not consistently swapped
#dat_s[hid != hid_swapped, .(nuts1, nuts2, nuts3, lau2)]
#
## use parameter carry_along
#dat_s <- recordSwap(
#   data = dat,
#   hid = hid,
#  hierarchy = hier,
#  similar = similar,
#  swaprate = swaprate,
#  k_anonymity = k_anonymity,
#  risk_variables = risk_variables,
#  carry_along = c("nuts3", "lau2"),
#  return_swapped_id = TRUE,
#  seed = seed)
#
#dat_s[hid != hid_swapped, .(nuts1, nuts2, nuts3, lau2)]
# }