Implements the Global Distance-Based Record Linkage (GDBRL; Herranz et al., 2015), which links records in an original dataset to records in an anonymized/protected dataset by computing pairwise distances on selected linkage variables and then finding the minimum-total-distance one-to-one matching via the Hungarian algorithm.
This corresponds to an attacker scenario in which the adversary knows the original data, or equivalent external information, for the linkage variables and uses the released protected dataset to infer the most plausible global record-to-record matching.
A `data.frame` containing the original data.
A `data.frame` containing the anonymized/protected data.
Character vector of variable names used for record linkage. These variables must exist in both `x` and `y`.
Character string specifying the distance metric. One of `"gower"` (default), `"euclidean"`, or `"manhattan"`.
Optional numeric vector of variable weights passed to [cluster::daisy()]. Must have length `length(vars)`. If `NULL`, equal weights are used.
Optional single character string naming the identifier column in `x`. If `NULL`, row numbers are used as truth IDs.
Optional single character string naming the identifier column in `y`. If `NULL`, row numbers are used as truth IDs.
Logical; if `TRUE`, the full pairwise distance matrix is returned.
Character string specifying how to handle missing values in linkage variables. One of:
retain missing values and compute pairwise distances using the subset of linkage variables observed for each record pair, as handled by [cluster::daisy()]. Distances for different record pairs may therefore be based on different numbers of variables. Missing values are not treated as a separate category and do not contribute directly to the corresponding variable-specific distance.
stop if linkage variables contain any missing values.
Numeric tolerance used to determine tied minimum distances.
An object of class `"recordLinkage"` with elements:
A data.frame with matched pairs and corresponding distances
Number of correctly linked records
Proportion of correctly linked records
Mean matched distance
Total matched distance
Optional full pairwise distance matrix
The matched call
The distance measure can be chosen via `distance`. Gower distance is suitable for mixed-type quasi-identifiers, including numeric, factor, character, and logical variables. Variables of class factor are treated as nominal variables, while variables of class ordered are treated as ordinal variables. Euclidean and Manhattan distances are supported for purely numeric linkage variables. The Hungarian algorithm finds the global minimum-cost one-to-one assignment.
In addition to the global assignment, the function also returns the number of candidates attaining the minimum distance (`n_best`). The quantity `n_best` counts, for each record in `x`, how many records in `y` attain the same minimum row-wise distance in the pairwise distance matrix. If multiple optimal assignments exist, the chosen solution depends on the deterministic behavior of [clue::solve_LSAP()] for the supplied cost matrix.
For strict global assignment, `nrow(x)` must equal `nrow(y)`. If `x_id` and `y_id` are not supplied, row order is treated as the truth for evaluating correct matches.
Results depend on both the matching direction and the row order of the input data frames.
Herranz, J., Nin, J., Rodríguez, P., and Tassa, T. (2015). Revisiting distance-based record linkage for privacy-preserving release of statistical datasets. Data & Knowledge Engineering, 100, 78–93. doi:10.1016/j.datak.2015.07.009
Hornik, K. (2005). A CLUE for cluster ensembles. Journal of Statistical Software, 14(12). doi:10.18637/jss.v014.i12
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., & Hornik, K. (2026). cluster: Cluster Analysis Basics and Extensions. https://CRAN.R-project.org/package=cluster
x <- data.frame(
id = c(1, 2, 3),
age = c(23, 40, 35),
sex = factor(c("f", "m", "f")),
region = c("A", "B", "A"),
stringsAsFactors = FALSE
)
y <- data.frame(
id = c(1, 2, 3),
age = c(24, 39, 35),
sex = factor(c("f", "m", "f")),
region = c("A", "B", "B"),
stringsAsFactors = FALSE
)
out <- recordLinkage(
x = x,
y = y,
vars = c("age", "sex", "region"),
distance = "gower",
x_id = "id",
y_id = "id"
)
out
#> <recordLinkage>
#> Correct matches: 3/3
#> Correct match percent: 100.00%
#> Mean distance: 0.124183
out$matches
#> x_row y_row x_id y_id distance correct_match n_best
#> 1 1 1 1 1 0.01960784 TRUE 1
#> 2 2 2 2 2 0.01960784 TRUE 1
#> 3 3 3 3 3 0.33333333 TRUE 1