Record linkage via Global Distance-Based Record Linkage

Implements the Global Distance-Based Record Linkage (GDBRL; Herranz et al., 2015), which links records in an original dataset to records in an anonymized/protected dataset by computing pairwise distances on selected linkage variables and then finding the minimum-total-distance one-to-one matching via the Hungarian algorithm.

This corresponds to an attacker scenario in which the adversary knows the original data, or equivalent external information, for the linkage variables and uses the released protected dataset to infer the most plausible global record-to-record matching.

recordLinkage(
  x,
  y,
  vars,
  distance = c("gower", "euclidean", "manhattan"),
  weights = NULL,
  x_id = NULL,
  y_id = NULL,
  return_matrix = FALSE,
  na_action = c("ignore", "fail"),
  tol = sqrt(.Machine$double.eps)
)

Arguments

x

A `data.frame` containing the original data.

y

A `data.frame` containing the anonymized/protected data.

vars

Character vector of variable names used for record linkage. These variables must exist in both `x` and `y`.

distance

Character string specifying the distance metric. One of `"gower"` (default), `"euclidean"`, or `"manhattan"`.

weights

Optional numeric vector of variable weights passed to [cluster::daisy()]. Must have length `length(vars)`. If `NULL`, equal weights are used.

x_id

Optional single character string naming the identifier column in `x`. If `NULL`, row numbers are used as truth IDs.

y_id

Optional single character string naming the identifier column in `y`. If `NULL`, row numbers are used as truth IDs.

return_matrix

Logical; if `TRUE`, the full pairwise distance matrix is returned.

na_action

Character string specifying how to handle missing values in linkage variables. One of:

`ignore`: retain missing values and compute pairwise distances using the subset of linkage variables observed for each record pair, as handled by [cluster::daisy()]. Distances for different record pairs may therefore be based on different numbers of variables. Missing values are not treated as a separate category and do not contribute directly to the corresponding variable-specific distance.
`fail`: stop if linkage variables contain any missing values.

tol

Numeric tolerance used to determine tied minimum distances.

Value

An object of class `"recordLinkage"` with elements:

matches: A data.frame with matched pairs and corresponding distances
correct_matches: Number of correctly linked records
correct_match_rate: Proportion of correctly linked records
mean_distance: Mean matched distance
total_distance: Total matched distance
distance_matrix: Optional full pairwise distance matrix
call: The matched call

Details

The distance measure can be chosen via `distance`. Gower distance is suitable for mixed-type quasi-identifiers, including numeric, factor, character, and logical variables. Variables of class factor are treated as nominal variables, while variables of class ordered are treated as ordinal variables. Euclidean and Manhattan distances are supported for purely numeric linkage variables. The Hungarian algorithm finds the global minimum-cost one-to-one assignment.

In addition to the global assignment, the function also returns the number of candidates attaining the minimum distance (`n_best`). The quantity `n_best` counts, for each record in `x`, how many records in `y` attain the same minimum row-wise distance in the pairwise distance matrix. If multiple optimal assignments exist, the chosen solution depends on the deterministic behavior of [clue::solve_LSAP()] for the supplied cost matrix.

For strict global assignment, `nrow(x)` must equal `nrow(y)`. If `x_id` and `y_id` are not supplied, row order is treated as the truth for evaluating correct matches.

Results depend on both the matching direction and the row order of the input data frames.

References

Herranz, J., Nin, J., Rodríguez, P., and Tassa, T. (2015). Revisiting distance-based record linkage for privacy-preserving release of statistical datasets. Data & Knowledge Engineering, 100, 78–93. doi:10.1016/j.datak.2015.07.009

Hornik, K. (2005). A CLUE for cluster ensembles. Journal of Statistical Software, 14(12). doi:10.18637/jss.v014.i12

Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., & Hornik, K. (2026). cluster: Cluster Analysis Basics and Extensions. https://CRAN.R-project.org/package=cluster

Examples

x <- data.frame(
  id = c(1, 2, 3),
  age = c(23, 40, 35),
  sex = factor(c("f", "m", "f")),
  region = c("A", "B", "A"),
  stringsAsFactors = FALSE
)

y <- data.frame(
  id = c(1, 2, 3),
  age = c(24, 39, 35),
  sex = factor(c("f", "m", "f")),
  region = c("A", "B", "B"),
  stringsAsFactors = FALSE
)

out <- recordLinkage(
  x = x,
  y = y,
  vars = c("age", "sex", "region"),
  distance = "gower",
  x_id = "id",
  y_id = "id"
)

out
#> <recordLinkage>
#> Correct matches:       3/3
#> Correct match percent: 100.00%
#> Mean distance:         0.124183
out$matches
#>   x_row y_row x_id y_id   distance correct_match n_best
#> 1     1     1    1    1 0.01960784          TRUE      1
#> 2     2     2    2    2 0.01960784          TRUE      1
#> 3     3     3    3    3 0.33333333          TRUE      1