MEC with blocking

Adam Struzik

1 Setup

Load required packages.

library(automatedRecLin)
library(data.table)

options("text2vec.mc.cores" = 1L)

2 Data

We use the full example Census and Customer Information System (CIS) datasets from McLeod et al. (2011). The goal is to link records from CIS to records from Census.

data("census", package = "automatedRecLin")
data("cis", package = "automatedRecLin")
setDT(census)
setDT(cis)

NROW(cis)
#> [1] 24613
NROW(census)
#> [1] 25343

The person_id variable identifies the correct linkage. We use this information only to evaluate the result.

cis[is.na(cis)] <- ""
census[is.na(census)] <- ""

cis[, pername1 := gsub("-", "", pername1)]
census[, pername1 := gsub("-", "", pername1)]

true_matches <- merge(
  x = cis[, .(a = .I, person_id)],
  y = census[, .(b = .I, person_id)],
  by = "person_id"
)[, .(a, b)]

NROW(true_matches)
#> [1] 24043

3 MEC with blocking

We compare forename and surname using the Jaro-Winkler distance. These two comparison variables are modeled with the continuous parametric MEC method. Sex and date-of-birth variables use the default binary method. Address fields are used only to construct blocks.

variables <- c(
  "pername1", "pername2", "sex",
  "dob_day", "dob_mon", "dob_year"
)

comparators <- list(
  "pername1" = jarowinkler_complement(),
  "pername2" = jarowinkler_complement()
)

methods <- list(
  "pername1" = "continuous_parametric",
  "pername2" = "continuous_parametric"
)

blocking_variables <- c(variables, "enumcap", "enumpc")

Run blocked MEC. The model is trained on sampled blocks that contain at least the requested number of pairs and a lower bound on nonmatches.

set.seed(1)

result <- mec_blocking(
  A = cis,
  B = census,
  variables = variables,
  comparators = comparators,
  methods = methods,
  blocking_variables = blocking_variables,
  blocking_sep = "",
  controls_blocking = list(seed = 1, n_threads = 1),
  min_training_pairs = 1000,
  min_training_nonmatches = 1000,
  block_sampling_seed = 1,
  nonmatch_sample_size = 100000,
  nonmatch_sampling_seed = 1,
  true_matches = true_matches
)

result
#> Blocked MEC record linkage based on the following variables:  
#> pername1, pername2, sex, dob_day, dob_mon, dob_year.
#> ========================================================
#> Number of final blocks: 23726.
#> Training rule: threshold_sampling.
#> Number of training blocks: 14741.
#> Number of training pairs: 15741.
#> Training nonmatch lower bound: 1000.
#> ========================================================
#> The algorithm predicted 23718 matches.
#> The first 6 predicted matches are:
#>        a     b block ratio / 1000
#>    <int> <int> <num>        <num>
#> 1:  8152     1     1    138413.65
#> 2:  8584     2     2    598056.47
#> 3: 20590     3     3    999258.68
#> 4: 18456     4     4     51129.87
#> 5: 17257     5     5    316432.73
#> 6: 19868     6     6    316432.73
#> ========================================================
#> Estimated false link rate (FLR): 0.0033 %.
#> Estimated missing match rate (MMR): 0.0033 %.
#> ========================================================
#> Blocking diagnostics:
#> Known matches: 24043.
#> Known matches retained by blocking: 23688.
#> Known matches missed by blocking: 355.
#> Blocking MMR: 1.4765 %.
#> Candidate pairs retained: 25343 of 623767259.
#> Candidate pair reduction: 99.9959 %.
#> ========================================================
#> Evaluation metrics:
#> FLR (%) MMR (%) 
#>  0.1560  1.5056

4 Blocking efficiency and linkage results

The full Cartesian product contains 623,767,259 record pairs. Blocking reduces this to 25,343 candidate pairs, while retaining 98.52% of known links. The final linkage set contains 23,718 predicted matches.

step result
Training threshold_sampling on 14,741 blocks
Blocking 23,688 of 24,043 known links retained
Linkage FLR = 0.16%; MMR = 1.51%