Scaling Bayesian Probabilistic Record Linkage with Post-Hoc Blocking: An
Application to the California Great Registers
release_ywhrjm5zezbijhro6s6dvllf5e
by
Brendan S. McVeigh, Bradley T. Spahn, Jared S. Murray
2019
Abstract
Probabilistic record linkage (PRL) is the process of determining which
records in two databases correspond to the same underlying entity in the
absence of a unique identifier. Bayesian solutions to this problem provide a
powerful mechanism for propagating uncertainty due to uncertain links between
records (via the posterior distribution). However, computational considerations
severely limit the practical applicability of existing Bayesian approaches. We
propose a new computational approach, providing both a fast algorithm for
deriving point estimates of the linkage structure that properly account for
one-to-one matching and a restricted MCMC algorithm that samples from an
approximate posterior distribution. Our advances make it possible to perform
Bayesian PRL for larger problems, and to assess the sensitivity of results to
varying prior specifications. We demonstrate the methods on a subset of an
OCR'd dataset, the California Great Registers, a collection of 57 million voter
registrations from 1900 to 1968 that comprise the only panel data set of party
registration collected before the advent of scientific surveys.
In text/plain
format
Archived Files and Locations
application/pdf 821.0 kB
file_eq7gsoxa6vheljkrb7f2qfj7t4
|
arxiv.org (repository) web.archive.org (webarchive) |
1905.05337v1
access all versions, variants, and formats of this works (eg, pre-prints)