An Ensemble Blocking Scheme for Entity Resolution of Large and Sparse Datasets release_hwfauebo6renxklbu4zn4spg3a

by Janani Balaji, Faizan Javed, Mayank Kejriwal, Chris Min, Sam Sander, Ozgur Ozturk

Released as a article .

2016  

Abstract

Entity Resolution, also called record linkage or deduplication, refers to the process of identifying and merging duplicate versions of the same entity into a unified representation. The standard practice is to use a Rule based or Machine Learning based model that compares entity pairs and assigns a score to represent the pairs' Match/Non-Match status. However, performing an exhaustive pair-wise comparison on all pairs of records leads to quadratic matcher complexity and hence a Blocking step is performed before the Matching to group similar entities into smaller blocks that the matcher can then examine exhaustively. Several blocking schemes have been developed to efficiently and effectively block the input dataset into manageable groups. At CareerBuilder (CB), we perform deduplication on massive datasets of people profiles collected from disparate sources with varying informational content. We observed that, employing a single blocking technique did not cover the base for all possible scenarios due to the multi-faceted nature of our data sources. In this paper, we describe our ensemble approach to blocking that combines two different blocking techniques to leverage their respective strengths.
In text/plain format

Archived Files and Locations

application/pdf  1.1 MB
file_qec5irr6bnfwhgvb4xwpcwvfia
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2016-09-20
Version   v1
Language   en ?
arXiv  1609.06265v1
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 6754bc47-f559-49a9-bdfc-75ba9f14948c
API URL: JSON