Entity resolution aims to identify records in two datasets (A and B) that describe the same real-world entity. Since comparing all record pairs between two datasets can be computationally expensive, entity resolution is approached in two steps, blocking and matching. Blocking applies a computationally cheap method to remove non-matching record pairs and produces a smaller set of candidate record pairs reducing the workload of the matcher. During matching a more expensive pair-wise matcher produces a final set of matching record pairs. Existing benchmark datasets for blocking and matching are rather small with respect to the Cartesian product AxB for comparing all records and the vocabulary size (number of unique tokens that need to be indexed). If blockers are evaluated only on these small datasets, effects resulting from a high number of records or from a large vocabulary size may be missed.
WDC Block is a new blocking benchmark that provides much larger datasets with a larger vocabulary and thus requires blockers that address these scalability challenges. Additionally, we provide three development sets with different sizes (~1K pairs, ~5K pairs & ~20K pairs) to experiment with different amounts of training data for the blockers.