WDC Block: A large Blocking Benchmark released

We are happy to announce the release of Web Data Commons Block (WDC-Block), a large Blocking Benchmark. WDC Block is based on product data that has been extracted in 2020 from 3,259 e-shops that marked up product offers within their HTML pages using the schema.org vocabulary. The benchmark is available in three sizes (small, medium, large). The largest variant of WDC Block features a maximum Cartesian product of 200 billion pairs. We also provide training sets of different sizes for evaluating supervised blockers.

Motivation:

Entity resolution aims to identify records in two datasets (A and B) that describe the same real-world entity. Since comparing all record pairs between two datasets can be computationally expensive, entity resolution is approached in two steps, blocking and matching. Blocking applies a computationally cheap method to remove non-matching record pairs and produces a smaller set of candidate record pairs reducing the workload of the matcher. During matching a more expensive pair-wise matcher produces a final set of matching record pairs. Existing benchmark datasets for blocking and matching are rather small with respect to the Cartesian product AxB for comparing all records and the vocabulary size (number of unique tokens that need to be indexed). If blockers are evaluated only on these small datasets, effects resulting from a high number of records or from a large vocabulary size may be missed.

WDC Block is a new blocking benchmark that provides much larger datasets with a larger vocabulary and thus requires blockers that address these scalability challenges. Additionally, we provide three development sets with different sizes (~1K pairs, ~5K pairs & ~20K pairs) to experiment with different amounts of training data for the blockers.

More Information:

More information about the WDC Products benchmark is found on the WDC Block website which also offers the benchmark for public download as well as the accompanying paper.

Back