WDC Products: Multi-Dimensional Entity Matching Benchmark released

We are happy to announce the release of the multi-dimensional WDC Products Benchmark for entity matching. WDC Products is based on product data that has been extracted in 2020 from 3259 e-shops that mark up product offers within their HTML pages using the schema.org vocabulary. It contains overall 11715 product offers describing in total 2162 product entities belonging to various product categories. The benchmark provides 27 variants constituted by pre-defined training, validation and test splits which allow for an evaluation of entity matching systems along three dimensions: (i) amount of corner-cases, (ii) amount of unseen entities in the test set, and (iii) development set size.

Motivation:

The research focus in the field of entity matching for data integration has moved to embeddings and deep neural network based matching mainly in the form of the Transformer architecture. While these methods brought significant improvements compared to earlier methods, some challenges still remain e.g. the need for sizable amounts of training data as well as reduced performance on entities which are not represented in the training set. 

WDC Products is the first benchmark that employs non-generated real-world data to assess the performance of matching systems along three dimensions: (i) amount of corner-cases, (ii) fraction of unseen entities in the test set, and (iii) development set size. The multi-dimensional design of WDC Products provides 27 variants consisting of pre-defined training, validation and testing splits which allows for the evaluation of entity matching systems along these dimensions as well as their combination in a controlled environment. Furthermore, the benchmark is available in a pair-wise binary and multi-class formulation. We have performed a set of baseline experiments using recent entity matching systems in order to showcase the suitability of the WDC Products benchmark. The evaluation confirms the difficulty of the benchmark and shows that the multi-dimensional design is useful for identifying the strengths and weaknesses of the systems.

More Information:

More information about the WDC Products benchmark is found on the WDC Products website which also offers the benchmark for public download as well as the accompanying paper.

Back