The research focus in the field of entity matching for data integration has moved to embeddings and deep neural network based matching mainly in the form of the Transformer architecture. While these methods brought significant improvements compared to earlier methods, some challenges still remain e.g. the need for sizable amounts of training data as well as reduced performance on entities which are not represented in the training set.
WDC Products is the first benchmark that employs non-generated real-world data to assess the performance of matching systems along three dimensions: (i) amount of corner-cases, (ii) fraction of unseen entities in the test set, and (iii) development set size. The multi-dimensional design of WDC Products provides 27 variants consisting of pre-defined training, validation and testing splits which allows for the evaluation of entity matching systems along these dimensions as well as their combination in a controlled environment. Furthermore, the benchmark is available in a pair-wise binary and multi-class formulation. We have performed a set of baseline experiments using recent entity matching systems in order to showcase the suitability of the WDC Products benchmark. The evaluation confirms the difficulty of the benchmark and shows that the multi-dimensional design is useful for identifying the strengths and weaknesses of the systems.
More information about the WDC Products benchmark is found on the WDC Products website which also offers the benchmark for public download as well as the accompanying paper.