Ralph Peeters has successfully defended his PhD Thesis

Ralph Peeters has successfully defended his PhD thesis titled “Entity Matching using Deep Neural Networks: From Discriminative Pre-trained Language Models to Generative Large Language Models” today.

The key contributions of the thesis are threefold: Dr. Ralph Peeters has developed novel benchmarks for measuring the performance of entity matching methods. He has proposed two novel entity matching methods which exploit group information and are the first to apply multi-task learning as well as supervised contrastive learning for entity matching. Third, Dr. Peeters investigates the application of generative large language models for entity matching, as well as the capability of large language models to explain matching decisions and to analyze matching errors.

The examination committee consisted of Prof. Heiko Paulheim, Prof. Felix Naumann (Universität Potsdam), Prof. Simone Ponzetto, and Prof. Christian Bizer.

The full thesis can be found here.

Abstract:

Entity matching is the task of identifying records that refer to the same entity across different datasets. It is a critical step in the data integration process. Supervised entity matching methods typically frame the problem as a binary classification task between record pairs. These methods require labeled record pairs, consisting of matches and non-matches, for training. Key challenges in entity matching include high heterogeneity among records referring to the same entity, scarcity of training data, and the continuous emergence of unseen entities in real-world applications. This thesis introduces two novel benchmarks for product matching, created using semantically annotated product identifiers on the Web as distant supervision. These benchmarks, sourced from thousands of e-shops, are among the largest and most diverse publicly available product matching datasets. They enable a finegrained evaluation of entity matching methods across different entity matching challenges. The thesis presents two new neural approaches for entity matching based on pre-trained language models, which achieve state-of-the-art results on multiple benchmarks. Unlike existing methods, both approaches exploit entity group information alongside binary matching labels during training. The first method, JointBERT, employs a dual-objective fine-tuning strategy. The second method, RSupCon, uses supervised contrastive learning and establishes new state-of-the-art results on multiple benchmarks, proving particularly effective on smaller training sets. In addition, the thesis explores the usefulness of multilingual Transformers for improving product matching performance in low-resource languages. This work further investigates generative large language models for entity matching, comparing them with pre-trained language models. The investigations include an analysis of prompting techniques, such as zero-shot inference, in-context learning, and rule-based prompting, as well as fine-tuning for entity matching. The results highlight the potential of large language models to match or exceed the performance of fine-tuned pre-trained language models, while requiring no or minimal amounts of training data. Additionally, the experiments demonstrate better generalization to unseen entities compared to pre-trained language models. The thesis also examines the explainability of matching decisions, introducing two methods for aggregating local explanations into global insights. The first method, based on LIME explanations, is broadly applicable to matching classifiers. The second method uses large language models to produce structured explanations that can be automatically parsed and aggregated. Finally, the thesis introduces a method for automating error analysis using large language models. This approach allows for the automatic generation of error classes, which can help data engineers in the process of improving entity matching pipelines.

Back