Alexander Brinkmann has successfully defended his PhD Thesis

The key contributions of the thesis are threefold: Dr. Alexander Brinkmann proposes a novel blocking method which applies supervised contrastive learning to improve the blocking step within entity resolution pipelines. Second, he analyzes the potential of large language models for product attribute value extraction and attribute value normalization. Third, by extracting the WDC Schema.org Dataset Series from the Common Crawl, Dr. Brinkmann shows that schema.org data can be employed as distant supervision for machine learning tasks such as product classification and entity resolution.
The examination committee consisted of Prof. Roee Shraga (Worcester Polytechnic Institute, USA) , Prof. Heiko Paulheim, Prof. Simone Ponzetto, and Prof. Christian Bizer.
Abstract:
The increasing prevalence of e-commerce platforms has transformed online shopping experiences by offering personalized product recommendations, dynamic pricing strategies, and seamless product discovery. However, effective product data integration across diverse web sources remains a challenge due to semantic inconsistencies and the variations in quality of product data. To address these challenges, this thesis explores advanced deep-learning techniques for integrating and normalizing product data found on the web. The thesis makes several contributions to the field of product data integration. First, it introduces the WDC Schema.org Dataset Series, a publicly available dataset derived from the Common Crawl, facilitating the analysis of schema.org adoption on the Web and providing distant supervision for machine learning tasks such as product classification and entity resolution. Second, it introduces WDCBlock, a benchmark dataset for evaluating blocking techniques for entity resolution. WDC-Block has a Cartesian product of 200 billion record pairs, making it 166 thousand times larger than existing benchmarks. Third, the thesis develops SC-Block, a supervised contrastive learning-based blocking method that exploits existing training data. Entity resolution pipelines with SC-Block run up to 4 times faster than pipelines with other state-of-the-art blocking methods. Fourth, this thesis advances hierarchical product classification with pre-trained language models by leveraging domain-specific self-supervised pre-training on schema.org product annotations. Fifth, the thesis investigates the potential of large language models (LLMs) for product attribute value extraction, demonstrating that few-shot learning techniques with GPT-4 outperform existing pre-trained language model baselines. Sixth, to further support research in this area, the thesis introduces WDC-PAVE, a benchmark dataset designed to evaluate attribute value extraction and normalization tasks, addressing the limitation of existing benchmark datasets that evaluate extraction and normalization in isolation. Finally, this thesis examines automated self-refinement techniques for LLM-based attribute value extraction, finding that error-based prompt rewriting and self-correction increase computational costs but do not significantly improve extraction performance. The findings of this thesis contribute to the advancement of data-driven ecommerce applications by improving the integration, classification, and normalization of product data from diverse web sources. The introduced datasets and methods provide a foundation for future research in product data integration to enhance the efficiency of e-commerce platforms.