Title of the Talk:
GPT4 versus BERT: Which Foundation Model is more Suitable for Web Data Integration?
The Web contains vast amounts of structured data in the form of HTML tables, schema.org annotations, as well as datasets accessible via data repositories. The automated integration of data from larger numbers of Web data sources is a long-standing research challenge as the integration requires dealing with several tricky tasks such as schema matching, entity matching, and data indexing for retrieval. Most state-of-the-art methods for these tasks rely on variants of the BERT transformer model fine-tuned using significant amounts of task-specific training data. In the talk, Christian Bizer will critically review BERT-based data integration methods and question their robustness concerning out-of-distribution entities. He will compare the performance of BERT-based methods with results of GPT-4-based data integration methods and will argue that GPT-4-based methods are more training data efficient and more robust concerning unseen entities.