Large-Scale Data Integration Seminar (FSS 2023)

This seminar covers topics related to integrating data from large numbers of independent data sources. This includes large-scale entity matching, schema matching, table annotation, data cleansing, data fusion, data search and data exploration. The specific focus of the FSS2023 edition of the seminar is Data Integration using Large Language Models.

Organization

This seminar is organized by Prof. Dr. Christian Bizer, Keti Korini, Ralph Peeters, Alexander Brinkmann.
The seminar is available for master students of the Data Science and Business Informatics programs.
Slides of the Kickoff-Session including the organizational information about the seminar

Goals

In this seminar, you will

read, understand, and explore scientific literature
critically summarize the state-of-the-art concerning your topic
give a presentation about your topic (before the submission of the report)

Requirements

Attending Web Data Integration and Data Mining I before the seminar is strongly recommended
Report and presentation language: English

Schedule

Please register for the seminar via the centrally-coordinated seminar registration in Portal2
After you have been accepted into the seminar, please email us your three preferred topics from the list below.
We will assign topic to students according to your preferences.
Attend the kickoff meeting in which we will discuss general requirements for the reports and presentations as well as answer initial questions about the topics
You will be assigned a mentor, who provides guidance and one-to-one meetings
Work individually throughout the semester: explore literature, create a presentation, and write a report
Give your presentation in a block seminar towards the end of the semester
Write and submit your seminar thesis until July 2023.

Topics

1. Entity Matching using Domain Adaptation

N. Kirielle, P. Christen, and T. Ranbaduge, “TransER: Homogeneous Transfer Learning for Entity Resolution.” in Proceedings of the 25th International Conference on Extending Database Technology , 2022, pp. 118–130
M. Trabelsi, J. Heflin, and J. Cao, “DAME: Domain Adaptation for Matching Entities,” in Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, New York, NY, USA, Feb. 2022, pp. 1016–1024
J. Tu et al., “Domain Adaptation for Deep Entity Resolution,” in Proceedings of the 2022 International Conference on Management of Data, New York, NY, USA, Jun. 2022, pp. 443–457
More references and benchmarks: Papers with Code: Entity Resolution

2. Experimental Topic: Evaluating ChatGPT on the Task of Entity Matching

P. Wnag et al.: PromptEM: Prompt-tuning for low-resource generalized entity matching. Proceedings of the VLDB Endowment. Volume 16, Issue 2, pp 369–378. November 2022.
Avanika Narayan et al.: Can Foundation Models Wrangle Your Data? arXiv:2205.09911 [cs.LG] (2022)
A. Venkatesh et al., “On Evaluating and Comparing Open Domain Dialog Systems.” arXiv:1801.03625 [cs], Dec. 2018.
A. Srivastava et al., “Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models”. arXiv:2206.04615 [cs], June 2022.
Q. Dong et al., “A Survey for In-context Learning”. arXiv:2301.00234 [cs], Dec. 2022.
N. Barlaug and J. A. Gulla, “Neural Networks for Entity Matching: A Survey,” ACM Transactions on Knowledge Discovery from Data, vol. 15, no. 3, p. 52:1–52:37, Apr. 2021.
A. Primpeli and C. Bizer, “Profiling Entity Matching Benchmark Tasks,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, New York, NY, USA, Oct. 2020, pp. 3101–3108.

3. Deep Learning for Blocking

S. Thirumuruganathan, H. Li, N. Tang, M. Ouzzani, Y. Govind, D. Paulsen, G. Fung, and A. Doan. 2021. Deep learning for blocking in entity matching: a design space exploration. Proceedings of the 2021 VLDB Endowment 14, 11 (July 2021), 2459–2472.
W. Zhang, H. Wei, B. Sisman, L. Dong, C. Faloutsos, and D. Page. 2020. AutoBlock: A Hands-off Blocking Framework for Entity Matching. In Proceedings of the 13th International Conference on Web Search and Data Mining (WSDM ’20), Association for Computing Machinery, New York, NY, USA, 744–752.
R. Wang, Y. Li, and J. Wang, “Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation.” arXiv:2207.04122 [cs], Jul. 08, 2022.

4. Deep Learning for Table Search

G. Fan, J. Wang, Y. Li, D. Zhang, and R. Miller. 2023. Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. arxivx.
A. Bogatu, A. A. Fernandes, N. W. Paton, and A. Konstantinou. 2020. Dataset Discovery in Data Lakes. In IEEE 36th International Conference on Data Engineering (ICDE), 709–720.
A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. 2012. Finding Related Tables. In SIGMOD.

5. Representation Learning for Missing Value Imputation

Richard Wu, Aoqian Zhang, Ihab Ilyas, and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. Proceedings of Machine Learning and Systems 2, (March 2020), 307–325.
Avanika Narayan et al.: Can Foundation Models Wrangle Your Data? arXiv:2205.09911 [cs.LG] (2022)

Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2022. TURL: Table Understanding through Representation Learning. SIGMOD Rec. 51, 1 (June 2022), 33–40.
J. Yoon, J. Jordon, and M. Schaar. 2018. GAIN: Missing Data Imputation using Generative Adversarial Nets. In Proceedings of the 35th International Conference on Machine Learning, PMLR, 5689–5698.
Ihab F. Ilyas and Theodoros Rekatsinas. 2022. Machine Learning and Data Cleaning: Which Serves the Other? J. Data and Information Quality 14, 3 (September 2022), 1–11.

6. Experimental Topic: Evaluating ChatGPT on the Task of Missing Value Imputation for Knowledge Graph Completion

A. Venkatesh et al., “On Evaluating and Comparing Open Domain Dialog Systems.” arXiv:1801.03625 [cs], Dec. 2018.
A. Srivastava et al., “Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models”. arXiv:2206.04615 [cs], June 2022.
Q. Dong et al., “A Survey for In-context Learning”. arXiv:2301.00234 [cs], Dec. 2022.
Richard Wu, Aoqian Zhang, Ihab Ilyas, and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. Proceedings of Machine Learning and Systems 2, (March 2020), 307–325.
Avanika Narayan et al.: Can Foundation Models Wrangle Your Data? arXiv:2205.09911 [cs.LG] (2022)
https://paperswithcode.com/task/knowledge-graph-completion

7. Schema Matching using Deep Learning

Zhang, Jing, et al. “SMAT: An attention-based deep learning solution to the automation of schema matching.” European Conference on Advances in Databases and Information Systems. Springer, Cham, 2021.
Shraga, Roee, Avigdor Gal, and Haggai Roitman. “Adnev: Cross-domain schema matching using deep similarity matrix adjustment and evaluation.” Proceedings of the VLDB Endowment 13.9 (2020): 1401-1415.
Koutras, Christos, et al. “REMA: Graph Embeddings-based Relational Schema Matching.” EDBT/ICDT Workshops. 2020.
Rahm, E., Bernstein, P. A survey of approaches to automatic schema matching. The VLDB Journal 10 (2001), 334–350.

2. Experimental Topic: Evaluating ChatGPT on the Task of Schema Matching/Table Annotation

Avanika Narayan et al.: Can Foundation Models Wrangle Your Data? arXiv:2205.09911 [cs.LG] (2022)
A. Srivastava et al., “Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models”. arXiv:2206.04615 [cs], June 2022.
Q. Dong et al., “A Survey for In-context Learning”. arXiv:2301.00234 [cs], Dec. 2022.
Korini K, Peeters R, Bizer C., “SOTAB: The WDC Schema. org Table Annotation Benchmark”. Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab), CEUR-WS. org. 2022.
https://paperswithcode.com/task/table-annotation

8. Cell Entity Annotation in Tabular Data

X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu, “TURL: table understanding through representation learning,” Proc. VLDB Endow., vol. 14, no. 3, Nov. 2020, pp. 307–319
Huynh, V.P., Liu, J., Chabot, Y., Labbé, T., Monnin, P. and Troncy, R., DAGOBAH: Enhanced Scoring Algorithms for Scalable Annotations of Tabular Data. In SemTab@ ISWC, Nov. 2020, (pp. 27–39).
Chen, S., Karaoglu, A., Negreanu, C., Ma, T., Yao, J.G., Williams, J., Jiang, F., Gordon, A. and Lin, C.Y. LinkingPark: An automatic semantic table interpretation system. Journal of Web Semantics, 74, 2022, p.100733.
More references and benchmarks: Papers with Code: Cell Entity Annotation

9. Deep Tabular Learning for Domain-Specific Prediction Tasks

Yoon, Jinsung, et al. “Vime: Extending the success of self-and semi-supervised learning to tabular domain.” Advances in Neural Information Processing Systems 33 (2020).
Somepalli, Gowthami, et al. “Saint: Improved neural networks for tabular data via row attention and contrastive pre-training.” arXiv preprint arXiv:2106.01342 (2021).
Gharibshah, Zhabiz, and Xingquan Zhu. “Local Contrastive Feature Learning for Tabular Data.” Proceedings of the 31st ACM International Conference on Information & Knowledge Management (2022).
Stefan Hegselmann, et al. “TabLLM: Few-shot Classification of Tabular Data with Large Language Models” arXiv:2210.10723 [cs.CL] (2022).
Borisov, Vadim, Tobias Leemann, et al. “Deep neural networks and tabular data: A survey.” IEEE Transactions on Neural Networks and Learning Systems (2022).

10. Information Extraction for E-Commerce Product Data

Xinyang Zhang, Chenwei Zhang, Xian Li, Xin Luna Dong, Jingbo Shang, Christos Faloutsos, and Jiawei Han. 2022. OA-Mine: Open-World Attribute Mining for E-Commerce Products with Weak Supervision. In Proceedings of the ACM Web Conference 2022, ACM, Virtual Event, Lyon France, 3153–3161.
Huimin Xu, Wenting Wang, Xin Mao, Xinyu Jiang, and Man Lan. 2019. Scaling up Open Tagging from Tens to Thousands: Comprehension Empowered Attribute Value Extraction from Product Title. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 5214–5223.
Qifan Wang, et al. 2020: Learning to Extract Attribute Value from Product via Question Answering: A Multi-task Approach. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
Gilad Fuchs and Yoni Acriche. 2022. Product Titles-to-Attributes As a Text-to-Text Task. In Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5), Association for Computational Linguistics, Dublin, Ireland, 91–98.

11. Experimental Topic: Evaluating GPT3 on the Task of Product Information Extraction

A. Venkatesh et al., “On Evaluating and Comparing Open Domain Dialog Systems.” arXiv:1801.03625 [cs], Dec. 2018.
A. Srivastava et al., “Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models”. arXiv:2206.04615 [cs], June 2022.
Q. Dong et al., “A Survey for In-context Learning”. arXiv:2301.00234 [cs], Dec. 2022.
Li Yang: MAVE: A Product Dataset for Multi-source Attribute Value Extraction. WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 2022.
P. Petrovski, et al.: The wdc gold standards for product feature extraction and product matching. In E-Commerce and Web Technologies: 17th International Conference, EC-Web 2016.
OpenAI Plyayground Example: https://beta.openai.com/playground/p/default-parse-data
Aleph Alpha Plyayground Example: https://app.aleph-alpha.com/jumpstart/text-to-table

12. Experimental Topic: Combining WebAPIs and Large Language Models for Question Answering via In-Context Learning

Omar Khattab, et al.: Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP. arXiv:2212.14024 [cs.CL], Dec. 2022.
Q. Dong et al., “A Survey for In-context Learning”. arXiv:2301.00234 [cs], Dec. 2022.
Christopher Potts: Stanford online seminar – GPT-3 & Beyond. Starting from minute 28:13, Jan 2023.
Example Task: Ask ChatGPT or GPT3 questions about restaurants or hotels in Mannheim using TripAdvisor data and in-context learning.

Getting started

The following books are good starting points for getting an overview of the topic of large-scale data integration:

Dong/Srivastava: Big Data Integration. Morgan & Claypool, 2015.
Doan/Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.