Large-Scale Data Integration Seminar (FSS 2021)

This seminar covers topics related to integrating data from large numbers of independent data sources. This includes large-scale schema matching, identity resolution, data fusion, set completion, data search and data exploration, and data profiling.

The specific focus of the 2021 edition of the seminar is Data Integration using Deep Learning. The topic of the seminar will be organized along the this survey/vision paper: Thirumuruganathan, Tang, Ouzzani, Doan: Data Curation with Deep Learning (PDF). EDBT 2020.

Organization

This seminar is organized by Prof. Dr. Christian Bizer, Anna Primpeli, Alexander Brinkmann, Ralph Peeters.
The seminar is available for master students of the Data Science and Business Informatics programs.
Slides of the kick-off meeting (PDF, 618 kB)

Goals

In this seminar, you will

Read, understand, and explore scientific literature
Summarize a current research topic in a concise report (10–12 pages)
Give a presentation about your topic (before the submission of the report)

Kick-Off Meeting

The the kick-off meeting will take place online on 17. March 2021 at 11:00.

Requirements

Attending Web Data Integration and Data Mining I before the seminar is strongly recommended
Report and presentation language: English

Schedule

Please register for the seminar via the centrally-coordinated seminar registration
After you have been accepted into the seminar, please email us your three preferred topics from the list below.
We will assign topic to students according to your preferences
Attend the kickoff meeting in which we will discuss general requirements for the reports and presentations as well as answer initial questions about the topics
You will be assigned a mentor, who provides guidance and one-to-one meetings
Work individually throughout the semester: explore literature, create a presentation, and write a report
Give your presentation in a block seminar towards the end of the semester

Topics

1. Entity Matching using Deep Learning

Y. Li, J. Li, Y. Suhara, A. Doan, and W.-C. Tan, “Deep entity matching with pre-trained language models,” in Proceedings of the VLDB Endowment, vol. 14, Sep. 2020.
J. Shao, Q. Wang, A. Wijesinghe, and E. Rahm, “ErGAN: Generative Adversarial Networks for Entity Resolution,” arXiv:2012.10004 [cs], Dec. 2020.
Z. Wang, B. Sisman, H. Wei, X. L. Dong, and S. Ji, “CorDEL: A Contrastive Deep Learning Approach for Entity Linkage,” arXiv:2009.07203 [cs], Sep. 2020.

2. Entity Matching using Transformers

Y. Li, J. Li, Y. Suhara, A. Doan, and W.-C. Tan, “Deep entity matching with pre-trained language models,” in Proceedings of the VLDB Endowment, vol. 14, 2020.
K.-S. Teong, L.-K. Soon, and T. T. Su, “Schema-Agnostic Entity Matching using Pre-trained Language Models,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020.
R. Peeters, C. Bizer, G. Glavas “Intermediate Training of BERT for Product Matching,” In: DI2KG Workshop @ VLDB, 2020.

3. Schema Matching using Deep Learning

X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu, “TURL: Table Understanding through Representation Learning,” arXiv:2006.14806 [cs], Jun. 2020.
J. Chen, E. Jimenez-Ruiz, I. Horrocks, and C. Sutton, “ColNet: Embedding the Semantics of Web Tables for Column Type Prediction,” arXiv:1811.01304 [cs], Nov. 2018.
M. Hulsebos et al., “Sherlock: A Deep Learning Approach to Semantic Data Type Detection,” arXiv:1905.10688 [cs, stat], May 2019.

4. Data Imputation using Deep Learning

X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu, “TURL: Table Understanding through Representation Learning,” arXiv:2006.14806 [cs], Jun. 2020.
R. Wu, A. Zhang, I. F. Ilyas, and T. Rekatsinas, “Attention-based Learning for Missing Data Imputation in HoloClean,” in Proceedings of the 3rd MLSys Conference, 2020.
J. Yoon, J. Jordon, and M. van der Schaar, “GAIN: Missing Data Imputation using Generative Adversarial Nets,” arXiv:1806.02920 [cs, stat], Jun. 2018.

5. Representation Learning for Table Understanding

X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu, “TURL: Table Understanding through Representation Learning,” arXiv:2006.14806 [cs], Jun. 2020.
R. Cappuzzo, P. Papotti, and S. Thirumuruganathan, “Local Embeddings for Relational Data Integration,” arXiv:1909.01120 [cs], Sep. 2019.
A. Y. L. Sim and A. Borthwick, “Record2Vec: Unsupervised Representation Learning for Structured Records,” in 2018 IEEE International Conference on Data Mining, Nov. 2018.

6. Hierarchical Classification using Deep Learning

C. Silla and A. Freitas: “A survey of hierarchical classification across different application domains“ in Data Mining and Knowledge Discovery, 2011.
K. Kowsari, D. E. Brown, M. Heidarysafa, K. J. Meimandi, M. S. Gerber and L .E. Barnes: “HDLTex: Hierarchical Deep Learning for Text Classification“ in Proceedings of 16th IEEE International Conference on Machine Learning and Applications, 2017
D. Gao, W. Yang, H. Zhou, Y. Wei, Y. Hu and H. Wang: “Deep Hierarchical Classification for Category Prediction in E-commerce System“ in Proceedings of the 3rd Workshop on e-Commerce and NLP (ECNLP 3), 2020.

7. Table Search using Deep Learning

X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu, “TURL: Table Understanding through Representation Learning,” arXiv:2006.14806 [cs], Jun. 2020.
S. Zhang and K. Balog, “Web Table Extraction, Retrieval and Augmentation: A Survey.“ in Proceedings of the 11th ACM Transactions on Intelligent Systems and Technology, pages 1–35. ACM, 2020
M. Trabelsi, Z. Chen, B. D. Davison and J. Heflin, “A Hybrid Deep Model for Learning to Rank Data Tables“ in Proceedings of the IEEE Internationl Conference on Big Data, 2020

8. Weakly-Supervised and Transfer Learning

S. Thirumuruganathan, S. A. P. Parambath, M. Ouzzani, N. Tang, and S. Joty: “Reuse and adaptation for entity resolution through transfer learning. arXiv:1809.11084, 2018.
C. Bizer, A. Primpeli, and R. Peeters, “Using the semantic web as a source of training data,” Datenbank-Spektrum, vol. 19, no. 2, 2019.
S. N. Negahban, B. I. Rubinstein, and J. G. Gemmell. Scaling multiple-source entity resolution using statistically efficient transfer learning. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 2224–2228. ACM, 2012.

9. Deep Active Learning

J. Kasai, K. Qian, S. Gurajada, Y. Li, and L. Popa, “Low-resource Deep Entity Resolution with Transfer and Active Learning,” arXiv:1906.08042, 2019.
Y. Nafa et al., “Active Deep Learning on Entity Resolution by Risk Sampling,” arXiv:2012.12960, 2020.
Siméoni, Oriane, et al. “Rethinking deep active learning: Using unlabeled data at model training.” arXiv preprint arXiv:1911.08177 (2019).

10. Active Transfer Learning

J. Kasai, K. Qian, S. Gurajada, Y. Li, and L. Popa, “Low-resource Deep Entity Resolution with Transfer and Active Learning,” arXiv:1906.08042, 2019.
E. Gavves, T. Mensink, T. Tommasi, C. G. Snoek, and T. Tuytelaars, “Active transfer learning with zero-shot priors: Reusing past datasets for future tasks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015.
Wang, Xuezhi, Tzu-Kuo Huang, and Jeff Schneider. “Active transfer learning under model shift.” International Conference on Machine Learning. PMLR, 2014.

Students are free to suggest additional topics of their choice that are related to large-scale data integration.

Getting started

The following books are good starting points for getting an overview of the topic of large-scale data integration:

Dong/Srivastava: Big Data Integration. Morgan & Claypool, 2015.
Doan/Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.