Large-Scale Data Integration Seminar (FSS 2022)

This seminar covers topics related to integrating data from large numbers of independent data sources. The specific focus of the FSS 2022 edition of the seminar are Data Lakes which are large uncurated collections of data having different degrees to structuredness. The seminar will review the concept of data lakes as well as their use cases and will cover methods for profiling data lakes, searching within data lakes, as well as for integrating and cleansing data from data lakes.

Organization

Goals

In this seminar, you will

  • read, understand, and explore scientific literature
  • critically summarize the state-of-the-art concerning your topic
  • give a presentation about your topic (before the submission of the report)

Requirements

Schedule

  1. Please register for the seminar via the centrally-coordinated seminar registration in Portal2
  2. After you have been accepted into the seminar, please email us your three preferred topics from the list below.
    We will assign topic to students according to your preferences
  3. Attend the kickoff meeting in which we will discuss general requirements for the reports and presentations as well as answer initial questions about the topics
  4. You will be assigned a mentor, who provides guidance and one-to-one meetings
  5. Work individually throughout the semester: explore literature, create a presentation, and write a report
  6. Give your presentation in a block seminar towards the end of the semester
  7. Write and submit your seminar thesis until July 2022.

Topics

1. Data Lakes: Concepts, Functionalities, Examples

  • Hai, Rihan, Christoph Quix, and Matthias Jarke. “Data lake concept and systems: a survey.” arXiv preprint arXiv:2106.09592 (2021).
  • Sawadogo, Pegdwendé, and Jérôme Darmont. “On data lake architectures and metadata management.” Journal of Intelligent Information Systems 56.1 (2021): 97–120.
  • Nargesian, Fatemeh, et al. “Data lake management: challenges and opportunities.” Proceedings of the VLDB Endowment 12.12 (2019): 1986-1989.

2. Comparison of Data Lake Management Platforms

  • Hai, Rihan, Christoph Quix, and Matthias Jarke. “Data lake concept and systems: a survey.” arXiv preprint arXiv:2106.09592 (2021)
  • Sawadogo, Pegdwendé, and Jérôme Darmont. “On data lake architectures and metadata management.” Journal of Intelligent Information Systems 56.1 (2021): 97–120.

3. Data Lake Profiling

  • Hai, Rihan, Christoph Quix, and Matthias Jarke. “Data lake concept and systems: a survey.” arXiv preprint arXiv:2106.09592 (2021).
  • Felix Naumann: Data profiling revisited. SIGMOD Rec. 42, 4, 2014.
  • Mohamed Ellefi, et al.: RDF Dataset Profiling – a Survey of Features, Methods, Vocabularies and Applications. Journal of Web Semantics, 2017.

4. Dataset Search within Data Lake

  • Chapman, A., Simperl, E., Koesten, L. et al. Dataset search: a survey. The VLDB Journal (2019)
  • Trabelsi, et al.: Improved Table Retrieval Using Multiple Context Embeddings for Attributes. Big Data 2019.
  • Chapman, Adriane, et al. “Dataset search: a survey.” The VLDB Journal 29.1 (2020): 251–272.
  • Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J. Miller. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19), Association for Computing Machinery, New York, NY, USA, 847–864.

5. Entity Search within Data Lakes

  • Wei Zhang, Hao Wei, Bunyamin Sisman, Xin Luna Dong, Christos Faloutsos, and Davd Page. 2020. AutoBlock: A Hands-off Blocking Framework for Entity Matching. In Proceedings of the 13th International Conference on Web Search and Data Mining (WSDM ’20), Association for Computing Machinery, New York, NY, USA, 744–752.
  • Saravanan Thirumuruganathan, Han Li, Nan Tang, Mourad Ouzzani, Yash Govind, Derek Paulsen, Glenn Fung, and AnHai Doan. Deep Learning for Blocking in Entity Matching: A Design Space Exploration. 14.
  • N. Barlaug and J. A. Gulla, “Neural Networks for Entity Matching: A Survey,” ACM Trans. Knowl. Discov. Data, vol. 15, no. 3, p. 52:1–52:37, Apr. 2021

6. Metadata for Dataset Search

  •  Chapman, A., Simperl, E., Koesten, L. et al. Dataset search: a survey. The VLDB Journal (2019)
  • Alon Halevy, Flip Korn, Natalya F. Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google’s Datasets. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16), Association for Computing Machinery, New York, NY, USA, 795–806
  • Omar Benjelloun, Shiyu Chen, and Natasha Noy. 2020. Google Dataset Search by the Numbers. In The Semantic Web – ISWC 2020 (Lecture Notes in Computer Science), Springer International Publishing, Cham, 667–682.

7. Entity Matching using Deep Learning

  • N. Barlaug and J. A. Gulla, “Neural Networks for Entity Matching: A Survey,” ACM Trans. Knowl. Discov. Data, vol. 15, no. 3, p. 52:1–52:37, Apr. 2021
  • C. Ge, P. Wang, L. Chen, X. Liu, B. Zheng, and Y. Gao, “CollaborER: A Self-supervised Entity Resolution Framework Using Multi-features Collaboration,” arXiv:2108.08090 [cs], Sep. 2021
  • M. Loster, I. Koumarelas, and F. Naumann, “Knowledge Transfer for Entity Resolution with Siamese Neural Networks,” J. Data and Information Quality, vol. 13, no. 1, p. 2:1–2:25, Jan. 2021

8. Schema Matching using Deep Learning (Annotating table columns using a Knowledge Base)

  • Y. Suhara et al., “Annotating Columns with Pre-trained Language Models,” arXiv:2104.01785 [cs], Apr. 2021
  • X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu, “TURL: table understanding through representation learning,” Proc. VLDB Endow., vol. 14, no. 3, pp. 307–319, Nov. 2020.
  • E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, and K. Srinivas, “SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems,” in The Semantic Web, Cham, 2020, pp. 514–530

9. Schema Matching using Deep Learning (Matching columns across multiple tables)

  • R. Shraga, A. Gal, and H. Roitman, “ADnEV: cross-domain schema matching using deep similarity matrix adjustment and evaluation,” Proc. VLDB Endow., vol. 13, no. 9, pp. 1401–1415, May 2020
  • J. Zhang, B. Shin, J. D. Choi, and J. C. Ho, “SMAT: An Attention-Based Deep Learning Solution to the Automation of Schema Matching,” in Advances in Databases and Information Systems, Cham, 2021, pp. 260–274
  • E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. the VLDB Journal, 10(4):334–350, 2001

10. Embedding Methods for Tabular Data

  • H. Iida, D. Thai, V. Manjunatha, and M. Iyyer, “TABBIE: Pretrained Representations of Tabular Data,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, Jun. 2021, pp. 3446–3456
  • D. Wang, P. Shiralkar, C. Lockard, B. Huang, X. L. Dong, and M. Jiang, “TCN: Table Convolutional Network for Web Table Interpretation,” in Proceedings of the Web Conference 2021, New York, NY, USA, Apr. 2021, pp. 4020–4032
  • Z. Wang et al., “TUTA: Tree-based Transformers for Generally Structured Table Pre-training,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, New York, NY, USA, Aug. 2021, pp. 1780–1790
  • X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu, “TURL: table understanding through representation learning,” Proc. VLDB Endow., vol. 14, no. 3, pp. 307–319, Nov. 2020.

 


Getting started

The following books are good starting points for getting an overview of the topic of large-scale data integration:

  • Dong/Srivastava: Big Data Integration. Morgan & Claypool, 2015.
  • Doan/Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.