Large-Scale Data Integration Seminar (FSS 2020)
This seminar covers topics related to integrating data from large numbers of independent data sources. This includes large-scale schema matching, identity resolution, data fusion, set completion, data search and data exploration, and data profiling.
Organization
- This seminar is organized by Prof. Dr. Christian Bizer, Anna Primpeli, Ralph Peeters.
- The seminar is available for master students of the Data Science and Business Informatics programs.
- Slides from the Kick Off Meeting: tbd
Goals
In this seminar, you will
- Read, understand, and explore scientific literature
- Summarize a current research topic in a concise report (10–12 pages)
- Give a presentation about your topic (before the submission of the report)
Kick-Off Meeting
The kick-off meeting will take place on Monday, 02.03 room C1.01 (building B6 entrance C).
You can download the slides of the kick-off meeting here.
Requirements
- Attending Web Data Integration and Data Mining I before the seminar is strongly recommended
- Report and presentation language: English
Schedule
- Please register for the seminar via the centrally-coordinated seminar registration.
- After you have been accepted into the seminar, please email us your three preferred topics from the list below.
We will assign topic to students according to your preferences. - Attend the kickoff meeting on Monday, 2 March at 10:00 in room B6 C1.01 (library room).
- You will be assigned a mentor, who provides guidance and one-to-one meetings
- Work individually throughout the semester: explore literature, create a presentation, and write a report
- Give your presentation in a block seminar towards the end of the semester
Topics
- Embedding Methods for Tabular Data
- R. Cappuzzo, P. Papotti, and S. Thirumuruganathan, “Local Embeddings for Relational Data Integration,” arXiv:1909.01120 [cs], Sep. 2019.
- A. Y. L. Sim and A. Borthwick, “Record2Vec: Unsupervised Representation Learning for Structured Records,” in 2018 IEEE International Conference on Data Mining (ICDM), 2018
- Transfer Learning for Entity Resolution
- S. N. Negahban, B. I. Rubinstein, and J. G. Gemmell: Scaling multiple-source entity resolution using statistically efficient transfer learning. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 2224–2228. ACM, 2012.
- Thirumuruganathan, Saravanan, Shameem A. Puthiya Parambath, Mourad Ouzzani, Nan Tang, and Shafiq Joty: Reuse and Adaptation for Entity Resolution through Transfer Learning. arXiv preprint arXiv:1809.11084, 2018.
- Matching Numeric Attributes using Symbolic and Subsymbolic Features
- Phuc Nguyen, et al: EmbNum+: Effective, Efficient, and Robust Semantic Labeling for Numerical Values. New Generation Computing Journal, 2019.
- Sebastian Neumaier, et al.: Multi-level Semantic Labelling of Numerical Values. International Semantic Web Conference, 2016.
- Profiling of Relational Datafor Data Integration
- Abedjan, Ziawasch, Lukasz Golab, and Felix Naumann. “Profiling relational data: a survey.” The VLDB Journal—The International Journal on Very Large Data Bases 24.4 (2015): 557–581.
- Naumann, Felix. “Data profiling revisited.” ACM SIGMOD Record 42.4 (2014): 40–49.
- Generating Synthetic Benchmark Datasets for Entity Resolution
- SIoannou, Ekaterini and Yannis Velegrakis. EMBench++: Data for a thorough benchmarking of matching-related methods. Semantic Web 10, 435–450, 2019.
- Saveta, Tzanina, et al.: Pushing the limits of instance matching systems: A semantics-aware benchmark for linked data. Proceedings of the 24th International Conference on World Wide Web. 2015
- Explaining Entity Resolution Methods
- A. Ebaid, S. Thirumuruganathan, W. G. Aref, A. Elmagarmid, and M. Ouzzani, “EXPLAINER: Entity Resolution Explanations,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE), 2019.
- S. Thirumuruganathan, M. Ouzzani, and N. Tang, “Explaining Entity Resolution Predictions : Where are we and What needs to be done? HILDA'19 (2019).
- Active Learning for Entity Resolution
- Chen, X. et al.: Heterogeneous Committee-Based Active Learning for Entity Resolution (HeALER). In: Advances in Databases and Information Systems. 2019.
- Ngomo, A.-C.N., Lyko, K.: Eagle: Efficient active learning of link specifications using genetic programming. In: Extended Semantic Web Conference. 2012.
- Dataset Search
- Chapman, A., Simperl, E., Koesten, L.: Dataset Search: A Survey. The VLDB Journal, 2019.
- Trabelsi, et al.: Improved Table Retrieval Using Multiple Context Embeddings for Attributes. In: Big Data 2019.
- Profiling Semantic Annotations for Dataset Search
- Natascha Noy: Discovering Millions of Datasets on the Web. 2019
- Web Data Common Schema.org Data extracted from November 2019 Common Crawl.
- Deployment of JSON-LD and Microdata: Alternative vs Joint Usage
- Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC), 2014.
- Web Data Common Schema.org Data extracted from November 2019 Common Crawl.
Students are free to suggest additional topics of their choice that are related to large-scale data integration.
Getting started
The following books are good starting points for getting an overview of the topic of large-scale data integration:
- Dong/
Srivastava: Big Data Integration. Morgan & Claypool, 2015. - Doan/
Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.