Photo credit: Anna Logue

Large-Scale Data Integration Seminar (FSS 2020)

This seminar covers topics related to integrating data from large numbers of independent data sources. This includes large-scale schema matching, identity resolution, data fusion, set completion, data search and data exploration, and data profiling.

Organization

Goals

In this seminar, you will

  • Read, understand, and explore scientific literature
  • Summarize a current research topic in a concise report (10-12 pages)
  • Give a presentation about your topic (before the submission of the report)

Kick-Off Meeting

The kick-off meeting will take place on Monday, 02.03 room C1.01 (building B6 entrance C).

You can download the slides of the kick-off meeting here.

 

Requirements

Schedule

  1. Please register for the seminar via the centrally-coordinated seminar registration.
  2. After you have been accepted into the seminar, please email us your three preferred topics from the list below.
    We will assign topic to students according to your preferences.
  3. Attend the kickoff meeting on Monday, March 2nd at 10:00 in room B6 C1.01 (library room).
  4. You will be assigned a mentor, who provides guidance and one-to-one meetings
  5. Work individually throughout the semester: explore literature, create a presentation, and write a report
  6. Give your presentation in a block seminar towards the end of the semester

Topics

  1. Embedding Methods for Tabular Data
    • R. Cappuzzo, P. Papotti, and S. Thirumuruganathan, “Local Embeddings for Relational Data Integration,” arXiv:1909.01120 [cs], Sep. 2019.
    • A. Y. L. Sim and A. Borthwick, “Record2Vec: Unsupervised Representation Learning for Structured Records,” in 2018 IEEE International Conference on Data Mining (ICDM), 2018
  2. Transfer Learning for Entity Resolution
    • S. N. Negahban, B. I. Rubinstein, and J. G. Gemmell: Scaling multiple-source entity resolution using statistically efficient transfer learning. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 2224–2228. ACM, 2012.
    • Thirumuruganathan, Saravanan, Shameem A. Puthiya Parambath, Mourad Ouzzani, Nan Tang, and Shafiq Joty: Reuse and Adaptation for Entity Resolution through Transfer Learning. arXiv preprint arXiv:1809.11084, 2018.
  3. Matching Numeric Attributes using Symbolic and Subsymbolic Features
    • Phuc Nguyen, et al: EmbNum+: Effective, Efficient, and Robust Semantic Labeling for Numerical Values. New Generation Computing Journal, 2019.
    • Sebastian Neumaier, et al.: Multi-level Semantic Labelling of Numerical Values. International Semantic Web Conference, 2016.
  4. Profiling of Relational Datafor Data Integration
    • Abedjan, Ziawasch, Lukasz Golab, and Felix Naumann. „Profiling relational data: a survey.“ The VLDB Journal—The International Journal on Very Large Data Bases 24.4 (2015): 557-581.
    • Naumann, Felix. „Data profiling revisited.“ ACM SIGMOD Record 42.4 (2014): 40-49.
  5. Generating Synthetic Benchmark Datasets for Entity Resolution
    • SIoannou, Ekaterini and Yannis Velegrakis. EMBench++: Data for a thorough benchmarking of matching-related methods. Semantic Web 10, 435-450, 2019.
    • Saveta, Tzanina, et al.: Pushing the limits of instance matching systems: A semantics-aware benchmark for linked data. Proceedings of the 24th International Conference on World Wide Web. 2015
  6. Explaining Entity Resolution Methods
    • A. Ebaid, S. Thirumuruganathan, W. G. Aref, A. Elmagarmid, and M. Ouzzani, “EXPLAINER: Entity Resolution Explanations,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE), 2019.
    • S. Thirumuruganathan, M. Ouzzani, and N. Tang, “Explaining Entity Resolution Predictions : Where are we and What needs to be done? HILDA'19 (2019).
  7. Active Learning for Entity Resolution
    • Chen, X. et al.: Heterogeneous Committee-Based Active Learning for Entity Resolution (HeALER). In: Advances in Databases and Information Systems. 2019.
    • Ngomo, A.-C.N., Lyko, K.: Eagle: Efficient active learning of link specifications using genetic programming. In: Extended Semantic Web Conference. 2012.
  8. Dataset Search
    • Chapman, A., Simperl, E., Koesten, L.: Dataset Search: A Survey. The VLDB Journal, 2019.
    • Trabelsi, et al.: Improved Table Retrieval Using Multiple Context Embeddings for Attributes. In: Big Data 2019.
  9. Profiling Semantic Annotations for Dataset Search
  10. Deployment of JSON-LD and Microdata: Alternative vs Joint Usage
    • Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC), 2014.
    • Web Data Common Schema.org Data extracted from November 2019 Common Crawl.

 Students are free to suggest additional topics of their choice that are related to large-scale data integration.


Getting started

The following books are good starting points for getting an overview of the topic of large-scale data integration:

  • Dong/Srivastava: Big Data Integration. Morgan & Claypool, 2015.
  • Doan/Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.