Photo credit: Anna Logue

Large-Scale Data Integration Seminar (FSS2018)

This seminar covers topics related to integrating data from large numbers of independent data sources. This includes large-scale schema matching, identity resolution, data fusion, set completion, data search, and data profiling.



In this seminar, you will

  • Read, understand, and explore scientific literature
  • Summarize a current research topic in a concise report (10-12 pages)
  • Give a presentation about your topic (before the submission of the report)


  • Attending the courses Web Data Integration and Data Mining I before the seminar is recommended
  • The report has to be written using Latex
  • Report and presentation have to be in English


  1. Select your preferred topics and register before February 16th
  2. Attend the kickoff meeting on February 27th at 10:00 in room B6 C1.01 (library room),
  3. You will be assigned a mentor, who provides guidance and one-to-one meetings
  4. Work individually throughout the semester: explore literature, create a presentation, and write a report
  5. Give your presentation in a block seminar towards the end of the semester

Getting started

The following books are good starting points for getting an overview of the topic of large-scale data integration:

  • Dong/Srivastava: Big Data Integration. Morgan & Claypool, 2015.
  • Doan/Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.


Explore the list of topics below and select at least 3 topics of your preference. Send a ranked list of your selected topics via email to anna(at) until February 16, 2018. We will confirm your registration and assign you one of your preferred topics (if possible) in the week of February 19, 2018.

  • 1. Collective Instance Matching

    • Doan/Halevy: Principles of Data Integration. Pages 198ff, Morgan Kaufmann, 2012.
    • Christophides/Efthymiou/Stefanidis: Entity resolution in the web of data. Pages 55-72, Synthesis Lectures on the Semantic Web, 2015.
  • 2. Holistic Schema Matching

    • Mulwad, Varish, Tim Finin, and Anupam Joshi: Semantic message passing for generating linked data from tables. International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013.
    • He, Yeye, et al.: Automatic discovery of attribute synonyms using query logs and table corpora. Proceedings of the 25th International Conference on World Wide Web, 2016.
  • 3. Data Search for Table Extension

    • Yakout, et al.: InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables. SIGMOD 2012.
    • Bhagavatula, et al.: Methods for Exploring and Mining Tables on Wikipedia. KDD IDEA 2013.
  • 4. Truth Discovery for Knowledge Base Completion

    • Li, Gao: A Survey on Truth Discovery. KDD SIGKDD Explorations Newsletter, 2015.
    • Dong: Knowledge-based trust: estimating the trustworthiness of web sources. VLDB 2015.
  • 5. Wrapper Induction for Knowledge Base Completion

    • Bühmann L. et al.: Web-Scale Extension of RDF Knowledge Bases from Templated Websites. Proceedings of the International Semantic Web Conference (ISWC), 2014.
    • Furche, Tim, et al.: DIADEM: domain-centric, intelligent, automated data extraction methodology. Proceedings of the 21st International Conference on World Wide Web (WWW), 2012.
  • 6. Set Completion using Semi-Structured Web Data

    • Wang and Cohen: Language-independent set expansion of named entities using the web. Seventh IEEE International Conference on Data Mining (ICDM), 2007.
    • Lidong Bing, Wai Lam, and Tak-Lam Wong. 2013. Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In Proceedings of the sixth ACM international conference on Web search and data mining (WSDM), 2013.
  • 7. Query Strategies for Active Learning

    • Settles, Burr: „ctive learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6.1 (2012): 1-114.
    • Isele, Robert, Anja Jentzsch, and Christian Bizer: Active learning of expressive linkage rules for the web of data. Proceedings of the International Conference on Web Engineering, 2012.
  • 8. Profiling JobPosting Data on the Web

    • Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC), 2014.
    • Anna Primpeli: WebDataCommon Data Extracted from November 2017 Common Crawl.
  • 9. Corporate Data Lakes

    • Alon Halevy, Flip Korn, Natalya F. Noy, et al.: Goods: Organizing Google's Datasets. SIGMOD, 2016.
    • I. Terrizzano, P. M. Schwarz, et al.: Data wrangling: The challenging journey from the wild to the lake. CIDR, 2015.
  • 10. Dataspace Profiling

    • Felix Naumann: Data profiling revisited. SIGMOD Rec. 42, 4, 2014.
    • Mohamed Ellefi, et al.: RDF Dataset Profiling - a Survey of Features, Methods, Vocabularies and Applications. Journal of Web Semantics, 2017.

 Students are free to suggest additional topics of their choice that are related to large-scale data integration.

Presentation Schedule

The final seminar presentations will take place on Friday 04.05 and Monday 07.05 in room  A305, B6 Building A.

We have assigned the following timeslots:

Topic Timeslot
Collective Instance Matching Friday, 04.05, 10:30 - 10:55
Holistic Schema Matching Friday, 04.05, 10:55 - 11:20
Active Learning for Entity Resolution - Blocking Friday, 04.05, 11:20 - 11:45
Active Learning for Entity Resolution - Query Strategies Friday, 04.05, 11:45 - 12:10
Data Search for Table Extension Friday, 04.05, 12:10 - 12:35
Dataspace Profiling Monday, 07.05, 10:15 - 10:40
Wrapper Induction for Knowledge Base Completion Monday, 07.05, 10:40 - 11:05
Set Completion using Semi-Structured Web Data Monday, 07.05, 11:05 - 11:30
Truth Discovery for Knowledge Base Completion Monday, 07.05, 11:30 - 11:55