Large-Scale Data Integration Seminar (FSS2018)

This seminar covers topics related to integrating data from large numbers of independent data sources. This includes large-scale schema matching, identity resolution, data fusion, set completion, data search, and data profiling.



In this seminar, you will

  • Read, understand, and explore scientific literature
  • Summarize a current research topic in a concise report (10–12 pages)
  • Give a presentation about your topic (before the submission of the report)


  • Attending the courses Web Data Integration and Data Mining I before the seminar is recommended
  • The report has to be written using Latex
  • Report and presentation have to be in English


  1. Select your preferred topics and register before 16 February
  2. Attend the kickoff meeting on 27 February at 10:00 in room B6 C1.01 (library room),
  3. You will be assigned a mentor, who provides guidance and one-to-one meetings
  4. Work individually throughout the semester: explore literature, create a presentation, and write a report
  5. Give your presentation in a block seminar towards the end of the semester

Getting started

The following books are good starting points for getting an overview of the topic of large-scale data integration:

  • Dong/Srivastava: Big Data Integration. Morgan & Claypool, 2015.
  • Doan/Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.


Explore the list of topics below and select at least 3 topics of your preference. Send a ranked list of your selected topics via email to anna until 16 February 2018. We will confirm your registration and assign you one of your preferred topics (if possible) in the week of 19 February 2018.

  • 1. Collective Instance Matching

    • Doan/Halevy: Principles of Data Integration. Pages 198ff, Morgan Kaufmann, 2012.
    • Christophides/Efthymiou/Stefanidis: Entity resolution in the web of data. Pages 55–72, Synthesis Lectures on the Semantic Web, 2015.
  • 2. Holistic Schema Matching

    • Mulwad, Varish, Tim Finin, and Anupam Joshi: Semantic message passing for generating linked data from tables. International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013.
    • He, Yeye, et al.: Automatic discovery of attribute synonyms using query logs and table corpora. Proceedings of the 25th International Conference on World Wide Web, 2016.
  • 3. Data Search for Table Extension

    • Yakout, et al.: InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables. SIGMOD 2012.
    • Bhagavatula, et al.: Methods for Exploring and Mining Tables on Wikipedia. KDD IDEA 2013.
  • 4. Truth Discovery for Knowledge Base Completion

    • Li, Gao: A Survey on Truth Discovery. KDD SIGKDD Explorations Newsletter, 2015.
    • Dong: Knowledge-based trust: estimating the trustworthiness of web sources. VLDB 2015.
  • 5. Wrapper Induction for Knowledge Base Completion

    • Bühmann L. et al.: Web-Scale Extension of RDF Knowledge Bases from Templated Websites. Proceedings of the International Semantic Web Conference (ISWC), 2014.
    • Furche, Tim, et al.: DIADEM: domain-centric, intelligent, automated data extraction methodology. Proceedings of the 21st International Conference on World Wide Web (WWW), 2012.
  • 6. Set Completion using Semi-Structured Web Data

    • Wang and Cohen: Language-independent set expansion of named entities using the web. Seventh IEEE International Conference on Data Mining (ICDM), 2007.
    • Lidong Bing, Wai Lam, and Tak-Lam Wong. 2013. Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In Proceedings of the sixth ACM international conference on Web search and data mining (WSDM), 2013.
  • 7. Query Strategies for Active Learning

    • Settles, Burr: “ctive learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6.1 (2012): 1–114.
    • Isele, Robert, Anja Jentzsch, and Christian Bizer: Active learning of expressive linkage rules for the web of data. Proceedings of the International Conference on Web Engineering, 2012.
  • 8. Profiling JobPosting Data on the Web

    • Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC), 2014.
    • Anna Primpeli: WebDataCommon Data Extracted from November 2017 Common Crawl.
  • 9. Corporate Data Lakes

    • Alon Halevy, Flip Korn, Natalya F. Noy, et al.: Goods: Organizing Google's Datasets. SIGMOD, 2016.
    • I. Terrizzano, P. M. Schwarz, et al.: Data wrangling: The challenging journey from the wild to the lake. CIDR, 2015.
  • 10. Dataspace Profiling

    • Felix Naumann: Data profiling revisited. SIGMOD Rec. 42, 4, 2014.
    • Mohamed Ellefi, et al.: RDF Dataset Profiling – a Survey of Features, Methods, Vocabularies and Applications. Journal of Web Semantics, 2017.

 Students are free to suggest additional topics of their choice that are related to large-scale data integration.

Presentation Schedule

The final seminar presentations will take place on Friday 04.05 and Monday 07.05 in room  A305, B6 Building A.

We have assigned the following timeslots:

Collective Instance MatchingFriday, 04.05, 10:30 – 10:55
Holistic Schema MatchingFriday, 04.05, 10:55 – 11:20
Active Learning for Entity Resolution – BlockingFriday, 04.05, 11:20 – 11:45
Active Learning for Entity Resolution – Query StrategiesFriday, 04.05, 11:45 – 12:10
Data Search for Table ExtensionFriday, 04.05, 12:10 – 12:35
Dataspace ProfilingMonday, 07.05, 10:15 – 10:40
Wrapper Induction for Knowledge Base CompletionMonday, 07.05, 10:40 – 11:05
Set Completion using Semi-Structured Web DataMonday, 07.05, 11:05 – 11:30
Truth Discovery for Knowledge Base CompletionMonday, 07.05, 11:30 – 11:55