Large-Scale Data Integration Seminar (FSS2018)

This seminar covers topics related to integrating data from large numbers of independent data sources. This includes large-scale schema matching, identity resolution, data fusion, set completion, data search, and data profiling.

Organization

Goals

In this seminar, you will

  • Read, understand, and explore scientific literature
  • Summarize a current research topic in a concise report (10–12 pages)
  • Give a presentation about your topic (before the submission of the report)

Requirements

  • Attending the courses Web Data Integration and Data Mining I before the seminar is recommended
  • The report has to be written using Latex
  • Report and presentation have to be in English

Schedule

  1. Select your preferred topics and register before 16 February
  2. Attend the kickoff meeting on 27 February at 10:00 in room B6 C1.01 (library room),
  3. You will be assigned a mentor, who provides guidance and one-to-one meetings
  4. Work individually throughout the semester: explore literature, create a presentation, and write a report
  5. Give your presentation in a block seminar towards the end of the semester

Getting started

The following books are good starting points for getting an overview of the topic of large-scale data integration:

  • Dong/Srivastava: Big Data Integration. Morgan & Claypool, 2015.
  • Doan/Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.

Registration

Explore the list of topics below and select at least 3 topics of your preference. Send a ranked list of your selected topics via email to annamail-informatik.uni-mannheim.de until 16 February 2018. We will confirm your registration and assign you one of your preferred topics (if possible) in the week of 19 February 2018.

  • 1. Collective Instance Matching

    • Doan/Halevy: Principles of Data Integration. Pages 198ff, Morgan Kaufmann, 2012.
    • Christophides/Efthymiou/Stefanidis: Entity resolution in the web of data. Pages 55–72, Synthesis Lectures on the Semantic Web, 2015.
  • 2. Holistic Schema Matching

    • Mulwad, Varish, Tim Finin, and Anupam Joshi: Semantic message passing for generating linked data from tables. International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013.
    • He, Yeye, et al.: Automatic discovery of attribute synonyms using query logs and table corpora. Proceedings of the 25th International Conference on World Wide Web, 2016.
  • 3. Data Search for Table Extension

    • Yakout, et al.: InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables. SIGMOD 2012.
    • Bhagavatula, et al.: Methods for Exploring and Mining Tables on Wikipedia. KDD IDEA 2013.
  • 4. Truth Discovery for Knowledge Base Completion

    • Li, Gao: A Survey on Truth Discovery. KDD SIGKDD Explorations Newsletter, 2015.
    • Dong: Knowledge-based trust: estimating the trustworthiness of web sources. VLDB 2015.
  • 5. Wrapper Induction for Knowledge Base Completion

    • Bühmann L. et al.: Web-Scale Extension of RDF Knowledge Bases from Templated Websites. Proceedings of the International Semantic Web Conference (ISWC), 2014.
    • Furche, Tim, et al.: DIADEM: domain-centric, intelligent, automated data extraction methodology. Proceedings of the 21st International Conference on World Wide Web (WWW), 2012.
  • 6. Set Completion using Semi-Structured Web Data

    • Wang and Cohen: Language-independent set expansion of named entities using the web. Seventh IEEE International Conference on Data Mining (ICDM), 2007.
    • Lidong Bing, Wai Lam, and Tak-Lam Wong. 2013. Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In Proceedings of the sixth ACM international conference on Web search and data mining (WSDM), 2013.
  • 7. Query Strategies for Active Learning

    • Settles, Burr: “ctive learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6.1 (2012): 1–114.
    • Isele, Robert, Anja Jentzsch, and Christian Bizer: Active learning of expressive linkage rules for the web of data. Proceedings of the International Conference on Web Engineering, 2012.
  • 8. Profiling schema.org JobPosting Data on the Web

    • Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC), 2014.
    • Anna Primpeli: WebDataCommon Schema.org Data Extracted from November 2017 Common Crawl.
  • 9. Corporate Data Lakes

    • Alon Halevy, Flip Korn, Natalya F. Noy, et al.: Goods: Organizing Google's Datasets. SIGMOD, 2016.
    • I. Terrizzano, P. M. Schwarz, et al.: Data wrangling: The challenging journey from the wild to the lake. CIDR, 2015.
  • 10. Dataspace Profiling

    • Felix Naumann: Data profiling revisited. SIGMOD Rec. 42, 4, 2014.
    • Mohamed Ellefi, et al.: RDF Dataset Profiling – a Survey of Features, Methods, Vocabularies and Applications. Journal of Web Semantics, 2017.

 Students are free to suggest additional topics of their choice that are related to large-scale data integration.


Presentation Schedule

The final seminar presentations will take place on Friday 04.05 and Monday 07.05 in room  A305, B6 Building A.

We have assigned the following timeslots:

TopicTimeslot
Collective Instance MatchingFriday, 04.05, 10:30 – 10:55
Holistic Schema MatchingFriday, 04.05, 10:55 – 11:20
Active Learning for Entity Resolution – BlockingFriday, 04.05, 11:20 – 11:45
Active Learning for Entity Resolution – Query StrategiesFriday, 04.05, 11:45 – 12:10
Data Search for Table ExtensionFriday, 04.05, 12:10 – 12:35
  
Dataspace ProfilingMonday, 07.05, 10:15 – 10:40
Wrapper Induction for Knowledge Base CompletionMonday, 07.05, 10:40 – 11:05
Set Completion using Semi-Structured Web DataMonday, 07.05, 11:05 – 11:30
Truth Discovery for Knowledge Base CompletionMonday, 07.05, 11:30 – 11:55