Large-Scale Data Integration Seminar (FSS2018)

This seminar covers topics related to integrating data from large numbers of independent data sources. This includes large-scale schema matching, identity resolution, data fusion, set completion, data search, and data profiling.

Organization

This seminar is organized by Prof. Dr. Christian Bizer, Anna Primpeli, Oliver Lehmberg, Yaser Oulabi.
The seminar is available for master students of the Data Science and Business Informatics programs.

Goals

In this seminar, you will

Read, understand, and explore scientific literature
Summarize a current research topic in a concise report (10–12 pages)
Give a presentation about your topic (before the submission of the report)

Requirements

Attending the courses Web Data Integration and Data Mining I before the seminar is recommended
The report has to be written using Latex
Report and presentation have to be in English

Schedule

Select your preferred topics and register before 16 February
Attend the kickoff meeting on 27 February at 10:00 in room B6 C1.01 (library room),
You will be assigned a mentor, who provides guidance and one-to-one meetings
Work individually throughout the semester: explore literature, create a presentation, and write a report
Give your presentation in a block seminar towards the end of the semester

Getting started

The following books are good starting points for getting an overview of the topic of large-scale data integration:

Dong/Srivastava: Big Data Integration. Morgan & Claypool, 2015.
Doan/Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.

Registration

Explore the list of topics below and select at least 3 topics of your preference. Send a ranked list of your selected topics via email to annainformatik.uni-mannheim.de until 16 February 2018. We will confirm your registration and assign you one of your preferred topics (if possible) in the week of 19 February 2018.

1. Collective Instance Matching
Doan/Halevy: Principles of Data Integration. Pages 198ff, Morgan Kaufmann, 2012.
Christophides/Efthymiou/Stefanidis: Entity resolution in the web of data. Pages 55–72, Synthesis Lectures on the Semantic Web, 2015.
2. Holistic Schema Matching
Mulwad, Varish, Tim Finin, and Anupam Joshi: Semantic message passing for generating linked data from tables. International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013.
He, Yeye, et al.: Automatic discovery of attribute synonyms using query logs and table corpora. Proceedings of the 25th International Conference on World Wide Web, 2016.
3. Data Search for Table Extension
Yakout, et al.: InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables. SIGMOD 2012.
Bhagavatula, et al.: Methods for Exploring and Mining Tables on Wikipedia. KDD IDEA 2013.
4. Truth Discovery for Knowledge Base Completion
Li, Gao: A Survey on Truth Discovery. KDD SIGKDD Explorations Newsletter, 2015.
Dong: Knowledge-based trust: estimating the trustworthiness of web sources. VLDB 2015.
5. Wrapper Induction for Knowledge Base Completion
Bühmann L. et al.: Web-Scale Extension of RDF Knowledge Bases from Templated Websites. Proceedings of the International Semantic Web Conference (ISWC), 2014.
Furche, Tim, et al.: DIADEM: domain-centric, intelligent, automated data extraction methodology. Proceedings of the 21st International Conference on World Wide Web (WWW), 2012.
6. Set Completion using Semi-Structured Web Data
Wang and Cohen: Language-independent set expansion of named entities using the web. Seventh IEEE International Conference on Data Mining (ICDM), 2007.
Lidong Bing, Wai Lam, and Tak-Lam Wong. 2013. Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In Proceedings of the sixth ACM international conference on Web search and data mining (WSDM), 2013.
7. Query Strategies for Active Learning
Settles, Burr: “ctive learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6.1 (2012): 1–114.
Isele, Robert, Anja Jentzsch, and Christian Bizer: Active learning of expressive linkage rules for the web of data. Proceedings of the International Conference on Web Engineering, 2012.
8. Profiling schema.org JobPosting Data on the Web
Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC), 2014.
Anna Primpeli: WebDataCommon Schema.org Data Extracted from November 2017 Common Crawl.
9. Corporate Data Lakes
Alon Halevy, Flip Korn, Natalya F. Noy, et al.: Goods: Organizing Google's Datasets. SIGMOD, 2016.
I. Terrizzano, P. M. Schwarz, et al.: Data wrangling: The challenging journey from the wild to the lake. CIDR, 2015.
10. Dataspace Profiling
Felix Naumann: Data profiling revisited. SIGMOD Rec. 42, 4, 2014.
Mohamed Ellefi, et al.: RDF Dataset Profiling – a Survey of Features, Methods, Vocabularies and Applications. Journal of Web Semantics, 2017.

Students are free to suggest additional topics of their choice that are related to large-scale data integration.

Presentation Schedule

The final seminar presentations will take place on Friday 04.05 and Monday 07.05 in room A305, B6 Building A.

We have assigned the following timeslots:

Topic	Timeslot
Collective Instance Matching	Friday, 04.05, 10:30 – 10:55
Holistic Schema Matching	Friday, 04.05, 10:55 – 11:20
Active Learning for Entity Resolution – Blocking	Friday, 04.05, 11:20 – 11:45
Active Learning for Entity Resolution – Query Strategies	Friday, 04.05, 11:45 – 12:10
Data Search for Table Extension	Friday, 04.05, 12:10 – 12:35

Dataspace Profiling	Monday, 07.05, 10:15 – 10:40
Wrapper Induction for Knowledge Base Completion	Monday, 07.05, 10:40 – 11:05
Set Completion using Semi-Structured Web Data	Monday, 07.05, 11:05 – 11:30
Truth Discovery for Knowledge Base Completion	Monday, 07.05, 11:30 – 11:55

1. Collective Instance Matching

2. Holistic Schema Matching

3. Data Search for Table Extension

4. Truth Discovery for Knowledge Base Completion

5. Wrapper Induction for Knowledge Base Completion

6. Set Completion using Semi-Structured Web Data

7. Query Strategies for Active Learning

8. Profiling schema.org JobPosting Data on the Web

9. Corporate Data Lakes

10. Dataspace Profiling