Large-Scale Data Integration Seminar (FSS2018)
This seminar covers topics related to integrating data from large numbers of independent data sources. This includes large-scale schema matching, identity resolution, data fusion, set completion, data search, and data profiling.
Organization
- This seminar is organized by Prof. Dr. Christian Bizer, Anna Primpeli, Oliver Lehmberg, Yaser Oulabi.
- The seminar is available for master students of the Data Science and Business Informatics programs.
Goals
In this seminar, you will
- Read, understand, and explore scientific literature
- Summarize a current research topic in a concise report (10–12 pages)
- Give a presentation about your topic (before the submission of the report)
Requirements
- Attending the courses Web Data Integration and Data Mining I before the seminar is recommended
- The report has to be written using Latex
- Report and presentation have to be in English
Schedule
- Select your preferred topics and register before 16 February
- Attend the kickoff meeting on 27 February at 10:00 in room B6 C1.01 (library room),
- You will be assigned a mentor, who provides guidance and one-to-one meetings
- Work individually throughout the semester: explore literature, create a presentation, and write a report
- Give your presentation in a block seminar towards the end of the semester
Getting started
The following books are good starting points for getting an overview of the topic of large-scale data integration:
- Dong/
Srivastava: Big Data Integration. Morgan & Claypool, 2015. - Doan/
Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.
Registration
Explore the list of topics below and select at least 3 topics of your preference. Send a ranked list of your selected topics via email to annainformatik.uni-mannheim.de until 16 February 2018. We will confirm your registration and assign you one of your preferred topics (if possible) in the week of 19 February 2018.
1. Collective Instance Matching
- Doan/
Halevy: Principles of Data Integration. Pages 198ff, Morgan Kaufmann, 2012. - Christophides/
Efthymiou/Stefanidis: Entity resolution in the web of data. Pages 55–72, Synthesis Lectures on the Semantic Web, 2015.
- Doan/
2. Holistic Schema Matching
- Mulwad, Varish, Tim Finin, and Anupam Joshi: Semantic message passing for generating linked data from tables. International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013.
- He, Yeye, et al.: Automatic discovery of attribute synonyms using query logs and table corpora. Proceedings of the 25th International Conference on World Wide Web, 2016.
3. Data Search for Table Extension
- Yakout, et al.: InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables. SIGMOD 2012.
- Bhagavatula, et al.: Methods for Exploring and Mining Tables on Wikipedia. KDD IDEA 2013.
4. Truth Discovery for Knowledge Base Completion
- Li, Gao: A Survey on Truth Discovery. KDD SIGKDD Explorations Newsletter, 2015.
- Dong: Knowledge-based trust: estimating the trustworthiness of web sources. VLDB 2015.
5. Wrapper Induction for Knowledge Base Completion
- Bühmann L. et al.: Web-Scale Extension of RDF Knowledge Bases from Templated Websites. Proceedings of the International Semantic Web Conference (ISWC), 2014.
- Furche, Tim, et al.: DIADEM: domain-centric, intelligent, automated data extraction methodology. Proceedings of the 21st International Conference on World Wide Web (WWW), 2012.
6. Set Completion using Semi-Structured Web Data
- Wang and Cohen: Language-independent set expansion of named entities using the web. Seventh IEEE International Conference on Data Mining (ICDM), 2007.
- Lidong Bing, Wai Lam, and Tak-Lam Wong. 2013. Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In Proceedings of the sixth ACM international conference on Web search and data mining (WSDM), 2013.
7. Query Strategies for Active Learning
- Settles, Burr: “ctive learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6.1 (2012): 1–114.
- Isele, Robert, Anja Jentzsch, and Christian Bizer: Active learning of expressive linkage rules for the web of data. Proceedings of the International Conference on Web Engineering, 2012.
8. Profiling schema.org JobPosting Data on the Web
- Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC), 2014.
- Anna Primpeli: WebDataCommon Schema.org Data Extracted from November 2017 Common Crawl.
9. Corporate Data Lakes
- Alon Halevy, Flip Korn, Natalya F. Noy, et al.: Goods: Organizing Google's Datasets. SIGMOD, 2016.
- I. Terrizzano, P. M. Schwarz, et al.: Data wrangling: The challenging journey from the wild to the lake. CIDR, 2015.
10. Dataspace Profiling
- Felix Naumann: Data profiling revisited. SIGMOD Rec. 42, 4, 2014.
- Mohamed Ellefi, et al.: RDF Dataset Profiling – a Survey of Features, Methods, Vocabularies and Applications. Journal of Web Semantics, 2017.
Students are free to suggest additional topics of their choice that are related to large-scale data integration.
Presentation Schedule
The final seminar presentations will take place on Friday 04.05 and Monday 07.05 in room A305, B6 Building A.
We have assigned the following timeslots:
| Topic | Timeslot |
|---|---|
| Collective Instance Matching | Friday, 04.05, 10:30 – 10:55 |
| Holistic Schema Matching | Friday, 04.05, 10:55 – 11:20 |
| Active Learning for Entity Resolution – Blocking | Friday, 04.05, 11:20 – 11:45 |
| Active Learning for Entity Resolution – Query Strategies | Friday, 04.05, 11:45 – 12:10 |
| Data Search for Table Extension | Friday, 04.05, 12:10 – 12:35 |
| Dataspace Profiling | Monday, 07.05, 10:15 – 10:40 |
| Wrapper Induction for Knowledge Base Completion | Monday, 07.05, 10:40 – 11:05 |
| Set Completion using Semi-Structured Web Data | Monday, 07.05, 11:05 – 11:30 |
| Truth Discovery for Knowledge Base Completion | Monday, 07.05, 11:30 – 11:55 |
