Large-Scale Data Integration Seminar (FSS2019)

This seminar covers topics related to integrating data from large numbers of independent data sources. This includes large-scale schema matching, identity resolution, data fusion, set completion, data search and data exploration, and data profiling.

Organization

This seminar is organized by Prof. Dr. Christian Bizer, Anna Primpeli, Yaser Oulabi.
The seminar is available for master students of the Data Science and Business Informatics programs.
Slides from the Kick Off Meeting (PDF, 763 kB)

Goals

In this seminar, you will

Read, understand, and explore scientific literature
Summarize a current research topic in a concise report (10–12 pages)
Give a presentation about your topic (before the submission of the report)

Requirements

Attending the courses Web Data Integration and Data Mining I before the seminar is recommended
The report has to be written using Latex
Report and presentation have to be in English

Schedule

Please register for the seminar via the centrally-coordinated seminar registration that is performed for the first time this semester.
After you have been accepted into the seminar, please email us your three preferred topics from the list below.
We will assign topic to students according to your preferences.
Attend the kickoff meeting on Monday, 25 February at 10:00 in room B6 C1.01 (library room).
You will be assigned a mentor, who provides guidance and one-to-one meetings
Work individually throughout the semester: explore literature, create a presentation, and write a report
Give your presentation in a block seminar towards the end of the semester

Getting started

The following books are good starting points for getting an overview of the topic of large-scale data integration:

Dong/Srivastava: Big Data Integration. Morgan & Claypool, 2015.
Doan/Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.

Topics

Product Data Matching using Embeddings and Deep Neural Networks
- Mudgal, S. et al.: Deep Learning for Entity Matching: A Design Space Exploration. In: Proceedings of the 2018 International Conference on Management of Data – SIGMOD 18. pp. 1934 ACM Press, Houston, TX, USA (2018).
- Shah, K., Kopru, S., Ruvini, J.D.: Neural Network based Extreme Classication and Similarity Models for Product Matching. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers). pp. 815
Transfer Learning of Matching Knowledge
- S. N. Negahban, B. I. Rubinstein, and J. G. Gemmell. Scaling multiple-source entity resolution using statistically efficient transfer learning. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 2224–2228. ACM, 2012.
- Thirumuruganathan, Saravanan, Shameem A. Puthiya Parambath, Mourad Ouzzani, Nan Tang, and Shafiq Joty. “Reuse and Adaptation for Entity Resolution through Transfer Learning.” arXiv preprint arXiv:1809.11084 (2018).
Best Practices for Building Training and Evaluation Sets for Entity Matching
- Köpcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD. pp. 3–12 (2008).
- Bianco, G.D. et al.: A Practical and Effective Sampling Selection Strategy for Large Scale Deduplication. IEEE Transactions on Knowledge and Data Engineering. 27, 9, 2305–2319 (2015).
Blocking for Large-scale N:M Entity Matching
- Christophides, Vassilis, Vasilis Efthymiou, and Kostas Stefanidis. “Entity resolution in the web of data.” Synthesis Lectures on the Semantic Web 5.3 (2015): 55–72.
- G. Papadakis, E. Ioannou, C. Niederee, T. Palpanas, and W. Nejdl. Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In Proc. of the 5th Int. Conf. on Web search and data mining, ACM, 2012.
A comparison of two families of active learning query strategies: committee based and structure based (density weighted) methods
- Settles, B.: Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. 6, 1, 1–114 (2012 (Chapters 3&5).
- Settles, Burr, and Mark Craven. “An analysis of active learning strategies for sequence labeling tasks.” Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008.
Hierarchical Categorization of Products using Embeddings and Deep Neural Networks
- Silla, Carlos N., and Alex A. Freitas. “A survey of hierarchical classification across different application domains.” Data Mining and Knowledge Discovery 22.1–2 (2011): 31–72.
- Xiong, Tengke, and Putra Manggala. “Hierarchical Classification with Hierarchical Attention Networks.” (2018).
Product Taxonomy Matching and Integration
- Euzenat, J., Shvaiko, P.: Ontology Matching. Springer-Verlag, Berlin Heidelberg (2013).
- Park, S., Kim, W.: Ontology Mapping Between Heterogeneous Product Taxonomies in an Electronic Commerce Environment. Int. J. Electronic Commerce. 12, 69–87 (2007).
Truth Discovery on the Web
- Zhao, Bo, et al. “A bayesian approach to discovering truth from conflicting sources for data integration.” Proceedings of the VLDB Endowment 5.6 (2012): 550–561.
- Li, Qi, et al. “Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation.” Proceedings of the 2014 ACM SIGMOD international conerence on Management of data. ACM, 2014.
Data Space Profiling
- Ziawasch Abedjan, Lukasz Golab, Felix Naumann, Thorsten Papenbrock: Data Profiling. Morgan & Cleypool Synthesis Lecture in Computer Science, 2018.
- Max Schmachtenberg, Christian Bizer, and Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. In Proceedings of the 13th International Semantic Web Conference. 2014.
Profiling schema.org Event Data on the Web
- Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC), 2014.
- Anna Primpeli: WebDataCommon Schema.org Data Extracted from November 2018 Common Crawl.

Students are free to suggest additional topics of their choice that are related to large-scale data integration.