Large-Scale Data Integration Seminar (FSS2019)
This seminar covers topics related to integrating data from large numbers of independent data sources. This includes large-scale schema matching, identity resolution, data fusion, set completion, data search and data exploration, and data profiling.
Organization
- This seminar is organized by Prof. Dr. Christian Bizer, Anna Primpeli, Yaser Oulabi.
- The seminar is available for master students of the Data Science and Business Informatics programs.
- Slides from the Kick Off Meeting
Goals
In this seminar, you will
- Read, understand, and explore scientific literature
- Summarize a current research topic in a concise report (10–12 pages)
- Give a presentation about your topic (before the submission of the report)
Requirements
- Attending the courses Web Data Integration and Data Mining I before the seminar is recommended
- The report has to be written using Latex
- Report and presentation have to be in English
Schedule
- Please register for the seminar via the centrally-coordinated seminar registration that is performed for the first time this semester.
- After you have been accepted into the seminar, please email us your three preferred topics from the list below.
We will assign topic to students according to your preferences. - Attend the kickoff meeting on Monday, 25 February at 10:00 in room B6 C1.01 (library room).
- You will be assigned a mentor, who provides guidance and one-to-one meetings
- Work individually throughout the semester: explore literature, create a presentation, and write a report
- Give your presentation in a block seminar towards the end of the semester
Getting started
The following books are good starting points for getting an overview of the topic of large-scale data integration:
- Dong/
Srivastava: Big Data Integration. Morgan & Claypool, 2015. - Doan/
Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.
Topics
- Product Data Matching using Embeddings and Deep Neural Networks
- Mudgal, S. et al.: Deep Learning for Entity Matching: A Design Space Exploration. In: Proceedings of the 2018 International Conference on Management of Data – SIGMOD 18. pp. 1934 ACM Press, Houston, TX, USA (2018).
- Shah, K., Kopru, S., Ruvini, J.D.: Neural Network based Extreme Classication and Similarity Models for Product Matching. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers). pp. 815
- Transfer Learning of Matching Knowledge
- S. N. Negahban, B. I. Rubinstein, and J. G. Gemmell. Scaling multiple-source entity resolution using statistically efficient transfer learning. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 2224–2228. ACM, 2012.
- Thirumuruganathan, Saravanan, Shameem A. Puthiya Parambath, Mourad Ouzzani, Nan Tang, and Shafiq Joty. “Reuse and Adaptation for Entity Resolution through Transfer Learning.” arXiv preprint arXiv:1809.11084 (2018).
- Best Practices for Building Training and Evaluation Sets for Entity Matching
- Köpcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/
MUD. pp. 3–12 (2008). - Bianco, G.D. et al.: A Practical and Effective Sampling Selection Strategy for Large Scale Deduplication. IEEE Transactions on Knowledge and Data Engineering. 27, 9, 2305–2319 (2015).
- Köpcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/
- Blocking for Large-scale N:M Entity Matching
- Christophides, Vassilis, Vasilis Efthymiou, and Kostas Stefanidis. “Entity resolution in the web of data.” Synthesis Lectures on the Semantic Web 5.3 (2015): 55–72.
- G. Papadakis, E. Ioannou, C. Niederee, T. Palpanas, and W. Nejdl. Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In Proc. of the 5th Int. Conf. on Web search and data mining, ACM, 2012.
- A comparison of two families of active learning query strategies: committee based and structure based (density weighted) methods
- Settles, B.: Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. 6, 1, 1–114 (2012 (Chapters 3&5).
- Settles, Burr, and Mark Craven. “An analysis of active learning strategies for sequence labeling tasks.” Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008.
- Hierarchical Categorization of Products using Embeddings and Deep Neural Networks
- Silla, Carlos N., and Alex A. Freitas. “A survey of hierarchical classification across different application domains.” Data Mining and Knowledge Discovery 22.1–2 (2011): 31–72.
- Xiong, Tengke, and Putra Manggala. “Hierarchical Classification with Hierarchical Attention Networks.” (2018).
- Product Taxonomy Matching and Integration
- Euzenat, J., Shvaiko, P.: Ontology Matching. Springer-Verlag, Berlin Heidelberg (2013).
- Park, S., Kim, W.: Ontology Mapping Between Heterogeneous Product Taxonomies in an Electronic Commerce Environment. Int. J. Electronic Commerce. 12, 69–87 (2007).
- Truth Discovery on the Web
- Zhao, Bo, et al. “A bayesian approach to discovering truth from conflicting sources for data integration.” Proceedings of the VLDB Endowment 5.6 (2012): 550–561.
- Li, Qi, et al. “Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation.” Proceedings of the 2014 ACM SIGMOD international conerence on Management of data. ACM, 2014.
- Data Space Profiling
- Ziawasch Abedjan, Lukasz Golab, Felix Naumann, Thorsten Papenbrock: Data Profiling. Morgan & Cleypool Synthesis Lecture in Computer Science, 2018.
- Max Schmachtenberg, Christian Bizer, and Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. In Proceedings of the 13th International Semantic Web Conference. 2014.
- Profiling schema.org Event Data on the Web
- Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC), 2014.
- Anna Primpeli: WebDataCommon Schema.org Data Extracted from November 2018 Common Crawl.
Students are free to suggest additional topics of their choice that are related to large-scale data integration.