Data integration is one of the key challenges in most IT projects and it is estimated that data scientists spend about 80% of their time on data integration. Within the enterprise context, data integration problems arise whenever data from separate sources needs to be combined as the basis for new applications or data analysis projects. Within the context of the Web, data integration techniques form the foundation for taking advantage of the ever growing number of publicly-accessible data sources and for enabling applications such as product comparison portals, job portals, location-based mashups, or data search engines.
In the course, students will learn techniques for integrating and cleansing data from large sets of heterogeneous data sources. The course will cover the following topics:
Heterogeneity and Distributedness
The Data Integration Process
Structured Data on the Web
Data Exchange Formats
Schema Mapping and Data Translation
Identity Resolution
Data Quality Assessment
Data Fusion
The course consists of a lecture as well as accompanying practical projects. The lecture (IE670) covers the theory and methods of web data integration and is concluded by a written exam (3 ECTS). In the projects (IE683), students will gain experience with web data integration methods by applying them within a real-world use case of their choise. Students will work on their projects in teams and will report the results of their projects in the form of a written report as well as an oral presentation (together 3 ECTS). While the lecture and the project can be attended in seperate years, it is highly recommended to attend both in the same semester as the schedule of the lecture and project are aligned to each other.
Week | Wednesday | Thursday |
---|---|---|
4.9.2019 | Lecture: Introduction to Web Data Integration | Lecture: Structured Data on the Web |
11.9.2019 | Lecture: Data Exchange Formats | Lecture: Data Exchange Formats |
18.9.2019 | Lecture: Schema Mapping | Lecture: Schema Mapping |
25.9.2019 | Project: Introduction to Student Projects | Project: Introduction to MapForce |
2.10.2019 | Project: Feedback about Project Outlines | - Holiday - |
9.10.2019 | Project Work: Data Translation | Lecture: Identity Resolution |
16.10.2019 | Lecture: Identity Resolution | Project: Identity Resolution |
23.10.2019 | Project Work: Identity Resolution | Project Work: Identity Resolution |
30.10.2019 | Project Work: Identity Resolution | - Holiday - |
6.11.2019 | Lecture: Data Quality and Data Fusion | Lecture: Data Quality and Data Fusion |
13.11.2019 | Project: Data Fusion | Project Work: Data Fusion |
20.11.2019 | Project Work: Data Fusion | Project Work: Data Fusion |
27.11.2019 | Project Work: Data Fusion | Project Work: Data Fusion |
4.12.2019 | Presentation of project results | Presentation of project results |
Exercise 1: Data Exchange Formats
Exercise 2: Schema Mapping
Exercise 3: Identity Resolution
Exercise 4: Data Fusion
Video recordings of the Web Data Integration lectures from HWS2015 are available here.
AnHai Doan, Alon Halevy, Zachary Ives: Principles of Data Integration. Morgan Kaufmann, 2012.
Ulf Leser, Felix Naumann: Informationsintegration. Dpunkt Verlag, 2007. (Free PDF Version)
Luna Dong, Divesh Srivastava: Big Data Integration. Morgan & Claypool, 2015.
Serge Abiteboul, et al: Web Data Management. Cambridge University Press, 2012.
Jérôme Euzenat, Pavel Shvaiko: Ontology Maching. Springer, 2007.
Felix Naumann: An Introduction to Duplicate Detection. Morgan & Claypool, 2012.
Peter Christen: Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, 2012.