Web Data Integration (HWS2018)
Data integration is one of the key challenges within most IT projects. Within the enterprise context, data integration problems arise whenever data from separate sources needs to be combined as the basis for new applications or data analysis projects. Within the context of the Web, data integration techniques form the foundation for taking advantage of the ever growing number of publicly-accessible data sources and for enabling applications such as product comparison portals, job portals, location-based mashups, or data search engines.
In the course, students will learn techniques for integrating and cleansing data from large sets of heterogeneous data sources. The course will cover the following topics:
Heterogeneity and Distributedness
The Data Integration Process
Structured Data on the Web
Data Exchange Formats
Schema Mapping and Data Translation
Identity Resolution
Data Quality Assessment
Data Fusion
The course consists of a lecture as well as accompanying practical projects. The lecture (IE670) covers the theory and methods of web data integration and is concluded by a written exam (3 ECTS). In the projects (IE683), students will gain experience with web data integration methods by applying them within a real-world use case of their choise. Students will work on their projects in teams and will report the results of their projects in the form of a written report as well as an oral presentation (together 3 ECTS). While the lecture and the project can be attended in seperate years, it is highly recommended to attend both in the same semester as the schedule of the lecture and project are aligned to each other.
Exam Review
- The exam review will take place on the 1st of March 2019 at 14:00 Uhr in room B6 C101.
Time and Location
- Wednesday, 15:30–17:00. Building: B6, Room: A 101 (Starting: 5.9.2018)
- Thursday, 10:15–11:45. Building: B6, Room: A 101 (Starting: 6.9.2018)
- The exercises take place in rooms A 101 (Group 1) and A 305 (Group 2)
ECTS
- 3 ECTS: Lecture with written exam (IE670)
- 3 ECTS: Project with report and presentation (IE683)
Lecture Videos
Video recordings of the Web Data Integration lectures from HWS2015 are available here.
Outline
Week Wednesday Thursday 5.9.2018 Lecture: Introduction to Web Data Integration Lecture: Structured Data on the Web 12.9.2018 Lecture: Data Exchange Formats Lecture: Data Exchange Formats 19.9.2018 Lecture: Schema Mapping Lecture: Schema Mapping 26.9.2018 Project: Introduction to Student Projects Project: Introduction to MapForce 3.10.2018 - Holiday - Project Work: Data Translation 10.10.2018 Project: Feedback about Project Outlines Lecture: Identity Resolution 17.10.2018 Lecture: Identity Resolution Project: Identity Resolution 24.10.2018 Project Work: Identity Resolution Project Work: Identity Resolution 31.10.2018 Project Work: Identity Resolution - Holiday - 7.11.2018 Lecture: Data Quality and Data Fusion Lecture: Data Quality and Data Fusion 14.11.2018 Project: Data Fusion Project Work: Data Fusion 21.11.2018 Project Work: Data Fusion Project Work: Data Fusion 28.11.2018 Project Work: Data Fusion Project Work: Data Fusion 5.12.2018 Presentation of project results Presentation of project results Slides
- Slide set: Introduction and Course Outline
- Slide set: Types of Structured Data on the Web
- Slide set: Data Exchange Formats – Part 1
- Slide set: Data Exchange Formats – Part 2
- Exercise: Data Exchange Formats (Data, Code, Solution)
- Slide set: Schema Mapping and Data Translation
- Slide set: Introduction to the Student Projects
- Exercise: Schema Mapping (Data, Solution)
- Slide set: Identity Resolution
- Exercise: Identity Resolution (Slides, Code, Solution)
- Slide set: Data Fusion
- Exercise: Data Fusion (Slides, Code, Solution)
Registration and Participation
- The lecture and the projects are open to students of the Mannheim Master in Data Science and Master Business Informatics.
- The lecture (IE670) is not restricted on the number of participants.
- The projects (IE683) are restricted to altogether 60 participants (30+30).
- Please register for the projects (IE683) via Portal2.
- The registration will close on Wednesday, 29 August 2018, 10:15 am.
- Once the registration is closed, we will assign the places in the projects preferring high-semester students and not students registering early as in the previous semesters.
Requirements
- Programming skills in Java are required for the projects as we are going to use the Winte.r framework.
Course Evaluation
- HWS2017 results of the evaluation of the course by the participants.
- HWS2015 results of the evaluation of the course by the participants.
- HWS2014 results of the evaluation of the course by the participants.
- HWS2013 results of the evaluation of the course by the participants.
Tools
Literature
AnHai Doan, Alon Halevy, Zachary Ives: Principles of Data Integration. Morgan Kaufmann, 2012.
Ulf Leser, Felix Naumann: Informationsintegration. Dpunkt Verlag, 2007. (Free PDF Version)
Luna Dong, Divesh Srivastava: Big Data Integration. Morgan & Claypool, 2015.
Serge Abiteboul, et al: Web Data Management. Cambridge University Press, 2012.
Jérôme Euzenat, Pavel Shvaiko: Ontology Maching. Springer, 2007.
Felix Naumann: An Introduction to Duplicate Detection. Morgan & Claypool, 2012.
Peter Christen: Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, 2012.