Web Data Integration (HWS2022)
Data integration is one of the key challenges in many IT projects and it is estimated that data scientists spend about 80% of their time on data integration and cleansing. In the enterprise context, data integration techniques are applied whenever data from separate sources needs to be combined for new applications or data analysis projects. Within the context of the Web, data integration lays the foundation for taking advantage of the ever growing number of publicly-accessible data sources and enables applications such as product comparison portals, job portals, or data search engines.
In the course, students will learn and experiment with techniques for integrating and cleansing data from large sets of heterogeneous data sources. The course will cover the following topics:
Heterogeneity and Distributedness
The Data Integration Process
Structured Data on the Web
Data Exchange Formats
Schema Mapping and Data Translation
Identity Resolution
Data Quality Assessment
Data Fusion
The course consists of a lecture as well as accompanying practical projects. The lecture (IE670) covers the theory and methods of web data integration and is concluded by a written exam (3 ECTS). In the projects (IE683), students will gain experience with web data integration methods by applying them within a real-world use case of their choise. Students will work on their projects in teams and will report the results of their projects in the form of a written report as well as an oral presentation (together 3 ECTS). While the lecture and the project can be attended in seperate years, it is highly recommended to attend both in the same semester as the schedule of the lecture and project are aligned to each other.
Exam Review
The exam review will take place on the 1st of March 2023 at 2:00 pm. Please register for the review until the 24th of February by sending a mail to Alexander.
Time and Location
- Wednesday, 15:30–17:00. Location: A5 C015 (Starting: 7.9.2022)
- Thursday, 10:15–11:45. Location: B6 A101 (Starting: 8.9.2022)
ECTS
- 3 ECTS: Lecture with written exam (IE670)
- 3 ECTS: 70 % project report, 30 % presentation (IE683)
Requirements
- Programming skills in Java are required for the projects as we are going to use the Winte.r framework.
Registration and Participation
- The lecture and the projects are open to students of the Mannheim Master in Data Science and Master Business Informatics.
- The lecture (IE670) is not restricted on the number of participants, but you still need to register in Portal2 for the course.
- The projects (IE683) are restricted to altogether 75 participants. The registration for the projects (IE683) is done via Portal2. The registration period is 15. August to 5. September 2022.
Outline
Week | Wednesday (Room: A5 C015) | Thursday (Room: B6 A101) |
---|---|---|
07.9.2022 | Lecture: Introduction to Web Data Integration | Lecture: Structured Data on the Web |
14.9.2022 | Lecture: Data Exchange Formats | Lecture: Data Exchange Formats |
21.9.2022 | Lecture: Schema Mapping | Lecture: Schema Mapping |
28.9.2022 | Project: Introduction to Student Projects | Exercise: Introduction to MapForce |
05.10.2022 | Project: Feedback about Project Outlines | Coaching: Schema Mapping |
12.10.2022 | Project Work: Schema Mapping | Lecture: Identity Resolution |
19.10.2022 | Lecture: Identity Resolution | Exercise: Identity Resolution |
26.10.2022 | Project Work: Identity Resolution | Coaching: Identity Resolution |
02.11.2022 | Project Work: Identity Resolution | Coaching: Identity Resolution |
09.11.2022 | Lecture: Data Quality and Data Fusion | Lecture: Data Quality and Data Fusion |
16.11.2022 | Exercise: Data Quality and Data Fusion | Project Work: Data Quality and Data Fusion |
23.11.2022 | Project Work: Data Quality and Fusion | Coaching: Data Quality and Fusion |
30.11.2022 | Project Work: Data Quality and Fusion | Coaching: Data Quality and Fusion |
07.12.2022 | Presentation of Project Results | Presentation of Project Results |
Slides
- Slideset: Introduction and Course Organization
- Slideset: Types of Structured Data on the Web
- Slideset: Data Exchange Formats Part 1
- Slideset: Data Exchange Formats Part 2
- Slideset: Schema Mapping and Data Translation
- Slideset: Introduction to the Student Projects (IE683)
- Slideset: Identity Resolution
- Slideset: Data Fusion
- Example Exam Questions
- Slideset: Student Project Presentations
Exercises
Exercise 1: Data Exchange Formats (Solution)
Exercise 2: Schema Mapping (Solution)
Exercise 3: Identity Resolution (Solution)
Exercice 4: Data Fusion (Solution)
Lecture Videos
Video recordings of the Web Data Integration lectures from HWS2019 are available here.
Course Evaluation
- HWS2017 results of the evaluation of the course by the participants.
- HWS2015 results of the evaluation of the course by the participants.
- HWS2014 results of the evaluation of the course by the participants.
- HWS2013 results of the evaluation of the course by the participants.
Tools
Literature
AnHai Doan, Alon Halevy, Zachary Ives: Principles of Data Integration. Morgan Kaufmann, 2012.
Luna Dong, Divesh Srivastava: Big Data Integration. Morgan & Claypool, 2015.
Ulf Leser, Felix Naumann: Informationsintegration. Dpunkt Verlag, 2007. (Free PDF Version)
Peter Christen: Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, 2012.