Web Data Integration (HWS2025)

Data integration is a key challenge in many IT projects. It is estimated that data scientists spend about 80% of their time on data integration and preparation. In the enterprise context, data integration techniques are applied whenever data from separate sources must be combined for new applications or analytical purposes. In the context of the Web, data integration lays the foundation for taking advantage of the ever growing number of publicly-accessible data sources and enables applications such as product comparison portals, job search plattforms, as well as finance and real-estate data analysis.

In the course, students will learn and experiment with techniques for integrating and cleansing data from large sets of heterogeneous data sources. The course will cover the following topics:

  1. Heterogeneity and Distributedness
  2. The Data Integration Process
  3. Structured Data on the Web
  4. Data Exchange Formats
  5. Information Extraction
  6. Schema Mapping and Data Translation
  7. Identity Resolution
  8. Data Quality Assessment
  9. Data Fusion

The course consists of a lecture as well as accompanying practical projects. The lecture (IE670) covers the theory and methods of web data integration and is concluded by a written exam (3 ECTS). In the projects (IE683), students will gain practical experience with web data integration methods by applying them within a real-world use case of their choise. Students will work on their projects in teams and will report the results of their projects in the form of a written report as well as an oral presentation (together 3 ECTS). While the lecture and the project can be attended in seperate years, it is highly recommended to attend both in the same semester as the schedule of the lecture and project are aligned to each other.

Time and Location

  • Wednesday, 15:30–17:00. Location: B6 D007 Garden House (Starting: 3.9.2025)
  • Thursday, 13:45–15:15. Location: B6 D007 Garden House (Starting: 4.9.2025)

ECTS

  • 3 ECTS: Lecture with written exam (IE670)
  • 3 ECTS: 70 % project report, 30 % presentation (IE683)

Outline and Course Material

WeekWednesday (Room: B6 D007)Thursday (Room: B6 D007)
03.09.2025Lecture: Introduction to Web Data Integration (Slides)Lecture: Structured Data on the Web (Slides)
10.9.2025Lecture: Data Exchange Formats – Part 1 (Slides)Lecture: Data Exchange Formats – Part 2
17.9.2025Exercise: JSON, XML, and Information ExtractionLecture: Data Profiling (Slides)
24.9.2025Lecture: Schema Mapping (Slides)Project: Introduction to Student Projects (Slides)
01.10.2025Exercise: Introduction to MapForce (Task)Coaching: Schema Mapping
08.10.2025Project: Feedback about Project OutlinesLecture: Identity Resolution (Slides)
15.10.2025Lecture: Identity Resolution (Slides)Exercise: Identity Resolution (Task)
22.10.2025Project Work: Identity ResolutionCoaching: Identity Resolution
29.10.2025Project Work:  Identity ResolutionCoaching: Identity Resolution
05.11.2025Lecture: Data Quality and Data Fusion (Slides)Lecture: Data Quality and Data Fusion (Slides)
12.11.2025Exercise: Data Quality and Data Fusion (Task​​​​​​​)Project Work: Data Quality and Data Fusion
19.11.2025Project Work: Data Quality and FusionCoaching: Data Quality and Fusion
26.11.2025Project Work: Data Quality and FusionCoaching: Data Quality and Fusion
03.12.2025Presentation of Project ResultsPresentation of Project Results
XX.12.2025Final Exam  

Requirements

  • Programming skills in Python and experience with the pandas and scikit-learn libraries are required for the exercises and projects.

Registration and Participation

  • The lecture and the projects are open to students of the Mannheim Master in Data Science and Master Business Informatics.
  • The lecture (IE670) is not restricted on the number of participants, but you still need to register in Portal2 for the course.
  • The projects (IE683) are restricted to altogether 60 participants. The registration for the projects (IE683) is organized by the Studiengang Management and is done via Portal2.