Photo credit: Anna Logue

Web Data Integration (HWS2020)

Data integration is one of the key challenges in many IT projects and it is estimated that data scientists spend about 80% of their time on data integration and cleansing. Within the enterprise context, data integration problems arise whenever data from separate sources needs to be combined as the basis for new applications or data analysis projects. Within the context of the Web, data integration techniques form the foundation for taking advantage of the ever growing number of publicly-accessible data sources and for enabling applications such as product comparison portals, job portals, location-based mashups, and data search engines.

In the course, students will learn and experiment with techniques for integrating and cleansing data from large sets of heterogeneous data sources. The course will cover the following topics:

  1. Heterogeneity and Distributedness

  2. The Data Integration Process

  3. Structured Data on the Web

  4. Data Exchange Formats

  5. Schema Mapping and Data Translation

  6. Identity Resolution

  7. Data Quality Assessment

  8. Data Fusion

The course consists of a lecture as well as accompanying practical projects. The lecture (IE670) covers the theory and methods of web data integration and is concluded by a written exam (3 ECTS). In the projects (IE683), students will gain experience with web data integration methods by applying them within a real-world use case of their choise. Students will work on their projects in teams and will report the results of their projects in the form of a written report as well as an oral presentation (together 3 ECTS). While the lecture and the project can be attended in seperate years, it is highly recommended to attend both in the same semester as the schedule of the lecture and project are aligned to each other.

Time and Location

  • Wednesday, 15:30-17:00. Location: WIM-ZOOM Room 6 (Starting: 30.9.2020)
  • Thursday, 10:15-11:45. Location: WIM-ZOOM Room 6 (Starting: 1.10.2019)

Instructors

ECTS

  • 3 ECTS: Lecture with written exam (IE670)
  • 3 ECTS: 70 % project report, 30 % presentation (IE683)

Requirements

  • Programming skills in Java are required for the projects as we are going to use the Winte.r framework.

Registration and Participation

  • The lecture and the projects are open to students of the Mannheim Master in Data Science and Master Business Informatics.
  • The lecture (IE670) is not restricted on the number of participants and does not require any registration for attending the lecture.
  • The projects (IE683) are restricted to altogether 60 participants (30+30).
  • The registration for the projects (IE683) is done via Portal2. 
  • Once the registration is closed, we will assign the places in the projects preferring high-semester students and not students registering early as in the previous semesters.

Outline

The sessions set in bold will take place live via ZOOM.
For the other sessions, we will provide video recordings.

Week Wednesday Thursday
30.9.2020 Lecture: Introduction to Web Data Integration Lecture: Structured Data on the Web
7.10.2020 Lecture: Data Exchange Formats Q&A: Data Exchange Formats
14.10.2020 Lecture: Schema Mapping Q&A: Schema Mapping
21.10.2020 Project: Introduction to Student Projects Project: Preparation of Project Outlines
28.10.2020 Project: Feedback about Project Outlines Exercise: Introduction to MapForce
4.11.2020 Project: Schema Mapping Lecture: Identity Resolution
11.11.2020 Q&A: Identity Resolution Exercise: Identity Resolution
18.11.2020 Project Work: Identity Resolution Lecture: Data Quality and Data Fusion
25.11.2020 Q&A: Data Quality and Data Fusion Exercise: Data Quality and Data Fusion
2.12.2020 Project Work: Data Quality and Fusion Project Work: Data Quality and Fusion
9.12.2020 Presentation of Project Results Presentation of Project Results