Web Data Integration (HWS2023)

Data integration is one of the key challenges in many IT projects and it is estimated that data scientists spend about 80% of their time on data integration and data preparation. In the enterprise context, data integration techniques are applied whenever data from separate sources needs to be combined for new applications or data analysis projects. In the context of the Web, data integration lays the foundation for taking advantage of the ever growing number of publicly-accessible data sources and enables applications such as product comparison portals, job portals, or data search engines.

In the course, students will learn and experiment with techniques for integrating and cleansing data from large sets of heterogeneous data sources. The course will cover the following topics:

  1. Heterogeneity and Distributedness

  2. The Data Integration Process

  3. Structured Data on the Web

  4. Data Exchange Formats

  5. Schema Mapping and Data Translation

  6. Identity Resolution

  7. Data Quality Assessment

  8. Data Fusion

The course consists of a lecture as well as accompanying practical projects. The lecture (IE670) covers the theory and methods of web data integration and is concluded by a written exam (3 ECTS). In the projects (IE683), students will gain experience with web data integration methods by applying them within a real-world use case of their choise. Students will work on their projects in teams and will report the results of their projects in the form of a written report as well as an oral presentation (together 3 ECTS). While the lecture and the project can be attended in seperate years, it is highly recommended to attend both in the same semester as the schedule of the lecture and project are aligned to each other.

Exam Review

The exam review will take place on the 28th February 2024 at 12:00. Please register for the review until the 24th of February by sending a mail to Alexander Brinkmann.

Time and Location

  • Wednesday, 15:30–17:00. Location: B6 A101 (Starting: 13.9.2022)
  • Thursday, 10:15–11:45. Location: B6 A101 (Starting: 7.9.2022)

ECTS

  • 3 ECTS: Lecture with written exam (IE670)
  • 3 ECTS: 70 % project report, 30 % presentation (IE683)

Requirements

Outline and Course Material

Week Wednesday (Room: B6 A101)Thursday (Room: B6 A101)
06.09.2023- no lecture – Lecture: Introduction to Web Data Integration (Slides)
13.9.2023Lecture: Structured Data on the Web (Slides)Lecture: Data Exchange Formats (Slides)
20.9.2023Lecture: Data Exchange Formats (Slides, Exercise, Solution)Lecture: Schema Mapping (Slides)
27.9.2023Lecture: Schema MappingProject: Introduction to Student Projects (Slides)
04.10.2023Exercise: Introduction to MapForce (Exercise, Solution)Coaching: Schema Mapping
11.10.2023Project: Feedback about Project OutlinesLecture: Identity Resolution (Slides)
18.10.2023Lecture: Identity ResolutionExercise: Identity Resolution (Exercise, Solution)
25.10.2023Project Work: Identity ResolutionCoaching: Identity Resolution
02.11.2023- Public Holiday – Coaching: Identity Resolution
08.11.2023Lecture: Data Quality and Data Fusion (Slides, Questions)Lecture: Data Quality and Data Fusion
15.11.2023Exercise: Data Quality and Data Fusion (Exercise, Solution)Project Work: Data Quality and Data Fusion
22.11.2023Project Work: Data Quality and FusionCoaching: Data Quality and Fusion
30.11.2023Project Work: Data Quality and FusionCoaching: Data Quality and Fusion
06.12.2023Presentation of Project ResultsPresentation of Project Results
18.12.2023Final Exam (A5 B144, 12:30 – 13:30)