Large-Scale Data Integration Seminar (HWS 2021)

This seminar covers topics related to integrating data from large numbers of independent data sources. The specific focus of the HWS 2021 edition of the seminar are the Content and Dynamics of the Web of Data. Structured data is published on the Web in various formats including (HTML-)Tables, Web APIs, Linked Data, HTML-embedded Data, as well as in the form of Knowledge Graphs such as Wikidata or DBpedia. See this slideset for additional information and some statistics about the deployment of the different formats. The goal of the HWS2021 edition of the seminar is to empricially analyze the content and the dynamics of the Web of Data . For this, the participants will first review the existing empirical findings concerning a specific type of structured data on the Web. They will identify gaps in the state-of-the-art knowledge about the adoption, content, and usage of the specific type of data and will afterwards try to partly fill a gap by gathering additional statistics or performing a small automated content analysis. 

Organization

Goals

In this seminar, you will

  • Read, understand, and explore scientific literature
  • Critically summarize the state-of-the-art concerning your topic
  • Fill a gap concerning the state of the art by gathering new statistics or performing a content analysis. 
  • Give a presentation about your topic (before the submission of the report)

Requirements

Schedule

  1. Please register for the seminar via the centrally-coordinated seminar registration
  2. After you have been accepted into the seminar, please email us your three preferred topics from the list below.
    We will assign topic to students according to your preferences
  3. Attend the kickoff meeting in which we will discuss general requirements for the reports and presentations as well as answer initial questions about the topics
  4. You will be assigned a mentor, who provides guidance and one-to-one meetings
  5. Work individually throughout the semester: explore literature, create a presentation, and write a report
  6. Give your presentation in a block seminar towards the end of the semester

Topics

1. Schema.org Local Business/Hotel Data – Structure, Dynamics, and Applications

  • Meusel, Robert, Christian Bizer, and Heiko Paulheim. „A web-scale study of the adoption and evolution of the schema. org vocabulary over time.“ Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics. 2015.
  • Kärle E., Fensel A., Toma I., Fensel D. (2016) Why Are There More Hotels in Tyrol than in Austria? Analyzing Schema.org Usage in the Hotel Domain. In: Information and Communication Technologies in Tourism 2016.
  • https://developers.google.com/search/docs/advanced/structured-data/local-business

2. Schema.org Product Data – Structure, Dynamics, and Applications

  • Meusel, Robert, Christian Bizer, and Heiko Paulheim. „A web-scale study of the adoption and evolution of the schema. org vocabulary over time.“ Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics. 2015.
  • Peeters, Ralph, et al. „Using schema. org annotations for training and maintaining product matchers.“ Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics. 2020.
  • https://developers.google.com/search/docs/advanced/structured-data/product

3. Schema.org Event Data – Structure, Dynamics, and Applications

  • Meusel, Robert, Christian Bizer, and Heiko Paulheim. „A web-scale study of the adoption and evolution of the schema. org vocabulary over time.“ Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics. 2015.
  • Foley J, Bendersky M, Josifovski V (2015) Learning to extract local events from the web. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 423–432
  • https://developers.google.com/search/docs/advanced/structured-data/event

4. Schema.org Data Set Metadata – Structure, Dynamics, and Applications

  • Brickley, Dan, Matthew Burgess, and Natasha Noy. „Google Dataset Search: Building a search engine for datasets in an open Web ecosystem.“ The World Wide Web Conference. 2019.
  • Benjelloun, Omar, Shiyu Chen, and Natasha Noy. „Google dataset search by the numbers.“ International Semantic Web Conference. Springer, Cham, 2020.
  • Chapman, Adriane, et al. „Dataset search: a survey.“ The VLDB Journal 29.1 (2020): 251-272.
  • https://developers.google.com/search/docs/advanced/structured-data/dataset

5. Schema.org Job Posting Data – Structure, Dynamics, and Applications

  • Meusel, Robert, Christian Bizer, and Heiko Paulheim. „A web-scale study of the adoption and evolution of the schema. org vocabulary over time.“ Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics. 2015.

https://developers.google.com/search/docs/advanced/structured-data/job-posting

6. Schema.org Course Data – Structure, Dynamics, and Applications

  • Meusel, Robert, Christian Bizer, and Heiko Paulheim. „A web-scale study of the adoption and evolution of the schema. org vocabulary over time.“ Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics. 2015.

https://developers.google.com/search/docs/advanced/structured-data/course

7. Schema.org Q&A Data – Structure, Dynamics, and Applications

  • Meusel, Robert, Christian Bizer, and Heiko Paulheim. „A web-scale study of the adoption and evolution of the schema. org vocabulary over time.“ Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics. 2015.

https://developers.google.com/search/docs/advanced/structured-data/qapage

8. Linked Data – Topics, Dynamics, Best Practices, and Applications

  • Debattista, Jeremy, et al. „Evaluating the quality of the LOD cloud: An empirical investigation.“ Semantic Web 9.6 (2018): 859-901.
  • Debattista, Jeremy, et al. „Is the LOD cloud at risk of becoming a museum for datasets? Looking ahead towards a fully collaborative and sustainable LOD cloud.“ Companion Proceedings of The 2019 World Wide Web Conference. 2019.
  • Herrera, et al. „BTC-2019: The 2019 Billion Triple Challenge Dataset.“ Proceedings of the International Semantic Web Conference. 2019.
  • Schmachtenberg, Max, Christian Bizer, and Heiko Paulheim. „Adoption of the linked data best practices in different topical domains.“ International Semantic Web Conference. Springer, Cham, 2014.

9. Web APIs – Topics, Dynamics, and Applications

  • M. Maleshkova, C. Pedrinaci, and J. Domingue, “Investigating web APIs on the world wide web,” in Proc. of European Conference on Web Services (ECOWS), 2010.
  • Sohan, S. M., Craig Anslow, and Frank Maurer. „A case study of web API evolution.“ 2015 IEEE World Congress on Services. IEEE, 2015.
  • https://www.programmableweb.com/api-research

10. Open Data Repositories – Topics, Dynamics, and Applications

  • Neumaier S, Umbrich J, Polleres A. Automated quality assessment of metadata across open data portals. Journal of Data and Information Quality (JDIQ). 2016 Oct 25;8(1):1-29.
  • Magazine, D-Lib. „The landscape of research data repositories in 2015: A re3data analysis.“ D-Lib Magazine 23.3/4 (2017).
  • Hulsebos M, Demiralp Ç, Groth P. GitTables: A Large-Scale Corpus of Relational Tables. arXiv preprint arXiv:2106.07258. 2021.

11. Knowledge Graphs – Topics, Dynamics, and Applications

  • Färber, Michael, et al. „Linked data quality of dbpedia, freebase, opencyc, wikidata, and yago.“ Semantic Web 9.1 (2018): 77-129.
  • Heist, Paulheim: Knowledge Grpahs on the Web – An Overview.” arXiv preprint arXiv:2003.00719 (2020).
  • Shenoy, Kartik, et al. „A Study of the Quality of Wikidata.“ arXiv preprint arXiv:2107.00156 (2021).

 

 Students are free to suggest additional topics of their choice that are related to large-scale data integration.


Getting started

The following books are good starting points for getting an overview of the topic of large-scale data integration:

  • Dong/Srivastava: Big Data Integration. Morgan & Claypool, 2015.
  • Doan/Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.