Large-Scale Data Integration Seminar (FSS 2023)

This seminar covers topics related to integrating data from large numbers of independent data sources. This includes large-scale entity matching, schema matching, table annotation, data cleansing, data fusion, data search and data exploration. The specific focus of the FSS2023 edition of the seminar is Data Integration using Large Language Models.

Organization

Goals

In this seminar, you will

  • read, understand, and explore scientific literature
  • critically summarize the state-of-the-art concerning your topic
  • give a presentation about your topic (before the submission of the report)

Requirements

Schedule

  1. Please register for the seminar via the centrally-coordinated seminar registration in Portal2
  2. After you have been accepted into the seminar, please email us your three preferred topics from the list below.
    We will assign topic to students according to your preferences.
  3. Attend the kickoff meeting in which we will discuss general requirements for the reports and presentations as well as answer initial questions about the topics
  4. You will be assigned a mentor, who provides guidance and one-to-one meetings
  5. Work individually throughout the semester: explore literature, create a presentation, and write a report
  6. Give your presentation in a block seminar towards the end of the semester
  7. Write and submit your seminar thesis until July 2023.

Topics

1. Entity Matching using Domain Adaptation

  • N. Kirielle, P. Christen, and T. Ranbaduge, “TransER: Homogeneous Transfer Learning for Entity Resolution.” in Proceedings of the 25th International Conference on Extending Database Technology , 2022, pp. 118–130
  • M. Trabelsi, J. Heflin, and J. Cao, “DAME: Domain Adaptation for Matching Entities,” in Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, New York, NY, USA, Feb. 2022, pp. 1016–1024
  • J. Tu et al., “Domain Adaptation for Deep Entity Resolution,” in Proceedings of the 2022 International Conference on Management of Data, New York, NY, USA, Jun. 2022, pp. 443–457
  • More references and benchmarks: Papers with Code: Entity Resolution

2. Empirical Topic:  Evaluating ChatGPT on the Task of Entity Matching

  • A. Venkatesh et al., “On Evaluating and Comparing Open Domain Dialog Systems.” arXiv:1801.03625 [cs], Dec. 2018.
  • Q. Dong  et al., “A Survey for In-context Learning”. arXiv:2301.00234 [cs], Dec. 2022.
  • N. Barlaug and J. A. Gulla, “Neural Networks for Entity Matching: A Survey,” ACM Transactions on Knowledge Discovery from Data, vol. 15, no. 3, p. 52:1–52:37, Apr. 2021.
  • A. Primpeli and C. Bizer, “Profiling Entity Matching Benchmark Tasks,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, New York, NY, USA, Oct. 2020, pp. 3101–3108.
  • help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api

3. Deep Learning for Blocking

  • S. Thirumuruganathan, H. Li, N. Tang, M. Ouzzani, Y. Govind, D. Paulsen, G. Fung, and A. Doan. 2021. Deep learning for blocking in entity matching: a design space exploration. Proceedings of the 2021 VLDB Endowment 14, 11 (July 2021), 2459–2472. 
  • W. Zhang, H. Wei, B. Sisman, L. Dong, C. Faloutsos, and D. Page. 2020. AutoBlock: A Hands-off Blocking Framework for Entity Matching. In Proceedings of the 13th International Conference on Web Search and Data Mining (WSDM ’20), Association for Computing Machinery, New York, NY, USA, 744–752. 
  • R. Wang, Y. Li, and J. Wang, “Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation.” arXiv:2207.04122 [cs], Jul. 08, 2022.

4. Deep Learning for Table Search

  • G. Fan, J. Wang, Y. Li, D. Zhang, and R. Miller. 2023. Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. arxivx.
  • A. Bogatu, A. A. Fernandes, N. W. Paton, and A. Konstantinou. 2020. Dataset Discovery in Data Lakes. In IEEE 36th International Conference on Data Engineering (ICDE), 709–720.
  • A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. 2012. Finding Related Tables. In SIGMOD.

    5. Representation Learning for Missing Value Imputation

    • Richard Wu, Aoqian Zhang, Ihab Ilyas, and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. Proceedings of Machine Learning and Systems 2, (March 2020), 307–325.
    • Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2022. TURL: Table Understanding through Representation Learning. SIGMOD Rec. 51, 1 (June 2022), 33–40. 
    • J. Yoon, J. Jordon, and M. Schaar. 2018. GAIN: Missing Data Imputation using Generative Adversarial Nets. In Proceedings of the 35th International Conference on Machine Learning, PMLR, 5689–5698.
    • Ihab F. Ilyas and Theodoros Rekatsinas. 2022. Machine Learning and Data Cleaning: Which Serves the Other? J. Data and Information Quality 14, 3 (September 2022), 1–11.

    6. Empirical Topic:  Evaluating ChatGPT on the Task of Missing Value Imputation

    • A. Venkatesh et al., “On Evaluating and Comparing Open Domain Dialog Systems.” arXiv:1801.03625 [cs], Dec. 2018.
    • Q. Dong  et al., “A Survey for In-context Learning”. arXiv:2301.00234 [cs], Dec. 2022.
    • Richard Wu, Aoqian Zhang, Ihab Ilyas, and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. Proceedings of Machine Learning and Systems 2, (March 2020), 307–325.
    • Ihab F. Ilyas and Theodoros Rekatsinas. 2022. Machine Learning and Data Cleaning: Which Serves the Other? J. Data and Information Quality 14, 3 (September 2022), 1–11. 
    • help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api

    7. Schema Matching using Deep Learning

    • Zhang, Jing, et al. “SMAT: An attention-based deep learning solution to the automation of schema matching.” European Conference on Advances in Databases and Information Systems. Springer, Cham, 2021.
    • Shraga, Roee, Avigdor Gal, and Haggai Roitman. “Adnev: Cross-domain schema matching using deep similarity matrix adjustment and evaluation.” Proceedings of the VLDB Endowment 13.9 (2020): 1401–1415.
    • Koutras, Christos, et al. “REMA: Graph Embeddings-based Relational Schema Matching.” EDBT/ICDT Workshops. 2020.
    • Rahm, E., Bernstein, P. A survey of approaches to automatic schema matching. The VLDB Journal 10 (2001), 334–350.

    8. Cell Entity Annotation in Tabular Data

    • X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu, “TURL: table understanding through representation learning,” Proc. VLDB Endow., vol. 14, no. 3, Nov. 2020, pp. 307–319 
    • Huynh, V.P., Liu, J., Chabot, Y., Labbé, T., Monnin, P. and Troncy, R., DAGOBAH: Enhanced Scoring Algorithms for Scalable Annotations of Tabular Data. In SemTab@ ISWC, Nov. 2020, (pp. 27–39).
    • Chen, S., Karaoglu, A., Negreanu, C., Ma, T., Yao, J.G., Williams, J., Jiang, F., Gordon, A. and Lin, C.Y. LinkingPark: An automatic semantic table interpretation system. Journal of Web Semantics, 74, 2022, p.100733.
    • More references and benchmarks: Papers with Code: Cell Entity Annotation

    9. Deep Tabular Learning for Domain-Specific Prediction Tasks

    • Yoon, Jinsung, et al. “Vime: Extending the success of self-and semi-supervised learning to tabular domain.” Advances in Neural Information Processing Systems 33 (2020): 11033–11043.
    • Somepalli, Gowthami, et al. “Saint: Improved neural networks for tabular data via row attention and contrastive pre-training.” arXiv preprint arXiv:2106.01342 (2021).
    • Gharibshah, Zhabiz, and Xingquan Zhu. “Local Contrastive Feature Learning for Tabular Data.” Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2022.
    • Borisov, Vadim, Tobias Leemann, et al. “Deep neural networks and tabular data: A survey.” IEEE Transactions on Neural Networks and Learning Systems (2022).

    10. Information Extraction for E-Commerce Product Data

    • Xinyang Zhang, Chenwei Zhang, Xian Li, Xin Luna Dong, Jingbo Shang, Christos Faloutsos, and Jiawei Han. 2022. OA-Mine: Open-World Attribute Mining for E-Commerce Products with Weak Supervision. In Proceedings of the ACM Web Conference 2022, ACM, Virtual Event, Lyon France, 3153–3161. 
    • Huimin Xu, Wenting Wang, Xin Mao, Xinyu Jiang, and Man Lan. 2019. Scaling up Open Tagging from Tens to Thousands: Comprehension Empowered Attribute Value Extraction from Product Title. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 5214–5223.
    • Gilad Fuchs and Yoni Acriche. 2022. Product Titles-to-Attributes As a Text-to-Text Task. In Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5), Association for Computational Linguistics, Dublin, Ireland, 91–98.

    11. Empirical Topic:  Evaluating GPT3 on the Task of Product Information Extraction

    • A. Venkatesh et al., “On Evaluating and Comparing Open Domain Dialog Systems.” arXiv:1801.03625 [cs], Dec. 2018.
    • Q. Dong  et al., “A Survey for In-context Learning”. arXiv:2301.00234 [cs], Dec. 2022.
    • Li Yang: MAVE: A Product Dataset for Multi-source Attribute Value Extraction. WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 2022.
    • OpenAI Plyayground Example: https://beta.openai.com/playground/p/default-parse-data 

    Getting started

    The following books are good starting points for getting an overview of the topic of large-scale data integration:

    • Dong/Srivastava: Big Data Integration. Morgan & Claypool, 2015.
    • Doan/Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.