Large-Scale Data Integration Seminar (HWS 2022)

This seminar covers topics related to integrating data from large numbers of independent data sources. This includes large-scale identity resolution, schema matching, table annotation, data fusion, data search and data exploration. The specific focus of the HWS2022 edition of the seminar is Data Integration using Deep Learning. The topic of the seminar will be organized along the this survey/vision paper: Thirumuruganathan, Tang, Ouzzani, Doan:  Data Curation with Deep Learning. EDBT 2020. 

Organization

Goals

In this seminar, you will

  • read, understand, and explore scientific literature
  • critically summarize the state-of-the-art concerning your topic
  • give a presentation about your topic (before the submission of the report)

Requirements

Schedule

  1. Please register for the seminar via the centrally-coordinated seminar registration in Portal2
  2. After you have been accepted into the seminar, please email us your three preferred topics from the list below.
    We will assign topic to students according to your preferences
  3. Attend the kickoff meeting in which we will discuss general requirements for the reports and presentations as well as answer initial questions about the topics
  4. You will be assigned a mentor, who provides guidance and one-to-one meetings
  5. Work individually throughout the semester: explore literature, create a presentation, and write a report
  6. Give your presentation in a block seminar towards the end of the semester
  7. Write and submit your seminar thesis until January 2023.

Topics

1. Entity Matching using Deep Learning

  • S. Mudgal et al., “Deep Learning for Entity Matching: A Design Space Exploration,” in Proceedings of the 2018 International Conference on Management of Data, New York, NY, USA, 2018, pp. 19–34.
  • N. Barlaug and J. A. Gulla, “Neural Networks for Entity Matching: A Survey,” ACM Transactions on Knowledge Discovery from Data, vol. 15, no. 3, p. 52:1-52:37, Apr. 2021.
  • Y. Li, J. Li, Y. Suhara, A. Doan, and W.-C. Tan, “Deep entity matching with pre-trained language models,” Proceedings of the VLDB Endowment, vol. 14, no. 1, pp. 50–60, Sep. 2020.
  • More references and benchmarks: Papers with Code: Entity Resolution

2. Entity Matching using Contrastive Learning

  • M. Almagro, D. Jiménez, D. Ortego, E. Almazán, and E. Martínez, “Block-SCL: Blocking Matters for Supervised Contrastive Learning in Product Matching.” arXiv:2207.02008 [cs], Jul. 05, 2022.
  • R. Wang, Y. Li, and J. Wang, “Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation.” arXiv:2207.04122 [cs], Jul. 08, 2022.
  • R. Peeters and C. Bizer, “Supervised Contrastive Learning for Product Matching.” in Companion Proceedings of the Web Conference 2022, Lyon, France, April 2022.

3. Entity Matching using Domain Adaptation

  • N. Kirielle, P. Christen, and T. Ranbaduge, “TransER: Homogeneous Transfer Learning for Entity Resolution.” in Proceedings of the 25th International Conference on Extending Database Technology , 2022, pp. 118-130.
  • M. Trabelsi, J. Heflin, and J. Cao, “DAME: Domain Adaptation for Matching Entities,” in Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, New York, NY, USA, Feb. 2022, pp. 1016–1024.
  • J. Tu et al., “Domain Adaptation for Deep Entity Resolution,” in Proceedings of the 2022 International Conference on Management of Data, New York, NY, USA, Jun. 2022, pp. 443–457.

4. Active Learning for Entity Matching

  • J. Huang, W. Hu, Z. Bao, Q. Chen, and Y. Qu, “Deep entity matching with adversarial active learning,” The VLDB Journal, Apr. 2022.
  • A. Jain, S. Sarawagi, and P. Sen, “Deep indexed active learning for matching heterogeneous entity representations,” Proceedings of the VLDB Endowment, vol. 15, no. 1, pp. 31–45, Sep. 2021.
  • A. Bogatu, N. W. Paton, M. Douthwaite, S. Davie, and A. Freitas, “Cost–effective Variational Active Entity Resolution,” in 2021 IEEE 37th International Conference on Data Engineering, Apr. 2021, pp. 1272–1283.
  • More references and benchmark results: Papers with Code: MusicBrainz20K

5. Deep Learning for Blocking

  • S. Thirumuruganathan, H. Li, N. Tang, M. Ouzzani, Y. Govind, D. Paulsen, G. Fung, and A. Doan. 2021. Deep learning for blocking in entity matching: a design space exploration. Proceedings of the 2021 VLDB Endowment 14, 11 (July 2021), 2459–2472. 
  • W. Zhang, H. Wei, B. Sisman, L. Dong, C. Faloutsos, and D. Page. 2020. AutoBlock: A Hands-off Blocking Framework for Entity Matching. In Proceedings of the 13th International Conference on Web Search and Data Mining (WSDM ’20), Association for Computing Machinery, New York, NY, USA, 744–752. 
  • R. Wang, Y. Li, and J. Wang, “Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation.” arXiv:2207.04122 [cs], Jul. 08, 2022.

6. Column Type Annotation in Tabular Data

  • Suhara, Y., Li, J., Li, Y., Zhang, D., Demiralp, Ç., Chen, C. and Tan, W.C. “Annotating columns with pre-trained language models.” In Proceedings of the 2022 International Conference on Management of Data, 2022, pp. 1493–1503
  • Zhang, D., Suhara, Y., Li, J., Hulsebos, M., Demiralp, Ç. and Tan, W.C. Sato: Contextual semantic type detection in tables. arXiv preprint arXiv:1911.06311. 2019
  • X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu, “TURL: table understanding through representation learning,” Proc. VLDB Endow., vol. 14, no. 3, Nov. 2020, pp. 307–319
  • More references and bechnmarks: Papers with Code: Column Type Annotation

7. Column Pair Annotation in Tabular Data

  • Suhara, Y., Li, J., Li, Y., Zhang, D., Demiralp, Ç., Chen, C. and Tan, W.C. “Annotating columns with pre-trained language models.” In Proceedings of the 2022 International Conference on Management of Data, 2022, pp. 1493–1503
  • D. Wang, P. Shiralkar, C. Lockard, B. Huang, X. L. Dong, and M. Jiang, “TCN: Table Convolutional Network for Web Table Interpretation,” in Proceedings of the Web Conference 2021, New York, NY, USA, Apr. 2021, pp. 4020–4032
  • Chen, S., Karaoglu, A., Negreanu, C., Ma, T., Yao, J.G., Williams, J., Jiang, F., Gordon, A. and Lin, C.Y. LinkingPark: An automatic semantic table interpretation system. Journal of Web Semantics, 74, 2022, p.100733.
  • More references and bechnmarks: Papers with Code: Columns Property Annotation

8. Cell Entity Annotation in Tabular Data

  • X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu, “TURL: table understanding through representation learning,” Proc. VLDB Endow., vol. 14, no. 3, Nov. 2020, pp. 307–319 
  • Huynh, V.P., Liu, J., Chabot, Y., Labbé, T., Monnin, P. and Troncy, R., DAGOBAH: Enhanced Scoring Algorithms for Scalable Annotations of Tabular Data. In SemTab@ ISWC, Nov. 2020, (pp. 27-39).
  • Chen, S., Karaoglu, A., Negreanu, C., Ma, T., Yao, J.G., Williams, J., Jiang, F., Gordon, A. and Lin, C.Y. LinkingPark: An automatic semantic table interpretation system. Journal of Web Semantics, 74, 2022, p.100733.
  • More references and benchmarks: Papers with Code: Cell Entity Annotation

9. Representation Learning for Tabular Data

  • D. Wang, P. Shiralkar, C. Lockard, B. Huang, X. L. Dong, and M. Jiang, “TCN: Table Convolutional Network for Web Table Interpretation,” in Proceedings of the Web Conference 2021, New York, NY, USA, Apr. 2021, pp. 4020–4032
  • H. Iida, D. Thai, V. Manjunatha, and M. Iyyer, “TABBIE: Pretrained Representations of Tabular Data,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, Jun. 2021, pp. 3446–3456
  • Z. Wang et al., “TUTA: Tree-based Transformers for Generally Structured Table Pre-training,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, New York, NY, USA, Aug. 2021, pp. 1780–1790
  • X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu, “TURL: table understanding through representation learning,” Proc. VLDB Endow., vol. 14, no. 3, Nov. 2020, pp. 307–319

10. Representation Learning for Data Cleansing/Missing Value Imputation

  • Richard Wu, Aoqian Zhang, Ihab Ilyas, and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. Proceedings of Machine Learning and Systems 2, (March 2020), 307–325.
  • Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2022. TURL: Table Understanding through Representation Learning. SIGMOD Rec. 51, 1 (June 2022), 33–40. 
  • Ihab F. Ilyas and Theodoros Rekatsinas. 2022. Machine Learning and Data Cleaning: Which Serves the Other? J. Data and Information Quality 14, 3 (September 2022), 1–11.

11. Information Extraction for E-Commerce Product Data

  • Xinyang Zhang, Chenwei Zhang, Xian Li, Xin Luna Dong, Jingbo Shang, Christos Faloutsos, and Jiawei Han. 2022. OA-Mine: Open-World Attribute Mining for E-Commerce Products with Weak Supervision. In Proceedings of the ACM Web Conference 2022, ACM, Virtual Event, Lyon France, 3153–3161. 
  • Huimin Xu, Wenting Wang, Xin Mao, Xinyu Jiang, and Man Lan. 2019. Scaling up Open Tagging from Tens to Thousands: Comprehension Empowered Attribute Value Extraction from Product Title. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 5214–5223.
  • Gilad Fuchs and Yoni Acriche. 2022. Product Titles-to-Attributes As a Text-to-Text Task. In Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5), Association for Computational Linguistics, Dublin, Ireland, 91–98.

Getting started

The following books are good starting points for getting an overview of the topic of large-scale data integration:

  • Dong/Srivastava: Big Data Integration. Morgan & Claypool, 2015.
  • Doan/Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.