1. Data Lakes: Concepts, Functionalities, Examples
- Hai, Rihan, Christoph Quix, and Matthias Jarke. “Data lake concept and systems: a survey.” arXiv preprint arXiv:2106.09592 (2021).
- Sawadogo, Pegdwendé, and Jérôme Darmont. “On data lake architectures and metadata management.” Journal of Intelligent Information Systems 56.1 (2021): 97–120.
- Nargesian, Fatemeh, et al. “Data lake management: challenges and opportunities.” Proceedings of the VLDB Endowment 12.12 (2019): 1986-1989.
2. Comparison of Data Lake Management Platforms
- Hai, Rihan, Christoph Quix, and Matthias Jarke. “Data lake concept and systems: a survey.” arXiv preprint arXiv:2106.09592 (2021)
- Sawadogo, Pegdwendé, and Jérôme Darmont. “On data lake architectures and metadata management.” Journal of Intelligent Information Systems 56.1 (2021): 97–120.
3. Data Lake Profiling
- Hai, Rihan, Christoph Quix, and Matthias Jarke. “Data lake concept and systems: a survey.” arXiv preprint arXiv:2106.09592 (2021).
- Felix Naumann: Data profiling revisited. SIGMOD Rec. 42, 4, 2014.
- Mohamed Ellefi, et al.: RDF Dataset Profiling – a Survey of Features, Methods, Vocabularies and Applications. Journal of Web Semantics, 2017.
4. Dataset Search within Data Lake
- Chapman, A., Simperl, E., Koesten, L. et al. Dataset search: a survey. The VLDB Journal (2019)
- Trabelsi, et al.: Improved Table Retrieval Using Multiple Context Embeddings for Attributes. Big Data 2019.
- Chapman, Adriane, et al. “Dataset search: a survey.” The VLDB Journal 29.1 (2020): 251–272.
- Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J. Miller. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19), Association for Computing Machinery, New York, NY, USA, 847–864.
5. Entity Search within Data Lakes
- Wei Zhang, Hao Wei, Bunyamin Sisman, Xin Luna Dong, Christos Faloutsos, and Davd Page. 2020. AutoBlock: A Hands-off Blocking Framework for Entity Matching. In Proceedings of the 13th International Conference on Web Search and Data Mining (WSDM ’20), Association for Computing Machinery, New York, NY, USA, 744–752.
- Saravanan Thirumuruganathan, Han Li, Nan Tang, Mourad Ouzzani, Yash Govind, Derek Paulsen, Glenn Fung, and AnHai Doan. Deep Learning for Blocking in Entity Matching: A Design Space Exploration. 14.
- N. Barlaug and J. A. Gulla, “Neural Networks for Entity Matching: A Survey,” ACM Trans. Knowl. Discov. Data, vol. 15, no. 3, p. 52:1–52:37, Apr. 2021
6. Metadata for Dataset Search
- Chapman, A., Simperl, E., Koesten, L. et al. Dataset search: a survey. The VLDB Journal (2019)
- Alon Halevy, Flip Korn, Natalya F. Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google’s Datasets. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16), Association for Computing Machinery, New York, NY, USA, 795–806
- Omar Benjelloun, Shiyu Chen, and Natasha Noy. 2020. Google Dataset Search by the Numbers. In The Semantic Web – ISWC 2020 (Lecture Notes in Computer Science), Springer International Publishing, Cham, 667–682.
7. Entity Matching using Deep Learning
- N. Barlaug and J. A. Gulla, “Neural Networks for Entity Matching: A Survey,” ACM Trans. Knowl. Discov. Data, vol. 15, no. 3, p. 52:1–52:37, Apr. 2021
- C. Ge, P. Wang, L. Chen, X. Liu, B. Zheng, and Y. Gao, “CollaborER: A Self-supervised Entity Resolution Framework Using Multi-features Collaboration,” arXiv:2108.08090 [cs], Sep. 2021
- M. Loster, I. Koumarelas, and F. Naumann, “Knowledge Transfer for Entity Resolution with Siamese Neural Networks,” J. Data and Information Quality, vol. 13, no. 1, p. 2:1–2:25, Jan. 2021
8. Schema Matching using Deep Learning (Annotating table columns using a Knowledge Base)
- Y. Suhara et al., “Annotating Columns with Pre-trained Language Models,” arXiv:2104.01785 [cs], Apr. 2021
- X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu, “TURL: table understanding through representation learning,” Proc. VLDB Endow., vol. 14, no. 3, pp. 307–319, Nov. 2020.
- E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, and K. Srinivas, “SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems,” in The Semantic Web, Cham, 2020, pp. 514–530
9. Schema Matching using Deep Learning (Matching columns across multiple tables)
- R. Shraga, A. Gal, and H. Roitman, “ADnEV: cross-domain schema matching using deep similarity matrix adjustment and evaluation,” Proc. VLDB Endow., vol. 13, no. 9, pp. 1401–1415, May 2020
- J. Zhang, B. Shin, J. D. Choi, and J. C. Ho, “SMAT: An Attention-Based Deep Learning Solution to the Automation of Schema Matching,” in Advances in Databases and Information Systems, Cham, 2021, pp. 260–274
- E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. the VLDB Journal, 10(4):334–350, 2001
10. Embedding Methods for Tabular Data
- H. Iida, D. Thai, V. Manjunatha, and M. Iyyer, “TABBIE: Pretrained Representations of Tabular Data,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, Jun. 2021, pp. 3446–3456
- D. Wang, P. Shiralkar, C. Lockard, B. Huang, X. L. Dong, and M. Jiang, “TCN: Table Convolutional Network for Web Table Interpretation,” in Proceedings of the Web Conference 2021, New York, NY, USA, Apr. 2021, pp. 4020–4032
- Z. Wang et al., “TUTA: Tree-based Transformers for Generally Structured Table Pre-training,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, New York, NY, USA, Aug. 2021, pp. 1780–1790
- X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu, “TURL: table understanding through representation learning,” Proc. VLDB Endow., vol. 14, no. 3, pp. 307–319, Nov. 2020.