Paper on LLM-based Table Annotation presented at ADBIS 2025

Keti Korini presented our paper Evaluating Knowledge Generation and Self-refinement Strategies for LLM-Based Column Type Annotation at the 29th European Conference on Advances in Databases and Information Systems in Tampere, Finland.

Abstract:  

Understanding the semantics of columns in relational tables is an important pre-processing step for indexing data lakes in order to provide rich data search. An approach to establishing such understanding is column type annotation (CTA) where the goal is to annotate table columns with terms from a given vocabulary. This paper experimentally compares knowledge generation and self-refinement strategies for LLM-based CTA. The strategies include using LLMs to generate term definitions, error-based refinement of term definitions, and fine-tuning using examples and term definitions. We evaluate these strategies along two dimensions: effectiveness measured as F1 performance and efficiency measured in terms of token usage and cost. Our experiments show that using training data to generate label definitions outperforms using the same data as demonstrations for in-context learning for two out of three datasets using OpenAI models. The experiments further show that using the LLMs to refine label definitions brings an average increase of 3.9% F1 in most setups compared to the performance of the non-refined definitions. Combining fine-tuned models with self-refined term definitions results in the overall highest performance for gpt-4o, outperforming zero-shot prompting fine-tuned models by at least 3% in F1. The cost analysis shows that self-refinement via prompting is more cost-efficient than fine-tuning for use cases requiring smaller amounts of tables to be annotated.

Link to the paper (pre-print on Arxiv)

Presentation slides

Link to conference website

Back