Best Paper Award at iiWAS 2024 |

We are pleased to announce that the paper “ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction” by Alexander Brinkmann, Roee Shraga, and Christian Bizer was awarded as best paper at the 26th International Conference on Information Integration and Web Intelligence (iiWAS).

The abstract and a link to the paper can be found below:

ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction

In order to facilitate features such as faceted product search and product comparison, e-commerce platforms require accurately structured product data, including precise attribute/value pairs. However, vendors oftentimes provide unstructured product descriptions consisting only of an offer title and a textual description. Consequently, extracting attribute values from titles and descriptions is vital for e-commerce platforms. State-of-the-art attribute value extraction (AVE) methods based on pre-trained language models (PLMs), such as BERT, face two drawbacks (i) the methods require significant amounts of task-specific training data and (ii) the fine-tuned models have problems with generalising to unseen attribute values that were not part of the training data. This paper explores the potential of using large language models (LLMs) as a more training data-efficient and more robust alternative to existing AVE methods. We propose different prompt templates for describing the target attributes of the extraction to the LLM, covering both zero-shot and few-shot scenarios. In the zero-shot scenario, textual and JSON-based target schema representations of the attributes are compared. In the few-shot scenario, we investigate (i) the provision of example attribute values, (ii) the selection of in-context demonstrations, (iii) shuffled ensembling to prevent position bias, and (iv) fine-tuning the LLM. We evaluate the prompt templates in combination with hosted LLMs, such as GPT-3.5 and GPT-4, and open-source LLMs which can be run locally. We compare the performance of the LLMs to the PLM-based methods SU-OpenTag, AVEQA, and MAVEQA. The highest average F1-score of 86% was achieved by GPT-4 using an ensemble of shuffled prompts that integrated a comprehensive target schema containing attribute descriptions and example values with demonstrations. Llama-3-70B performs only 3% worse than GPT-4, making it a competitive open-source alternative. Given the same training data, this prompt/GPT-4 combination outperforms the best PLM baseline by an average of 6% F1-score. Fine-tuning GPT-3.5 results in comparable performance to GPT-4 but harms the LLM’s ability to generalize.

Please find the slides for the talk at iiWAS2024 here: Slides.

Back