Background
Understanding the semantics of table elements is a central pre-requisite for many data integration and data discovery tasks such as knowledge base completion or dataset search. Table annotation is the task of annotating a table with terms from a knowledge graph, database schema or vocabulary. Table annotation included several subtasks, such as column type annotation, columns property annotation, cell entity annotation, row annotation, and table type detection. Table annotation has attracted quite some attention in the research community in recent years and there are active benchmarking campaigns on table annotation, such as the SemTab challenge.
The WDC SOTAB Benchmark
The SOTAB benchmarks completements the set of publicly available table annotation benchmarks with a new benchmark which covers a wide range of different entity types and offers a large amount of heterogeneous training data from many independent data sources for these types.
SOTAB features two annotation tasks: Column type annotation (CTA) and columns property annotation (CPA). The goal of SOTAB's CTA task is to annotate the columns of a table with one out of 91 different Schema.org types, such as telephone, duration, location, or organization. The goal of the CPA task is to annotate pairs of table columns with one out of 176 Schema.org properties, such as gtin13, startDate, priceValidUntil, or recipeIngredient. The benchmark consists of 59,548 tables annotated for CTA and 48,379 tables annotated for CPA. The data in the tables originates from 74,215 different websites. The tables are split into training-, validation- and test sets for both tasks. The tables cover 17 popular Schema.org types including Product, LocalBusiness, Event, and JobPosting. The evaluation of the SOTAB benchmark using the state of the art table annotation systems TURL and DODUO has shown that the benchmark is quite challenging for current systems.
SOTAB is provided for public download on the following webpage, which also offers additional information about the benchmark: http://webdatacommons.org/structureddata/sotab/
The Dataset Track of the SemTab Challenge at ISWC 2022 Conference
The Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab) is organized as part of the International Semantic Web Conference since 2019. The challenge aims at benchmarking systems dealing with the tabular data to KG matching problem, so as to facilitate their comparison on the same basis and the reproducibility of the results. This year, the challenge consisted of an Accuracy Track for benchmarking matching systems as well as a Dataset Track which awards prices to challenging benchmark datasets. We are happy that SOTAB has won the Dataset Track of the SemTab 2022 Challenge. A video of the SOTAB presentation at ISWC2022 as well as a poster about SOTAB from the ISWC poster session is available for download. Results of SemTab 2022 summarizes the results of the SemTab Accuracy and Dataset Track.
General Information about the Web Data Commons Project
Since 2012, the Web Data Commons project extracts structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download to support researchers in exploiting the wealth of information that is available on the Web. Beside of the yearly extractions of semantic annotations from webpages, the Web Data Commons project also provides large hyperlink graphs, large web table corpora, two corpora of product data, as well as a collection of hypernyms extracted from billions of web pages for public download. General information about the Web Data Commons project is found at http://webdatacommons.org/