WDC Schema.org Table Corpus released |

We are happy to announce the release of the WDC Schema.org Table Corpus.

The corpus consists of ~4.2 million relational tables describing a wide range of entities including products, people, organizations, places, and events. The tables use the schema.org vocabulary as shared schema.

The three schema.org classes covered by the largest number of tables are Product (~2 million tables having overall ~232 million rows), Person (~922,000 tables having overall ~6.6 million rows) and LocalBusiness (~466,000 tables having overall ~7.4 million rows). Overall, 13 of the 43 classes are covered by more than 10,000 tables each, another 7 classes are covered by more than 1,000 tables each.

The use cases of the corpus include the supervised and self-supervised training of data integration methods, such as entity matching, schema matching, table augmentation, cell filling, or data search methods.

Background

Many websites embed structured data describing products, people, organizations, places, and events into their HTML pages in order to support search engines to understand and visualize their content. The schema.org vocabulary is widely used as shared schema for the embedded data. The Web Data Commons project regularly extracts schema.org annotations from the Common Crawl, a large public web corpus, and offers them for public download in the form of RDF dumps.

The Schema.org Table corpus

The WDC Schema.org Table Corpus was generated by grouping the extracted data from the December 2020 version of the WDC schema.org data sets into relational tables. A single table contains all entities of a specific schema.org class that have been extracted for a specific host after a set of filtering steps. The column values of a table originate from the extracted values of schema.org attributes of the extracted entities.

The overall size of all tables in zipped form is ~50 GB. For download, we offer for each of the 43 Schema.org classes three separate files (in JSON format) containing the Top 100 largest tables, the remaining tables with at least 3 rows and finally any smaller remaining tables. The table downloads are accompanied by easy-to-view samples and further files containing in-depth profiling statistics for the tables.

For more information and statistics about the corpus as well as for downloading the corpus please visit

http://webdatacommons.org/structureddata/schemaorgtables/

General Information about the Web Data Commons Project

Since 2012, the Web Data Commons project extracts structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers in exploiting the wealth of information that is available on the Web. Beside of the yearly extractions of semantic annotations from webpages, the Web Data Commons project also provides large hyperlink graphs, the largest public corpus of web tables, two corpora of product data, as well as a collection of hypernyms extracted from billions of web pages for public download. General information about the Web Data Commons project is found at

http://webdatacommons.org/

Have fun with the new corpus!

Cheers,

Ralph Peeters and Christian Bizer

Back