WebDataCommons releases 86.3 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 15.3 million websites

The DWS group is happy to announce the new release of the WebDataCommons Microdata, JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the September 2020 version of the Common Crawl covering 3.4 billion HTML pages which originate from 34.5 million websites (pay-level domains). For the extraction of structured data, the newest version 2.4 of the any23 library was used.

In summary, we found structured data within 1.7 billion HTML pages out of the 3.4 billion pages contained in the crawl (50%). These pages originate from 15.3 million different pay-level domains out of the 34.5 million pay-level-domains covered by the crawl (44.3%). Last year, we only found structured data in 37% of the pages and on 37.2% of the pay-level-domains.

Approximately 7.8 million of the 2020 websites use Microdata, 7.6 million websites use JSON-LD, and 3.3 million websites make use of RDFa. Microformats are used by more than 4 million websites within the crawl.

 

Statistics about the December 2020 Release:

Basic statistics about the December 2020 Microdata, JSON-LD, RDFa, and Microformat data sets as well as the vocabularies that are used together with each markup format are found at: 

http://webdatacommons.org/structureddata/2020-12/stats/stats.html

 

Markup Format Adoption

The page below provides an overview of trends in the adoption of the different markup formats as well as widely used schema.org classes in the timespan 2012 to 2020:

http://webdatacommons.org/structureddata/#toc3

Comparing the statistics from the new 2020 release to the statistics about the 2019 release of the data sets

http://webdatacommons.org/structureddata/2019-12/stats/stats.html

we can observe that although the overall number of pages in the crawl is by 38.9% larger in comparison to the crawl used for the 2019 release, the corresponding growth in terms of domains is only 7.9%, indicating that the crawl corpus used this year is much deeper in comparison to the one of last year. However, we see that more and more websites annotate their content, as the yearly increase of the domains having annotated data was more than 28%. The markup format with the largest domain growth in adoption (>50%) is JSON-LD. The growing trend of the JSON-LD format becomes even more obvious in certain domains, such as hotels.com and yahoo.com, which have switched from using Microdata to using JSON-LD as dominant markup language. Concerning the vocabulary adoption, schema.org continues to be the most dominant vocabulary. More concretely the classes schema:WebPage, schema:Product, schema:Rating, schema:Organization and schema:Person saw a major adoption increase in comparison to 2019 (>40%). Looking at the richness of JSON-LD descriptions, we notice that the average number of triples per URL has grown from 29 in 2019 to 41 in 2020 and has now reached a similar level of detail as the Microdata annotations (avg 39 triples per URL).

 

Download

The overall size of the December 2020 RDFa, Microdata, Embedded JSON-LD and Microformat data sets is 86.3 billion RDF quads. For download, we split the data into 21,346 files with a total size of 1.9 TB.

http://webdatacommons.org/structureddata/2020-12/stats/how_to_get_the_data.html

In addition, we have created for over 43 different schema.org classes separate files, including all quads extracted from pages, using a specific schema.org class. 

http://webdatacommons.org/structureddata/2020-12/stats/schema_org_subsets.html

 

Lots of thanks to: 

+ the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project. 
+ the Any23 project for providing and maintaining their great library of structured data parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 

 

General Information about the WebDataCommons Project

The WebDataCommons project extracts yearly since 2012 structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web. Beside of the yearly extractions of semantic annotations from webpages, the WebDataCommons project also provides large hyperlink graphs, the largest public corpus of web tables, two corpora of product data, as well as a collection of hypernyms extracted from billions of web pages for public download. General information about the WebDataCommons project is found at 

http://webdatacommons.org/


Have fun with the new data set. 

Cheers, 
Anna Primpeli, Alexander Brinkmann and Chris Bizer

Back