WebDataCommons releases 82.1 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 14.6 million websites

The DWS group is happy to announce the new release of the WebDataCommons Microdata, JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the October 2021 version of the Common Crawl covering 3.2 billion HTML pages which originate from 35.4 million websites (pay-level domains).

In summary, we found structured data within 1.5 billion HTML pages out of the 3.2 billion pages contained in the crawl (47.4%). These pages originate from 14.6 million different pay-level domains out of the 35.4 million pay-level-domains covered by the crawl (41.1%).

Approximately 8.3 million websites provide structured data using the JSON-LD syntax, 7.8 million websites use the Microdata markup format to annotate structured data within their pages, while less than one million websites were found to use the RDFa markup format.

Statistics about the October 2021 Release:

Basic statistics about the October 2021 Microdata, JSON-LD, RDFa, and Microformat data sets as well as the vocabularies that are used along with each markup format are found at:

http://webdatacommons.org/structureddata/2021–12/stats/stats.html

Markup Format Adoption

The WebDataCommons project has been extracting structured data from the CommonCrawl yearly since 2010. The October 2021 release signifies 11 years of monitoring the adoption of structured data on the Web. This allows us to spot trends concerning the adoption of different markup formats as well as the usage of specific classes and properties, a short overview of which is provided on the page below:

http://webdatacommons.org/structureddata/#toc3

The first WDC release in 2010 revealed that only 5.7% of the examined webpages contained structured data. In 2021, we found structured data within 47.4% of the examined webpages indicating a huge growth in adoption over the last decade. The two markup formats that saw the largest increase in adoption are Microdata and JSON-LD. By 2021, Microdata and JSON-LD dominate over RDFa and other Microformats. More concretely, in the 2010 release Microdata was found only in less than 1% of the websites containing structured data while in the newest 2021 release, the relative adoption is more than 53%. JSON-LD has been monitored by the WebDataCommons project since 2015 and was initially found in 21% of the websites deploying markup annotations. In 2021 more than 57% of the websites were found to use this markup format, which makes JSON-LD the most widely adopted markup format. In contrast, the relative adoption of RDFa and Microformats (hCard) has decreased over the last decade from 22% and 66% to 4.9% and 28.5%, respectively.

Looking at the richness of the Microdata and JSON-LD annotations which we can approximate by the average amount of triples per webpage, we can see that there is an overall increasing trend with some small fluctuations between the years for the Microdata format. On average, we extracted 21 Microdata triples from each webpage in 2010. The number of triples per page increased to 38 in 2016, while there was a slight decrease to 36 triples per webpage in 2021. The growth of the richness of JSON-LD annotations is even more significant with the average amount of triples per webpage continuously increasing from 10 in 2015 to 47 in 2021. This indicates that JSON-LD data provides a higher level of detail in comparison to Microdata annotations.

The schema.org vocabulary remains the most popular in the context of Microdata and JSON-LD. It is used for annotating navigation elements within webpages, using classes such as BreadcrumbList, SearchAction and SiteNavigationElement, as well as the main content of a page, using classes like Product, LocalBusiness, and JobPosting. We observe a rapidly increasing adoption of several content classes: Over the past four years the number of websites providing Product annotations increased from 594K to 2.5M (334% growth), the amount of websites annotating LocalBusiness entities increased from 386K to 727M (88% growth) while the adoption of the JobPosting class increased from 7K websites to 43K (514% growth).

Finally, we observe that an increasing number of websites explicitly annotates entity identifiers, such as product identifiers, as well as other identifying attributes such as telephone numbers or geo coordinates for local businesses. Schema.org provides different terms for annotating different types of product identifiers, with schema:Product/sku being the most popular among them. Over the past four years, the relative adoption of the schema:Product/sku property has increased from 21% to 55%. The properties schema:LocalBusiness/telephone and schema:LocalBusiness/geo have also seen a comparable increased growth in the last four years from 64% to 76% and from 6% to 22.5%, respectively. This verifies our previous observation on the increasing richness of the annotations.

Download

The overall size of the October 2021 RDFa, Microdata, Embedded JSON-LD and Microformat data sets is 82.1 billion RDF quads. For download, we split the data into 21,346 files with a total size of 1.6 TB.

http://webdatacommons.org/structureddata/2021–12/stats/how_to_get_the_data.html

In addition, we have created for 44 different schema.org classes separate files, including all quads extracted from pages, using a specific schema.org class.

http://webdatacommons.org/structureddata/2021–12/stats/schema_org_subsets.html

Lots of thanks to:
+ the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project.
+ the Any23 project for providing and maintaining their great library of structured data parsers.
+ Amazon Web Services in Education Grant for supporting WebDataCommons.

General Information about the WebDataCommons Project

Since 2010 the WebDataCommons project has yearly extracted structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web. Besides the yearly extractions of semantic annotations from webpages, the WebDataCommons project provides large hyperlink graphs, the largest public corpus of web tables, two corpora of product data, as well as a collection of hypernyms extracted from billions of web pages for public download. General information about the WebDataCommons project is found at

http://webdatacommons.org/

Have fun with the new data set.

Cheers,
Anna Primpeli, Alexander Brinkmann and Chris Bizer

Back