Resources

The Data and Web Science group offers the following resources for public download:

Open Data
Open Source Software
Benchmarks

1. Open Data

DBpedia – DBpedia is a community effort to extract structured information from Wikipedia editions in over 90 languages and to make the resulting knowledge base available on the Web. DBpedia website
OPIEC is an Open Information Extraction (OIE) corpus constructed from Wikipedia, containing more than 341M triples. OPIEC website
Web Data Commons - Microdata RDFa JSON-LD and Microformat Corpus The Web Data Commons project extracts all Microformat, Microdata JSON-LD, and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download. Project website
Web Data Commons – Web Table Corpora The Web Data Commons project extracts relational Web tables from the Common Crawl and provides two large table corpora for download. Link
Web Data Commons – WebIsA Database WebIsADb is a publicly available database containing more than 400 million hypernymy relations extracted from the Common Crawl. Link Linked Open Data version
Web Data Commons – DBkWik DBkWik is a consolidated knowledge graph created from thousands of Wikis. Link
Web Data Commons – Product Data Corpus We provide a large training set and a gold standard for Product Matching for public download. The training dataset consists of more than 26 million product offers originating from 79 thousand websites. Link

2. Open Source Software

AnyBURL – AnyBURL (Anytime Bottom Up Rule Learning) is a rule learner designed for the use case of Knowledge Base Completion. Website.
ALCOMO – ALCOMO (Applying Logical Constraints to Match Ontologies) is a debugging system that allows to transform incoherent alignments in coherent alignments by selecting a coherent subset. Website.
D2RQ Plattform – The D2RQ Platform is a system for accessing relational databases as virtual, read-only RDF graphs. D2RQ website.
DESQ – A general-purpose system for frequent sequence mining. It features a simple and intuitive pattern expression language to express various pattern mining tasks and provides efficient and scalable algorithms for mining. Github page.
ExtractGPT – A framework for product attribute value extraction using Large Language Models combined with different prompting approaches. Github page
LibKGE is a PyTorch-based library for efficient training, evaluation, and hyperparameter optimization of knowledge graph embeddings (KGE). It is highly configurable, easy to use, and extensible. Github page.
MatchGPT – A framework for entity matching using Large Language Models combined with different prompting approaches. Github page.
MinIE – An Open Information Extraction system which provides compact extractions with semantic annotations, including information about polarity, modality, attribution, and quantities. GitHub page.
MELT - Melt is a maven based framework for developing, tuning, evaluating, and packaging ontology matching systems. GitHub page
PyDI – Data Integration Framework – A Python framework for end-to-end data integration. The PyDI framework implements symbolic as well as LLM-based methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation. Github page.
RapidMiner Linked Open Data Extension – The RapidMiner Linked Open Data Extension allows using data from Linked Open Data both as an input for data mining as well as for enriching existing datasets with background knowledge. Project Website.
Silk – The Silk framework is a tool for discovering relationships between data items within different Linked Data sources. Silk website.
WDC Extraction Framework – The Extraction Framework is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, Web graphs, and HTML tables from web crawls. See Web Data Commons website.
WInte.r Data Integration Framework – WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation. WInte.r website

3. Benchmarks

WDC Products: A Multi-Dimensional Entity Matching Benchmark - WDC Products provides for evaluating entity matching systems along combinations of the three dimensions (i) amount of corner-cases, (ii) amount of unseen entities, and (iii) development set size. Website
WDC Schema.org Table Annotation Benchmark (SOTAB) - The SOTAB benchmark can be used to compare Column Type Annotation and Columns Property Annotation systems using a large set of tables containing heterogeneous data from the Web. Website.
WebMall – Multi-Shop Web Agent Benchmark – The benchmark provides for evaluating Web agents in a multi-shop e-commerce scenario. Website.
Berlin SPARQL Benchmark (BSBM) - The Berlin SPARQL Benchmark (BSBM) defines a suite of benchmarks for comparing the performance of of storage systems that expose SPARQL endpoints. BSBM website.
SW4ML Benchmark – The SW4ML benchmark collects a number of machine learning datasets with links to Semantic Web datasets for benchmarking semantic web based machine learning approaches. SW4ML website.
T2D Gold Standard for Evaluating Web Table Matching Systems – The T2K Gold standard provides a rich set of correspondences between a public Web table corpus and the DBpedia knowledge base. T2D website.
SV-IDENT 2022 Benchmark – The benchmark includes the data that has been used in the 2022 Shared Task on “Survey Variable Identification in Social Science Publications”. SV-IDENT website.
Speaker Attribution Benchmark (SpkAtt-2023) - The SpkAtt-2023 Benchmark provides a corpus of German parliamentary debates, manually annotated for speaker attribution, that has been used in the 2023 Shared Task on Speaker Attribution in Newswire and Parliamentary Debates. SpkAtt-2023 website.