Photo credit: Anna Logue

Resources

The Data and Web Science group offers the following resources for public download:

  1. Open Data
  2. Open Source Software
  3. Benchmarks

1. Open Data

  • DBpedia - DBpedia is a community effort to extract structured information from Wikipedia editions in over 90 languages and to make the resulting knowledge base available on the Web. DBpedia website
  • OPIEC is an Open Information Extraction (OIE) corpus constructed from Wikipedia, containing more than 341M triples. OPIEC website
  •  Web Data Commons Microdata RDFa JSON-LD and Microformat Corpus The Web Data Commons project extracts all Microformat, Microdata JSON-LD, and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download. Project website
  • Web Data Commons - Web Table Corpora The Web Data Commons project extracts relational Web tables from the Common Crawl and provides two large table corpora for download. Link
  • Web Data Commons - Hyperlink Graph The project provides a large hyperlink graph for public download and analyses the topology of the graph. Link
  • Web Data Commons - WebIsA Database WebIsADb is a publicly available database containing more than 400 million hypernymy relations extracted from the Common Crawl. Link Linked Open Data version
  • Web Data Commons - DBkWik DBkWik is a consolidated knowledge graph created from thousands of Wikis. Link
  • Web Data Commons - Product Data Corpus  We provide a large training set and a gold standard for Product Matching for public download. The training dataset consists of more than 26 million product offers originating from 79 thousand websites. Link

2. Open Source Software

  • AnyBURL - AnyBURL (Anytime Bottom Up Rule Learning) is a rule learner designed for the use case of Knowledge Base Completion. Website.
  • ALCOMO - ALCOMO (Applying Logical Constraints to Match Ontologies) is a debugging system that allows to transform incoherent alignments in coherent alignments by selecting a coherent subset. Website.
  • WInte.r Data Integration Framework - WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation. WInte.r website
  • D2RQ Plattform - The D2RQ Platform is a system for accessing relational databases as virtual, read-only RDF graphs. D2RQ website.
  • DESQ - A general-purpose system for frequent sequence mining. It features a simple and intuitive pattern expression language to express various pattern mining tasks and provides efficient and scalable algorithms for mining. Github page.
  • MinIE - An Open Information Extraction system which provides compact  extractions with semantic annotations, including  information about polarity, modality, attribution, and quantities. GitHub page.
  • MELT - Melt is a maven based framework for developing, tuning, evaluating, and packaging ontology matching systems. GitHub page
  • RapidMiner Linked Open Data Extension - The RapidMiner Linked Open Data Extension allows using data from Linked Open Data both as an input for data mining as well as for enriching existing datasets with background knowledge. Project Website.
  • Silk - The Silk framework is a tool for discovering relations­hips between data items within different Linked Data sources. Silk website.
  • WDC Extraction Framework - The Extraction Framework is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, Web graphs, and HTML tables from web crawls. See Web Data Commons website.

3. Benchmarks

  • Berlin SPARQL Benchmark (BSBM) - The Berlin SPARQL Benchmark (BSBM) defines a suite of benchmarks for comparing the performance of of storage systems that expose SPARQL endpoints. BSBM website.
  • SW4ML Benchmark - The SW4ML benchmark collects a number of machine learning datasets with links to Semantic Web datasets for benchmarking semantic web based machine learning approaches. SW4ML website.
  • T2D Gold Standard for Evaluating Web Table Matching Systems - The T2K Gold standard provides a rich set of correspondences between a public Web table corpus and the DBpedia knowledge base. T2D website.
  • WDC Training Dataset and Gold Standard for Large-Scale Product Matching - The training dataset consists of more than 26 million product offers originating from 79 thousand websites. The gold standard consists of 2000 pairs of offers.
  • WDC Gold Standard for Product Matching and Product Feature Extraction - In order to support the evaluation and comparison of product feature extraction and product matching methods, we have created two public gold standards for these tasks. Website.