A Collection of Benchmark Datasets for Systematic Evaluations of Machine Learning on the Semantic Web

In the recent years, several approaches for machine learning on the Semantic Web have been proposed. However, no extensive comparisons between those approaches have been undertaken, in particular due to a lack of publicly available, acknowledged benchmark datasets. Here, we present a collection of 22 benchmark datasets at different sizes, derived from existing Semantic Web datasets as well as from external classification and regression problems linked to datasets in the Linked Open Data cloud. Such a collection of datasets can be used to conduct quantitative performance testing and systematic comparisons of approaches, which, due to the number of datasets, also allows for determining the statistical significance of the findings.

The datasets, as well as a detailed description for each of them, can be found here.

Datasets

Our dataset collection consists of 22 datasets divided into three categories:

  1. Existing datasets that are commonly used in machine learning experiments
  2. Datasets that were generated from official observations
  3. Datasets generated from existing RDF datasets.

 

Each of the datasets in the first two categories are initially linked to DBpedia. This has two main reasons, (1) DBpedia being a cross-domain knowledge base usable in datasets from very different topical domains, and (2) tools like DBpedia Lookup and DBpedia Spotlight making it easy to link external datasets to DBpedia. However, DBpedia can be seen as an entry point to the Web of Linked Data, with many datasets linking to and from DBpedia. In fact, we use the initial DBpedia links to retrieve external links for each entity to YAGO and Wikidata. Such links could be exploited for systematic evaluation of the relevance of the data of different LOD dataset in different learning tasks.

Existing ML datasets
Name #Instances Source Task Licence
Auto MPG 371 UCI ML Regression pending
AAUP 960 JSE Regression/Classification (c=3) pending
Auto 93 93 JSE Regression pending
Zoo 101 UCI ML Classification (c=3) pending
Generated datasets from official observations
Name #Instances Source Task Licence
Forbes 1,585 Forbes Regression/Classification (c=2) pending
Cities 212 Mercer Regression/Classification (c=3) pending
Facebook Books 1,600 Facebook Regression/Classification (c=2) pending
Facebook  Movies 1,600 Facebook Regression/Classification (c=2) pending
Metacritic Albums 1,600 Metacritic Regression/Classification (c=2) pending
Metacritic Movies 2,000 Metacritic Regression/Classification (c=2) pending
HIV Deaths Country 114 WHO Regression/Classification (c=2) Open
Traffic Accidents Country 146 WHO Regression/Classification (c=2) Open
Energy Savings Country 162 WorldBank Regression/Classification (c=2) Open
Inflation Country 160 WorldBank Regression/Classification (c=2) Open
Scientific Journals Country 160 WorldBank Regression/Classification (c=2) Open
Unemployment French Region 26 SemStats 2013 Regression/Classification (c=2) pending
Endangered Species 301 a-z-animals Regression/Classification (c=2) pending
Drug-Food Interaction 2,000 FinkiLOD Classification (c=2) odc-by
Datasets generated from existing RDF datasets
Name #Instances Task Licence
AIFB 176 Classification (c=4) CC-BY
AM 1,000 Classification (c=11) cc-by-sa
MUTAG 340 Classification (c=2) CC-BY 
BGS 146 Classification (c=2) Open

Link Quality Evaluation

To evaluate the quality of the DBpedia links, for each of the datasets we randomly selected at least 100 instances (for dataset smaller than 100 instances, we selected all instances) and manually evaluated the correctness of the links.

Link quality evaluation
Dataset #Test Links #Correct Links Precision (%)
Auto MPG 100 100 100.00
AAUP 100 100 100.00
Auto 93 93 93 100.00
Zoo 101 99 98.01
Forbes 100 100 100.00
Cities 100 100 100.00
Facebook Books 100 98 98.00
Facebook Movies 130 130 100.00
Metacritic Albums 100 100 100.00
Metacritic Movies 130 128 98.46
HIV Deaths Country 114 114 100.00
Traffic Accidents Country 146 146 100.00
Energy Savings Country 162 162 100.00
Inflation Country 160 160 100.00
Scientific Journals Country 160 160 100.00
Unemployment French Region 26 26 100.00
Endangered Species 100 100 100.00
Drug-Food Interaction 100 100 100.00

Dataset Download

The datasets, as well as a detailed description for each of them, can be found here.

Citation

If you use the collection of datasets in your research, please cite the following paper:

  • Ristoski, P., de Vries, G.K.D., Paulheim, H.: A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. In: International Semantic Web Conference, 2016