A Collection of Benchmark Datasets for Systematic Evaluations of Machine Learning on the Semantic Web

In the recent years, several approaches for machine learning on the Semantic Web have been proposed. However, no extensive comparisons between those approaches have been undertaken, in particular due to a lack of publicly available, acknowledged benchmark datasets. Here, we present a collection of 22 benchmark datasets at different sizes, derived from existing Semantic Web datasets as well as from external classification and regression problems linked to datasets in the Linked Open Data cloud. Such a collection of datasets can be used to conduct quantitative performance testing and systematic comparisons of approaches, which, due to the number of datasets, also allows for determining the statistical significance of the findings.

The datasets, as well as a detailed description for each of them, can be found here.

Datasets

Our dataset collection consists of 22 datasets divided into three categories:

Existing datasets that are commonly used in machine learning experiments
Datasets that were generated from official observations
Datasets generated from existing RDF datasets.

Each of the datasets in the first two categories are initially linked to DBpedia. This has two main reasons, (1) DBpedia being a cross-domain knowledge base usable in datasets from very different topical domains, and (2) tools like DBpedia Lookup and DBpedia Spotlight making it easy to link external datasets to DBpedia. However, DBpedia can be seen as an entry point to the Web of Linked Data, with many datasets linking to and from DBpedia. In fact, we use the initial DBpedia links to retrieve external links for each entity to YAGO and Wikidata. Such links could be exploited for systematic evaluation of the relevance of the data of different LOD dataset in different learning tasks.

Existing ML datasets
Name	#Instances	Source	Task	Licence
Auto MPG	371	UCI ML	Regression	pending
AAUP	960	JSE	Regression/Classification (c=3)	pending
Auto 93	93	JSE	Regression	pending
Zoo	101	UCI ML	Classification (c=3)	pending

Generated datasets from official observations
Name	#Instances	Source	Task	Licence
Forbes	1,585	Forbes	Regression/Classification (c=2)	pending
Cities	212	Mercer	Regression/Classification (c=3)	pending
Facebook Books	1,600	Facebook	Regression/Classification (c=2)	pending
Facebook Movies	1,600	Facebook	Regression/Classification (c=2)	pending
Metacritic Albums	1,600	Metacritic	Regression/Classification (c=2)	pending
Metacritic Movies	2,000	Metacritic	Regression/Classification (c=2)	pending
HIV Deaths Country	114	WHO	Regression/Classification (c=2)	Open
Traffic Accidents Country	146	WHO	Regression/Classification (c=2)	Open
Energy Savings Country	162	WorldBank	Regression/Classification (c=2)	Open
Inflation Country	160	WorldBank	Regression/Classification (c=2)	Open
Scientific Journals Country	160	WorldBank	Regression/Classification (c=2)	Open
Unemployment French Region	26	SemStats 2013	Regression/Classification (c=2)	pending
Endangered Species	301	a-z-animals	Regression/Classification (c=2)	pending
Drug-Food Interaction	2,000	FinkiLOD	Classification (c=2)	odc-by

Datasets generated from existing RDF datasets
Name	#Instances	Task	Licence
AIFB	176	Classification (c=4)	CC-BY
AM	1,000	Classification (c=11)	cc-by-sa
MUTAG	340	Classification (c=2)	CC-BY
BGS	146	Classification (c=2)	Open

Link Quality Evaluation

To evaluate the quality of the DBpedia links, for each of the datasets we randomly selected at least 100 instances (for dataset smaller than 100 instances, we selected all instances) and manually evaluated the correctness of the links.

Link quality evaluation
Dataset	#Test Links	#Correct Links	Precision (%)
Auto MPG	100	100	100.00
AAUP	100	100	100.00
Auto 93	93	93	100.00
Zoo	101	99	98.01
Forbes	100	100	100.00
Cities	100	100	100.00
Facebook Books	100	98	98.00
Facebook Movies	130	130	100.00
Metacritic Albums	100	100	100.00
Metacritic Movies	130	128	98.46
HIV Deaths Country	114	114	100.00
Traffic Accidents Country	146	146	100.00
Energy Savings Country	162	162	100.00
Inflation Country	160	160	100.00
Scientific Journals Country	160	160	100.00
Unemployment French Region	26	26	100.00
Endangered Species	100	100	100.00
Drug-Food Interaction	100	100	100.00

Dataset Download

The datasets, as well as a detailed description for each of them, can be found here.

Citation

If you use the collection of datasets in your research, please cite the following paper:

Ristoski, P., de Vries, G.K.D., Paulheim, H.: A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. In: International Semantic Web Conference, 2016