In the recent years, several approaches for machine learning on the Semantic Web have been proposed. However, no extensive comparisons between those approaches have been undertaken, in particular due to a lack of publicly available, acknowledged benchmark datasets. Here, we present a collection of 22 benchmark datasets at different sizes, derived from existing Semantic Web datasets as well as from external classification and regression problems linked to datasets in the Linked Open Data cloud. Such a collection of datasets can be used to conduct quantitative performance testing and systematic comparisons of approaches, which, due to the number of datasets, also allows for determining the statistical significance of the findings.
The datasets, as well as a detailed description for each of them, can be found here.
Our dataset collection consists of 22 datasets divided into three categories:
Each of the datasets in the first two categories are initially linked to DBpedia. This has two main reasons, (1) DBpedia being a cross-domain knowledge base usable in datasets from very different topical domains, and (2) tools like DBpedia Lookup and DBpedia Spotlight making it easy to link external datasets to DBpedia. However, DBpedia can be seen as an entry point to the Web of Linked Data, with many datasets linking to and from DBpedia. In fact, we use the initial DBpedia links to retrieve external links for each entity to YAGO and Wikidata. Such links could be exploited for systematic evaluation of the relevance of the data of different LOD dataset in different learning tasks.
Name | #Instances | Source | Task | Licence |
---|---|---|---|---|
Auto MPG | 371 | UCI ML | Regression | pending |
AAUP | 960 | JSE | Regression/ | pending |
Auto 93 | 93 | JSE | Regression | pending |
Zoo | 101 | UCI ML | Classification (c=3) | pending |
Name | #Instances | Source | Task | Licence |
---|---|---|---|---|
Forbes | 1,585 | Forbes | Regression/ | pending |
Cities | 212 | Mercer | Regression/ | pending |
Facebook Books | 1,600 | Regression/ | pending | |
Facebook Movies | 1,600 | Regression/ | pending | |
Metacritic Albums | 1,600 | Metacritic | Regression/ | pending |
Metacritic Movies | 2,000 | Metacritic | Regression/ | pending |
HIV Deaths Country | 114 | WHO | Regression/ | Open |
Traffic Accidents Country | 146 | WHO | Regression/ | Open |
Energy Savings Country | 162 | WorldBank | Regression/ | Open |
Inflation Country | 160 | WorldBank | Regression/ | Open |
Scientific Journals Country | 160 | WorldBank | Regression/ | Open |
Unemployment French Region | 26 | SemStats 2013 | Regression/ | pending |
Endangered Species | 301 | a-z-animals | Regression/ | pending |
Drug-Food Interaction | 2,000 | FinkiLOD | Classification (c=2) | odc-by |
Name | #Instances | Task | Licence |
---|---|---|---|
AIFB | 176 | Classification (c=4) | CC-BY |
AM | 1,000 | Classification (c=11) | cc-by-sa |
MUTAG | 340 | Classification (c=2) | CC-BY |
BGS | 146 | Classification (c=2) | Open |
To evaluate the quality of the DBpedia links, for each of the datasets we randomly selected at least 100 instances (for dataset smaller than 100 instances, we selected all instances) and manually evaluated the correctness of the links.
Dataset | #Test Links | #Correct Links | Precision (%) |
---|---|---|---|
Auto MPG | 100 | 100 | 100.00 |
AAUP | 100 | 100 | 100.00 |
Auto 93 | 93 | 93 | 100.00 |
Zoo | 101 | 99 | 98.01 |
Forbes | 100 | 100 | 100.00 |
Cities | 100 | 100 | 100.00 |
Facebook Books | 100 | 98 | 98.00 |
Facebook Movies | 130 | 130 | 100.00 |
Metacritic Albums | 100 | 100 | 100.00 |
Metacritic Movies | 130 | 128 | 98.46 |
HIV Deaths Country | 114 | 114 | 100.00 |
Traffic Accidents Country | 146 | 146 | 100.00 |
Energy Savings Country | 162 | 162 | 100.00 |
Inflation Country | 160 | 160 | 100.00 |
Scientific Journals Country | 160 | 160 | 100.00 |
Unemployment French Region | 26 | 26 | 100.00 |
Endangered Species | 100 | 100 | 100.00 |
Drug-Food Interaction | 100 | 100 | 100.00 |
The datasets, as well as a detailed description for each of them, can be found here.
If you use the collection of datasets in your research, please cite the following paper: