Example: Discovering Wrong Links between Datasets

Note:An in-depth description of the approach behind this process is given in the following paper: Heiko Paulheim. Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection. In: Third International Workshop on Debugging Ontologies and Ontology Mappings (WoDOOM 2014).

Links between datasets are an essential ingredient to the Linked Open Data cloud. Since creating links manually would be an approach that hardly scales, automatic tools such as Silk are often used to create links with heuristics. With simple heuristics, large amounts of links can be created with reasonable efforts, at the price of not being 100% correct.

This example shows how to use outlier detection for finding erroneous links. It combines operators from the Linked Open Data extension with those from the Anomaly Detection extension. In the example, we read links between the EventMedia dataset and DBpedia, and identify errors among those links. The process is available from myExperiment.

The overall process looks as follows: first, links between the two datasets are read, second, features are created for types of each of the two linked resources, and third, the suspicious links are identified using outlier detection:

The SPARQL data importer reads the list of links from the EventMedia dataset to DBpedia:

Next, we create features for the direct types of both linked resources. For the EventMedia resources, we use the Direct Types generator, for DBpedia, we want to include only types from the DBpedia ontology, so we use the custom SPARQL generator:

Finally, the Local Outlier Factor operator from the Anomaly Detection extension is used to find links whose pattern of types deviates from the overall pattern. The result is a list of links with scores, which can be sorted by the scores to find the most suspicious links:

Looking at the results, the top 5 links contain three links that are actually wrong:

  • The first two links refer to rivers in DBpedia, but to music clubs of the same name in EventMedia
  • The fifth link links a DBpedia concept to a non-dereferencable URI

The other two elements in the top 5 are a bridge and a library, which are rare event locations otherwise, and are hence identified as outliers, despite being correct links.