RapidMiner Linked Open Data Extension
Winner of the Semantic Web Challenge 2014
The RapidMiner Linked Open Data Extension is an extension to the open source data mining software RapidMiner. It allows using data from Linked Open Data both as an input for data mining as well as for enriching existing datasets with background knowledge. The RapidMiner Linked Open Data Extension is based on the earlier FeGeLOD framework (which is discontinued now).
Possible usages include (click to see details):
- Importing data from a Linked Data source, such as Eurostat, into RapidMiner, and analyze it using RapidMiner operators.
- Adding data about population, GDP, and literacy from Eurostat to a data set of countries
- Adding data about turnover and number of employees from DBpedia to a data set of companies
- Identifying potentially wrong links between datasets in Linked Open Data
- Predicting the fuel consumption of cars
- Hybrid recommender system Using Linked Open Data
- Using Graph Kernels for Feature Generation
Unlike many related approaches, the RapidMiner Linked Open Data Extension may work in a completely unsupervised fashion, which means that almost no knowledge about the data source used and about technologies such as RDF and SPARQL is required to use it.
Download
The RapidMiner Linked Open Data Extension is available from the RapidMiner marketplace.
To install the extension, go to the “Help”->“Updates and Extensions” menu in RapidMiner, and search for “Linked Open Data”.
Operators
The extension provides three main categories of operators:
- Data importers that load data from Linked Open Data into RapidMiner for further processing
- Linkers that create links from a given dataset to a dataset in Linked Open Data (e.g., linking a CSV file to DBpedia)
- Generators that gather data from Linked Open Data and add it as attributes in the data set at hand
There are different kinds of generators in the extension, such as
- Adding data attributes, such as population
- Adding types, such as “G20 country”
- Adding aggregated relations, such as number of companies located in a city
Adding arbitrary data using customizable SPARQL statementsThe operators provided by the Linked Open Data Extension can be used in conjunction with built-in RapidMiner operators as well as other extensions to build powerful Data Mining processes.
Documentation
All operators, as well as example workflows, are described in the user manual.
Publications
The extension itself, as well as the underlying algorithms, are described in:
- Petar Ristoski: Towards Linked Open Data enabled Data Mining: Strategies for Feature Generation, Propositionalization, Selection, and Consolidation. In: Extended Semantic Web Conference, 2015.
- Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, and Christian Bizer. Data Mining with Background Knowledge from the Web. In: RapidMiner World, 2014.
- Heiko Paulheim. Exploiting Linked Open Data as Background Knowledge in Data Mining. In: CEUR workshop proceedings DMoLD 2013 : Proceedings of the International Workshop on Data Mining on Linked Data, with Linked Data Mining Challenge collocated with ECMLPKDD 2013; 1–10. RWTH, Aachen, 2013.
- Heiko Paulheim and Johannes Fürnkranz: Unsupervised Generation of Data Mining Features from Linked Open Data. In: International Conference on Web Intelligence, Mining, and Semantics (WIMS), 2012.
The following publications discuss various applications that use the RapidMiner LOD Extension (or its predecessor FeGeLOD):
- Identifying wrong links in Linked Open Data: Heiko Paulheim. Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection. In: Third International Workshop on Debugging Ontologies and Ontology Mappings (WoDOOM 2014).
- Explaining statistical data:
- Heiko Paulheim. Generating Possible Interpretations for Statistics from Linked Open Data. In: 9th Extended Semantic Web Conference, ESWC 2012; 560–574. Springer, Berlin [u.a.], 2012.
- Petar Ristoski and Heiko Paulheim. Analyzing Statistics with Background Knowledge from Linked Open Data. In: First International Workshop on Semantic Statistics (SemStats 2013).
- Petar Ristoski and Heiko Paulheim. Visual Analysis of Statistical Data on Maps using Linked Open Data. In: 12th Extended Semantic Web Conference, ESWC 2015; Posters and Demos.
- Classifying Tweets: Axel Schulz, Petar Ristoski and Heiko Paulheim. I See a Car Crash: Real-time Detection of Small Scale Incidents in Microblogs. In: Workshop on Social Media and Linked Data for Emergency Response (SMILE), 2013.
- Predicting the location of Twitter users: Axel Schulz, Aristotelis Hadjakos, Heiko Paulheim, Johannes Nachtwey, Max Mühlhäuser. A Multi-Indicator Approach for Geolocalization of Tweets. In: International AAAI Conference on Weblogs and Social Media (ICWSM 2013).
- Classifying event information extracted from Wikipedia: Daniel Hienert, Dennis Wegener and Heiko Paulheim. Automatic Classification and Relationship Extraction for Multi-Lingual and Multi-Granular Events from Wikipedia. In: Proceedings of the Workhop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2012); 1–10. RWTH, Aachen, 2012.
- Schema Matching for Linked Data: Frederik Janssen, Faraz Fallahi, Jan Nößner and Heiko Paulheim. Towards Rule Learning Approaches to Instance-based Ontology Matching. In: Proceedings of the First International Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data; 13–18. RWTH, Aachen, 2012.
Support & Community
If you use the RapidMiner LOD extension, you may want to join the Google Group at
https://groups.google.com/forum/#!forum/rmlod
or contact the user community via its mailing list:
rmlod
googlegroups.comTeam
Project lead:
Current team:
- Christian Bizer
- Evgeny Mitichkin
- Petar Ristoski
Past contributors:
- Raad Bahmani
- Johannes Fürnkranz
- Alexander Gabriel
- Simon Holthausen