Example: Adding Background Information from DBpedia with Multiple Generators
This is example analyzes a statistics file about burnout rates in German DAX companies, obtained from here. The input CSV file can be downloaded here. The workflow is also available on myExperiment.
The example uses the preconfigured DBpedia endpoint to retrieve background information. It shows how multiple generators can be combined for enhancing a dataset.
This picture shows the overall workflow in RapidMiner. Below, the individual steps are explained in detail.
The basic workflow design for combining multiple generators is as follows:
- Multiply both outputs of the linkers. This will ensure that both the data table as well as the list of attributes containing the links are available for different generators.
- Join the results using the RapidMiner built in Join operator. In order to perform the join, it is required to have an ID attribute set. This can be done either during data import or using the Set Role operator, which is built in in RapidMiner. The example uses the latter variant.
To link the company dataset to DBpedia, we use the pattern-based linker. It generates URIs based on a pattern given by the user. Specifically, it uses the value of the field “Link to merge with” and concatenates it with the value of the attribute specified by the user. In the example shown below, the attribute value “Henkel” in the attribute “Company” would lead to the link “http://dbpedia.org/resource/Henkel”.
Two more expert parameters are available: “URL encoding” performs encoding of special characters. “Use DBpedia link format” performs some special string operations used for the link format in DBpedia. For example, blanks are replaced by underscores, so that “Deutsche Telekom” leads to the link “http://dbpedia.org/resource/Deutsche_Telekom”.
We use two specific generators in this example.
The Direct Types generator creates a boolean feature for each direct type. For example, the company Henkel has the (YAGO) type “ChemicalCompanies” in DBpedia, so an attribute for that type is created, which is true for the instance Henkel (and all other companies to which it applies), false for the other companies.
The DataProperties generator creates features for all data properties, i.e., mostly numeric values. In this examples, those include netIncome, assets, and numberOfEmployees.
The outcome of both generators is joined using the Join operator. From the joined output, we can compute a correlation matrix and examine it for attributes that are correlated with the percentage of women employed in the company, according to our initial question. The correlation matrix is depicted in this picture.
From the correlation matrix, we can observe some findings regarding our target attribute (i.e., the percentage of women employed in a company). Examples include:
- Companies with a large number of employees have a low percentage of female employees.
- Companies with a high operating income have a low percentage of female employees.
- Sportswear companies have high percentage of female employees, while car manufacturers have a low percentage.
The first two findings use data from the Data Properties generator (i.e., numerical facts about the entities), while the third uses data from the Direct Types generator (companies have the type car manufacturer or sportswear company).
To wrap up: this example has shown how to add data about entities, such as companies, from DBpedia, using different generators at once.