This is example analyzes a statistics file about burnout rates in German DAX companies, obtained from here. The input CSV file can be downloaded here. The workflow is also available on myExperiment.
The example uses the preconfigured DBpedia endpoint to retrieve background information. It shows how multiple generators can be combined for enhancing a dataset.
This picture shows the overall workflow in RapidMiner. Below, the individual steps are explained in detail.
The basic workflow design for combining multiple generators is as follows:
To link the company dataset to DBpedia, we use the pattern-based linker. It generates URIs based on a pattern given by the user. Specifically, it uses the value of the field “Link to merge with” and concatenates it with the value of the attribute specified by the user. In the example shown below, the attribute value “Henkel” in the attribute “Company” would lead to the link “http://dbpedia.org/resource/Henkel”.
Two more expert parameters are available: “URL encoding” performs encoding of special characters. “Use DBpedia link format” performs some special string operations used for the link format in DBpedia. For example, blanks are replaced by underscores, so that “Deutsche Telekom” leads to the link “http://dbpedia.org/resource/Deutsche_Telekom”.
We use two specific generators in this example.
The Direct Types generator creates a boolean feature for each direct type. For example, the company Henkel has the (YAGO) type “ChemicalCompanies” in DBpedia, so an attribute for that type is created, which is true for the instance Henkel (and all other companies to which it applies), false for the other companies.
The DataProperties generator creates features for all data properties, i.e., mostly numeric values. In this examples, those include netIncome, assets, and numberOfEmployees.
The outcome of both generators is joined using the Join operator. From the joined output, we can compute a correlation matrix and examine it for attributes that are correlated with the percentage of women employed in the company, according to our initial question. The correlation matrix is depicted in this picture.
From the correlation matrix, we can observe some findings regarding our target attribute (i.e., the percentage of women employed in a company). Examples include:
The first two findings use data from the Data Properties generator (i.e., numerical facts about the entities), while the third uses data from the Direct Types generator (companies have the type car manufacturer or sportswear company).
To wrap up: this example has shown how to add data about entities, such as companies, from DBpedia, using different generators at once.