# Exercise 1: Data Preprocessing

In this exercise we will mainly focus on data preprocessing. Additionally, we will do some basic classification. If you are unfamiliar with the basics of those topics, take a look at the lectures and exercises of [Data Mining I](https://dws.informatik.uni-mannheim.de/en/teaching/courses-for-master-candidates/ie-500-data-mining/).

For a quick reference of how to work with pandas, you can use [this cheat sheet](http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

## Task 1: The Data Set
In the following we will work with the Data Mining Cup Data Set of 2010:
- Download the data set from https://www.data-mining-cup.com/reviews/dmc-2010/
- Make yourself familiar with task and features
- Use the `pandas`-library to import the training data as a DataFrame
- Have an initial look at the data set. Are the features parsed correctly?

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('max_columns', 100)  # show all columns

In [2]:
# Use the pandas library to import the training data.
# Take a look at the training data to find out which of the import methods of pandas fits best.
# (https://pandas.pydata.org/pandas-docs/stable/api.html#input-output)

# --- TODO ---

## Task 2: Data Visualisation
Now, we inspect the data in order to find out what kinds of problems we need to tackle during preprocessing. Most importantly, we want to answer the following questions:
- Which features have a high correlation with each other and are candidates for removal?
- Which features are the most important ones (i.e. correlate best with the label)?
- What other special characteristics can be found for the features of the data set? (keep the last lecture in mind!) 

In [3]:
# Compute the correlation between features (or between a feature and the label) with pandas.
# Here are some hints for a visualisation of the correlations:
# https://stackoverflow.com/questions/29432629/correlation-matrix-using-pandas

# --- TODO ---

In [4]:
# What else could be important in the data set? Try to think of topics treated in the lecture!
# Check out the preprocessing-documentation of sklearn for additional ideas:
# https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

# --- TODO ---

## Task 3: Classification
Before we do any preprocessing, we first build our classification pipeline. We then use it during the preprocessing to evaluate whether our modifications have an impact on the performance of the classification.
- Complete the `evaluate_classification` function.
  - Use NaiveBayes and DecisionTree as classification algorithms
  - Use 10-fold cross-validation
  - Print accuracy, precision, recall, and F1-measure
- Use `evaluate_classification` to get some baseline results

In [5]:
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

In [6]:
# Complete the following function which evaluates the performance of the NaiveBayes and DecisionTree classifiers.

def evaluate_classification(X, y):
    # --- TODO ---
    pass

## Task 4: Basic Preprocessing
We start with some initial preprocessing steps. Check, whether the following modifications improve your results:
- Experiment with different feature sets (i.e. only use some features, or discard some irrelevant features)
- Tackle the problem of imbalanced label distributions (you might need an adapted version of `evaluate_classification`)
- Impute missing values using the methods of the lecture (default, min, max, avg, ..)

In [8]:
# Compute the performance metrics for feature subsets. Pick the features based on insights from the Data Visualisation.

# --- TODO ---

In [9]:
# Create an adapted version of 'evaluate_classification' to tackle the problem of imbalanced label distributions:
# You can either try to balance the training data during cross-validation (which is cumbersome)
# or you check whether the classification algorithms have their own mechanisms of dealing with imbalanced labels

# --- TODO ---

In [10]:
# Are there missing values in the data set? Try to impute them using different mechanisms.
# Imputation with pandas: https://pandas.pydata.org/pandas-docs/stable/missing_data.html

# --- TODO ---

## Task 5: Feature Generation
Now we generate additional features to improve the classification results. Try to find features that are usefull in this shopping scenario. Check whether the generated features improve the performance of your classification.
- Generate new features from the existing date features (ask yourself which times/days/months you usually shop)
- Apply PCA (Principal Component Analysis) to transform the feature space

In [11]:
# Generate features from existing date features. Try to think of features that define the actual shopping behavior.
# Use pandas to deal with dates: https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties

# --- TODO ---

In [12]:
# Transform the feature space using PCA and check whether the results improve.
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

# --- TODO ---

## Task 6: Optimisation
Finally, try to optimise your results by tweaking individual parts of the classification process. How much better can the results get?

In [13]:
# OPEN END QUESTION - try to improve your results as much as possible!