# Exercise 4: Ensembles

This exercise is about ensembles. To get familiar with ensembles in scikit-learn, refer to the respective [part in the documentation](https://scikit-learn.org/stable/modules/ensemble.html).

## Task 1: Warm-up
To get you started with ensemble learning, have a look at the `dart` data set that is provided in ILIAS. It contains the positions of darts thrown by four different people. First, we learn two basic classifiers on this data set. In a second step, we apply stacking to improve the performance.

- Train a k-NN (with `n_neighbors=1`) and a SVM (with `C=5`) and check their accuracy on the test set
- Inspect the classifications on the training data. What can you say about the decision boundaries?
- Combine the two classifiers by stacking (use a decision tree as meta learner). Does it improve the accuracy on the test set?
- Create a new attribute `distance from centre`. How does the accuracy change?

<sub>This task was originally published in https://gormanalysis.com/guide-to-model-stacking-i-e-meta-ensembling/</sub>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df_train = pd.read_csv('dart/dart_train.csv', sep=',').drop(columns='ID')
X_train, y_train = df_train.drop(columns='Competitor'), df_train['Competitor']

df_test = pd.read_csv('dart/dart_test.csv', sep=',').drop(columns='ID')
X_test, y_test = df_test.drop(columns='Competitor'), df_test['Competitor']

In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Train a k-NN and a SVM and check their performance on the test set.

# --- TODO ---

In [3]:
# Inspect the classifications on the training data (using a scatter plot).

# --- TODO ---

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder

# Combine the two classifiers using stacking with a decision tree as meta learner. Measure the accuracy on the test set.
# Stacking: scikit-learn has no implementation for Stacking. You can use this -> http://ml-ensemble.com/
# HINT: mlens can only work with numerical labels. You can use the LabelEncoder to transform your labels.

# --- TODO ---

In [5]:
# Create the attribute 'distance_from_centre' and measure the accuracy on the test set (for base classifiers and stacking)

# --- TODO ---

## Task 2: Data Mining Cup 2006

Now we apply ensembles to the data set of the Data Mining Cup 2006 which you already know from the second exercise. The task is to predict the attribute `gms_greater_avg` as precisely as possible. We again use accuracy as performance metric.

Please make sure to understand the principles of Voting, Bagging, Boosting, and Stacking before you work on this task.

- Build a baseline with several classifiers (like k-NN, Decision Tree, SVM, Naive Bayes, Neural Network,..)
- Use Bagging to improve the performance of the classifiers. How do they perform compared to a Random Forest?
- Try out boosting techniques (like AdaBoost and XGBoost). Do they work better?
- Finally, try to get the best results using various kinds of ensembles. You could try to combine several classifiers with Voting or Stacking!

In [6]:
from sklearn.model_selection import train_test_split

df = pd.read_csv('dmc2006/dmc2006_train.txt', sep='\t', encoding='cp1252').drop(columns=['auct_id', 'gms', 'listing_title', 'listing_subtitle', 'listing_start_date', 'listing_end_date'])
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='gms_greater_avg'), df['gms_greater_avg'], test_size=.3, random_state=42)

In [7]:
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier

# Make baseline predictions for the classifiers k-NN, Decision Tree, SVM, Naive Bayes, Neural Network

# --- TODO ---

In [8]:
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier

# Use Bagging to improve the performance. Additionaly, check how a Random Forest classifier performs

# --- TODO ---

In [9]:
from sklearn.ensemble import AdaBoostClassifier

# Try out AdaBoost and XGBoost
# XGBoost: scikit-learn does not implement XGBoost. Use this -> https://xgboost.readthedocs.io/en/latest/index.html

# --- TODO ---

In [10]:
# Now try to improve your results by applying other ensemble techniques (e.g. Voting, Stacking)
# or anything else that comes to your mind. How much can you improve the results?

# OPEN END QUESTION - try to improve your results as much as possible!