# Exercise 3: Anomaly Detection

In this exercise we will focus on anomaly detection. To get familiar with anomaly detection in scikit-learn, refer to the respective [part in the documentation](https://scikit-learn.org/stable/modules/outlier_detection.html).

## Task 1: Get to know the techniques
In this task we use an artifical data set with only two features to experiment with anomaly detection techniques. In addition to the "correct" data points that form clusters, there are 27 random data points. Use the techniques from scikit-learn to spot the outliers.
- Visualise and inspect the data. Can you already see the outliers?
- Use the following techniques to spot outliers and visualise the result: LocalOutlierFactor, OneClassSVM, IsolationForest

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('artificial.txt', '\t').drop(columns='ID')

In [2]:
# Visualise the data with a scatter plot. Can you spot the outliers?

# --- TODO ---

In [3]:
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.ensemble import IsolationForest

# Use the anomaly detection methods of scikit-learn to automatically detect the outliers.
# You might want to vary the parameters of the techniques for better results. Visualise the results!

# --- TODO ---

## Task 2: Evaluation of anomaly detection

In this task we work with a data set about breast cancer. It contains various features from cancer diagnostics. The majority of the data set is obtained from non-cancer patients (label 'B'), and 20 examples are obtained from patients where a cancer is present (label 'M'). Because of their low frequency, the cancer examples can be treated as outliers and we can use the previously mentioned techniques to detect them.

- Apply the techniques learned in the previous task to this data set.
- Use a ROC-curve to visualise the performance of every technique (outliers should be treated as the 'true' class).

In [4]:
df = pd.read_csv('breast_cancer_outliers.csv', sep=';').drop(columns='id')
X, y = df.drop(columns='class'), df['class']

In [5]:
# Apply the anomaly detection techniques and draw a ROC-curve of their performance.
# How to work with ROC-curves in scikit-learn: https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

# --- TODO ---

## Task 3: Anomaly detection for preprocessing

We want to use anomaly detection to ignore outliers during training of a classifier in order to improve its performance. We use an alternative version of the Iris data set where some errors happened during data entry. After learning a simple classifier for the task, we try to improve it by removing outliers (to simplify the task, outliers are marked in the training set).

- Train a decision tree classifier on the training set and evaluate its performance on the test set
- Try to improve the performance by removing outliers during the _training phase_

In [7]:
df_train = pd.read_csv('iris_shuffled_train.csv', sep=';').drop(columns='id')
X_train, y_train = df_train.drop(columns=['label', 'is_outlier']), df_train['label']
df_test = pd.read_csv('iris_shuffled_test.csv', sep=';').drop(columns='id')
X_test, y_test = df_test.drop(columns='label'), df_test['label']

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score, recall_score, f1_score

# Complete this function which trains a decision tree on the training set and evaluates it on the test set

def evaluate_decision_tree(X_train: pd.DataFrame, y_train: pd.Series, X_test: pd.DataFrame, y_test: pd.Series):
    # --- TODO ---
    pass

evaluate_decision_tree(X_train, y_train, X_test, y_test)

In [9]:
# Now change df_train by removing outliers with the techniques given in the previous tasks. Does this improve the results?

# --- TODO ---

## Task 4: Learn to recognise what you know!

In this final task, we assume that our training data does not have any outliers. We thus want to learn a model that represents this data as good as possible in order to find outliers in the test data set. We apply a one-class SVM to the Shuttle data set.

- Load the data from `shuttle_train.csv` and `shuttle_test.csv` respectively
- Learn a one-class SVM that represents the training data as good as possible (you can use cross-validation to evaluate your model)
- Apply your model to the test data which contains outliers. Are you satisfied with the performance?

In [10]:
# Import the data set

# --- TODO ---

In [11]:
# Learn a one-class SVM model for the training data and evaluate its performance

# --- TODO ---

In [12]:
# Apply your model to the test data to find the outliers

# --- TODO ---