{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise 2: Regression\n",
"\n",
"In this exercise we will mainly focus on regression. We start with a simple example first, so that we get used to the methods. Later we will work with a larger data set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 1: Predicting Auction Prices\n",
"In this task we work with the auction data set, which you can find on ILIAS. The data set has three features: the price, the age of the clock, and the number of bidders. Try to predict the price of the offered clocks by following these steps:\n",
"- Visualise and inspect the data. Can you already make an assumption about good predictors?\n",
"- Measure the performance of the regression with cross-validation and RMSE\n",
"- Apply other - more sophisticated - regression models to the problem and evaluate their performance"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"df = pd.read_csv('auction.txt', '\\t')\n",
"X, y = df.drop(columns='Price'), df['Price']"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Visualise and inspect the data as learned in the previous exercise.\n",
"\n",
"# --- TODO ---"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_validate\n",
"\n",
"# Complete this function which applies regression estimators with Cross-Validation to a data set and prints the RMSE score\n",
"# (see next cell for an example of how this function will be called)\n",
"def evaluate_regression(X: pd.DataFrame, y: pd.Series, estimators: dict):\n",
" # --- TODO ---\n",
" pass"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import linear_model\n",
"\n",
"def evaluate_linreg(X: pd.DataFrame, y: pd.Series):\n",
" evaluate_regression(X, y, {'Linear Regression': linear_model.LinearRegression()})\n",
" \n",
"evaluate_linreg(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Now apply more sophisticated regression models to the problem (use the 'evaluate_regression' function!)\n",
"# HINT - You can take a look at the sklearn documentation for Supervised Learning to find appropriate regression models:\n",
"# https://scikit-learn.org/stable/supervised_learning.html\n",
"\n",
"# --- TODO ---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 2: Predicting Car Prices\n",
"\n",
"Now we work with data about cars using [this data set](http://archive.ics.uci.edu/ml/datasets/Automobile). You can either download it from ILIAS or fetch it directly from the website. We predict the price of a car based on the remaining information about it.\n",
"\n",
"- Import the data (use reasonable data types and check that missing values are imported correctly)\n",
"- Now apply the regression workflow learned in the previous task (and the preprocessing methods of the previous exercise)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import the car data set. Are the chosen data types reasonabble? Are missing values imported correctly?\n",
"\n",
"column_names = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']\n",
"\n",
"# --- TODO ---"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Apply the regression workflow to this data set. You may want to reuse the previously defined evaluation functions\n",
"# Do preprocessing methods of the previous exercise improve the results of your model? What about new attributes?\n",
"\n",
"# --- TODO ---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 3: DMC 2006 Auction Prices\n",
"\n",
"Let's get serious: The [Data Mining Cup 2006 task](https://www.data-mining-cup.com/reviews/dmc-2006/) is again about auction prices. It is not originally designed for a regression analysis, but we will still try to predict the GMS (price) for each auction article.\n",
"\n",
"- Read the description of the data set and import it correctly\n",
"- Apply the methods of the previous task to get a competitive RMSE score\n",
"- In addition to the previous methods, think about different strategies for attribute selection to improve your results!\n",
"\n",
"_{Hint: The encoding of the data set is `cp1252`}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# OPEN END QUESTION - try to improve your results as much as possible!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}