"\u001b[1;32m~\\AppData\\Local\\Temp/ipykernel_14112/2328129668.py\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mdf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mgroupby\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"smoker\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m\"expenses\"\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mplot\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkind\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'bar'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"We already have a converted dataframe with only integer numbers which is great. Furthermore, we need to separate our\n",
" dataframe into 2 parts, where the first part is the dataframe excluding the value we want to predict(charges) with the given data.\n",
" And the other part is just the output value."
"## Preprocessing\n",
"We already have a converted dataframe with only integer numbers which is great.\n",
"Next we will define our prediction target and the features which will be used for the prediction.\n",
"We will also scale the features so that a feature which has a larger number does not have more of an impact than features with smaller numbers. For this we will use the MinMaxScaler."
"The last preprocessing task is to scale our data. We are scaling the data because the algorithm can't work well with different scales of numbers per column. We are using the MinMaxScaler for that which will scale all numerical data between 0 and 1."
"We will split the data into a training and a test dataset. The training dataset is used for training the model and the test dataset is used to verify the predictions.\n",
"If we didn't split the data, we wouldn't be able to assure that our model is actually working and not just overfitting our training data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"execution_count": 15,
"id": "3bf48dc4",
"metadata": {},
"outputs": [],
"source": [
"pipeline = Pipeline([\n",
" ('minmax', MinMaxScaler()),\n",
"])"
"train, test = sklearn.model_selection.train_test_split(df,test_size=0.1,random_state=42)"
]
},
{
...
...
@@ -602,38 +716,20 @@
"collapsed": false
},
"source": [
"In the next step we are going to split the data into a training and a test dataset. The `X` datasets are filled with the features used to predict the data in the `Y` datasets.\n",
"The training dataset is used for training the model and the test dataset is used to verify the predictions.\n",
"If we didn't split the data, we wouldn't be able to assure that our model is actually working and not just overfitting our training data."
"In the next step we are going to split the data into `X` and `y` sets. The `X` datasets are filled with the features used to predict the data in the `y` datasets."
"With the coefficients and intercept we are able to predict our values with the formula which you can find in the mdbook linear regression part."
"With the coefficients and intercept we can demonstrate how the model predicts the target variable internally using the formula which you can find in the mdbook linear regression chapter."
"\u001b[1;32m~\\AppData\\Local\\Programs\\Python\\Python39\\lib\\site-packages\\pandas\\_libs\\index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[1;34m()\u001b[0m\n",
"\u001b[1;32m~\\AppData\\Local\\Programs\\Python\\Python39\\lib\\site-packages\\pandas\\_libs\\index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[1;34m()\u001b[0m\n",
"\u001b[1;32mpandas\\_libs\\hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[1;34m()\u001b[0m\n",
"\u001b[1;32mpandas\\_libs\\hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[1;34m()\u001b[0m\n",
"\u001b[1;31mKeyError\u001b[0m: 100",
"\nThe above exception was the direct cause of the following exception:\n",
In this notebook we will take a look at an example of a linear and polynomial regression.
We will do a bunch of preprocessing to see how different problems can be solved in a typical regression problem.
Afterwards we will train the algorithm and try to find the best polynomial degree possible. The reason for this is to get the lowest rmse possible which will result in better predictions.
The next step is for demonstration purposes only. Because many datasets are incomplete we are going to introduce some missing data here to take a look at how you can handle them. Since our dataset has no missing data, we will intentionally drop some values.
%% Cell type:code id: tags:
``` python
forxinrange(50):
df.loc[x,"age"]=np.NaN
```
%% Cell type:markdown id: tags:
As we can see, the column age has 1288 non-Null values now. That means the column is missing 50 values now.
%% Cell type:code id: tags:
``` python
df.info()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1288 non-null float64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(3), int64(1), object(3)
memory usage: 73.3+ KB
%% Cell type:markdown id: tags:
If we want to be able to do the linear regression we have 2 possibilities
The first one is getting rid of all the data which are non-numerical values. The second and much better option which we also use below is converting the categorical data into numbers.
Also we will fill our missing values with the mean value of this column. Otherwise we can't proceed with our linear regression because the algorithm can not work with NaN values.
%% Cell type:code id: tags:
``` python
df["age"]=df["age"].fillna(df["age"].mean())
df["sex"]=df["sex"].map({'male':0,'female':1})
df["smoker"]=df["smoker"].map({'no':0,'yes':1})
df["region"]=df["region"].map({'southeast':0,
'southwest':1,'northeast':2,'northwest':3})
df.head()
```
%% Output
age sex bmi children smoker region charges
0 39.29736 1 27.900 0 1 1 16884.92400
1 39.29736 0 33.770 1 0 0 1725.55230
2 39.29736 0 33.000 3 0 0 4449.46200
3 39.29736 0 22.705 0 0 3 21984.47061
4 39.29736 0 28.880 0 0 3 3866.85520
%% Cell type:code id: tags:
``` python
df.info()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null float64
1 sex 1338 non-null int64
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null int64
5 region 1338 non-null int64
6 charges 1338 non-null float64
dtypes: float64(3), int64(4)
memory usage: 73.3 KB
%% Cell type:markdown id: tags:
Now that our data has been converted and no longer has zero values, we will plot the mean of our dependent variable as a function of each independent variable.
The reason for the plots above is that we want to get an understanding about the data we are working with and try to figure out which features are the most important for the regression algorithm.
The columns `age`, `smoker` and `bmi` have a huge impact on our dependent variable while region and children have less of an impact. The column "sex" doesn't seem to have much impact on our "expenses" variable and is not necessary. That's why we are dropping the "sex" column below
%% Cell type:markdown id: tags:
## separating our dataframe
We already have a converted dataframe with only integer numbers which is great. Furthermore, we need to separate our
dataframe into 2 parts, where the first part is the dataframe excluding the value we want to predict(charges) with the given data.
And the other part is just the output value.
## Preprocessing
We already have a converted dataframe with only integer numbers which is great.
Next we will define our prediction target and the features which will be used for the prediction.
We will also scale the features so that a feature which has a larger number does not have more of an impact than features with smaller numbers. For this we will use the MinMaxScaler.
The last preprocessing task is to scale our data. We are scaling the data because the algorithm can't work well with different scales of numbers per column. We are using the MinMaxScaler for that which will scale all numerical data between 0 and 1.
We will split the data into a training and a test dataset. The training dataset is used for training the model and the test dataset is used to verify the predictions.
If we didn't split the data, we wouldn't be able to assure that our model is actually working and not just overfitting our training data.
In the next step we are going to split the data into a training and a test dataset. The `X` datasets are filled with the features used to predict the data in the `Y` datasets.
The training dataset is used for training the model and the test dataset is used to verify the predictions.
If we didn't split the data, we wouldn't be able to assure that our model is actually working and not just overfitting our training data.
In the next step we are going to split the data into `X` and `y` sets. The `X` datasets are filled with the features used to predict the data in the `y` datasets.
Now we are going to train our model using the scaled and preprocessed training data. To see how our model weighed each column we are printing the coefficients and the intercept afterwards.
With the coefficients and intercept we are able to predict our values with the formula which you can find in the mdbook linear regression part.
With the coefficients and intercept we can demonstrate how the model predicts the target variable internally using the formula which you can find in the mdbook linear regression chapter.
~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 100
%% Cell type:markdown id: tags:
To see how our model performed on the training data we are calculating the root mean squared error (rmse) and the accuracy to see the performance. Also we are adding a plot with the first 20 predictions compared to the actual values to get a visual understanding of how good our prediction is.
Now we are trying to see if our model is in a linear relation or a polynomial relation. To be able to answer this question we will plot different polynomial degrees and their respective rmse to see which polynomial degree is best for predicting our data.
Based on the plot we can see that the best polynomial degree for our data is 3. We will use this degree and plot the first 20 datapoints again.
Since Linear Regression has a score function which calculates the r2 score but polynomial features doesn't have a score function, we need to include r2 score to see our accuracy improvement.
As a last step we are doing the same prediction as earlier but with our new polynomial degree. There is a nice improvement with our polynomial prediction.