Skip to content
Snippets Groups Projects
Commit e5dfa5db authored by Tobias Stein's avatar Tobias Stein
Browse files

rewrote some portions

parent 5f759236
No related branches found
No related tags found
1 merge request!108added polynomial regression and edited linear regression example
%% Cell type:markdown id: tags:
# Linear Regression
In this notebook we will take a look at an example of a linear and polynomial regression.
We will do a bunch of preprocessing to see how different problems can be solved in a typical regression problem.
Afterwards we will train the algorithm and try to find the best polynomial degree possible. The reason for this is to get the lowest rmse possible which will result in better predictions.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score
```
%% Cell type:markdown id: tags:
Read the csv file into a dataframe.
The data.head() function will show the first few entries of the dataset and give us a little insight into the kind of data each column contains.
%% Cell type:code id: tags:
``` python
df = pd.read_csv("../data/Insurance/insurance.csv", sep=",")
df.head()
```
%% Output
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
%% Cell type:code id:509d93f0 tags:
``` python
df.info()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
%% Cell type:markdown id: tags:
The next step is for demonstration purposes only. Because many datasets are incomplete we are going to introduce some missing data here to take a look at how you can handle them. Since our dataset has no missing data, we will intentionally drop some values.
%% Cell type:code id: tags:
``` python
for x in range(50):
df.loc[x,"age"] = np.NaN
```
%% Cell type:markdown id: tags:
As we can see, the column age has 1288 non-Null values now. That means the column is missing 50 values now.
%% Cell type:code id: tags:
``` python
df.info()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1288 non-null float64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(3), int64(1), object(3)
memory usage: 73.3+ KB
%% Cell type:markdown id: tags:
If we want to be able to do the linear regression we have 2 possibilities
The first one is getting rid of all the data which are non-numerical values. The second and much better option which we also use below is converting the categorical data into numbers.
Also we will fill our missing values with the mean value of this column. Otherwise we can't proceed with our linear regression because the algorithm can not work with NaN values.
%% Cell type:code id: tags:
``` python
df["age"] = df["age"].fillna(df["age"].mean())
df["sex"] = df["sex"].map({'male': 0, 'female': 1})
df["smoker"] = df["smoker"].map({'no': 0, 'yes': 1})
df["region"] = df["region"].map({'southeast': 0,
'southwest': 1,'northeast': 2, 'northwest': 3})
df.head()
```
%% Output
age sex bmi children smoker region charges
0 39.29736 1 27.900 0 1 1 16884.92400
1 39.29736 0 33.770 1 0 0 1725.55230
2 39.29736 0 33.000 3 0 0 4449.46200
3 39.29736 0 22.705 0 0 3 21984.47061
4 39.29736 0 28.880 0 0 3 3866.85520
%% Cell type:code id: tags:
``` python
df.info()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null float64
1 sex 1338 non-null int64
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null int64
5 region 1338 non-null int64
6 charges 1338 non-null float64
dtypes: float64(3), int64(4)
memory usage: 73.3 KB
%% Cell type:markdown id: tags:
Now that our data has been converted and no longer has zero values, we will plot the mean of our dependent variable as a function of each independent variable.
%% Cell type:code id: tags:
``` python
df.groupby("smoker")["expenses"].mean().plot(kind='bar')
df.groupby("smoker")["charges"].mean().plot(kind='bar')
```
%% Output
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_14112/2328129668.py in <module>
----> 1 df.groupby("smoker")["expenses"].mean().plot(kind='bar')
<AxesSubplot:xlabel='smoker'>
~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\generic.py in __getitem__(self, key)
1536 stacklevel=2,
1537 )
-> 1538 return super().__getitem__(key)
1539
1540 def _gotitem(self, key, ndim: int, subset=None):
~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\base.py in __getitem__(self, key)
230 else:
231 if key not in self.obj:
--> 232 raise KeyError(f"Column not found: {key}")
233 subset = self.obj[key]
234 ndim = subset.ndim
KeyError: 'Column not found: expenses'
%% Cell type:code id: tags:
``` python
df.groupby("children")["expenses"].mean().plot(kind='bar')
df.groupby("children")["charges"].mean().plot(kind='bar')
```
%% Output
<AxesSubplot:xlabel='children'>
%% Cell type:code id: tags:
``` python
df.groupby("region")["expenses"].mean().plot(kind='bar')
df.groupby("region")["charges"].mean().plot(kind='bar')
```
%% Output
<AxesSubplot:xlabel='region'>
%% Cell type:code id: tags:
``` python
df.groupby("bmi")["expenses"].mean().plot()
df.groupby("bmi")["charges"].mean().plot()
```
%% Output
<AxesSubplot:xlabel='bmi'>
%% Cell type:code id: tags:
``` python
df.groupby("sex")["expenses"].mean().plot(kind='bar')
df.groupby("sex")["charges"].mean().plot(kind='bar')
```
%% Output
<AxesSubplot:xlabel='sex'>
%% Cell type:code id: tags:
``` python
df.groupby("age")["expenses"].mean().plot()
df.groupby("age")["charges"].mean().plot()
```
%% Output
<AxesSubplot:xlabel='age'>
%% Cell type:markdown id: tags:
The reason for the plots above is that we want to get an understanding about the data we are working with and try to figure out which features are the most important for the regression algorithm.
The columns `age`, `smoker` and `bmi` have a huge impact on our dependent variable while region and children have less of an impact. The column "sex" doesn't seem to have much impact on our "expenses" variable and is not necessary. That's why we are dropping the "sex" column below
%% Cell type:markdown id: tags:
## separating our dataframe
We already have a converted dataframe with only integer numbers which is great. Furthermore, we need to separate our
dataframe into 2 parts, where the first part is the dataframe excluding the value we want to predict(charges) with the given data.
And the other part is just the output value.
## Preprocessing
We already have a converted dataframe with only integer numbers which is great.
Next we will define our prediction target and the features which will be used for the prediction.
We will also scale the features so that a feature which has a larger number does not have more of an impact than features with smaller numbers. For this we will use the MinMaxScaler.
%% Cell type:code id: tags:
%% Cell type:code id:6fbef232 tags:
``` python
predict = "expenses"
predict = "charges"
features = ["age","smoker","bmi","children","region"]
df_X = np.array(df.drop([predict],1))
df_X = df[features]
df_Y = np.array(df[predict])
scaler = MinMaxScaler()
df[features] = scaler.fit_transform(df[features])
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id:8e4bb356 tags:
The last preprocessing task is to scale our data. We are scaling the data because the algorithm can't work well with different scales of numbers per column. We are using the MinMaxScaler for that which will scale all numerical data between 0 and 1.
We will split the data into a training and a test dataset. The training dataset is used for training the model and the test dataset is used to verify the predictions.
If we didn't split the data, we wouldn't be able to assure that our model is actually working and not just overfitting our training data.
%% Cell type:code id: tags:
%% Cell type:code id:3bf48dc4 tags:
``` python
pipeline = Pipeline([
('minmax', MinMaxScaler()),
])
train, test = sklearn.model_selection.train_test_split(df,test_size=0.1,random_state=42)
```
%% Cell type:markdown id: tags:
In the next step we are going to split the data into a training and a test dataset. The `X` datasets are filled with the features used to predict the data in the `Y` datasets.
The training dataset is used for training the model and the test dataset is used to verify the predictions.
If we didn't split the data, we wouldn't be able to assure that our model is actually working and not just overfitting our training data.
%% Cell type:code id: tags:
``` python
X_Train, X_Test, Y_Train, Y_Test = sklearn.model_selection.train_test_split(df_X,df_Y,test_size=0.10,random_state=42)
```
In the next step we are going to split the data into `X` and `y` sets. The `X` datasets are filled with the features used to predict the data in the `y` datasets.
%% Cell type:code id: tags:
%% Cell type:code id:99bcbdbc tags:
``` python
insurance_price_train = pipeline.fit_transform(X_Train)
insurance_price_test = pipeline.fit_transform(X_Test)
X_train = train[features]
y_train = train[predict]
X_test = test[features]
y_test = test[predict]
```
%% Cell type:markdown id: tags:
Now we are going to train our model using the scaled and preprocessed training data. To see how our model weighed each column we are printing the coefficients and the intercept afterwards.
%% Cell type:code id: tags:
``` python
linear = linear_model.LinearRegression()
linear.fit(insurance_price_train,Y_Train)
predictions = linear.predict(insurance_price_test)
linear.fit(train[features],train[predict])
predictions = linear.predict(test[features])
print("linear coeffecients for: age, smoker, bmi, children, region")
print(linear.coef_)
print("intercept: ", linear.intercept_)
```
%% Output
linear coeffecients for: age, smoker, bmi, children, region
[11918.86648376 23898.14003977 12188.17064225 2244.89376338
761.84404289]
intercept: -2758.2257893886454
%% Cell type:markdown id: tags:
With the coefficients and intercept we are able to predict our values with the formula which you can find in the mdbook linear regression part.
With the coefficients and intercept we can demonstrate how the model predicts the target variable internally using the formula which you can find in the mdbook linear regression chapter.
%% Cell type:code id: tags:
``` python
predictValue = linear.intercept_ + (insurance_price_train[100][0] * linear.coef_[0]) + (insurance_price_train[100][1] * linear.coef_[1]) + (insurance_price_train[100][2] * linear.coef_[2]) + (insurance_price_train[100][3] * linear.coef_[3]) + (insurance_price_train[100][4] * linear.coef_[4])
predictValue = linear.intercept_ + (X_train[100][0] * linear.coef_[0]) + (X_train[100][1] * linear.coef_[1]) + (X_train[100][2] * linear.coef_[2]) + (X_train[100][3] * linear.coef_[3]) + (X_train[100][4] * linear.coef_[4])
print("predicted value : ", predictValue)
print("actual value : ", Y_Train[100])
print("actual value : ", y_train[100])
```
%% Output
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 100
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_14236/3776536213.py in <module>
----> 1 predictValue = linear.intercept_ + (X_train[100][0] * linear.coef_[0]) + (X_train[100][1] * linear.coef_[1]) + (X_train[100][2] * linear.coef_[2]) + (X_train[100][3] * linear.coef_[3]) + (X_train[100][4] * linear.coef_[4])
2 print("predicted value : ", predictValue)
3 print("actual value : ", y_train[100])
~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 100
%% Cell type:markdown id: tags:
To see how our model performed on the training data we are calculating the root mean squared error (rmse) and the accuracy to see the performance. Also we are adding a plot with the first 20 predictions compared to the actual values to get a visual understanding of how good our prediction is.
%% Cell type:code id: tags:
``` python
acc = linear.score(insurance_price_test,Y_Test)
acc = linear.score(X_test,y_test)
print(acc)
rmse = np.sqrt(mean_squared_error(Y_Test,predictions))
rmse = np.sqrt(mean_squared_error(y_test,predictions))
print("Root mean squared error : ", rmse)
plt.plot(Y_Test[0:20])
plt.plot(y_test[0:20])
plt.plot(predictions[0:20])
plt.legend(["this is test", "this is prediction"])
plt.show()
```
%% Cell type:markdown id: tags:
Now we are trying to see if our model is in a linear relation or a polynomial relation. To be able to answer this question we will plot different polynomial degrees and their respective rmse to see which polynomial degree is best for predicting our data.
%% Cell type:code id: tags:
``` python
from sklearn.preprocessing import PolynomialFeatures
rmselist = []
for x in range(5):
polynomial_features= PolynomialFeatures(degree=x+1)
x_poly_train = polynomial_features.fit_transform(insurance_price_train)
x_poly_test = polynomial_features.fit_transform(insurance_price_test)
x_poly_train = polynomial_features.fit_transform(X_train)
x_poly_test = polynomial_features.fit_transform(X_test)
model = linear_model.LinearRegression()
model.fit(x_poly_train, Y_Train)
model.fit(x_poly_train, y_train)
y_poly_pred = model.predict(x_poly_test)
rmse = np.sqrt(mean_squared_error(Y_Test,y_poly_pred))
rmse = np.sqrt(mean_squared_error(y_test,y_poly_pred))
rmselist.append(rmse)
print(rmselist)
plt.plot((1,2,3,4,5),rmselist)
plt.show()
```
%% Cell type:markdown id: tags:
Based on the plot we can see that the best polynomial degree for our data is 3. We will use this degree and plot the first 20 datapoints again.
Since Linear Regression has a score function which calculates the r2 score but polynomial features doesn't have a score function, we need to include r2 score to see our accuracy improvement.
%% Cell type:code id: tags:
``` python
polynomial_features= PolynomialFeatures(degree=3)
x_poly_train = polynomial_features.fit_transform(insurance_price_train)
x_poly_test = polynomial_features.fit_transform(insurance_price_test)
x_poly_train = polynomial_features.fit_transform(X_train)
x_poly_test = polynomial_features.fit_transform(X_test)
model = linear_model.LinearRegression()
model.fit(x_poly_train, Y_Train)
model.fit(x_poly_train, y_train)
y_poly_pred = model.predict(x_poly_test)
```
%% Cell type:code id: tags:
``` python
rmse = np.sqrt(mean_squared_error(Y_Test,y_poly_pred))
rmse = np.sqrt(mean_squared_error(y_test,y_poly_pred))
print("rmse : ",rmse)
print("accuracy : ", r2_score(Y_Test,y_poly_pred))
plt.plot(Y_Test[0:20])
print("accuracy : ", r2_score(y_test,y_poly_pred))
plt.plot(y_test[0:20])
plt.plot(y_poly_pred[0:20])
plt.legend(["this is test", "this is prediction"])
plt.show()
```
%% Cell type:markdown id: tags:
As a last step we are doing the same prediction as earlier but with our new polynomial degree. There is a nice improvement with our polynomial prediction.
%% Cell type:code id: tags:
``` python
predictValue = model.intercept_ + (x_poly_train[100][0] * model.coef_[0]) + (x_poly_train[100][1] * model.coef_[1]) + (x_poly_train[100][2] * model.coef_[2]) + (x_poly_train[100][3] * model.coef_[3]) + (x_poly_train[100][4] * model.coef_[4])
print("predicted value : ", predictValue)
print("actual value : ", Y_Train[100])
print("actual value : ", y_train[100])
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment