Skip to content
Snippets Groups Projects
Commit 036cbb9a authored by Daniel Müller's avatar Daniel Müller :speech_balloon:
Browse files

Merge branch 'fix-notebooks' into 'main'

Fix LSTM loading

See merge request !125
parents 9de86178 4ac3db27
No related branches found
No related tags found
1 merge request!125Fix LSTM loading
Pipeline #93039 passed
%% Cell type:markdown id:291f0068 tags:
# Long Short-Term Memory
This notebook will show you an example of a long short-term memory neural network with tensorflow. As a dataset we are going to use the Corona_NLP dataset which you can find in the data folder. This dataset contains tweets regarding the corona virus and an estimation of the authors sentiment toward the topic.
The following sentiment classes exist:
>['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive']
## Imports
For this task we require quite a few imports. Some of the more special ones we are going to use here are `re` which will be used in the preprocessing step as well as `keras.models.load_model` and `keras.callbacks.CSVLogger` which will be used to allow saving and loading the model and its training history from disk. For this we are also going to define a `BASE_PATH` which is the folder were we will save the model and logfile to
%% Cell type:code id:1f1d51cb-44b6-4d7e-97aa-b1e699e6338e tags:
``` python
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import re
import tensorflow as tf
from tensorflow.keras.layers import Dense, Embedding, LSTM, Bidirectional, GlobalAveragePooling1D, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.models import load_model
from keras.callbacks import CSVLogger
from sklearn.metrics import confusion_matrix
BASE_PATH = "./saved_models/"
```
%% Cell type:markdown id:6Sp0upFjpIL5 tags:
## Loading the dataset and analysis
We start by doing the usual. Loading our dataset and displaying a few example entries.
%% Cell type:code id:61c32169-b9a2-4868-aeac-9a44f108e623 tags:
``` python
train = pd.read_csv("../data/Corona_NLP/Corona_NLP_train.csv", encoding="latin1",engine="python")
test = pd.read_csv("../data/Corona_NLP/Corona_NLP_test.csv", encoding="latin1",engine="python")
```
%% Cell type:code id:43bdce54-3e08-4af8-ad69-1b0b5b68f184 tags:
``` python
train.head(5)
```
%% Output
UserName ScreenName Location TweetAt \
0 3799 48751 London 16-03-2020
1 3800 48752 UK 16-03-2020
2 3801 48753 Vagabonds 16-03-2020
3 3802 48754 NaN 16-03-2020
4 3803 48755 NaN 16-03-2020
OriginalTweet Sentiment
0 @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i... Neutral
1 advice Talk to your neighbours family to excha... Positive
2 Coronavirus Australia: Woolworths to give elde... Positive
3 My food stock is not the only one which is emp... Positive
4 Me, ready to go at supermarket during the #COV... Extremely Negative
%% Cell type:markdown id:n2SNLzfwqCuG tags:
As our first preprocessing step we will drop some columns that we are not interested in like the username, screenname, location of the author and the time of the tweet.
%% Cell type:code id:d0501221-8bee-45b7-8ecf-1d5b483680cb tags:
``` python
# drop unnecessary columns
unnecessary_columns=["UserName", "ScreenName", "TweetAt", "Location"]
train.drop(unnecessary_columns, axis=1, inplace=True)
test.drop(unnecessary_columns, axis=1, inplace=True)
train.head(5)
```
%% Output
OriginalTweet Sentiment
0 @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i... Neutral
1 advice Talk to your neighbours family to excha... Positive
2 Coronavirus Australia: Woolworths to give elde... Positive
3 My food stock is not the only one which is emp... Positive
4 Me, ready to go at supermarket during the #COV... Extremely Negative
%% Cell type:markdown id:KPw5uQWC0PyB tags:
Lets check if there are any unknown values in the dataframe that we have to drop or replace
%% Cell type:code id:Aqri2Hm60u6Q tags:
``` python
print(f"train data null values:\n{train.isnull().sum()}\n")
print(f"test data null values:\n{test.isnull().sum()}")
```
%% Output
train data null values:
OriginalTweet 0
Sentiment 0
dtype: int64
test data null values:
OriginalTweet 0
Sentiment 0
dtype: int64
%% Cell type:markdown id:DTm-2S531NWq tags:
There aren't any missing values so we don't have to do any preprocessing in regards to replacing or dropping missing values.
Next let's take a look at the classes we are trying to predict.
%% Cell type:code id:edef1814-645e-4a5d-b303-ebf2a9dde1c4 tags:
``` python
# show variations of sentiment
train['Sentiment'].unique()
```
%% Output
array(['Neutral', 'Positive', 'Extremely Negative', 'Negative',
'Extremely Positive'], dtype=object)
%% Cell type:markdown id:3acb201f tags:
We are going to sort a temporary dataframe based on the sentiment before plotting so that our plot shows the most negative sentiment on the left and the positive on the right
%% Cell type:code id:32592de7-07f4-411b-a953-d9f9fb6a0c66 tags:
``` python
mapping = {'Extremely Negative':0,'Negative':1,'Neutral':2,'Positive':3,'Extremely Positive': 4} # Assign sentiment class to a value
tmp = train.sort_values(by='Sentiment', key=lambda col: col.map(mapping)) # sort by the value
sns.histplot(data=tmp, x="Sentiment")
plt.xticks(rotation=30)
plt.show()
```
%% Output
%% Cell type:markdown id:UQr4hGqe1hLe tags:
All in all the data is pretty balanced although the extreme opinions do appear less often than the more moderate ones. If we have trouble predicting those correctly it might be an option to merge the extreme and the moderate columns to form a single column each.
## Preprocessing
The 5 classes we have will need to be encoded as numbers so that our model will be able to handle them properly.
Aside from that we aren't doing a lot of preprocessing. The only things we do is removing URLs and twitter mentions as those are not very helpful when predicting the sentiment. This function will encode the Sentiment values in the dataframe and apply the preprocess_string function on every entry.
%% Cell type:code id:bf84a59f-0e6d-4e95-ab78-3065cb0e34a4 tags:
``` python
def preprocess_dataframe(df: pd.DataFrame):
def preprocess_string(s):
'''
min_word_count describes the minimum amount of words that have to remain after cleaning for the string to not be returned as None
'''
# remove urls and twitter mentions
s = re.sub(r'http\S+', '', s)
s = re.sub("(@[a-zA-Z_0-9]+)","", s)
return s
# encode sentiment
mapping = {'Extremely Negative':0,'Negative':1,'Neutral':2,'Positive':3,'Extremely Positive': 4}
df['Sentiment'] = df['Sentiment'].map(mapping, na_action=None)
# preprocess the tweets
df['OriginalTweet'] = [preprocess_string(s) for s in df.OriginalTweet]
return df
```
%% Cell type:code id:0c95f6f2-f9ef-417d-ab2c-b7206495ea82 tags:
``` python
train = preprocess_dataframe(train)
test = preprocess_dataframe(test)
```
%% Cell type:code id:e4665e4f-080b-4ee3-94ad-2c0471943a19 tags:
``` python
train.head()
```
%% Output
OriginalTweet Sentiment
0 and and 2
1 advice Talk to your neighbours family to excha... 3
2 Coronavirus Australia: Woolworths to give elde... 3
3 My food stock is not the only one which is emp... 3
4 Me, ready to go at supermarket during the #COV... 0
%% Cell type:markdown id:RQZNBWi83HOD tags:
The data as it is now is not ready to be handed over to the model yet. First we will need to some additional preprocessing. We will start by converting our features and targets into numpy arrays because Keras cannot handle dataframes. We will also create and fit a tokenizer on both datasets so that it can learn all words that occur. After tokenizing the tweets we will apply padding to make all tweets a uniform length.
%% Cell type:code id:d436471f-6349-47c1-82b9-58c842204b6d tags:
``` python
# LSTM model
y = np.array(train['Sentiment'])
x = np.array(train['OriginalTweet'])
y_test = np.array(test['Sentiment'])
x_test = np.array(test['OriginalTweet'])
# create and fit Tokenizer on the train and test set.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x)
tokenizer.fit_on_texts(x_test)
sequences_train = tokenizer.texts_to_sequences(x)
sequences_test = tokenizer.texts_to_sequences(x_test)
# find length of longest tweet in train and test set and set padded_size to the maximum.
padded_size = max(max([len(tweet) for tweet in sequences_train]), max([len(tweet) for tweet in sequences_test]))
x = pad_sequences(sequences_train, maxlen=padded_size)
x_test = pad_sequences(sequences_test, maxlen=padded_size)
y = np.reshape(y, (-1,1))
y_test = np.reshape(y_test, (-1,1))
```
%% Cell type:markdown id:eupqf9UJDs37 tags:
After padding we can take a look at how our tweets look now. The first tweet in our array was reduced to a simple `and and` by the preprocessing. Each and is now represented by a 4 in the padded and tokenized string. The padding character used is a 0 and it was added in front of the tweet.
%% Cell type:code id:a4584bb4-09d1-48fe-a190-b5fc44d8f82f tags:
``` python
x[0]
```
%% Output
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4])
%% Cell type:markdown id:Bvr1sRMeEpWW tags:
## Building and training the model
Now that the preprocessing is done we can finally start building the model. This process usually requires a bit of experimentation to achieve a good accuracy so feel free to change some variables or add some layers and see if you can get a better result.
As this model takes quite a bit of time to train it is automatically saved after completing the training. When running this notebook a second time you can change the `LOAD_MODEL` variable to `True` to load the pretrained model from memory instead of retraining it.
%% Cell type:code id:yBxNn2GTa26v tags:
``` python
# Switch this to True or False to enable/disable loading of the model from disk
# Make sure to run the entire notebook at least once before setting this to True.
LOAD_MODEL = False
model = None
if LOAD_MODEL:
model = load_model(BASE_PATH+'long_short_term_memory_model.h5')
else:
dropout_rate = 0.4
num_classes = train['Sentiment'].unique()
num_words = len(tokenizer.word_index)+1
model = Sequential([
Embedding(input_dim=num_words,output_dim=5, input_length=padded_size),
Bidirectional(LSTM(16, return_sequences=True)),
Dropout(dropout_rate),
LSTM(16, return_sequences=True),
GlobalAveragePooling1D(),
Dense(len(num_classes), activation='softmax'),
])
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
optimizer=Adam(learning_rate=0.01, decay=0.001),
metrics=['accuracy'])
print(model.summary())
```
%% Output
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 65, 5) 273365
bidirectional (Bidirectiona (None, 65, 32) 2816
l)
dropout (Dropout) (None, 65, 32) 0
lstm_1 (LSTM) (None, 65, 16) 3136
global_average_pooling1d (G (None, 16) 0
lobalAveragePooling1D)
dense (Dense) (None, 5) 85
=================================================================
Total params: 279,402
Trainable params: 279,402
Non-trainable params: 0
_________________________________________________________________
None
%% Cell type:code id:2965e0d3-7649-4e04-b44f-950f4b3a2373 tags:
``` python
history = None
epochs = 5
if LOAD_MODEL:
history = pd.read_csv(BASE_PATH+'lstm_training.log', sep=',', engine='python')
else:
csv_logger = CSVLogger(BASE_PATH+'lstm_training.log', separator=',', append=False)
epochs = 5
history=model.fit(x,y, epochs = epochs, verbose = 'auto', validation_data=(x_test, y_test), callbacks=[csv_logger]).history
model.save(BASE_PATH+'long_short_term_memory_model.h5')
```
%% Output
Epoch 1/5
1287/1287 [==============================] - 43s 31ms/step - loss: 1.1219 - accuracy: 0.5346 - val_loss: 0.8971 - val_accuracy: 0.6651
Epoch 2/5
1287/1287 [==============================] - 39s 30ms/step - loss: 0.6193 - accuracy: 0.7790 - val_loss: 0.7436 - val_accuracy: 0.7351
Epoch 3/5
1287/1287 [==============================] - 37s 29ms/step - loss: 0.4338 - accuracy: 0.8569 - val_loss: 0.7290 - val_accuracy: 0.7615
Epoch 4/5
1287/1287 [==============================] - 37s 29ms/step - loss: 0.3273 - accuracy: 0.8970 - val_loss: 0.7934 - val_accuracy: 0.7551
Epoch 5/5
1287/1287 [==============================] - 38s 30ms/step - loss: 0.2597 - accuracy: 0.9215 - val_loss: 0.8643 - val_accuracy: 0.7370
%% Cell type:markdown id:nUjXgdOMFwD1 tags:
## Evaluation
After completing the training we can plot the accuracy and loss curves using pyplot.
%% Cell type:code id:26128add-7843-4533-8829-0fc6f03fe1a4 tags:
``` python
# plot accuracy and val_accuracy
acc = history['accuracy']
val_acc = history['val_accuracy']
loss = history['loss']
val_loss = history['val_loss']
epochs_range = range(1, epochs+1)
plt.subplot(1,2,1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='best')
plt.xlim(1,len(val_acc)+1)
plt.title('Training Accuracy')
plt.subplot(1,2,2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='best')
plt.xlim(1,len(val_acc)+1)
plt.title('Training Loss')
plt.show()
```
%% Output
%% Cell type:code id:bbbc20bf tags:
``` python
labels=['Extremely Negative','Negative','Neutral','Positive','Extremely Positive']
prediction = model.predict(np.array(x_test))
prediction = [np.argmax(p) for p in prediction]
mat = confusion_matrix(y_true=y_test, y_pred=prediction)
sns.heatmap(mat,annot=True, cmap='Blues',fmt='d', xticklabels=labels, yticklabels=labels)
plt.xlabel('true')
plt.ylabel('predicted')
```
%% Output
Text(32.99999999999999, 0.5, 'predicted')
%% Cell type:markdown id:9058966c tags:
Although the graph shows quite a high accuracy this is not what we are interested in. What's more important than accuracy is the validation accuracy which drops off after the 3rd epoch. Continuing the training after that will not improve the model any more but rather lead to overfitting.
## User input
The next part is an experiment. We are going to use the model that we just trained to try and predict the sentiment of user inputted strings
%% Cell type:code id:124900d0 tags:
``` python
user_input = ''
```
%% Cell type:code id:e9fcdd36 tags:
``` python
def predict_input(input: str)->str:
def format_string(string: str)->np.array:
tmp_df = pd.DataFrame({'OriginalTweet':[string],'Sentiment':['Neutral']}) # Creating a dataframe so we can reuse the previous preprocessing function
tmp_df = preprocess_dataframe(tmp_df)
x = np.array(tmp_df['OriginalTweet'])
tokenizer.fit_on_texts(x) # reusing tokenizer from training the model
seq = tokenizer.texts_to_sequences(x)
return pad_sequences(seq, maxlen=padded_size) # reusing padded_size from training the model
input = format_string(input)
pred = model.predict(input)
{'Extremely Negative':0,'Negative':1,'Neutral':2,'Positive':3,'Extremely Positive': 4}
classes = {0: 'Extremely Negative', 1: 'Negative', 2:'Neutral', 3:'Positive',4:'Extremely Positive'}
return classes[np.argmax(pred)]
```
%% Cell type:code id:3c576520 tags:
``` python
print(predict_input(user_input))
```
%% Output
Neutral
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment