Merge branch 'fix-notebooks' into 'main'

Fix LSTM loading See merge request !125

Merge branch 'fix-notebooks' into 'main'
036cbb9a · Daniel Müller · 9de86178 · 4ac3db27 · 036cbb9a
Commit 036cbb9a authored 3 years ago by Daniel Müller
--- a/notebooks/examples/long_short_term_memory.ipynb
+++ b/notebooks/examples/long_short_term_memory.ipynb
@@ -748,11 +748,12 @@
   ],
   "source": [
    "history = None\n",
+    "epochs = 5\n",
+    "\n",
    "if LOAD_MODEL:\n",
    "  history = pd.read_csv(BASE_PATH+'lstm_training.log', sep=',', engine='python')\n",
    "else:\n",
    "  csv_logger = CSVLogger(BASE_PATH+'lstm_training.log', separator=',', append=False)\n",
-    "  epochs = 5\n",
    "  history=model.fit(x,y, epochs = epochs, verbose = 'auto', validation_data=(x_test, y_test), callbacks=[csv_logger]).history\n",
    "  model.save(BASE_PATH+'long_short_term_memory_model.h5')"
   ]

 %% Cell type:markdown id:291f0068 tags:

 # Long Short-Term Memory
 This notebook will show you an example of a long short-term memory neural network with tensorflow. As a dataset we are going to use the Corona_NLP dataset which you can find in the data folder. This dataset contains tweets regarding the corona virus and an estimation of the authors sentiment toward the topic.

 The following sentiment classes exist:
 >['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive']
 ## Imports
 For this task we require quite a few imports. Some of the more special ones we are going to use here are `re` which will be used in the preprocessing step as well as `keras.models.load_model` and `keras.callbacks.CSVLogger` which will be used to allow saving and loading the model and its training history from disk. For this we are also going to define a `BASE_PATH` which is the folder were we will save the model and logfile to

 %% Cell type:code id:1f1d51cb-44b6-4d7e-97aa-b1e699e6338e tags:

 ``` python
 import pandas as pd
 import seaborn as sns
 import numpy as np
 import matplotlib.pyplot as plt
 import re
 import tensorflow as tf
 from tensorflow.keras.layers import Dense, Embedding, LSTM, Bidirectional, GlobalAveragePooling1D, Dropout
 from tensorflow.keras.models import Sequential
 from tensorflow.keras.optimizers import Adam
 from tensorflow.keras.preprocessing.sequence import pad_sequences
 from tensorflow.keras.preprocessing.text import Tokenizer
 from keras.models import load_model
 from keras.callbacks import CSVLogger
 from sklearn.metrics import confusion_matrix
 BASE_PATH = "./saved_models/"
 ```

 %% Cell type:markdown id:6Sp0upFjpIL5 tags:

 ## Loading the dataset and analysis
 We start by doing the usual. Loading our dataset and displaying a few example entries.

 %% Cell type:code id:61c32169-b9a2-4868-aeac-9a44f108e623 tags:

 ``` python
 train = pd.read_csv("../data/Corona_NLP/Corona_NLP_train.csv", encoding="latin1",engine="python")
 test = pd.read_csv("../data/Corona_NLP/Corona_NLP_test.csv", encoding="latin1",engine="python")
 ```

 %% Cell type:code id:43bdce54-3e08-4af8-ad69-1b0b5b68f184 tags:

 ``` python
 train.head(5)
 ```

 %% Output

       UserName  ScreenName   Location     TweetAt  \
    0      3799       48751     London  16-03-2020
    1      3800       48752         UK  16-03-2020
    2      3801       48753  Vagabonds  16-03-2020
    3      3802       48754        NaN  16-03-2020
    4      3803       48755        NaN  16-03-2020
    
                                           OriginalTweet           Sentiment
    0  @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...             Neutral
    1  advice Talk to your neighbours family to excha...            Positive
    2  Coronavirus Australia: Woolworths to give elde...            Positive
    3  My food stock is not the only one which is emp...            Positive
    4  Me, ready to go at supermarket during the #COV...  Extremely Negative

 %% Cell type:markdown id:n2SNLzfwqCuG tags:

 As our first preprocessing step we will drop some columns that we are not interested in like the username, screenname, location of the author and the time of the tweet.

 %% Cell type:code id:d0501221-8bee-45b7-8ecf-1d5b483680cb tags:

 ``` python
 # drop unnecessary columns
 unnecessary_columns=["UserName", "ScreenName", "TweetAt", "Location"]
 train.drop(unnecessary_columns, axis=1, inplace=True)
 test.drop(unnecessary_columns, axis=1, inplace=True)
 train.head(5)
 ```

 %% Output

                                           OriginalTweet           Sentiment
    0  @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...             Neutral
    1  advice Talk to your neighbours family to excha...            Positive
    2  Coronavirus Australia: Woolworths to give elde...            Positive
    3  My food stock is not the only one which is emp...            Positive
    4  Me, ready to go at supermarket during the #COV...  Extremely Negative

 %% Cell type:markdown id:KPw5uQWC0PyB tags:

 Lets check if there are any unknown values in the dataframe that we have to drop or replace

 %% Cell type:code id:Aqri2Hm60u6Q tags:

 ``` python
 print(f"train data null values:\n{train.isnull().sum()}\n")
 print(f"test data null values:\n{test.isnull().sum()}")
 ```

 %% Output

    train data null values:
    OriginalTweet    0
    Sentiment        0
    dtype: int64
    
    test data null values:
    OriginalTweet    0
    Sentiment        0
    dtype: int64

 %% Cell type:markdown id:DTm-2S531NWq tags:

 There aren't any missing values so we don't have to do any preprocessing in regards to replacing or dropping missing values.

 Next let's take a look at the classes we are trying to predict.

 %% Cell type:code id:edef1814-645e-4a5d-b303-ebf2a9dde1c4 tags:

 ``` python
 # show variations of sentiment
 train['Sentiment'].unique()
 ```

 %% Output

    array(['Neutral', 'Positive', 'Extremely Negative', 'Negative',
           'Extremely Positive'], dtype=object)

 %% Cell type:markdown id:3acb201f tags:

 We are going to sort a temporary dataframe based on the sentiment before plotting so that our plot shows the most negative sentiment on the left and the positive on the right

 %% Cell type:code id:32592de7-07f4-411b-a953-d9f9fb6a0c66 tags:

 ``` python
 mapping = {'Extremely Negative':0,'Negative':1,'Neutral':2,'Positive':3,'Extremely Positive': 4} # Assign sentiment class to a value
 tmp = train.sort_values(by='Sentiment', key=lambda col: col.map(mapping)) # sort by the value
 sns.histplot(data=tmp, x="Sentiment")
 plt.xticks(rotation=30)
 plt.show()
 ```

 %% Output



 %% Cell type:markdown id:UQr4hGqe1hLe tags:

 All in all the data is pretty balanced although the extreme opinions do appear less often than the more moderate ones. If we have trouble predicting those correctly it might be an option to merge the extreme and the moderate columns to form a single column each.
 ## Preprocessing
 The 5 classes we have will need to be encoded as numbers so that our model will be able to handle them properly.
 Aside from that we aren't doing a lot of preprocessing. The only things we do is removing URLs and twitter mentions as those are not very helpful when predicting the sentiment. This function will encode the Sentiment values in the dataframe and apply the preprocess_string function on every entry.

 %% Cell type:code id:bf84a59f-0e6d-4e95-ab78-3065cb0e34a4 tags:

 ``` python
 def preprocess_dataframe(df: pd.DataFrame):
    def preprocess_string(s):
        '''
        min_word_count describes the minimum amount of words that have to remain after cleaning for the string to not be returned as None
        '''
        # remove urls and twitter mentions
        s = re.sub(r'http\S+', '', s)
        s = re.sub("(@[a-zA-Z_0-9]+)","", s)
        return s

    # encode sentiment
    mapping = {'Extremely Negative':0,'Negative':1,'Neutral':2,'Positive':3,'Extremely Positive': 4}
    df['Sentiment'] = df['Sentiment'].map(mapping, na_action=None)
    # preprocess the tweets
    df['OriginalTweet'] = [preprocess_string(s) for s in df.OriginalTweet]
    return df
 ```

 %% Cell type:code id:0c95f6f2-f9ef-417d-ab2c-b7206495ea82 tags:

 ``` python
 train = preprocess_dataframe(train)
 test = preprocess_dataframe(test)
 ```

 %% Cell type:code id:e4665e4f-080b-4ee3-94ad-2c0471943a19 tags:

 ``` python
 train.head()
 ```

 %% Output

                                           OriginalTweet  Sentiment
    0                                          and  and           2
    1  advice Talk to your neighbours family to excha...          3
    2  Coronavirus Australia: Woolworths to give elde...          3
    3  My food stock is not the only one which is emp...          3
    4  Me, ready to go at supermarket during the #COV...          0

 %% Cell type:markdown id:RQZNBWi83HOD tags:

 The data as it is now is not ready to be handed over to the model yet. First we will need to some additional preprocessing. We will start by converting our features and targets into numpy arrays because Keras cannot handle dataframes. We will also create and fit a tokenizer on both datasets so that it can learn all words that occur. After tokenizing the tweets we will apply padding to make all tweets a uniform length.

 %% Cell type:code id:d436471f-6349-47c1-82b9-58c842204b6d tags:

 ``` python
 # LSTM model
 y = np.array(train['Sentiment'])
 x = np.array(train['OriginalTweet'])

 y_test = np.array(test['Sentiment'])
 x_test = np.array(test['OriginalTweet'])

 # create and fit Tokenizer on the train and test set.
 tokenizer = Tokenizer()
 tokenizer.fit_on_texts(x)
 tokenizer.fit_on_texts(x_test)

 sequences_train = tokenizer.texts_to_sequences(x)
 sequences_test = tokenizer.texts_to_sequences(x_test)

 # find length of longest tweet in train and test set and set padded_size to the maximum.
 padded_size = max(max([len(tweet) for tweet in sequences_train]), max([len(tweet) for tweet in sequences_test]))

 x = pad_sequences(sequences_train, maxlen=padded_size)
 x_test = pad_sequences(sequences_test, maxlen=padded_size)

 y = np.reshape(y, (-1,1))
 y_test = np.reshape(y_test, (-1,1))
 ```

 %% Cell type:markdown id:eupqf9UJDs37 tags:

 After padding we can take a look at how our tweets look now. The first tweet in our array was reduced to a simple `and and` by the preprocessing. Each and is now represented by a 4 in the padded and tokenized string. The padding character used is a 0 and it was added in front of the tweet.

 %% Cell type:code id:a4584bb4-09d1-48fe-a190-b5fc44d8f82f tags:

 ``` python
 x[0]
 ```

 %% Output

    array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4])

 %% Cell type:markdown id:Bvr1sRMeEpWW tags:

 ## Building and training the model
 Now that the preprocessing is done we can finally start building the model. This process usually requires a bit of experimentation to achieve a good accuracy so feel free to change some variables or add some layers and see if you can get a better result.

 As this model takes quite a bit of time to train it is automatically saved after completing the training. When running this notebook a second time you can change the `LOAD_MODEL` variable to `True` to load the pretrained model from memory instead of retraining it.

 %% Cell type:code id:yBxNn2GTa26v tags:

 ``` python
 # Switch this to True or False to enable/disable loading of the model from disk
 # Make sure to run the entire notebook at least once before setting this to True.
 LOAD_MODEL = False

 model = None
 if LOAD_MODEL:
  model = load_model(BASE_PATH+'long_short_term_memory_model.h5')
 else:
  dropout_rate = 0.4

  num_classes = train['Sentiment'].unique()
  num_words = len(tokenizer.word_index)+1

  model = Sequential([
      Embedding(input_dim=num_words,output_dim=5, input_length=padded_size),
      Bidirectional(LSTM(16, return_sequences=True)),
      Dropout(dropout_rate),
      LSTM(16, return_sequences=True),
      GlobalAveragePooling1D(),
      Dense(len(num_classes), activation='softmax'),
  ])
  model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              optimizer=Adam(learning_rate=0.01, decay=0.001),
              metrics=['accuracy'])
  print(model.summary())
 ```

 %% Output

    Model: "sequential"
    _________________________________________________________________
     Layer (type)                Output Shape              Param #
    =================================================================
     embedding (Embedding)       (None, 65, 5)             273365
    
     bidirectional (Bidirectiona  (None, 65, 32)           2816
     l)
    
     dropout (Dropout)           (None, 65, 32)            0
    
     lstm_1 (LSTM)               (None, 65, 16)            3136
    
     global_average_pooling1d (G  (None, 16)               0
     lobalAveragePooling1D)
    
     dense (Dense)               (None, 5)                 85
    
    =================================================================
    Total params: 279,402
    Trainable params: 279,402
    Non-trainable params: 0
    _________________________________________________________________
    None

 %% Cell type:code id:2965e0d3-7649-4e04-b44f-950f4b3a2373 tags:

 ``` python
 history = None
+epochs = 5
+
 if LOAD_MODEL:
  history = pd.read_csv(BASE_PATH+'lstm_training.log', sep=',', engine='python')
 else:
  csv_logger = CSVLogger(BASE_PATH+'lstm_training.log', separator=',', append=False)
-  epochs = 5
  history=model.fit(x,y, epochs = epochs, verbose = 'auto', validation_data=(x_test, y_test), callbacks=[csv_logger]).history
  model.save(BASE_PATH+'long_short_term_memory_model.h5')
 ```

 %% Output

    Epoch 1/5
    1287/1287 [==============================] - 43s 31ms/step - loss: 1.1219 - accuracy: 0.5346 - val_loss: 0.8971 - val_accuracy: 0.6651
    Epoch 2/5
    1287/1287 [==============================] - 39s 30ms/step - loss: 0.6193 - accuracy: 0.7790 - val_loss: 0.7436 - val_accuracy: 0.7351
    Epoch 3/5
    1287/1287 [==============================] - 37s 29ms/step - loss: 0.4338 - accuracy: 0.8569 - val_loss: 0.7290 - val_accuracy: 0.7615
    Epoch 4/5
    1287/1287 [==============================] - 37s 29ms/step - loss: 0.3273 - accuracy: 0.8970 - val_loss: 0.7934 - val_accuracy: 0.7551
    Epoch 5/5
    1287/1287 [==============================] - 38s 30ms/step - loss: 0.2597 - accuracy: 0.9215 - val_loss: 0.8643 - val_accuracy: 0.7370

 %% Cell type:markdown id:nUjXgdOMFwD1 tags:

 ## Evaluation
 After completing the training we can plot the accuracy and loss curves using pyplot.

 %% Cell type:code id:26128add-7843-4533-8829-0fc6f03fe1a4 tags:

 ``` python
 # plot accuracy and val_accuracy
 acc = history['accuracy']
 val_acc = history['val_accuracy']

 loss = history['loss']
 val_loss = history['val_loss']

 epochs_range = range(1, epochs+1)

 plt.subplot(1,2,1)
 plt.plot(epochs_range, acc, label='Training Accuracy')
 plt.plot(epochs_range, val_acc, label='Validation Accuracy')
 plt.legend(loc='best')
 plt.xlim(1,len(val_acc)+1)
 plt.title('Training Accuracy')

 plt.subplot(1,2,2)
 plt.plot(epochs_range, loss, label='Training Loss')
 plt.plot(epochs_range, val_loss, label='Validation Loss')
 plt.legend(loc='best')
 plt.xlim(1,len(val_acc)+1)
 plt.title('Training Loss')
 plt.show()
 ```

 %% Output



 %% Cell type:code id:bbbc20bf tags:

 ``` python
 labels=['Extremely Negative','Negative','Neutral','Positive','Extremely Positive']
 prediction = model.predict(np.array(x_test))
 prediction = [np.argmax(p) for p in prediction]
 mat = confusion_matrix(y_true=y_test, y_pred=prediction)
 sns.heatmap(mat,annot=True, cmap='Blues',fmt='d', xticklabels=labels, yticklabels=labels)
 plt.xlabel('true')
 plt.ylabel('predicted')
 ```

 %% Output

    Text(32.99999999999999, 0.5, 'predicted')



 %% Cell type:markdown id:9058966c tags:

 Although the graph shows quite a high accuracy this is not what we are interested in. What's more important than accuracy is the validation accuracy which drops off after the 3rd epoch. Continuing the training after that will not improve the model any more but rather lead to overfitting.
 ## User input
 The next part is an experiment. We are going to use the model that we just trained to try and predict the sentiment of user inputted strings

 %% Cell type:code id:124900d0 tags:

 ``` python
 user_input = ''
 ```

 %% Cell type:code id:e9fcdd36 tags:

 ``` python
 def predict_input(input: str)->str:
    def format_string(string: str)->np.array:
        tmp_df = pd.DataFrame({'OriginalTweet':[string],'Sentiment':['Neutral']}) # Creating a dataframe so we can reuse the previous preprocessing function
        tmp_df = preprocess_dataframe(tmp_df)
        x = np.array(tmp_df['OriginalTweet'])
        tokenizer.fit_on_texts(x) # reusing tokenizer from training the model
        seq = tokenizer.texts_to_sequences(x)
        return pad_sequences(seq, maxlen=padded_size) # reusing padded_size from training the model
    input = format_string(input)
    pred = model.predict(input)
    {'Extremely Negative':0,'Negative':1,'Neutral':2,'Positive':3,'Extremely Positive': 4}
    classes = {0: 'Extremely Negative', 1: 'Negative', 2:'Neutral', 3:'Positive',4:'Extremely Positive'}
    return classes[np.argmax(pred)]
 ```

 %% Cell type:code id:3c576520 tags:

 ``` python
 print(predict_input(user_input))
 ```

 %% Output

    Neutral