Newer
Older
{
"cell_type": "markdown",
"id": "291f0068",
"metadata": {},
"source": [
"# Long Short-Term Memory\n",
"This notebook will show you an example of a long short-term memory neural network with tensorflow. As a dataset we are going to use the Corona_NLP dataset which you can find in the data folder. This dataset contains tweets regarding the corona virus and an estimation of the authors sentiment toward the topic.\n",
"\n",
"The following sentiment classes exist:\n",
">['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive']\n",
"## Imports\n",
"For this task we require quite a few imports. Some of the more special ones we are going to use here are `re` which will be used in the preprocessing step as well as `keras.models.load_model` and `keras.callbacks.CSVLogger` which will be used to allow saving and loading the model and its training history from disk. For this we are also going to define a `BASE_PATH` which is the folder were we will save the model and logfile to"
]
},
{
"cell_type": "code",
"id": "1f1d51cb-44b6-4d7e-97aa-b1e699e6338e",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
"id": "1f1d51cb-44b6-4d7e-97aa-b1e699e6338e",
"outputId": "5181c0bb-a008-4094-964a-9f03e8b86d70"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import seaborn as sns\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import re\n",
"import tensorflow as tf\n",
"from tensorflow.keras.layers import Dense, Embedding, LSTM, Bidirectional, GlobalAveragePooling1D, Dropout\n",
"from tensorflow.keras.models import Sequential\n",
"from tensorflow.keras.optimizers import Adam\n",
"from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
"from tensorflow.keras.preprocessing.text import Tokenizer\n",
"from keras.models import load_model\n",
"from keras.callbacks import CSVLogger\n",
"BASE_PATH = \"./saved_models/\""
]
},
{
"cell_type": "markdown",
"id": "6Sp0upFjpIL5",
"metadata": {
"id": "6Sp0upFjpIL5"
},
"source": [
"## Loading the dataset and analysis\n",
"We start by doing the usual. Loading our dataset and displaying a few example entries."
]
},
{
"cell_type": "code",
"id": "61c32169-b9a2-4868-aeac-9a44f108e623",
"metadata": {
"id": "61c32169-b9a2-4868-aeac-9a44f108e623"
},
"outputs": [],
"source": [
"train = pd.read_csv(\"../data/Corona_NLP/Corona_NLP_train.csv\", encoding=\"latin1\",engine=\"python\")\n",
"test = pd.read_csv(\"../data/Corona_NLP/Corona_NLP_test.csv\", encoding=\"latin1\",engine=\"python\")"
]
},
{
"cell_type": "code",
"id": "43bdce54-3e08-4af8-ad69-1b0b5b68f184",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
"id": "43bdce54-3e08-4af8-ad69-1b0b5b68f184",
"outputId": "3523d045-620d-4a9f-dd65-35a3b677a176"
},
"outputs": [
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>UserName</th>\n",
" <th>ScreenName</th>\n",
" <th>Location</th>\n",
" <th>TweetAt</th>\n",
" <th>OriginalTweet</th>\n",
" <th>Sentiment</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3799</td>\n",
" <td>48751</td>\n",
" <td>London</td>\n",
" <td>16-03-2020</td>\n",
" <td>@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...</td>\n",
" <td>Neutral</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3800</td>\n",
" <td>48752</td>\n",
" <td>UK</td>\n",
" <td>16-03-2020</td>\n",
" <td>advice Talk to your neighbours family to excha...</td>\n",
" <td>Positive</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3801</td>\n",
" <td>48753</td>\n",
" <td>Vagabonds</td>\n",
" <td>16-03-2020</td>\n",
" <td>Coronavirus Australia: Woolworths to give elde...</td>\n",
" <td>Positive</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3802</td>\n",
" <td>48754</td>\n",
" <td>NaN</td>\n",
" <td>16-03-2020</td>\n",
" <td>My food stock is not the only one which is emp...</td>\n",
" <td>Positive</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3803</td>\n",
" <td>48755</td>\n",
" <td>NaN</td>\n",
" <td>16-03-2020</td>\n",
" <td>Me, ready to go at supermarket during the #COV...</td>\n",
" <td>Extremely Negative</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
"text/plain": [
" UserName ScreenName Location TweetAt \\\n",
"0 3799 48751 London 16-03-2020 \n",
"1 3800 48752 UK 16-03-2020 \n",
"2 3801 48753 Vagabonds 16-03-2020 \n",
"3 3802 48754 NaN 16-03-2020 \n",
"4 3803 48755 NaN 16-03-2020 \n",
"\n",
" OriginalTweet Sentiment \n",
"0 @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i... Neutral \n",
"1 advice Talk to your neighbours family to excha... Positive \n",
"2 Coronavirus Australia: Woolworths to give elde... Positive \n",
"3 My food stock is not the only one which is emp... Positive \n",
"4 Me, ready to go at supermarket during the #COV... Extremely Negative "
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train.head(5)"
]
},
{
"cell_type": "markdown",
"id": "n2SNLzfwqCuG",
"metadata": {
"id": "n2SNLzfwqCuG"
},
"source": [
"As our first preprocessing step we will drop some columns that we are not interested in like the username, screenname, location of the author and the time of the tweet."
]
},
{
"cell_type": "code",
"id": "d0501221-8bee-45b7-8ecf-1d5b483680cb",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
"id": "d0501221-8bee-45b7-8ecf-1d5b483680cb",
"outputId": "2d7a9e7b-21d5-4cf8-9a0b-bb7807ed5225"
},
"outputs": [
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>OriginalTweet</th>\n",
" <th>Sentiment</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...</td>\n",
" <td>Neutral</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>advice Talk to your neighbours family to excha...</td>\n",
" <td>Positive</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Coronavirus Australia: Woolworths to give elde...</td>\n",
" <td>Positive</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>My food stock is not the only one which is emp...</td>\n",
" <td>Positive</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Me, ready to go at supermarket during the #COV...</td>\n",
" <td>Extremely Negative</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
"text/plain": [
" OriginalTweet Sentiment\n",
"0 @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i... Neutral\n",
"1 advice Talk to your neighbours family to excha... Positive\n",
"2 Coronavirus Australia: Woolworths to give elde... Positive\n",
"3 My food stock is not the only one which is emp... Positive\n",
"4 Me, ready to go at supermarket during the #COV... Extremely Negative"
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# drop unnecessary columns\n",
"unnecessary_columns=[\"UserName\", \"ScreenName\", \"TweetAt\", \"Location\"]\n",
"train.drop(unnecessary_columns, axis=1, inplace=True)\n",
"test.drop(unnecessary_columns, axis=1, inplace=True)\n",
"train.head(5)"
]
},
{
"cell_type": "markdown",
"id": "KPw5uQWC0PyB",
"metadata": {
"id": "KPw5uQWC0PyB"
},
"source": [
"Lets check if there are any unknown values in the dataframe that we have to drop or replace"
]
},
{
"cell_type": "code",
"id": "Aqri2Hm60u6Q",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
"id": "Aqri2Hm60u6Q",
"outputId": "23fa3815-8968-42ef-891a-beb37ebdb0ee"
},
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"train data null values:\n",
"OriginalTweet 0\n",
"Sentiment 0\n",
"dtype: int64\n",
"\n",
"test data null values:\n",
"OriginalTweet 0\n",
"Sentiment 0\n",
"dtype: int64\n"
]
}
],
"source": [
"print(f\"train data null values:\\n{train.isnull().sum()}\\n\")\n",
"print(f\"test data null values:\\n{test.isnull().sum()}\")"
]
},
{
"cell_type": "markdown",
"id": "DTm-2S531NWq",
"metadata": {
"id": "DTm-2S531NWq"
},
"source": [
"There aren't any missing values so we don't have to do any preprocessing in regards to replacing or dropping missing values.\n",
"Next let's take a look at the classes we are trying to predict. "
]
},
{
"cell_type": "code",
"id": "edef1814-645e-4a5d-b303-ebf2a9dde1c4",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
"id": "edef1814-645e-4a5d-b303-ebf2a9dde1c4",
"outputId": "36a0a7ee-40f0-4390-a402-f167c6883eac"
},
"outputs": [
"data": {
"text/plain": [
"array(['Neutral', 'Positive', 'Extremely Negative', 'Negative',\n",
" 'Extremely Positive'], dtype=object)"
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# show variations of sentiment\n",
"train['Sentiment'].unique()"
]
},
{
"cell_type": "markdown",
"id": "3acb201f",
"metadata": {},
"source": [
"We are going to sort a temporary dataframe based on the sentiment before plotting so that our plot shows the most negative sentiment on the left and the positive on the right"
]
},
{
"cell_type": "code",
"id": "32592de7-07f4-411b-a953-d9f9fb6a0c66",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 297
"id": "32592de7-07f4-411b-a953-d9f9fb6a0c66",
"outputId": "eaf06407-90b2-47a1-da1a-cc6903fe28f1"
},
"outputs": [
"image/png": "",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"mapping = {'Extremely Negative':0,'Negative':1,'Neutral':2,'Positive':3,'Extremely Positive': 4} # Assign sentiment class to a value\n",
"tmp = train.sort_values(by='Sentiment', key=lambda col: col.map(mapping)) # sort by the value \n",
"sns.histplot(data=tmp, x=\"Sentiment\")\n",
"plt.xticks(rotation=30)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "UQr4hGqe1hLe",
"metadata": {
"id": "UQr4hGqe1hLe"
},
"source": [
"All in all the data is pretty balanced although the extreme opinions do appear less often than the more moderate ones. If we have trouble predicting those correctly it might be an option to merge the extreme and the moderate columns to form a single column each.\n",
"## Preprocessing\n",
"The 5 classes we have will need to be encoded as numbers so that our model will be able to handle them properly.\n",
"Aside from that we aren't doing a lot of preprocessing. The only things we do is removing URLs and twitter mentions as those are not very helpful when predicting the sentiment. This function will encode the Sentiment values in the dataframe and apply the preprocess_string function on every entry."
]
},
{
"cell_type": "code",
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
"id": "bf84a59f-0e6d-4e95-ab78-3065cb0e34a4",
"metadata": {
"id": "bf84a59f-0e6d-4e95-ab78-3065cb0e34a4"
},
"outputs": [],
"source": [
"def preprocess_dataframe(df: pd.DataFrame):\n",
" def preprocess_string(s):\n",
" '''\n",
" min_word_count describes the minimum amount of words that have to remain after cleaning for the string to not be returned as None\n",
" '''\n",
" # remove urls and twitter mentions\n",
" s = re.sub(r'http\\S+', '', s)\n",
" s = re.sub(\"(@[a-zA-Z_0-9]+)\",\"\", s)\n",
" return s\n",
" \n",
" # encode sentiment\n",
" mapping = {'Extremely Negative':0,'Negative':1,'Neutral':2,'Positive':3,'Extremely Positive': 4}\n",
" df['Sentiment'] = df['Sentiment'].map(mapping, na_action=None)\n",
" # preprocess the tweets\n",
" df['OriginalTweet'] = [preprocess_string(s) for s in df.OriginalTweet]\n",
" return df"
]
},
{
"cell_type": "code",
"id": "0c95f6f2-f9ef-417d-ab2c-b7206495ea82",
"metadata": {
"id": "0c95f6f2-f9ef-417d-ab2c-b7206495ea82"
},
"outputs": [],
"source": [
"train = preprocess_dataframe(train)\n",
"test = preprocess_dataframe(test)"
]
},
{
"cell_type": "code",
"id": "e4665e4f-080b-4ee3-94ad-2c0471943a19",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "e4665e4f-080b-4ee3-94ad-2c0471943a19",
"outputId": "cda0a2a3-ac53-49ad-fb69-1aa49ffac7ed"
},
"outputs": [
{
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>OriginalTweet</th>\n",
" <th>Sentiment</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>and and</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>advice Talk to your neighbours family to excha...</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Coronavirus Australia: Woolworths to give elde...</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>My food stock is not the only one which is emp...</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Me, ready to go at supermarket during the #COV...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
"text/plain": [
" OriginalTweet Sentiment\n",
"0 and and 2\n",
"1 advice Talk to your neighbours family to excha... 3\n",
"2 Coronavirus Australia: Woolworths to give elde... 3\n",
"3 My food stock is not the only one which is emp... 3\n",
"4 Me, ready to go at supermarket during the #COV... 0"
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train.head()"
]
},
{
"cell_type": "markdown",
"id": "RQZNBWi83HOD",
"metadata": {
"id": "RQZNBWi83HOD"
},
"source": [
"The data as it is now is not ready to be handed over to the model yet. First we will need to some additional preprocessing. We will start by converting our features and targets into numpy arrays because Keras cannot handle dataframes. We will also create and fit a tokenizer on both datasets so that it can learn all words that occur. After tokenizing the tweets we will apply padding to make all tweets a uniform length."
]
},
{
"cell_type": "code",
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
"id": "d436471f-6349-47c1-82b9-58c842204b6d",
"metadata": {
"id": "d436471f-6349-47c1-82b9-58c842204b6d"
},
"outputs": [],
"source": [
"# LSTM model\n",
"y = np.array(train['Sentiment'])\n",
"x = np.array(train['OriginalTweet'])\n",
"\n",
"y_test = np.array(test['Sentiment'])\n",
"x_test = np.array(test['OriginalTweet'])\n",
"\n",
"# create and fit Tokenizer on the train and test set.\n",
"tokenizer = Tokenizer()\n",
"tokenizer.fit_on_texts(x)\n",
"tokenizer.fit_on_texts(x_test)\n",
"\n",
"sequences_train = tokenizer.texts_to_sequences(x)\n",
"sequences_test = tokenizer.texts_to_sequences(x_test)\n",
"\n",
"# find length of longest tweet in train and test set and set padded_size to the maximum.\n",
"padded_size = max(max([len(tweet) for tweet in sequences_train]), max([len(tweet) for tweet in sequences_test]))\n",
"\n",
"x = pad_sequences(sequences_train, maxlen=padded_size)\n",
"x_test = pad_sequences(sequences_test, maxlen=padded_size)\n",
"\n",
"y = np.reshape(y, (-1,1))\n",
"y_test = np.reshape(y_test, (-1,1))"
]
},
{
"cell_type": "markdown",
"id": "eupqf9UJDs37",
"metadata": {
"id": "eupqf9UJDs37"
},
"source": [
"After padding we can take a look at how our tweets look now. The first tweet in our array was reduced to a simple `and and` by the preprocessing. Each and is now represented by a 4 in the padded and tokenized string. The padding character used is a 0 and it was added in front of the tweet."
]
},
{
"cell_type": "code",
"id": "a4584bb4-09d1-48fe-a190-b5fc44d8f82f",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
"id": "a4584bb4-09d1-48fe-a190-b5fc44d8f82f",
"outputId": "9bcca2c1-a930-406d-fdfc-2522ea8d2c3b"
},
"outputs": [
"data": {
"text/plain": [
"array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4])"
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x[0]"
]
},
{
"cell_type": "markdown",
"id": "Bvr1sRMeEpWW",
"metadata": {
"id": "Bvr1sRMeEpWW"
},
"source": [
"## Building and training the model\n",
"Now that the preprocessing is done we can finally start building the model. This process usually requires a bit of experimentation to achieve a good accuracy so feel free to change some variables or add some layers and see if you can get a better result.\n",
"\n",
"As this model takes quite a bit of time to train it is automatically saved after completing the training. When running this notebook a second time you can change the `LOAD_MODEL` variable to `True` to load the pretrained model from memory instead of retraining it."
]
},
{
"cell_type": "code",
"id": "yBxNn2GTa26v",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
"id": "yBxNn2GTa26v",
"outputId": "20069693-08ba-43b0-9f46-6ce2db44a6f9"
},
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"_________________________________________________________________\n",
" Layer (type) Output Shape Param # \n",
"=================================================================\n",
" bidirectional (Bidirectiona (None, 65, 32) 2816 \n",
" l) \n",
" global_average_pooling1d (G (None, 16) 0 \n",
" lobalAveragePooling1D) \n",
" \n",
"=================================================================\n",
"Total params: 279,402\n",
"Trainable params: 279,402\n",
"Non-trainable params: 0\n",
"_________________________________________________________________\n",
"None\n"
]
}
],
"source": [
"# Switch this to True or False to enable/disable loading of the model from disk\n",
"# Make sure to run the entire notebook at least once before setting this to True.\n",
"LOAD_MODEL = False\n",
"\n",
"model = None\n",
"if LOAD_MODEL:\n",
" model = load_model(BASE_PATH+'long_short_term_memory_model.h5')\n",
"else:\n",
" dropout_rate = 0.4\n",
"\n",
" num_classes = train['Sentiment'].unique()\n",
" num_words = len(tokenizer.word_index)+1\n",
"\n",
" model = Sequential([\n",
" Embedding(input_dim=num_words,output_dim=5, input_length=padded_size),\n",
" Bidirectional(LSTM(16, return_sequences=True)),\n",
" Dropout(dropout_rate),\n",
" LSTM(16, return_sequences=True),\n",
" GlobalAveragePooling1D(),\n",
" Dense(len(num_classes), activation='softmax'),\n",
" ])\n",
" model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),\n",
" optimizer=Adam(learning_rate=0.01, decay=0.001),\n",
" metrics=['accuracy'])\n",
" print(model.summary())"
]
},
{
"cell_type": "code",
"id": "2965e0d3-7649-4e04-b44f-950f4b3a2373",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
"id": "2965e0d3-7649-4e04-b44f-950f4b3a2373",
"outputId": "f0c59874-855d-4e08-af51-e684f6531a0b"
},
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/5\n",
"1287/1287 [==============================] - 43s 31ms/step - loss: 1.1219 - accuracy: 0.5346 - val_loss: 0.8971 - val_accuracy: 0.6651\n",
"1287/1287 [==============================] - 39s 30ms/step - loss: 0.6193 - accuracy: 0.7790 - val_loss: 0.7436 - val_accuracy: 0.7351\n",
"1287/1287 [==============================] - 37s 29ms/step - loss: 0.4338 - accuracy: 0.8569 - val_loss: 0.7290 - val_accuracy: 0.7615\n",
"1287/1287 [==============================] - 37s 29ms/step - loss: 0.3273 - accuracy: 0.8970 - val_loss: 0.7934 - val_accuracy: 0.7551\n",
"1287/1287 [==============================] - 38s 30ms/step - loss: 0.2597 - accuracy: 0.9215 - val_loss: 0.8643 - val_accuracy: 0.7370\n"
],
"source": [
"history = None\n",
"if LOAD_MODEL:\n",
" history = pd.read_csv(BASE_PATH+'lstm_training.log', sep=',', engine='python')\n",
" csv_logger = CSVLogger(BASE_PATH+'lstm_training.log', separator=',', append=False)\n",
" history=model.fit(x,y, epochs = epochs, verbose = 'auto', validation_data=(x_test, y_test), callbacks=[csv_logger]).history\n",
" model.save(BASE_PATH+'long_short_term_memory_model.h5')"
]
},
{
"cell_type": "markdown",
"id": "nUjXgdOMFwD1",
"metadata": {
"id": "nUjXgdOMFwD1"
},
"source": [
"## Evaluation\n",
"After completing the training we can plot the accuracy and loss curves using pyplot."
]
},
{
"cell_type": "code",
"id": "26128add-7843-4533-8829-0fc6f03fe1a4",
"metadata": {
"base_uri": "https://localhost:8080/",
"height": 281
"id": "26128add-7843-4533-8829-0fc6f03fe1a4",
"outputId": "770d716c-44d8-40dd-9901-75179a54c0dc"
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
],
"source": [
"# plot accuracy and val_accuracy\n",
"acc = history['accuracy']\n",
"val_acc = history['val_accuracy']\n",
"\n",
"loss = history['loss']\n",
"val_loss = history['val_loss']\n",
"\n",
"epochs_range = range(1, epochs+1)\n",
"\n",
"plt.subplot(1,2,1)\n",
"plt.plot(epochs_range, acc, label='Training Accuracy')\n",
"plt.plot(epochs_range, val_acc, label='Validation Accuracy')\n",
"plt.legend(loc='best')\n",
"plt.xlim(1,len(val_acc)+1)\n",
"plt.title('Training Accuracy')\n",
"\n",
"plt.subplot(1,2,2)\n",
"plt.plot(epochs_range, loss, label='Training Loss')\n",
"plt.plot(epochs_range, val_loss, label='Validation Loss')\n",
"plt.legend(loc='best')\n",
"plt.xlim(1,len(val_acc)+1)\n",
"plt.title('Training Loss')\n",
"plt.show()"
]
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
{
"cell_type": "code",
"execution_count": 16,
"id": "bbbc20bf",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(32.99999999999999, 0.5, 'predicted')"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"labels=['Extremely Negative','Negative','Neutral','Positive','Extremely Positive']\n",
"prediction = model.predict(np.array(x_test))\n",
"prediction = [np.argmax(p) for p in prediction]\n",
"mat = confusion_matrix(y_true=y_test, y_pred=prediction)\n",
"sns.heatmap(mat,annot=True, cmap='Blues',fmt='d', xticklabels=labels, yticklabels=labels)\n",
"plt.xlabel('true')\n",
"plt.ylabel('predicted')"
]
},
{
"cell_type": "markdown",
"id": "9058966c",
"metadata": {},
"source": [
"Although the graph shows quite a high accuracy this is not what we are interested in. What's more important than accuracy is the validation accuracy which drops off after the 3rd epoch. Continuing the training after that will not improve the model any more but rather lead to overfitting.\n",
"## User input\n",
"The next part is an experiment. We are going to use the model that we just trained to try and predict the sentiment of user inputted strings"
]
},
{
"cell_type": "code",
"id": "124900d0",
"metadata": {},
"outputs": [],
"source": [
"user_input = ''"
]
},
{
"cell_type": "code",
"id": "e9fcdd36",
"metadata": {},
"outputs": [],
"source": [
"def predict_input(input: str)->str:\n",
" def format_string(string: str)->np.array:\n",
" tmp_df = pd.DataFrame({'OriginalTweet':[string],'Sentiment':['Neutral']}) # Creating a dataframe so we can reuse the previous preprocessing function\n",
" tmp_df = preprocess_dataframe(tmp_df)\n",
" x = np.array(tmp_df['OriginalTweet'])\n",
" tokenizer.fit_on_texts(x) # reusing tokenizer from training the model\n",
" seq = tokenizer.texts_to_sequences(x)\n",
" return pad_sequences(seq, maxlen=padded_size) # reusing padded_size from training the model\n",
" input = format_string(input)\n",
" pred = model.predict(input)\n",
" {'Extremely Negative':0,'Negative':1,'Neutral':2,'Positive':3,'Extremely Positive': 4}\n",
" classes = {0: 'Extremely Negative', 1: 'Negative', 2:'Neutral', 3:'Positive',4:'Extremely Positive'}\n",
" return classes[np.argmax(pred)]"
]
},
{
"cell_type": "code",
"id": "3c576520",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Neutral\n"
]
}
],
"source": [
"print(predict_input(user_input))"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"collapsed_sections": [],
"name": "long_short_term_memory.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
}
},
"nbformat": 4,
"nbformat_minor": 5