This notebook will show you an example of a long short-term memory neural network with tensorflow. As a dataset we are going to use the Corona_NLP dataset which you can find in the data folder. This dataset contains tweets regarding the corona virus and an estimation of the authors sentiment toward the topic.
For this task we require quite a few imports. Some of the more special ones we are going to use here are `re` which will be used in the preprocessing step as well as `keras.models.load_model` and `keras.callbacks.CSVLogger` which will be used to allow saving and loading the model and its training history from disk. For this we are also going to define a `BASE_PATH` which is the folder were we will save the model and logfile to
1 advice Talk to your neighbours family to excha... Positive
2 Coronavirus Australia: Woolworths to give elde... Positive
3 My food stock is not the only one which is emp... Positive
4 Me, ready to go at supermarket during the #COV... Extremely Negative
%% Cell type:markdown id:n2SNLzfwqCuG tags:
As our first preprocessing step we will drop some columns that we are not interested in like the username, screenname, location of the author and the time of the tweet.
We are going to sort a temporary dataframe based on the sentiment before plotting so that our plot shows the most negative sentiment on the left and the positive on the right
mapping={'Extremely Negative':0,'Negative':1,'Neutral':2,'Positive':3,'Extremely Positive':4}# Assign sentiment class to a value
tmp=train.sort_values(by='Sentiment',key=lambdacol:col.map(mapping))# sort by the value
sns.histplot(data=tmp,x="Sentiment")
plt.xticks(rotation=30)
plt.show()
```
%% Output
%% Cell type:markdown id:UQr4hGqe1hLe tags:
All in all the data is pretty balanced although the extreme opinions do appear less often than the more moderate ones. If we have trouble predicting those correctly it might be an option to merge the extreme and the moderate columns to form a single column each.
## Preprocessing
The 5 classes we have will need to be encoded as numbers so that our model will be able to handle them properly.
Aside from that we aren't doing a lot of preprocessing. The only things we do is removing URLs and twitter mentions as those are not very helpful when predicting the sentiment. This function will encode the Sentiment values in the dataframe and apply the preprocess_string function on every entry.
1 advice Talk to your neighbours family to excha... 3
2 Coronavirus Australia: Woolworths to give elde... 3
3 My food stock is not the only one which is emp... 3
4 Me, ready to go at supermarket during the #COV... 0
%% Cell type:markdown id:RQZNBWi83HOD tags:
The data as it is now is not ready to be handed over to the model yet. First we will need to some additional preprocessing. We will start by converting our features and targets into numpy arrays because Keras cannot handle dataframes. We will also create and fit a tokenizer on both datasets so that it can learn all words that occur. After tokenizing the tweets we will apply padding to make all tweets a uniform length.
After padding we can take a look at how our tweets look now. The first tweet in our array was reduced to a simple `and and` by the preprocessing. Each and is now represented by a 4 in the padded and tokenized string. The padding character used is a 0 and it was added in front of the tweet.
Now that the preprocessing is done we can finally start building the model. This process usually requires a bit of experimentation to achieve a good accuracy so feel free to change some variables or add some layers and see if you can get a better result.
As this model takes quite a bit of time to train it is automatically saved after completing the training. When running this notebook a second time you can change the `LOAD_MODEL` variable to `True` to load the pretrained model from memory instead of retraining it.
%% Cell type:code id:yBxNn2GTa26v tags:
``` python
# Switch this to True or False to enable/disable loading of the model from disk
# Make sure to run the entire notebook at least once before setting this to True.
Although the graph shows quite a high accuracy this is not what we are interested in. What's more important than accuracy is the validation accuracy which drops off after the 3rd epoch. Continuing the training after that will not improve the model any more but rather lead to overfitting.
## User input
The next part is an experiment. We are going to use the model that we just trained to try and predict the sentiment of user inputted strings
%% Cell type:code id:124900d0 tags:
``` python
user_input=''
```
%% Cell type:code id:e9fcdd36 tags:
``` python
defpredict_input(input:str)->str:
defformat_string(string:str)->np.array:
tmp_df=pd.DataFrame({'OriginalTweet':[string],'Sentiment':['Neutral']})# Creating a dataframe so we can reuse the previous preprocessing function
tmp_df=preprocess_dataframe(tmp_df)
x=np.array(tmp_df['OriginalTweet'])
tokenizer.fit_on_texts(x)# reusing tokenizer from training the model
seq=tokenizer.texts_to_sequences(x)
returnpad_sequences(seq,maxlen=padded_size)# reusing padded_size from training the model