Sentiment analysis through Speech is basically determining the state of mind of the speaker, that is, representing a person’s emotions. This technique is among the most important marketing strategies in the world. Different things can be personalized on the basis of the interests of the user. Such an application has great potential that could be harnessed by many companies.
There are two main representations of emotion:
- Categorical :
This includes placing emotions under one of the following discrete categories, viz. Anger, Happiness, Sadness and Neutral.
- Attributes :
Here we look at the features or quality of emotions a person expresses.
- Activation: Tells if a person is excited or calm.
- Expectation: Predictable or unpredictable
- Dominance/Power: Weak or Strong
- Valence: if an emotion id positive or negative.
The following picture provides a broad view of how emotions can be classified based on attributes.
As we can see, the horizontal or x-axis represents positivity or negativity of emotion while the vertical axis shows how aroused/excited or calm a person is. Different classes of emotions fall under these quadrants. For example, happy is highly excited and very positive while angry is negative and highly aroused.
We will be building a Convolutional Neural Network Model for Sentiment analysis through Speech in this article, so before we move any deeper, let us first understand what is CNN?
Convoluted Neural Networks:
If I showed you an image of an animal, how would you know which animal is it? The answer obviously would be, “based on its features, our brains would classify the animal as one or the other“. Like if it has a long neck it might be a Giraffe.
The CNN works similar to our brains when comes to classifying something. The data say an image of a person smiling is fed to the CNN and after processing, it gives out the output which states that the person is happy.
So how does a CNN do it?
A picture is represented as a set of pixels. For a black and white image, these pixels are represented as a 2D array with values ranging 0 for black to 255 for white and the values in between represent greyscale. If the image is colored, it is represented as a 3D tensor with each dimension representing pixels in RGB.
This matrix is then passed through the following steps:
This is the convolution function. We can see that it is an integration of two functions and one of which determines the shape of the other.
In the diagram given below, we can see an input image, which undergoes a feature detector and gets convoluted. So how does that happen?
The feature detector /filter/kernel can be a matrix of 3×3 or 5×5 or 7×7 or whatever the user is comfortable with. Now take this filter and put it on the image or the pixels of the images represented in matrix form.
Next, multiply each value in the filter with the value at the same position of the input matrix as indicated in the function given above. For all the values obtained by multiplying nXn pixels of filter image with that of the input image, obtain one value and note it down in another matrix.
Now hover the filter over the input image. The steps at which one moves is called stride. Stride could be 1 pixel, 2 pixels etc. The final matrix obtained is called a feature map or a convulsed image.
That is, the size of our image is reduced here so it will be easier to process it. Sure some data is lost here but we don’t look at every single detail in real life. Like we discussed, we only consider a few features in real life for classification, hence, in the virtual world as well we will look at only certain features and not at each pixel.
A feature map thus helps us to get rid of unnecessary details and preserve only important features.
Similarly, multiple feature maps are created using different filters and lots of features are preserved. The model, through its training, learns which feature is more important.
Pooling works on the concept of spatial invariance. Consider a situation where we feed images of the same object taken from different angles to a network. How is it going to recognize the image? It will look for the most important distinguishable feature in the images. Pooling is done to bring a certain amount of flexibility in learning model, meaning, if the feature is present in the images, irrespective of its position, the model should be able to identify and classify it.
There are different types of pooling, for example, min pooling, max pooling, etc but for this article, we will look at how to apply maxpooling.
It is similar to convolution, the only difference is that here we don’t perform any multiplication, rather, we find out the max value. Take the feature map obtained and add a 2×2 filter to it. Out of the 4 boxes covered by the filter on the feature map, select the largest value and write it in another matrix. Then move the box by certain stride as we did in the previous case. Hence, we obtain a pooled feature map.
Pooling also helps us avoid overfitting.
So we have our pooled feature map. We take the numbers row by row and we put them into one long column. This gives us the data in vector form and that’s flattening.
How is emotion classified?
Gather the data, here a speech signal. You could use the following websites
- Since our data is a multi-dimensional tensor, we will divide it into small window sizes based on time frames.
- We select features to be taken into account while building the ML model e.g. Energy Levels, pitch, etc. We will be using Librosa library in Python to extract the features from the dataset, i.e. MFCC (Mel Frequency Cepstral Coefficient).
- Next stuff we built models.
- We train the model over the dataset for a number of epochs and plot the results.
# extracting features from the audio files
Syntax: class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Pandas is a python library and has been imported as pd. Pandas DataFrames is a 2-dimensional tabular data structure which may contain different types of data. The first line of code creates a dataframe created which contains the features of the audio dataset. This data will be used for training the model for Sentiment analysis through Speech.
In the next lines of code,
Syntax: librosa.core.load (path, sr=22050, mono=True, offset=0.0, duration=None, dtype=<class ‘numpy.float32’>, res_type=’kaiser_best’)
Audio file is loaded as long as floating time series.
Offset indicates the time period after which reading has to be started which duration represents up to which time period audio has to be loaded.
Res_type stands for resampling type while Kaiser_fast represents quality.
Next, we split the data into training and test sets and then build the model.
We build a CNN model for this classification problem as it gives better accuracy than LSTM (Long Short Term Memory) and MLP models.
We train the model for 1000 epochs, i.e., when the weights of the whole dataset (here batch size=32, that is, there are 32 samples in the training set) is being updated once, that is the complete dataset undergoes both forward propagation and backpropagation once, it is said to be 1 epoch. To know more about forward propagation, back propagation algorithms and how to create a neural network visit …………….
Import numpy as np
numpy is a general purpose library in python mainly used for array processing and scientific calculations.
from keras.model import sequential
Keras is another python library that runs on TensorFlow and is used for processing neural networks. There are two methods for building a keras model . First is to build a sequential model by stacking up layers one upon the other while the other method is to use Model (functional API) to create models. But the latter is more complicate and messier so former method are preferred.
from keras.layers import dense, flatten, activation, dropout
Dense is used to describe a fully connected neural network. A fully connected neural network is one in which each node is connected to every other node in the next layer.
As the name suggests flatten is used to flatten the output from the CNN so that it can enter our fully connected neural network.
Dropout is used to prevent our network from becoming fully fitting. An overfitted model makes more errors when introduces to a new dataset as it generalizes the output while learning. Dropout 0.1 suggests absence of dropout.
from keras.layers import MaxPooling2D, convulation2D
MaxPooling is used to identify the most important feature there helping in reduction of parameters.
Let us now decipher the given code:
We use add () function to add layers to the model. We will be building 1-dimensional convoluted neural network hence conv1D. Here, 128 stands for the number of nodes present in that layer and depends upon the size of the dataset.
Next, we have kernel size 5, i.e. our filter matrix is of size 5×5.
Padding is usually done to fill up the unused spaces in the matrix such that the size of each matrix is the same. Padding can be “valid” meaning no padding, “casual” resulting in dilated convolutions, or “same” which ensures that the length of the output is the same as that of the input.
In a sequential model, the first layer tells the model about the shape (dimensions) of the input layer hence it takes the parameter input_shape.
In the article on neural networks, we saw that each sample, x carries a weight w and when it reaches the next layer during forward propagation, the product of x and w is summed up and an activation function is applied to it in order to make sure that its output lies in a fixed range.
Here, to the input layer, we have applied relu or rectifier activation function, which means that our output lies in the range 0 to 1. To be more specific, we apply relu when we want our output to remain 0 up to a certain value and then gradually increase to 1 after reaching that value.
For example consider that you are going to buy a car and let age of the car be one of the parameters determining the cost of the car. Now suppose if the car is say 10 years old, it won’t increase the price of the car but say if the car is a 100 years old it would an antique and of some historical importance hence its price would shoot sky –high. That’s how relu works.
We have added 18 layers to our neural network. Again, apart from one input layer and one output layer, the number of layers a user adds is a hit and trial method, whichever number gives the best accuracy, we settle for that many layers.
In the final output layer, the sigmoid activation function is used. This function gives output in the range 0 to 1 and is used when wants to find the probability.
Optimizers are used to compile the model.
In gradient descent algorithm takes larger step in one direction, say x and smaller in the other, say y. If our algorithm were to take steps in only one direction it would converge faster and that’s where RMSprop steps in as it restricts oscillations in just one vertical direction. It is usually a better choice for RNN’s. lr is the learning rate which is advised to be kept small as a large rate would increase the training error rather than decreasing it.
Decay rate is used to gradually decrease the learning rate after each epoch.
Thus, after compiling the model, we can now plot the output.
Sentiment Analysis through Speech seems quite like magic but actually, it is possible with quite an accuracy and many things can be achieved using this like it can be used in classroom and office to analyze the behavior of students.
Humans perceive emotionally intelligent devices as more intelligent and conversations with such devices can be more natural. This technology find usage in many varied fields like health, education and entertainment, hence, it has a huge market potential in the upcoming future.