Emotion Detection via Voice and Speech Recognition

Emotion detection from voice signals is needed for human-computer interaction (HCI), which is a difficult challenge. In the literature on speech emotion recognition, various well known speech analysis and classification methods have been used to extract emotions from signals. Deep learning strategies have recently been proposed as a workable alternative to conventional methods and discussed. Several recent studies have employed these methods to identify speech-based emotions. The review examines the databases used, the emotions collected, and the contributions to speech emotion recognition. The Speech Emotion Recognition Project was created by the research team. It recognizes human speech emotions. The research team developed the project using Python 3.6. RAVDEESS dataset was also used since it contained eight distinct emotions expressed by all speakers. The RAVDESS dataset, Python programming languages, and Pycharm as an IDE were all used by the author team.


PRoBLEM STATEMENT
The Manuscript deals with the exploration and normalization of the data.As a performance measure for conversational analysis, SER (speech Emotion Recognition) may be used to categorize calls based on emotions and assess customer happiness, which enables businesses to enhance their services.The challenge is to design automated software for this purpose.

MoTIVATIoN oF STUDy
In this era of technology, one of the major concerns is the self's emotions.To overcome this problem, this research work analyses a person's speech and determines emotion.The motivation for this paper is to face the problem which is emotion itself.
Human beings as a species of higher intelligence show various emotions.Due to this, it is necessary to understand the emotions which are conveyed in speech.The basic human emotions can be categorized as happiness, sadness, fear, disgust, anger, and surprise.Furthermore, these are further classified into complex emotions such as awe, guilt, envy, etc.It is therefore in our interest to understand these emotions.

oBJECTIVES oF RESEARCH
The area of voice recognition that is expanding in acceptance and reputation is emotion recognition.This assignment attempts to apply deep learning to detect the sentiments from the data, even though there exist methods for understanding sentiment using a machine learning approach.
In this research project, the research team has built a model that may recognize emotions from sound files using an unsupervised learning algorithm known as an MLP-Classifier.The objective of speech emotion recognition is to detect the presence of frustration or annoyance in the speaker's voice by using the librosa libraries in python and the RAVDESS dataset.

SCoPE oF STUDy
This project works on how one can use audio files to detect emotions.Various audio files are processed, searched, and then resulted in different sets of emotions like sad, happy, and nervous.One of the Purposes of this study is to make human-system interaction more effective.
As of now, the system is working on the detection of audio files that is capable of detecting single audio files but not grouped audio files.More accurate implementation of the detection of voice can be done by clearing the audio files that are mixed with the background disturbances and also the pauses in between the audio files that lower the accuracy of our results.The features of different types of voices of different domains can also be loaded to make the best effective output of emotions through voice.

STAE oF ART AND ToPIC oRGANIZATIoN
This study, first of all, gives a general idea of speech emotion recognition and how it is applicable in smart cities.The author's team explained how the speech recognition system works and various emotions which can be detected.For the endorsement of this study, the author team reviewed eight research papers of concerned topics etc.
The author team has described the methodology in which they have presented the different methods used for the study.This study used ML based critical analysis of the data obtained from the RAVDESS database.The manuscript has tabular, graphical representation of data using graphs and spectrograms.
Within the recommendation area, which is one of the foremost critical parts of the research studies, recommendations for particular applications to address the issues and limitations distinguished within the appraisal have been displayed.The novelty area alludes to components that are new within the manuscript.In the last, the conclusion section represents the final assessment and describes the overall findings of the study.

ETHICAL CoMMITTEE AND FUNDING
The experiments don't include any human-related experiments and so no ethical constraints have been violated.Though the subjects performing the study were humans and air quality directly affects them but the study doesn't violate any health-related measures.The Project is not funded by any agency.

RoLE oF AUTHoRS
Dr. Rohit Rastogi acted as team leader and coordinated among all co-authors.He prepared the topic introduction and background study.He also prepared the structure of the manuscript and ensured the quality of the research.Mr. Shubham did the analysis part.Mr. Sarthak and Mr. Tushar performed the backend task of implementation that consists of downloading the dataset and assembling them to the right path location.Mr. Shubham handled the task of splitting the dataset into training and testing parts.Mr. Sarthak performed the initialization of the MLP Classifier and trained the model.Mr. Shubham completed the graphs-related work, and tested the accuracy of the model to come to a final conclusion.

INTRoDUCTIoN
Speech is the primary tool used by humans to interact and convey information.But the interesting fact is to know about the type of information that is really delivered i.e. detecting emotion is among the most crucial marketing tactics in the world today.For this reason, the research team decided to work on a project in which researchers can control numerous AL-related applications by being able to manage someone's mood merely by their voice.
In this article, the research team has tried to combine prosody non-verbal aspects of language that allow people to convey or understand emotion and deep learning in order to create a model that understands human emotions through speech.
Examples include the ability of call centers to play music during tense exchanges.Another example may be a smart automobile that slows down when the driver is scared or furious.Because of this, this kind of application has a lot of promise in the smart world and might potentially increase consumer safety while benefiting businesses.

Speech Recognition and worldwide Applications
Today, the most common application of Speech recognition is in mobile devices.Speech recognition has become a crucial element of many smartphones currently on the market, from voice calling to asking Siri what the weather will be like on Monday.
Speech recognition technology is also used for voice calling, speech-to-text conversion, call routing, and voice search.Users can also utilize speech recognition in computer word processing programs like Google Docs or Microsoft Word, where they can alter and say what they wish to appear as text.
It's similar to being able to listen to someone and recognize words and phrases, then translating them into sentences that assist you to grasp what they're saying (Xiaobo, B. et al., 2018) (as per Figure 1).

Emotion Recognition and 21st Century Lifestyle
Healthcare, marketing, fraud detection, and manufacturing are just a few of the industries that might benefit from the application of emotion recognition technologies.In urgent care centers where individuals don't make appointments, healthcare practitioners can utilize emotion detection AI to prioritize individual patients by monitoring facial expressions in the reception room.The most uncomfortable people could be given top priority, while those who are less ill would have to wait for a gap in service.
Before introducing a product, managers like to discover how it will operate.By evaluating the facial movements of testers using their goods or viewing their commercials, emotion recognition technology may help businesses get something out of focus groups.Emotion recognition technology is very useful for the automotive industry.Auto mobiles that warn drivers when they fall asleep or start to drift off might help avoid serious accidents.The warning might potentially be sent off by intense emotions like road rage.In the case of vehicles with self-driving or autopilot functions, this may be extremely useful.The autopilot can be used while informing the driver if the human operator becomes extremely emotional or fatigued.
When a consumer submits a claim, insurance firms employ voice analysis to determine if they are being truthful or not.Up to 30% of consumers have acknowledged lying to their auto insurance provider to obtain coverage, according to independent polls (Morre, S. et al., 2018) (as per Figure 2).

Speech Recognition with Emotion Recognition and Accuracy Standards
Emotion recognition from speech has been gaining popularity in recent years.It has huge potential and advantages in various sectors such as teaching, banking, and many more.Speech Recognition with Emotion Recognition has two major parts: Feature extraction from speech and Emotion machine classifier.
"The problem of Speech Emotion recognition is solved by classifying an order, in which input is a sequence whose length varies and a single output is obtained."(L.Kerkeni et al., 2018).
In recent years, researchers have proposed various classification algorithms in Speech Emotion Recognition (SER) such as Hidden Markov Model (HMM), Support Vector Machine (SVM), and Neural Networks (NN).Researchers are going to use Multilayer Perceptron (MLP) for Speech Emotion Recognition.The various classification algorithms have different accuracy standards for different emotions and different datasets (as per figure 3).

Smart City Designs and Emotion Capture Scenario
In this age, cities are becoming more advanced and smarter.Smart cities include smart infrastructure, smart healthcare, smart technology and smart energy.For a city to achieve a status of a smart city, IOT is necessary to connect various things.Speech Emotion recognition can play a vital role in making the system fast and more efficient by recognizing emotion.
Speech Emotion Recognition has various applications in a smart city.Using Speech Emotion Recognition, a customer's emotion can be determined after a service, which could be helpful in getting a review.SER can be helpful in a distress scenario.If a person is frightened it can be recognized and action can take place in times of emergency like burglary or an accident.
"SER serves an important role in developing smart services which are applicable in surveillance, healthcare, audio forensics, affective computing, and human-machine interaction" (Badshah, A.M., et al., 2019) (as per Figure 4).

Knowledge Pyramid and Knowledge Extractions by Speech Recognition-Based Systems
For any system knowledge management is needed to work efficiently and it is required to ensure that right things are available at the right place.The knowledge pyramid is a representation of relation between four different factors-Data, Information, Knowledge, and Wisdom.
In this system data is a set of different types of voices.Information consists of analyzing the data for further processing by the system.Knowledge consists of how one can apply one's collected information to achieve desired goals and the last is wisdom, that is what one is capable of doing now and what can be achieved in future (Demicran, S. et al., 2014) (as per figure 5).

Impact of AI, ML, and Big Data Based Systems in Global Development Index
In this fast growing world there is a requirement of that model which helps one to walk in parallel with the world.AI and ML helps to enhance the creativity level and can help to manage large data in the proper way.
The development of mobile phone software and various devices has made a huge change in the medical field.AI, ML and Big data management are also contributing in giving facilities in wirelessbased applications that help to reach remote areas.As the population is rising, countries are also funding and investing in these types of modern technologies.ML is that part of AI that is capable of learning automatically and also improves its functionality with increase in time.With help of these modern techniques can be developed performance measuring devices or indicators that make one's work easy.
AI can take up tasks that are involved with high risk and work in places where humans cannot stay for a long time.It is also estimated that AI and ML may contribute to an additional increase in global GDP by 1.2% annually (Z.Khan, et al., 2020) (as per Figure 6).

LITERATURE REVIEw
To improve the understanding and importance of Speech Emotion Recognition, our team reviewed various papers.The key to understanding SER, in another study, the speech signal was used to extract the standard emotional speech characteristics such as Perceptual Linear Prediction cepstral For training and recognition purposes, this text selected 1200 phrases containing the four fundamental emotions of grief, rage, surprise, and happiness.For training and testing purposes, this article uses 40% and 60% of the voice data, respectively.The study put out a technique for realizing the emotional elements that were automatically retrieved from the text.To extract voice emotion characteristics, a 5-layer deep network was trained using DBNs.
They will carry out more research on voice emotion identification using DBNs in the future and enlarge the training data set.Their ultimate goal is to do research on how to increase the accuracy of voice emotion identification (C.Huang et al., 2015).These tests produced more steady, accurate, and robust results for recognition in challenging situations with changing language and speaker as well as other environmental aberrations.It is simple to identify emotions like pleasure, happiness, sadness, surprise, neutrality, boredom, disgust, fear, and rage.But when real-time emotion identification is sought, it becomes challenging to accomplish so.(R.A. Khalil et al., 2019).
L. Kerkeni, et al., (2018) demonstrated how using different features and databases in Speech Emotion Recognition (SER) can produce different results and accuracy.
They used Mel-Frequency Cepstrum Coefficient (MFCC) and Modulation Spectral Features (MSF) for feature extraction from speech.The Berlin database and Spanish Databases were used.The classification algorithms that were used are Multivariate Linear Regression (MLR), Support Vector Machine (SVM), and Recurrent Neural Networks (RNN).
For training the model 70% data was used and for testing 30% data was used.From the Berlin database using MLR classifier, the average emotion detection rate for MS, MFCC, MFCC +MS features was 60.70%,67.10% and 75.90%.While for the same features but for Spanish database results were as of 70.60%,76.08%,82.41%.From the Berlin database using SVM classifier, the average detection rate for MS, MFCC,MFCC+MS features was 63.30%, 56.60%, 59.50%.While for Spanish database with the same features, the results were 77.63%,70.69%.From the Berlin database using RNN classifier, the average emotion detection rate for MS,MFCC,MFCC+MS features was 66.32%,69.55%,58.51%.While for the same features but for Spanish database results were as of 82.30%, 86.56%, and 90.05%.
Based on the above result the RNN classifier with MFCC+MS feature extractor gives the highest accuracy of 90.05% for the Spanish database.This is too early to determine a system which is best for Speech Emotion Recognition but the Fourier transform method is the most used method in speech recognition (L.Kerkeni et al., 2018).
B.A. Malik and his team (2017) proposed a feature learning system powered by a discriminative CNN which uses spectrograms to recognize emotion from speech.
Short term Fourier transform is used to generate spectrograms from speech input.Convolutional Neural Networks were used for classification of spectrograms.Emotion was predicted using a majority voting scheme from multiple spectrograms.Berlin Emotional Database was used for training and evaluation.
The accuracy for emotion prediction using CNN with rectangular kernels using Berlin Emotional Database were highest for angry which was 99.32% and lowest for happy which was 52.45%.If additional labelled data is provided and a much deeper CNN with rectangular kernels can be trained, the suggested approach can be improved even more.Using spectrograms, an experiment showed that rectangle kernels and max pooling processes are better suited for SER.(Badshah, A.M. et al., 2017).
H. Aouani and Y. Ben Ayed (2020) demonstrated how one can recognise emotions from speech to visualise the output from speech.There are many benefits of Speech Emotion Recognition explained such as-In Educational field, Automobile, Security, Communication and Health.
It has an emotion recognition system that uses parameters like 39MFCC, HNR, ZCR, TEO using Support Vector Machines and then there's usage of Auto-Encoder.There is also a parameter of Harmonic to Noise Rate (HNR).SVM is used for classifications of emotions and the autoencoder helps in the feature selection method.
There is an RML Emotion Database which contains 720 emotion expression samples that were used in testing taken from Ryerson Multimedia Lab.This dataset was furnished in six different languages.After achieving a series of experiments they achieved a better identification rate.Firstly, they presented the performance of a system based on a fusion of HNR.Secondly, it is the application of auto-encoder dimensions.The result of this system shows effectiveness of achieving good results (Aouani, H. et al., 2020).Woo, B.S. and team (2021) emphasized the need to understand emotions from speech or voice for better understanding.Generally emotions are accompanied by different changes in one's body.To make better results they were required by a high speech database.They constructed a Korean Emotional Speech Database (K-EmoDB) and used it with the RNN network.To check the sudden changes in voice it was loaded with MFCC, Chroma, spectral features, harmonic features and others.The feature extraction tool Essential is used for harmonic feature extraction which is available as a free open source tool.A LSTM model is used so that they can recognize emotion from speech.Generally emotions can only be recognized when one listens to the complete audio or speech.
The first experiment using the K-EomDB model database has achieved approximately 65.89% accuracy and the second experiment when LSTM was used the accuracy was 62.63%.They noted a feature benefit that if one uses RNN model then it can increase the performance.Although they are still dealing with the problem of finding a variety of emotions (Woo, B. et al., 2021).

Name of Algorithms Used
The algorithms used here include MFCC (Mel Frequency Cepstral), Chroma Feature Extraction, and Mel (Mel Spectrogram Frequency).

Types of Databases
Natural, Simulated (Acted), and Elicited (Induced) emotional speech databases are the three types of databases that are utilized to construct speech emotion recognition systems: • Natural Database: The majority of natural speech datasets come from talk shows, contact centre recordings, radio conversations, and other similar sources.These unscripted talks are sometimes referred to as real-life speeches.The data is more difficult to get while processing.• Simulated Database: Professional or semi-professional actors' record performed speech databases in sound-proof studios.When compared to other ways, creating such a database is quite simple.• Elicited Database: Elicited speech databases are made by putting speakers in a simulated emotional scenario that can elicit a variety of emotions.The emotions are near to real ones, despite not being fully evoked.

Dataset
The study team scoured the internet and discovered many dataset sets, some of which are shown below: 1. Ryerson Audio-Visual Database of Emotional Speech and Song. 2. Crowd-sourced Emotional Multimodal Actors Dataset.
The research team uses the Ryerson Audio-Visual Database of Emotional Speech and Song is used by the study team (RAVDESS).It can be downloaded for free and has 24 experienced actresses (12 Females and 12 male).The author team personally checked and confirmed the authenticity of the emotions recorded by the voice artists in the RAVDESS dataset is correct.

RAVDESS Data Set Attributes and URL
Both in speech and in the lyrics, there are expressions of calmness, happiness, joy, sadness, anger, fear, surprise, and contempt (as per Figure 7).Speech-emotion-recognition-ravdess-data.zip -Google Drive.

Metadata and Sample Dataset
Pitch, energy, and intensity are all key factors in expressing the emotional content of a speech.The speech attribute that were used to determine the emotions were: • Pitch: It provides the wave's greatest peak, allowing us to gauge our emotional condition.
• Energy: The most effective variable for recognizing emotions.
• Intensity: The parameter is utilized to determine the physical energy and volume of speech (as per Figure 8).

Dataset Size (Storagespace)
This large dataset consists of 7356 files that 247 individuals assessed 10 times for emotional sincerity, intensity, and validity.The complete dataset comes in at 24.8 GB and includes 24 actors, but the author team used only the audio data available in the RAVDESS dataset.

Functional and Non-Functional Requirements
Functional requirements are product features or functions that developers must implement to enable users to accomplish their tasks.So, it's important to make them clear both for the development team and the stakeholders.Generally, functional requirements describe system behavior under specific conditions.Non-functional requirements or NFRs are a set of specifications that describe the system's operation capabilities and constraints and attempt to improve its functionality.These are basically the requirements that outline how well it will operate including things like speed, security, reliability, data integrity, etc.

Hardware Requirement
Processing Power: 1.7 GHz or above Memory: 2GB or above Storage: For applications 100 MB or above and for databases 600 MB or above Sound: Sound card required Microphone for listening speech Software Requirement IDE: Any IDE which supports python Frontend: Kivy, a python framework for creating UI Backend: Python 3.6 or higher versions Libraries additionally installed are librosa, soundile, sklearn, numpy, pickle, and pyaudio.

Network Requirement
Cloud or Distributed Environment or can be used by a Single user.

OS Requirement
Windows 10 is used for back-end work but other versions like Windows 7 or 8 or 11 will also work.It is providing background for running python software and downloading the dataset.Other operating systems like Linux, and macOS can also be used.

Database Requirement
Database is required so that we can easily store the information and can access it later without any problem.The research team uses the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS).Structured Query language (SQL) is a programming language that develops databases.

Storage Requirement
System's storage is required to access audio files downloaded from RAVDESS Dataset.For applications 100 MB or above and for datasets additional 25 GB will be the basic storage requirement.

Front End
For the front end part, the research team is looking to design an application for this mowhatdel.The Researcher team is thinking of using Kivy GUI which is a python based framework for developing an application.

Back End
For the backend, Python Programming language is used.Python is a high-level, interpreted, and general-purpose programming language.Python is an interpreted language which can be used in machine learning.It has additional libraries such as librosa for analyzing music, sound file for using sound files from datasets, NumPy for numerical calculations, pickle for serializing and desterilizing data objects, pyaudio for taking audio files from user's microphone, and sklearn contains ML algorithms.

Steps of Execution
1. Load the data as Input speech.2. Extracting features from speech.3. Split Dataset for training and testing.4. Initialize MLP classifier to train the model.5. Getting emotions from input data as Output.6. Check the accuracy of the model.

Different Diagrams
As presented below.

Flow Chart of the Activity
The dataset in this Flowchart diagram contains sound files that load training data into the feature extractor.For live data, it will first pre-processes then extract features.For the training model, features are extracted and saved for the classifier and then the analyzing identifies emotion from speech (as per the Figure 9).

ER Diagram
In this ER diagram, the dataset contains sound files which load training data into feature extractor.Features are extracted and saved for classifier for training model.Testing data is used on model for accuracy standards and speech from user is entered to recognize emotion and user data is again sent for training model (as per Figure 10).

Use Case Diagram
Figure presented above explains that whenever a user tries to speak, it will be predicted as it suser voice or Well-trained audio if it is sample test audio then will directly reflect the classifier otherwise goes through of pre-processing for feature extraction that means it will pre-process live data before extracting features.Features are taken from the training model, kept for the classifier, and then analysis is used to determine the emotion in the speech, further goes to system and system to detect the emotion and display it to the user (MFCC) (as per Figure 15).The Codes in the Python are representing the accuracy is used as the performance metric for the conclusion of this study.Accuracy is the total predictions that the model got right.Higher the accuracy of the classifier, better the classifier is in predicting the correct emotion.
The MLP Classifier confusion matrix is represented above.The above figure is the confusion matrix of the Multi Layer Perceptrons (MLP) Classifier.In the Confusion matrix.Y-axis represents the actual emotions and the X-axis represents the predicted value by the model.According to the above figure, the highest accurately predicted emotion is angry and happy out of the other emotions (as per Figure 19).
The SVM Classifier confusion matrix is represented above.The above figure is the confusion matrix of the Support Vector Machine (SVM) Classifier.In the confusion matrix.Y-axis represents the actual emotions and the X-axis represents the predicted value by the model.According to the above figure, the highest accurately predicted emotion is angry and Happy out of the other emotions (as per Figure 20).The Naïve Bays Classifier confusion matrix is represented above.The above figure is the confusion matrix of the Gaussian Naive Baye's (GNB) Classifier.In the confusion matrix.Y-axis represents the actual emotions and the X-axis represents the predicted value by the model.According to the Figure 22 is the confusion matrix of the Decision Tree Classifier.In the confusion matrix.Y-axis represents the actual emotions and the X-axis represents the predicted value by the model.
According to the above figure, the highest accurately predicted emotion is Calm and Fearful out of the other emotions (as per Figure 22).
Figure 23 is the confusion matrix of the Random Forest Classifier.In the confusion matrix.Y-axis represents the actual emotions and the X-axis represents the predicted value by the model.
According to the above figure, the highest accurately predicted emotion is Calm and Fear out of the other emotions (as per Figure 23).
Figure 24 is the confusion matrix of the Support Vector Machine (SVM) Classifier.In the confusion matrix.Y-axis represents the actual emotions and the X-axis represents the predicted value by the model.According to the above figure, the highest accurately predicted emotion is Calm out of the other emotions (as per Figure 24).

Discussions
From above implemented visualizations, one can easily compare the results within the works and with existing and already done efforts in this domain (Pl.refer Table 1 and Table 2).
The Decision Tree classifier works on creating two nodes, one as Decision node and second as Leaf node and these nodes further produce output.The researcher team applied the algorithms on the dataset collected and also performed the real time testing on the voice that gives desired output in the form of emotions.The MLP classifier had higher average recall and F1-score while the Extra Trees Classifier had a higher average Precision.A no. of comparison charts have been presented in support of this fact (as per Figures 25, 26   The conversion of Speech to Emotion is conducted by performing an algorithm on a dataset full of different audio files.The names of the algorithms used were SVM, Gaussian naive Baye's, MLP, Random Forest, Decision Tree and Extra Tree.The accuracy obtained for the above algorithms are 75.00%,51.04%, 64.58%, 52.60%, 65.10%, 67.28% for MLP, Support Vector Machine, GNB, Decision Tree, Random Forest, Extra Trees Classifier respectively.Out of the above classifiers, the MLP classifier had the highest accuracy in predicting correct information (as per Table 1).

RECoMMENDATIoNS
This paper focuses on Emotion Recognition from speech.Emotion can be recognized in other ways too.Using Natural language processing, speech from the user can be translated into the text to make an understanding of the text and find concealed emotions conveyed from words of speech.Most of the emotions of a human being can be easily seen on the face of a person which can then be used toderive features of the face and recognize emotions.Alongside facial emotion recognition, body language can also be taken as a good way to recognize emotions.

Limitations
This paper uses a particular dataset.As everyone has a different accent, it's difficult for the system to interpret everyone's emotions, which may result in an error.To get emotion recognition it should be part of the regular mother tongue English.It can't detect all types of emotion accurately.The majority of speech emotional databases fall short when it comes to mimicking emotions in a genuine and understandable way.

Future Directions
• The next research should be conducted on the big data sets with more time duration.
• The dataset should be considered real time people voice input without delay in future research.
• In order to make the programme more user-friendly, future research should train the application in more languages, such as Hindi, Urdu, etc.

CoNCLUSIoN
Through this manuscript, the author team successfully created a Speech Emotion Recognition System.This project successfully detects emotion from speech entered by the user.The team used python 3.8 to create their research portal.RAVDESS Dataset was used as it contains 8 different emotions from all speakers.UI has been created in Kivy Python Framework.In this project, Librosa library is used to analyse audio files and extract features for emotion recognition, pyaudio for recording sound from user, numpy for calculations and sklearn library contains tools for machine learning.Kivy was used for creating front end and python was used for backend.It is a ready to install application which can be installed and used.
Multi Layer Perceptron is an algorithm which works as a neural network whereas SVM works.Gaussian Algorithm is based on applying Baye's theorem with strong assumptions.Random forest is like creating a number of decision trees on the provided dataset so that one can predict emotions.
The Manuscripts presents a way to upcoming researches for execution on the big data sets with more time duration of speech contents.It also emphasizes that the dataset should be considered real time people voice input without delay in future research and in order to make the programme more user-friendly, future research should train the application in more languages, such as Hindi, Urdu, and regional languages of south Asian continent etc.

Figure 4 .
Figure 4. Emotion prediction from speech (Source: 10.1007/s11042-017-5292-7) R.A Khalil et al. (2019) demonstrated how deep learning algorithms like DBM, RNN, DBN, CNN, and AE have received a lot of attention in recent years.These deep learning techniques and their layer-wise architectures are demonstrated through the categorization of a variety of natural emotions, including enjoyment, happiness, sorrow, calm, shock, bored, hate, terror, and anger.It identifies some constructive directions for enhanced SER systems.Investigated are SER methods based on CNNs and RNNs.With LSTM network layers, the deep hierarchical CNN's structure for extracting features has been merged.According to research, CNNs feature a time-based dispersed network that produces more accurate findings.Similar to this, a system called PCA-DCNNs-SER based on a deep convolution network (D-CNN) that employed audio data as input is provided.
Figure 8. Sample data set distribution

Figure 10 .
Figure 10.ER Diagram describing connection with dataset

Figure 14 .
Figure 14.Level 2 of data flow diagram for working procedure of extracting emotions from voice

Figure 15 .
Figure 15.User case diagram for extraction emotion from user speech

Figure 25 .
Figure 25.Comparison charts of precision of six algorithms

Figure 27 .
Figure 27.Comparison charts of f1-score of six algorithms